Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!sri-unix!teknowledge-vaxc!uw-beaver!tektronix!reed!psu-cs!omepd!mipos3!cpocd2!howard
From: howard@cpocd2.UUCP
Newsgroups: sci.bio
Subject: Re: question
Message-ID: <542@cpocd2.UUCP>
Date: Tue, 31-Mar-87 11:48:27 EST
Article-I.D.: cpocd2.542
Posted: Tue Mar 31 11:48:27 1987
Date-Received: Sat, 4-Apr-87 05:18:39 EST
References: <11189@teknowledge-vaxc.ARPA> <978@aecom.UUCP> <3310@udenva.UUCP>
Reply-To: howard@cpocd2.UUCP (Howard A. Landman)
Organization: Intel Corp. ASIC Services Organization, Chandler AZ
Lines: 51
Summary: A bit by any other name

In article <978@aecom.UUCP> werner@aecom.UUCP (Craig Werner) writes:
>	Hence, if a byte is a base pair, that's your answer, although
>only two bits are required to specify a base, ergo a 'byte' could 
>actually be a tetranucleotide, but most sequences are stored as
>letters (ATCG). 

In article <3310@udenva.UUCP> agranok@udenva.UUCP (Alexander Granok) writes:
>The whole arguement gets caught up in definitions, here.  I would consider a
>bit to be a base pair, and a byte to be the set of three that encodes for one
>amino acid.  Instead of eight bits to a byte, there are three.  After all, one
>base pair by itself doesn't do much good.  But, if a base pair is a bit, then
>what is a nucleotide?  I guess it all depends on what you mean by "informa-
>tion."

Most of us use the standard definition, in which a "bit" is enough information
to answer a yes/no question.  The reason a base pair is 2 bits is that there
are 4 possibilities, not 2.  Craig correctly points out that this means that
4 base pairs could be stored in an (8 bit) byte.  He also correctly points out
that most nucleotide sequences, when stored in machine-readable form, use one
byte per base pair.  This makes it easier to search for subsequences, reverse
sequences, and places where genes overlap (as they do in some viruses).  But
the information content is no more than 2 bits; the rest is redundancy, and a
compression program could easily squeeze such a file down.

It is possible to store protein sequences using one byte per amino acid, and in
that case you would be partly right.  Here again, though, the real information
content is less than 5 bits per amino acid.

>"How many amino acids (words in the language of proteins) are encoded for on
>the human chromosomes?"

Since three bases code for one amino acid, the simplistic answer is N/3 where
N is the number of base pairs.  In reality things are messier: (1) there are
long stretches of DNA that seem to be doing nothing, (2) there are various
initiation and termination sequences that don't actually code for proteins,
(3) in some organisms a single stretch of DNA/RNA can code for more than one
protein (but never more than three).

>or "How many books could these words fill?"  I seem to
>remember Sagan doing something like this on Cosmos.

My recollection is that it was something like "1500 volumes of Encyclopedia
Britannica" for the human gene set, or a wall full of books.  But you should
be able to calculate that from the numbers Craig posted.  Just count the number
of pages in a book, count the number of letters on a typical page, and divide N
by both.
-- 

Copyright (c) 1987 Howard A. Landman.  Transmission of this material
constitutes permission from the intermediary to all recipients to freely
retransmit the material within USENET.  All other rights reserved.