Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!sri-unix!teknowledge-vaxc!uw-beaver!tektronix!reed!psu-cs!omepd!mipos3!cpocd2!howard From: howard@cpocd2.UUCP Newsgroups: sci.bio Subject: Re: question Message-ID: <542@cpocd2.UUCP> Date: Tue, 31-Mar-87 11:48:27 EST Article-I.D.: cpocd2.542 Posted: Tue Mar 31 11:48:27 1987 Date-Received: Sat, 4-Apr-87 05:18:39 EST References: <11189@teknowledge-vaxc.ARPA> <978@aecom.UUCP> <3310@udenva.UUCP> Reply-To: howard@cpocd2.UUCP (Howard A. Landman) Organization: Intel Corp. ASIC Services Organization, Chandler AZ Lines: 51 Summary: A bit by any other name In article <978@aecom.UUCP> werner@aecom.UUCP (Craig Werner) writes: > Hence, if a byte is a base pair, that's your answer, although >only two bits are required to specify a base, ergo a 'byte' could >actually be a tetranucleotide, but most sequences are stored as >letters (ATCG). In article <3310@udenva.UUCP> agranok@udenva.UUCP (Alexander Granok) writes: >The whole arguement gets caught up in definitions, here. I would consider a >bit to be a base pair, and a byte to be the set of three that encodes for one >amino acid. Instead of eight bits to a byte, there are three. After all, one >base pair by itself doesn't do much good. But, if a base pair is a bit, then >what is a nucleotide? I guess it all depends on what you mean by "informa- >tion." Most of us use the standard definition, in which a "bit" is enough information to answer a yes/no question. The reason a base pair is 2 bits is that there are 4 possibilities, not 2. Craig correctly points out that this means that 4 base pairs could be stored in an (8 bit) byte. He also correctly points out that most nucleotide sequences, when stored in machine-readable form, use one byte per base pair. This makes it easier to search for subsequences, reverse sequences, and places where genes overlap (as they do in some viruses). But the information content is no more than 2 bits; the rest is redundancy, and a compression program could easily squeeze such a file down. It is possible to store protein sequences using one byte per amino acid, and in that case you would be partly right. Here again, though, the real information content is less than 5 bits per amino acid. >"How many amino acids (words in the language of proteins) are encoded for on >the human chromosomes?" Since three bases code for one amino acid, the simplistic answer is N/3 where N is the number of base pairs. In reality things are messier: (1) there are long stretches of DNA that seem to be doing nothing, (2) there are various initiation and termination sequences that don't actually code for proteins, (3) in some organisms a single stretch of DNA/RNA can code for more than one protein (but never more than three). >or "How many books could these words fill?" I seem to >remember Sagan doing something like this on Cosmos. My recollection is that it was something like "1500 volumes of Encyclopedia Britannica" for the human gene set, or a wall full of books. But you should be able to calculate that from the numbers Craig posted. Just count the number of pages in a book, count the number of letters on a typical page, and divide N by both. -- Copyright (c) 1987 Howard A. Landman. Transmission of this material constitutes permission from the intermediary to all recipients to freely retransmit the material within USENET. All other rights reserved.