Decodedscience.com

Comparing the Genetic Code of DNA to Binary Code

2015-08-25

This illustrates the structure of DNA, with strands holding the base pairs. Image by U.S. National Library of Medicine.

Does DNA store information in the same way that a computer stores data? What are the similarities or differences?

We need to discuss “What is DNA?” and “What is the genetic code?” to answer these questions – but let’s start with how computers store data.

Computer Data Storage in Binary Codes

The smallest unit of computer data storage is a “bit,” a contraction for “binary digit.”

One bit has the value of either zero or one; just as every digit in base-2 arithmetic has the values zero or one.

In base-10 arithmetic, a digit has any value from zero through 9. For example, 11001 in base-2 equals 25 in base-10.

Computers use binary arithmetic, since electrical circuits can “easily” distinguish two values (such as “on” versus “off”).

We usually discuss computer data in larger units. Different computers use many different coding schemes, or binary codes. ASCII code groups 7 bits into one “byte”. A 7-bit byte can represent 2x2x2x2x2x2x2 = 2^7 = 128 different values. In ASCII, each byte represents one character, such as ‘7’, ‘a’, or ‘Z’.

Some of these characters have meanings other than letters, numbers or symbols. For example, one character specifies the start of a message; another ends the message.

Unicode is a different binary coding scheme, which uses one or more 8-bit bytes for each character. An 8-bit byte has 2^8 = 256 values. Many computer applications require other binary codes.

Bytes also represent machine-level commands inside a computer chip. However, when one byte represents the command, the full instruction usually requires one or more subsequent bytes to represent the data for that command to process.

Size of Computer Files versus the Information of a Computer Program

The length of a computer program, or of a piece of data such as a document, image or video, may run to megabytes or gigabytes (millions or billions of bytes) of information.

The length of a data file may not correspond to the “information value” of that file. For example, a story may include redundant flashbacks, such as a mystery novel in which a suspect repeats previous dialogue verbatim to different lawyers and again in court.

Similarly, programmers might achieve the same result in different computer programs of very different lengths. In the field of computational mathematics, some theorems discuss the shortest program to perform a particular function.

What is DNA?

DNA Spooling from a Cell Nucleus. Image by National Institute on Aging.

Just what is DNA?

DNA is a long organic molecule, mainly made from four base chemicals: adenine (A); guanine (G); cytosine (C); and thymine (T). Only A+T and C+G may form base pairs.

Other chemicals hold the base pairs in place, and serve other functions in the DNA molecule, but most of the information is stored as a sequence of base pairs.

DNA’s geometry is a double spiral, with base pairs held between two long strands. For the purpose of data storage we can think of DNA as a ladder, with the base pairs as rungs, and the strands as the legs of the ladder.

Since there are only two types of base pair, AT or CG, each rung on the ladder might act exactly as a computer bit. For some purposes, it’s enough to report that a segment of DNA contains {AT, AT, AT, AT, CG, CG}.

However, the cell reads the DNA in one direction along one “strand” where it may find any one of the four molecules: A, T, C or G. As Dr. Donald E. Riley notes, genetic sequencing tests also report the base chemicals. So the DNA segment above might be reported as ‘AATTCG’, distinguishing the AT from TA and the CG from GC.

Therefore each base pair represents one of four values as the cell reads through the DNA; twice as many values as one computer bit.

The genome is the complete description of the sequence of bases in one organism’s DNA.

One continuous sequence of DNA is a “chromosome”. Humans have 46 chromosomes, in the form of 23 pairs, in most cells.

The Genetic Code from DNA to RNA to Amino Acids. Copyright image by Mike DeHaan, all rights reserved.

Transcribing DNA to RNA to Amino Acids

The cell gets information from its DNA by transcribing, or copying, some base pairs from DNA to messenger RNA. The cell does not copy all the DNA; we classify the useful sections as genes.

DNA provides the information to build amino acids by the sequence of base pairs in the genes.

The cell transcribes a protein-coding gene from DNA into RNA, but substitutes U, uracil, for T. Then the cell reads the RNA by following one side of the base pairs, in sequence. Therefore each base pair in RNA still has four possible values, A, C, G, or U. The only possible pairs are AU and CG.

Again, the cell reads one end each pair, so there are four possible values for each pair.

The cell uses RNA in groups, or “codons”, of three pairs. So each codon has one of 4x4x4 = 4^3 = 64 different values. Therefore the codon has exactly as much information as a 6-bit byte, since 2^6 = 64 possible sequences for codons. But there is a catch.

RNA only encodes 20 different amino acids, plus a “stop” signal, and a “start here with one specific amino acid, methionine” signal. Methionine is one of the 20 amino acids.

So each codon of RNA only leads to 21 possible outcomes, rather than 64.

This leads to the question: Should we say the genetic information is the content of the codons, or the resulting amino acid? (Actually, some genes have control functions; and there may be useful information in sections of DNA that we do not classify as genes. Later, we will check both possibilities).

Translating RNA into Amino Acids and Proteins

The cell builds a protein by creating the amino acid methionine for the first codon, and then creating and attaching the amino acid for the second codon, and so on, until it finds a “stop” codon. The sequence of amino acids forms a protein.

This is the Genetic Code Table in its usual format. Image by National Center for Biotechnology Information.

Junk DNA and Regulatory Mechanisms

A large percentage of any organism’s DNA never encodes proteins, because the codons follow a “stop” codon and do not have the next “start” codon. At one time, scientists called these sequences of codons, junk DNA, since they seemed to be useless.

However, researchers now know that some of these codon sequences regulate how other genes are expressed or repressed. Clearly a regulator gene carries information; it is not “junk”.

Also, the cell may suppress a gene by attaching other molecules to the gene; this keeps the cell from transcribing that gene into RNA.

One such suppression process is methylation, which adds a methyl group to the gene. A methyl group is one carbon atom bonded to three hydrogen atoms. Certainly there is an additional binary “bit” of information at each point where an additional molecule might suppress that gene.

Note that gene suppression plays an important role, especially in multicellular organisms. A pluripotent stem cell’s daughter must suppress some genes, and express others, in order to become a specialized cell.

Specific genes are expressed or repressed from time to time during the life of a cell. Some genetic suppression may last a lifetime and may be inherited if it is incorporated in the reproductive cells (egg or sperm cells). This long-term pattern of suppression is called the epigenome, a layer of information “above” the genome of DNA.

Should we consider the codons found inside a non-coding, never-expressed sequence of DNA as information? Yes, if we want to describe the whole genome. Perhaps not, if we only want to describe the full set of outputs.

Finally, it is possible that the length and placement of each non-coding section of DNA is vitally important. In a cell, DNA folds onto itself somewhat like a ball of rubber bands. The cell can only read what it finds on the outside of that folded bundle; and that “outside” depends on the length and twists of the DNA inside the bundle.

How Much Information Does DNA Encode?

The simplest answer to “How much information does DNA encode?” is “enough data to completely specify an organism’s particular genome and epigenome.” That involves the number of base pairs and the number of possible sites for adding a suppressor. Human DNA has approximately 3 billion base pairs, according to the National Human Genome Research Institute. That means 4^3,000,000,000 possible base sequences.

For simplicity, let’s say that each gene is either suppressed, or not, in the epigenome. That would be a binary choice for each gene. Most humans have between 20,000 and 25,000 genes. Let’s say the average is about 2^22,500 more choices.

The length of DNA varies for different species. Humans, with about 3 billion base pairs, have neither the largest nor smallest genome.

Normally we specify the “amount of information” in bits; so 2^n choices requires n bits. Note that 4^j = (2*2)^j = 2^(2*j).

Therefore human DNA genome encodes 4^(3 billion) = 2^(6 billion) choices, or 6 billion bits of information. The epigenome encodes at least 2^22,500 choices, or 22,500 bits. The total information is 6,000,022,500 bits, or approximately 6 Gb (gigabits).

We usually discuss computer storage in bytes rather than bits. 6 Gb would amount to 6/7 = 0.857 GB (gigabytes), or 857 MB (megabytes), using ASCII code.

How Much Information do the Amino Acids Encode?

One might suggest that the genetic information is equally carried by the amino acids produced by the codons. (This still assumes that “junk” DNA also carries exactly that information). There are 21 possible results from each codon. The one “start” codon encodes one amino acid; 60 different codons encode another 19 amino acids; and three codons encode “stop”. The 3 billion base pairs would be grouped into 1 billion codons, and each codon has 21 possible meanings. So that would be 21^(1 billion) sequences of amino acids.

We need to convert 21^(1 billion) to a power of two, since all the other information results are in bits. The conversion factor is ln(21)/ln(2), where “ln” is the natural logarithm function. We have ln(21)/ln(2) = 3.0445/0.6931 = 4.3923 (rounded), according to my calculator. (1 billion) * 4.3923 = 4,392,300,000 bits of information to code amino acids.

So that is a total information of 4,392,322,500 bits including the epigenome. In ASCII code, that would be 627,474,642 MB (megabytes).

Comparing the Genetic Code to Computer Data Storage

Let’s conclude by comparing computer data storage to the genetic code for DNA.

Computers store data in two-valued bits, grouped as bytes of 7 or more bits (for ASCII). One byte holds 2^7=128 unique values.

DNA stores data in four-valued base pairs, which RNA then groups as codons of 3 pairs. One codon holds 4^3=2^6=64 unique values.

A sequence of base pairs that convey biological information is called a gene. DNA includes extra information to express or suppress specific genes. Each gene has at least one bit of information for expression or suppression.

Computer files may be measured in megabytes or gigabytes: millions or billions of bytes. One CD-ROM disc may store about 710 MB. Modern solid-state memory and disk drives can store gigabytes.

If we can fully prescribe one human’s DNA by specifying the full sequence of base pairs, plus a binary flag to express or suppress each gene, then human DNA contains about 6 Gb or 857 MB of information.

The post Comparing the Genetic Code of DNA to Binary Code appeared first on Decoded Science.