top of page

Byte 1: Biology and Sequencing

Biology Basics

Human function depends upon proteins: our skin, muscles, eyes, hair–virtually every organ–is made up of proteins. A protein is essentially a chain of amino acids, of which there are 20 (Table 1). The interaction of amino acids between different points in the chain allow it to fold up into a unique shape, which permits it to complete certain functions in the body. Therefore, a protein’s function is dictated by its structure, and its structure is dictated by its amino acid sequence, but what determines the amino acid sequence?

ANALOGY: Say you just got the hottest new chair from IKEA. If there was an error in the instruction manual, or you misread it, you would make your chair wrong! Now, some mistakes may not be the worst – you might still be able to sit in it (IMAGE). This, in the world of biology, would be referred to as a missense mutation: something is wrong with the protein sequence, but the error does not affect the protein’s function. Other mistakes can make the chair no better than a piece of expensive kindling (IMAGE). This is analogous to a nonsense mutation: the error in the protein sequence makes the protein non-functional.



To understand where proteins come from, we must look to the central dogma, first proposed by Francis Crick in 1958. The central dogma explains how the genetic information in DNA is used to inform protein production. DNA (deoxyribonucleic acid), a vital macromolecule found in the nuclei of our cells, is composed of 2 strands twisted into a double helix. The two strands are complementary sequences of 4 particular nucleotides: adenine (A), thymine (T), cytosine (C), and guanine (G). Within the template strand, these 4 nucleotides (aka bases) can appear in any order (e.g. GATCCTCCAT). The complementary strand is so called because its sequence consists of the bases which are complementary to the template sequence. Complementary base pairs are A with T and G with C, so every A and G in the template strand is paired with a T and C (or vice versa) in the complementary strand, respectively.

There are 2 key issues with the flow of genetic information from DNA to proteins. (1) Proteins are not made in the nucleus, and (2) DNA cannot leave the nucleus because, if it does, it would be rapidly degraded and destroyed. So how is the genetic information in DNA sequences transmitted to inform protein synthesis? This is where the central dogma comes into play! The DNA sequence is copied into an RNA transcript, which can leave the nucleus. RNA (ribonucleic acid) is a single-stranded molecule that uses ribonucleotides (not deoxyribonucleotides like DNA). Ribonucleotides are named similarly to the 4 deoxyribonucleotides (A, C, T, G), except it uses uracil (U) instead of thymine (T). Ribosomes bind to the RNA transcript in the cytoplasm outside the nucleus and build the protein chain according to the RNA sequence. 3 consecutive nucleotides in RNA constitute a codon, and each codon codes for a specific amino acid. Ultimately, transcription is the copying of genetic information from DNA into RNA, and translation is the synthesis of a protein using the RNA transcript as a guide.

ANALOGY: Think of IKEA (the store) as the cell. The original patent in the head offices of IKEA (nucleus), would be like the DNA. IKEA can’t just give away their one and only patent to everyone who buys the chairs! So how will customers know how to make their furniture? Simple! The patent can be copied into instruction manuals (RNA transcripts) and distributed to customers (ribosomes). Customers can read the instruction manual and make their chairs (proteins), while the patent can remain safe and sound in the head offices.



 

Sequencing

As you can probably tell, bioinformatics involves a lot of sequences: DNA sequences, RNA sequences, and protein sequences. Molecular sequences are biology’s first fundamental dataset. In the 1960s, before we had powerful computational tools, sequences were assembled and compared manually. Early analysts had to write sequences on pieces of paper and shift them around like puzzle pieces to complete sequence analysis. Upon the advent of computers, however, these analysts could enter their manual algorithms into memory banks; they became the world’s first computational biologists. Sequence analysis using computers served as the genesis for the field of bioinformatics.

DNA sequencing is determination of the nucleotide sequence of a piece of DNA. Frederick Sanger and his colleagues developed the technique of Sanger sequencing in 1977. In this technique, the target DNA is copied several times, resulting in fragments of different lengths. Each fragment ends on a particular nucleotide in the sequence; for example, on the 6th or 8th nucleotide. Fluorescent “chain terminator” nucleotides allow analysts to determine the last nucleotide in all the sequences, which ultimately allows them to determine the sequence. Next-generation sequencing (NGS) technologies are the most recent developments in this field, and they allow for large scale sequence analysis. NGS decreases the cost and increases the speed of DNA sequencing.



bottom of page