Plenty of Room in the DNA: Scientists demonstrate for the first time that the DNA is a suitable material for the effective storage of big information data

By Annalisa Arcella, PhD














It has been estimated that by the year 2020 about 1.7 megabytes of new information will be created every second per person. This is faster growth than ever before, and soon the load of information may outstrip the capability of current hard-drives to capture it. We are at the beginning of a digital revolution in which the storage of big data is becoming a concrete problem that needs attention.

To address these issues, researchers of the New York Genome Center came up with a cutting-edge strategy based on DNA encoding called DNA Fountain, recently published in the journal Science.

DNA-based nano-engineering

In 1959, the Nobel Prize-winning physicist Richard Feynman argued in a legendary speech to the American Physics Society in Pasadena that the new frontiers of the science take place “at the bottom” of the atomic world, where there is plenty of room, and challenged scientists to write the Encyclopedia Britannica on the head of a needle.

Since the eighties, DNA has been co-opted as a versatile material, with the big advantages of being cheap, biodegradable, flexible, and ultra-compactable over the order of the nano-scale. One nanometer is one billionth of a meter. A sheet of paper is about 100,000 nanometers, and one strand of human DNA is only 2.5 nanometers wide.

Scientists use DNA fragments to engineer circuits, program instructions, and assemble nanodevices millions of times smaller than a grain of sand. These nanodevices will create a new generation of organic chips, multiplying the potential of computers and reducing their size to perform calculations simultaneously, potentially without limit. They will be made by biological materials, and the information will not be transmitted by the binary code but by the code of life: the genetic letters A, T, G, and C.

How much room is in the DNA?

In computing and telecommunication, strings of letters and numbers are translated into binary code by creating strings of two digits: 1 and 0. Using the same principle, DNA-based computing combines four letters, A, T, G, C, corresponding to adenine, thymine, guanine, and cytosine—the four nucleotides that are the building blocks of DNA. Through adopting DNA-based computing, scientists are proposing cutting-edge methods for transferring data as well as engineering biological logic gates that are just like electronic circuits.

So how much information can the DNA carry? In information science, one single binary value of 0 or 1 is called a “bit” (binary digit), and is the smallest element of a data. A byte is commonly 8 bits, and is the unit of measure for digital information. Claude Shannon, the father of Information Theory, defined the quantity of information produced by a source—for example, the quantity in a message—by a formula. In its most basic terms, the average available information, known as Shannon’s informational entropy, is the number of binary digits required to encode a message.

According to this theory, the DNA can potentially store over 1.8 bits of data per nucleotide. To put it in perspective, the DNA wrapped in the nucleus of a human cell contains 3.2 billion nucleotides. This means that each individual cell of ours can potentially carry 5.76 billion bits of information, equivalent to 670 megabytes, or the estimated space necessary to download a movie onto a hard-drive or PC.


Analog – Biological Digital – converters

Yaniv Erlich and Dina Zielinski from the New York Genome Center demonstrated for the first time that the algorithm they devised, the DNA Fountain, was able to approach this 5.76 billion bit limit.

As a test, they translated into binary code six files, including a full computer operative system, a computer virus, a French movie from 1895, and a 1948 study by Claude Shannon. Then the researchers compressed them into one master copy that was finally segmented into short  sequences of the standard ones and zeros of binary code.

Erlich and Zielinski used the DNA Fountain algorithm to randomly choose strings of code from the master copy and package the strings into “droplets,” to which they added binary code extra-tags to help them decode the six original files at the study’s end. In each droplet, the ones and zeros were mapped to A, G, T, and C sequences. In total, a digital list of 72,000 DNA strands, each 200 letters long, was generated.

Since not all ATGC combinations are possible in nature, the team screened the DNA strands so only the valid sequences of DNA were sent to a California-based startup to be synthesized into one DNA filament. After a few weeks, Erlich and Zielinski received a speck of DNA encoding the six files. When they decoded the speck using DNA sequencing technology and converted the DNA sequences back to the binary code, their six original files were successfully recovered without any error.

In this work, 1.6 bits of data were encoded per nucleotide—a milestone! For the first time, up to 85% of the theoretical limit of 1.8 bits of DNA storage was achieved. This accomplishment is a concrete demonstration that the storage of large-scale data in the DNA is both feasible and effective.

However, this approach is still expensive—two gigabytes of data cost a whopping $9000—and the process of DNA encoding and decoding is relatively slow. Therefore, this technology is expected to be most suitable for archive storage applications rather than for fast information processing and transferring on the air. Nevertheless, the future of DNA as data storage is bright and clear, and I cannot wait to have all my favourite books, music, and movies filed on the tip of my fingers.


About the Author: Annalisa Arcella has a PhD in Computational Biophysics, is a passionate biomedical researcher and scientific communicator across Europe and US, and left academia to pursue a career as a Data Scientist for Digital Marketing. Find Annalisa on LinkedIn here: