DNA - THE FUTURE OF DATA STORAGE SOLUTIONS

MADHAN RAGHU

Lead Software Engineer, Central Functions, SG GSC

Digital data has transformed how information is used and accessed. A large amount of data is produced every day and requires high-density storage devices that can store values for long periods of time. Researchers have demonstrated that DNA is a scalable, random-access, and error-free data storage system. Deoxyribonucleic acid (DNA) offers a rich, sustainable, and stable for thousands of years and offers utility in long-term data storage.

DEMAND FOR DATA STORAGE AND THE EMERGING NEED FOR A NEW ARCHIVE STORAGE IN THE MARKET

The world will generate 180 zetta bytes of data by 2025. Traditional storage media like flash-drives and hard- drives do not have the data density, or cost efficiency to meet this global demand.

Source : www.statista.com

Data generated globally (new and replicated copies) is expected to grow at a 23% compound annual growth rate (CAGR) from 2020-2025.

Many Cloud-based storage architectures keep multiple copies of data to increase the likelihood that it will be recovered intact and to avoid waiting as storage devices perform error recovery. It is often quicker to simply retrieve the data from another device or database that contains a copy of the data.

Today’s storage media (magnetic, semiconductors, etc.) can retain data for decades if properly cared. However, like all physical assets, they wear out and degrade over time. Therefore, their status needs to be checked and monitored regularly to ensure data Storage.

Despite ongoing improvements in media scaling, key challenges remain for today’s storage technologies when considered for zettabyte scale and long storage duration.

HOW TO USE DNA AS DATA STORAGE

DNA data storage is the process of encoding binary data into synthetic, man-made strands of DNA. To store a binary digital file in DNA, the bits (binary digits) are converted from 1s and 0s into the letters A, C, G, and T. These letters represent the four unique nucleotides that make up DNA: adenine, cytosine, guanine, and thymine

The physical storage medium is a synthesized chain of DNA containing the As, Cs, Gs, and Ts in a sequence corresponding to the order of the bits in the digital file. To recover the data, the chain of DNA is sequenced and the order of As, Cs, Gs, and Ts are decoded back into the original digital sequence

DURABILITY

DNA is the molecule of choice for information storage in biological systems. It can remain intact for thousands of years at room temperature in a dry atmosphere. DNA is an extremely stable molecule with a half-life of over 500 years. If stored in cold conditions, DNA is capable of remaining intact for hundreds of thousands of years.

MAINTENANCE

Today’s storage media must undergo periodic fixity checks to ensure that their data remains readable. Due to the durable nature and other properties of DNA, we expect its maintenance at rest to be much simpler than legacy storage solution

ENERGY EFFICIENCY AND SUSTAINABILITY

Compared to today’s datacenters with today’s storage technologies, data stored in DNA consumes minimal to no resources while at rest. While current datacenters use a significant amount of power and land, with DNA data storage, these requirements will be negligible. Finally, due to its durability and density, disposal of DNA should have much less environmental impact than the disposal of obsolete tape drives or HDDs

DNA DATA STORAGE PIPELINE

To store data in DNA, the original digital data is encoded (mapped from 1s and 0s to sequences of DNA bases), then synthesized (written), and stored. When the stored data is needed again, the DNA molecules are sequenced (read) and decoded (re-mapped from DNA bases back to 1s and 0s)

CODING

The basic concept of coding for storing DNA data is the process of converting the ones and zeros of the original digital data into sequences of bases (ACGT) that comprise DNA molecules. Coding methods are closely coupled to the synthesis and sequencing.

SYNTHESIS (WRITING)

Synthesis is where DNA manufacturing occurs. Based on a series of chemical steps, the DNA molecules, as determined by the encoding step, are assembled in various ways that mirror the “bits-to-bases” or other encoding methods

STORAGE

After synthesis, the DNA is encapsulated for long-term preservation and deposited in a library where pools of DNA are stored. There are several types of encapsulations, including sealing DNA in capsules mixing it with chemicals that help preserve it

RETRIEVAL

After storage, and once the data is needed, the encoded DNA is retrieved from its library and prepared for sequencing. Often, this process also includes making copies of the molecules for sequencing methods that are molecule intensive and for cases where more copies serve distribution or further storage needs

SEQUENCING

Sequencing is the process of determining the identity and order of DNA bases (ACGT) in a DNA segment. Various sequencing methods are used today (e.g., sequencing by synthesis (SBS), nanopore sequencing). These use various methods (e.g., optical, pH-based, electrical) to detect the actual bases in the DNA strands being read.

DECODING

Decoding involves mapping the bases in a string of sequenced DNA back to digital data. Importantly, it involves performing error correction to recover from any potential errors that may have occurred during synthesis, preservation, and sequencing

CHALLENGES IN DNA DATA STORAGE

Based on its unique characteristics and compared with the traditional media, DNA could be the potential and promising medium for digital data storage. However, it is still a long way to go before DNA could be commercially applied. The challenges we must deal with exist in various aspects, including high cost, low throughput, the limited access to data storage, short synthetic, error rate in synthesis and sequencing.

The use of DNA in data storage is much more expensive than the other traditional media like tape, disk, and HDD (hard disk drive). Currently, to encode and decode data cost almost $15,000 per megabyte (MB).

Via current approaches, only bulk access is available for DNA data storage. The entire DNA-based data storage must be sorted, sequenced, and decoded from DNA data storage even though we just need to read a single byte. Currently, DNA synthesis and sequencing are not perfect. During DNA synthesis and sequencing, the occurrence of insertion, deletion, updation and other errors can be occurred, with an error rate being about 1% per nucleotide.

MARKET FINDINGS

Research and development efforts within this field in academia, private industry, and the public sector with several cross-sector partnerships.

The science behind storing data in DNA has been proven

Researchers have demonstrated that DNA is a scalable, random-access, and error-free data storage system. Advancements in next generation sequencing have enabled rapid and error-free readout of data stored in DNA. As the data storage crisis worsens in the coming years, DNA will be utilized to store vast amounts of data in a highly dense medium.

The first commercial DNA storage company called Catalog is poised to take orders in 2019.

Catalog is building a proprietary DNA data storage machine in partnership with Cambridge Consultants that will synthesize 1TB of data per day at a cost of a few thousand dollars. This will revolutionize our approach to archival data storage and pave the way forward towards further advancements in the field

https://www.catalogdna.com/

Technology that utilizes engineered biological enzymes to synthesize DNA fragments will radically decrease costs and propel the field forward

Biology-inspired engineering approaches to synthesizing DNA will be the catalyst that drives down cost of DNA synthesis. Several independent groups are developing DNA synthesis technologies that utilize enzymes to construct novel DNA sequences. This 2nd generation

synthesis technology already has commercial developers and is also actively being developed in academia. This technology is the first breakthrough in DNA synthesis in decades and should lead to significant reductions in costs and facilitate development of new technologies that support storing data in DNA.

VISION

Moving forward, the potential for DNA-based storage is nearly limitless.

A piece of 3D-printed plastic with strands of DNA that contained the object files for the plastic object being printed. As the plastic passes through the printer, it can release the DNA to recreate the file in a circular process.

DNA-based data storage as a way to make forensic discoveries about inanimate objects that don’t have their own genetic material. Say you coat an airplane with a material that contains DNA, with the full instructions for building that portion of the plane. If something goes awry, scientists can analyze the stored DNA and retrieve information

Catalog is developing a different coding system that relies on pre-synthesized blocks of DNA, that can be stitched together to encode data in DNA. This approach can radically reduce the need to synthesize DNA fragments continuously from scratch and allow for faster and cheaper encoding of information. Catalog says that it aims

to encode 1 TB of data per day in DNA. To achieve this goal, they have partnered with the UK-firm Cambridge Consultants to build the world’s first DNA storage device, that will approximately be the size of a school bus

CONCLUSION

Information is being digitized on a massive scale, by servers in datacenters, by mobile devices, and by networks of sensors everywhere around us. Artificial intelligence techniques and ubiquitous processing power are making it possible to mine this massive ocean of data; however, integral to harnessing this data as knowledge is the ability to store it for long periods of time.

DNA, nature’s data storage medium, enters this picture at a time when synthesis and sequencing technologies for advanced medical and scientific applications are enabling the manipulation of synthetic DNA in ways previously unimagined. There are credible predictions that, by 2030, DNA synthesis could reach a cost of $1/terabyte and that DNA sequencing may reach similar levels. The scale of DNA data storage is unprecedented. The durability

of DNA and uniformity of the DNA molecular structure are ideally suited to long-term archival storage. Finally, DNA is an inherently environmentally friendly medium in terms of power, space, and sustainability, which will place significantly lower burdens than legacy storage technologies on our fragile ecosystem. Thus, following the advances in the technologies of DNA data storage, DNA serving as a data storage medium will be a golden opportunity in this era of big data.

GLOSSARY

Data Volumes

Unit	Value	Example
Kilobytes (KB)	1,000 bytes	A paragraph of a text document
Megabytes (MB)	1,000 Kilobytes	A small novel
Gigabytes (GB)	1,000 Megabytes	Beethoven’s 5^thSymphony
Terabytes (TB)	1,000 Gigabytes	All the X-rays in a large hospital
Petabytes (PB)	1,000 Terabytes	Half the contents of all US academic research libraries
Exabytes (EB)	1,000 Petabytes	About one fifth of the words people have ever spoken
Zettabytes (ZB)	1,000 Exabytes	As much information as there are grains of sand on all the world’s beaches
Yottabytes (YB)	1,000 Zettabytes	As much information as there are atoms in 7,000 human bodies

Computer Technology Terminology

Areal density	The number of bits a medium can store per unit area
Binary	A language that forms the basis of all computer information, in which there can only be two values: either 0 or 1. Each binary digit is known as a bit.
bit	A single binary digit: either 0 or 1. Bits can be combined to represent different values e.g., two bits can represent four values: 00, 01, 10, 11. The values these bits represent is dependent upon the coding notion used.
byte	A basic unit of memory that is usually 8 binary digits, or bits, in length.
string	A sequence of characters or symbols.

Molecular Biology Terminology

DNA	Deoxyribonucleic acid (DNA) is a biological molecule composed of repeating units, called nucleotides, predominantly arranged within a double helix for- mation. The order of nucleotides which can contain one of four possible bases – adenine (A), thymine (T), guanine (G) and cytosine (C) – forms the genetic code present within nearly all lifeforms
DNA sequence	The order of nucleotides in DNA
sequencing	Determining the order of nucleotides in a DNA molecule. Essentially, the read- ing of the DNA sequence.

REFERENCES

1.New Research Topics. Semiconductor Research Corporation. Retrievable at: https://www.src.org/src/new- research-topics/

2.Data Storage Supply and Demand Worldwide, from 2009 to 2020 (in exabytes).Statista. Retrievable at: https:// www.statista.com/statistics/751749/worldwidedata-storage-capacity-and-demand/

3.The Next Generation of Information Storage. iGEM. Retrievable at: http://igem.org/Team:Edinburgh_UG

4.New Research Topics. Semiconductor Research Corporation. Retrievable at: https://www.src.org/src/new- research-topics/

5.The Synthetic Biology Industry: Annual Industry Growth Update. SynBioBeta. Retrievable at: https:// synbiobeta.com/wp-content/uploads/sites/4/2017/03/Synthetic-BiologyIndustry-Annual-Growth- Update-2017.pd

6.Palluk, S; Arlow, D.H.; de Rond, T; Barthel, S;Kang, J.S.; Brector, R; Baghdassarian, H.M.;Troung, A; Kim, P.W.; Singh, A.K.; Hillson,N.J.; Keasling, J.D. June 18, 2018. De novo DNA synthesis using polymerase- nucleotideconjugates. Nature Biotechnology. Retrievable at: https://www.nature.com/articles/nbt.4173

7.October 2, 2018. DNA Script Announces World’s First Enzymatic Synthesis of a HighPurity 150-Nucleotide Strand. DNA Script. Retrievable at: http://www.dnascript.co/pr9

8.History. Molecular Assemblies. Retrievable at: http://molecularassemblies.com/about/story/

9.SmidgION. Oxford Nanopore Technologies. Retrievable at: https://nanoporetech.com/products/smidgion

10.NSF Program Solicitation 17-557. National Science Foundation. Retrievable at: https://www.nsf.gov/ pubs/2017/nsf17557/nsf17557.htm

11.Nuclera. Retrievable at: https://www.nuclera.com

12.Genes. Twist Bioscience. Retrievable at: https://www.twistbioscience.com/products/gen

DNA – THE FUTURE OF DATA STORAGE SOLUTIONS