The New York Times reports today that scientists reading human genomes are generating so much data that they must use snail mail instead of the Internet to send the DNA readouts around the globe.
BGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day.
BGI churns out so much data that it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because that would take weeks. Instead, it sends computer disks containing the data, via FedEx.
“It sounds like an analog solution in a digital age,” conceded Sifei He, the head of cloud computing for BGI, formerly known as the Beijing Genomics Institute. But for now, he said, there is no better way.
The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore’s law, which describes the rate at which computing gets faster and cheaper.
The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.
We’ve been talking about the oncoming rush of biomedical data for a while. A human genome consists of some 2.9 billion base pairs, easily stored in around 725 megabytes with standard compression techniques. Two thousand genomes a day, times 725 MB, equals 1,450,000 MB, or 1.45 terabytes. That’s a lot of data for one entity to transmit in a day’s time. Some researchers believe a genome can be losslessly compressed to approximately 4 megabytes. In compressed form, 2,000 genomes would total around 8,000 MB, or just 8 gigabytes. Easily doable for a major institution.
Interested to know more.