Sequencing the DNA of one million people will create datasets of unprecedented size; Intel, Broad add AI, improved analytics, reference architectures, cloud infrastructure to help carry the load
Intel Corp. and the Broad Institute have announced a five-year, $25 million collaboration designed to help researchers work more conveniently and efficiently with data sets that double about every eight months and could, conceivably, grow to occupy half the storage capacity currently available anywhere on the Internet.
Intel and the MIT/Harvard-affiliated Broad Institute announced the formation of the Intel-Broad Center for Genomic Data Engineering — a project designed to gather and optimize best-practices in hardware, software and analytics programming to make analysis of genomic data more efficient. “The size of genomic datasets doubles about every eight months—and, as it does, the challenge of acquiring, processing, storing, and analyzing this information increases as well,” Eric Banks, a Senior Director of the Broad Institute’s Data Sciences and Data Engineering Group,
Over the course of a five-year collaboration, Intel and Broad will revamp the analytics and software libraries available to do genomic analysis and create high-speed infrastructure that would allow researchers to reach extremely large databases online and provide the compute capacity as well as storage to analyze he data, according to Broad Institute Chief Data Officer Anthony Philippakis, in a YouTube announcement video.
“We hope to create a model for other industries to break down barriers and speed up research and discovery that relies on complex and distributed datasets,” Philippakis says.
A completely different category of “big” data
It would be surprising that researchers would need so much help just to keep up with the volume of data — in an age when high-capacity storage, hyper-scale datacenters and an unprecedented number of supercomputers operating in both scientific and commercial laboratories have turned compute-intensive analytics into an everyday resource for businesses and even consumers. Genomic data, however is big on a scale big data just can’t touch.
A 2015 analysis of the growth of data from the sequencing of the human genome concluded that, by 2025, sequenced human genomic data will rival or exceed the amount of bandwidth that would be required if Twitter grew to the point of circulating 15 billion Tweets in a single year (at an estimated 17 petabytes per year) and 900 million hours per year of YouTube videos. By 2025, according to the analysis led by Univ. of Illinois at Urbana-Champaign, researcher Zachary D. Stephens, human genetic sequencing data may occupy between 2 exabytes and 40 exabytes per year of storage, according to Stephens’ research in PLOS Biology.
In 2015, Intel estimated a single patient’s genomic sequence requires about 1TB of storage space per year, meaning 1,000 patients would generate 1PB of data.
That is an enormous load to move on and off disk for analysis, but it’s a drop in the bucket. That estimate was part of Intel’s support of a massive effort by the National Institutes of Health to sequence and put online the genomic information of one million people — in order to provide enough data to develop the ability to analyze the condition and genomic background of a patient, devise an effective, safe treatment designed for his or her particular biochemistry and provide a cure all in one 24-hour period. Intel created the Collaborative Cancer Cloud to hold online data, analytics and provide processing power and tools to researchers.
Sequencing the genome of every living person would require 7.3 zettabytes of storage — half of all the storage estimated to be attached to the Internet during 2016, according to a presentation at BioData World Congress in November outside of London by Michael J. McManus, Ph.D. , senior health and life sciences solution architect for Intel.
“To put it mildly,” McManus wrote, “sequencing everyone comes with a significant set of challenges. The amount of data generated by sequencing everyone borders on the unimaginable, and therefore the question of storage of this data becomes critical.”
Processing of that volume of data is also a problem, on which Intel has been collaborating with Broad Institute for several years.
In 2014 the two announced a version of Broad’s Genome Analysis Toolkit (GATK) optimized for the Advanced Vector Extensions (AVX) libraries that can accelerate analysis on Xeon-powered servers. Intel and Broad were able to increase performance by three to five times, according to a 2014 announcement.
The two are taking another crack at the GATK, with plans to codify Broad’s best-practices hardware recommendations to provide blueprints for on-premise, public cloud or hybrid configurations for GATK analyses.
The two also announced plans to revamp and simplify the execution of genome analytics by optimizing performance on Intel-based hardware of the analysis software that does the work — including GATK Cromwell and GenomicsDB.
The third initiative the two announced was a plan to encourage pharmaceutical companies, academic researchers and healthcare providers to participate in ongoing genomic research — increasing the number of hands as well as the number of petabytes available to do the work.
“Intel and Broad share the common vision of harnessing the power of genomic data and making it widely accessible for research around the world to yield important discoveries,” Diane Bryant, executive vice president and general manager of the Data Center Group for Intel is quoted saying in the announcement. “Through the use of artificial intelligence, we are confident we can solve the massive data challenges facing the industry.”
Source: Intel YouTube channel
Source: Michael J. McManus presentation, a presentation at BioData World Congress in November
itpeernetwork.intel.com/sequencing-everyone-fantastic-goal-huge-implications/
itpeernetwork.intel.com/insights-biodata-world-congress-2016/
DNA Sequencing Data Grows at Astronomical Rates
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Stephens, et al, July, 2015, PLOS Biology
The post Intel, Broad Institute Aim to Streamline Effort of Analytics in Genomic Research appeared first on Go Parallel.