Laura Schmitt, CS @ ILLINOIS
Note: this was the feature article in Click! Magazine, 2016, volume II.
When the first human genome was mapped more than a decade ago, it heralded the beginning of a potentially revolutionary era in healthcare. Precision medicine—using an individual’s specific genetic profile to help prevent, diagnose, and treat disease—seemed to be on the horizon. Thanks to rapidly decreasing costs of sequencing and analyzing genomes, researchers were hopeful that advances in precision medicine would quickly accelerate.
There has been some progress. For example, scientists have mapped gene mutations to specific diseases and disorders, including cystic fibrosis, melanoma, as well as lung and breast cancers. And patients with certain types of cancers routinely undergo molecular testing as part of their care, enabling physicians to prescribe drugs that target only cancerous cells once they understand a tumor’s genetic makeup. The result? Improved survival rates and less adverse patient side effects.
However, even proponents of precision medicine still await the widespread breakthrough that many predicted. One of the barriers is that many diseases don’t correlate directly to a single gene. For example, there are at least 65 genetic variations that increase the risk of developing Type 2 diabetes. Some researchers believe that if they can collect and analyze more genome data, then they may be able to identify every possible mutation that could be a factor in many diseases.
The Obama administration’s Precision Medicine Initiative is beginning to collect the medical records and genomic data of one million American volunteers. At the same time, massive amounts of already collected genomic data is being publicly released to researchers—e.g., the National Cancer Institute’s Genomic Data Commons database, which contains 12,000 cancer patients’ data, will enable researchers to access clinical information about cancer tumors and the efficacy of specific treatments.
This proliferation of genomic Big Data presents a unique set of challenges when it comes to acquiring, storing, distributing, and analyzing the data. According to CS Associate Professor Saurabh Sinha, who studied the issue with researchers from Cold Spring Harbor Lab and fellow Illinois faculty, genomic information from sequencing different organisms and a number of humans is already at the petabyte scale—a petabyte is 1 million gigabytes. By 2025, genomic data will, by their projection, reach the exabyte scale, or billions of gigabytes. To put this in perspective, genomic data may surpass YouTube and come close to matching astronomy as the reigning Big Data source.
“The DNA sequence in itself is not particularly useful for realizing all the great possibilities that genomics technology promises,” said Sinha, when his team’s study was published in July 2015. “The sequence data have to be analyzed through sophisticated and often computationally intensive algorithms, which find patterns in the data and make connections between those data and various other types of biological information, before they can lead to biologically or clinically important insights.”
Several CS @ ILLINOIS faculty are developing the computational analysis tools and methods that will not only make sense of the data, but will also provide the necessary security and privacy safeguards.
Creating a knowledge engine for genomics
A relatively new way for scientists to identify genes involved in human disease, genome-wide association studies have discovered some genetic variations that contribute to macular degeneration, type 2 diabetes, Parkinson’s disease, obesity, and heart disorders. However, these studies haven’t resulted in a slew of new therapies.
“The big lesson doing these genome-wide association studies for the last 10 years is they’re not as powerful in revealing the underlying source of biological conditions as we would have expected,” noted Sinha, who is also a member of the U of I Carl Woese Institute for Genomic Biology. “The realization is that we need to tie in these studies’ data and analyses with what we already know about biology at the molecular level. The genomics community has generated vast bodies of knowledge that could be incorporated here, but other than some crude methods, there’s no systematic way to do that now.”
Sinha and CS Professor Jiawei Han hope to change that. Funded by a $9.3 million grant from the National Institutes of Health’s Big Data to Knowledge (BD2K) initiative, they have partnered with fellow Illinois faculty and Mayo Clinic scientists to create a revolutionary cloud-based analytical tool that enables biomedical and clinical researchers to extract knowledge from genomic data. Their Knowledge Engine for Genomics (KnowEnG) tool will be uniquely powerful in its integration of many disparate sources of gene-related data, and its scalable design will be able to accommodate the continued growth of genomic community knowledge.
When it’s complete, researchers will access KnowEnG via an intuitive web-based user interface, through which they can analyze their own gene-based data sets in the context of the entire body of previously published gene-related data (aka the Knowledge Network). Currently, the team is testing KnowEnG’s usefulness and functionality through three projects: pharmacogenomics of breast cancer, identification of gene regulatory modules underlying behavioral patterns, and the genome-based prediction of microorganisms to synthesize novel drugs.
A part of precision medicine, pharmacogenomics is the study of how a person's unique genetic makeup influences his or her response to medications. So far, Sinha has developed a novel computational method that integrates the public knowledge of gene expression, genotype, and drug response data in diseased cells. His Gene Expression in the Middle (GENMi) tool discovered specific proteins, or transcription factors, that regulate which genes are turned on or off, thus influencing the effectiveness of certain drugs. GENMi is the first technique to assess regulatory associations with drug response.
In another area, Sinha and his colleagues are collaborating with NCSA on developing the computing infrastructure necessary to make their tools widely accessible to biologists and clinicians through a web-based user-friendly portal, rather than be posted on GitHub or a research group’s web site. The back end of KnowENG, which contains the massive body of community knowledge, will reside in the cloud.
Mining data to extract medical knowledge
CS Professor ChengXiang Zhai uses his data mining and natural language processing expertise to develop text information systems that enable knowledge discovery from vast amounts of information. “The goal of my research in the health domain is to empower patients or doctors to improve decision making,” Zhai said. “We can use advanced technology to reduce medical costs and improve diagnosis and treatments.”
One area of health costs and outcomes that Zhai is addressing is side effects of prescribed medication. According to the Food and Drug Administration, adverse drug reactions are the fourth leading cause of death among hospital patients in the United States. And experts estimate that the cost of treating these reactions is nearly $136 billion.
Zhai and his PhD student Sheng Wang created a novel tool that analyzed patient discussions about drug side effects found on Internet-based health forums. They used advanced data mining techniques to separate the vocabulary so they could automatically identify the drug names and whether the patient was describing disease symptoms or suspected drug side effects.
Their SideEffectPTM software was able to discover the side effect symptoms of many drugs in an unsupervised way—a major improvement over conventional methods, which require a health professional to first annotate or label the mined data so the software can use machine learning techniques to become familiar with the medical terminology.
“Our approach successfully identified side effects for certain drugs,” Zhai noted. “We even discovered a few side effects that had not been reported to the FDA yet. In addition, we demonstrated that our analysis method found some drug-drug interactions.”
In the future, Zhai and his team want to expand their data collection and mining to include hospital patient medical records and patient genomic data rather than just online health forums to discover knowledge about variations of side effects in different genetic groups. Someday, he envisions his techniques being incorporated in a hospital system, where doctors could look for any known or suspected side effects associated with drug interactions for each individual patient before prescribing medicine.
In a separate collaboration with IBM, Zhai analyzed patient medical records to predict the onset of congestive heart failure. Specifically, the team analyzed unstructured text data found in the clinical notes section of the records. By using natural language processing techniques, he could extract useful signals from the notes, especially the mention of symptoms. This discovery of new symptoms could be added to the known symptoms for CHF.
“We showed that these additional symptoms can improve the accuracy of predicting the onset of CHF by 10 percent,” Zhai said. “Before this experiment, there was a predictor for CHF by taking data from the structured fields, not the clinical notes. Our result is a specific example of the general benefit of exploiting Big Data--combining all relevant data to a problem from all sources in a predictive model can often improve prediction accuracy.”
Ensuring data security and patient privacy
While the collection and sharing of genomic data have many benefits, they also create potential challenges. According to CS Professor Carl Gunter, genomic big data has major personal privacy implications because it’s unique and has to have additional safeguards compared to data typically found in electronic health records, like body temperature, weight, or blood pressure. “Genomic data changes little over a lifetime and may have value that lasts for decades,” he said. “This long-lasting value means that holding and using genomic data over extended periods of time is likely.”
Gunter and his students are developing techniques that guarantee the security and privacy of genomic data while ensuring that biomedical researchers or clinicians have appropriate access to the data. They created the Controlled Functional Encryption (C-FE) tool, which is better than conventional methods like functional encryption or secure two-party communication because it doesn’t require any direct interaction between the researcher and the genome donor. In addition, C-FE is more secure and computationally efficient.
The C-FE tool is ideally suited for a situation where a cancer patient’s physician wants to search on a nationwide scale for similarly diagnosed patients with similar genetic makeup in the hopes of finding therapies that worked or didn’t work. While many schemes have been proposed for this scenario, they are designed to compare two genomes, not hundreds or thousands. Gunter’s C-FE scheme is scalable, so it efficiently supports genomic profile comparisons among very large populations, costing between $140 and $1,400 depending on the number of genetic variations being sequenced.
In another research project, Gunter and his students created a framework for properly handling genomic data—from understanding security and privacy requirements to designing a threat model that identifies the types of possible attacks at each step in obtaining, storing, and analyzing the data. Gunter’s framework deals head on with the issue of re-identification.
One of the most serious forms of attack, re-identification occurs when an unauthorized party tries to recover the identities of donors by looking at the published human genomes. A successful re-identification attack could seriously harm the donor through potential employment discrimination, denial of life insurance, or inappropriate marketing.
Although genomic repositories have removed obvious personal information like name and date of birth, hackers could still potentially identify individuals. “DNA is intrinsically identifiable, so if a hacker had access to your data, they could infer phenotype information—a person’s observable characteristics like eye, hair, and skin color,” Gunter explained. “There’s a chance they determine the shape of your face, and in the future, they could possibly identify people quickly from DNA data.”
How new sequence alignment tools can impact precision medicine
The human microbiome is known to have a huge impact on health. For example, the gut microbiome (i.e., the bacteria living in the intestine) are different between thin and obese people, and can impact whether a person develops Type 2 diabetes, how he or she responds to drugs, etc. No two people have the same microbiomes, so understanding and characterizing human microbiomes has immediate relevance to precision medicine.
The challenge in analyzing microbiomes is that the data consists of millions to billions of short DNA fragments. “We need to be able to look at each fragment, which might only have 100 nucleotides, and figure out what gene and what species it is,” said CS Professor Tandy Warnow. “Fundamentally, the challenge is to be able to place a short fragment inside the Tree of Life. But because the fragment is very short and also has errors (due to sequencing technologies), this is a very challenging problem.”
To address this problem, Warnow and CS Professor William D. Gropp are designing new statistical methods and implementing them for supercomputers, to ensure that the best accuracy can be obtained with high speed.
According to Warnow, one of the grand challenges in analyzing microbiome data is the need for new multiple sequence alignment (MSA) software tools that are more accurate and capable of handling larger data sets than conventional methods. Biologists and other researchers use MSA tools to analyze and find similarities in genomic data like DNA, RNA, and proteins, which can have millions of sequences.
These alignments are also used for constructing evolutionary trees, predicting protein structures and functions, understanding how humans migrated across the globe, etc. Thus, multiple sequence alignment methods are not only needed for microbiome analysis, but for many other challenging scientific problems.
However, there’s a catch: in addition to their vast size, the genomic data are sometimes fragmented or incomplete, which makes it very difficult to construct an accurate alignment. “There’s this whole cascade of inferences that begins with accurate multiple sequence alignments—errors here have consequences downstream,” explained Warnow. “What was happening was people were giving up on their data, thinking it was too big or too heterogeneous to be analyzed accurately.”
Warnow has developed a set of new MSA tools that can accurately and quickly align large sequence datasets. An international team of researchers used Warnow’s alignment method PASTA to align the DNA of 1,200 plant species and 13,000 gene families as part of the One Thousand Plant Transcription Project (1kP), which revealed new knowledge about the evolution of plant life on Earth and identified new algae proteins in the sequence data that may someday even have applications in medicine.
“We give biologists and researchers better methods, allowing them to get good accuracy quickly,” said Warnow. “Therefore they get better biological discoveries downstream.”
Saturday, October 29, 2016 - 14:15
Faculty