What do you get when you mix data from a genome-wide association study of schizophrenia with electronic medical information about blood cancer?
If you are Steven McCarroll, assistant professor of genetics at Harvard Medical School, you discover that warning signs of some blood cancers are already present in the blood, in the form of precancerous mutations, long before the cancer shows itself.
The work illustrates the power of using—and repurposing—big data, which was the theme of ReSourcing Big Data, a symposium held at the Joseph B. Martin Conference Center at HMS.
McCarroll, for instance, had set out to look for schizophrenia risk genes, but the data told a different story, one suggesting that early detection—very early detection—of diseases such as cancer might be possible.
Underscoring this point, another team at HMS simultaneously made the same discovery using data about Type 2 diabetes and cardiovascular disease.
“We used different large data sets and found the same thing,” said McCarroll.
McCarroll’s presentation was one of eight made during the March 23 daylong symposium that explored the creation, use and reuse of big data from many angles.
“An important part of the big data revolution—or evolution, if you will—has been the provocative questions that have been raised about medical research, patient privacy, security, analysis and more,” said symposium moderator Eva Guinan, HMS professor of radiation oncology at Dana-Farber Cancer Institute. “It’s just not about the data, but about lots of things.”
Sponsored by Harvard Catalyst’s Reactor Program, the event supported the program’s goal to accelerate clinical and translational research, said HMS Dean Jeffrey Flier in his opening remarks.
“Innovations in how we use big data will be absolutely critical for this acceleration to happen,” he said.
Moreover, Flier emphasized the importance of harnessing big data across medicine and collaborating across disciplines to make sense of it.
“The potential is no longer limited to the genomics and informatics communities,” he said. “This is an area that all investigators and clinicians must become conversant with in the coming decades.”
In the field of psychiatry, for example, big data has been transformative, said speaker Steven Hyman, director of the Stanley Center for Psychiatric Research at the Broad Institute of Harvard and MIT.
Since the 1950s, no new drugs or drug targets have emerged for psychiatric disorders such as schizophrenia, bipolar disorder or autism, but recent large-scale studies have enabled the discovery of gene variants that are associated with these disorders, providing researchers with a better understanding of these diseases.
“What mattered was technology,” said Hyman, referring to the dropping cost of genomic sequencing and to advances in computing that enable analysis of data at scales never before possible.
For instance, the search for genes associated with schizophrenia risk came up empty until the genes of 10,000 schizophrenia patients were analyzed. The number of schizophrenia-risk genes has since reached 108, based on data from 37,000 patients.
Similar large-scale studies are underway for autism, attention-deficit disorder and bipolar disorder. For schizophrenia, plans are in place to sequence 100,000 patients in the coming years.
“The scale is amazing,” said Hyman. “But when I showed the data to a Google executive, he said, ‘How can you learn anything with such tiny data sets?’ That puts big data into perspective.”
These advances also point to another trend toward a wider range of collaborations. Nontraditional collaborators may get involved, including patients, with nonmedical experts, such as mathematicians and even gamers, said speaker Stephen Friend, president of Sage Bionetworks.
“The person who generates the data is great, but the best insights might not come from them,” he said.
Friend presented innovative approaches to working with big data that could enable new kinds of collaborations. The projects open up biomedical research to new approaches, such as using competition as an incentive, fostering transparency and openness, and using crowd sourcing to address challenges, such as driving new diagnostic criteria for amyotrophic lateral sclerosis (ALS).
Engaging patients as partners will also be important, said Friend. He provided examples of disease-oriented smartphone apps that allow patients to record and share data about their conditions. Such apps make data sharing so easy that data sets emerge organically.
“That’s when you can really start to solve problems,” Friend said. “When you don’t have to write a study to get the information; that’s when things begin to crack open.”
Examples of these advances came from two vastly different approaches. Paul Avillach, assistant professor of pediatrics and a member of the Center for Biomedical Informatics at HMS, demonstrated the power of tranSMART, a web-based, open-source technological platform that allows users to integrate big data from multiple sources and then mine it to generate new hypotheses.
For instance, with a few clicks, Avillach searched genomic data from more than 2,700 autistic patients and found 53 who also had epilepsy. As he dragged and dropped a few more fields into his search tool, he located six tissue samples from those patients, along with information about who to contact to access them.
With a few more clicks, Avillach said, he could also identify which patients had given consent to use their data, based on an amalgam of consent forms for patients who were involved in more than one study.
“The idea is to integrate all of the data, allowing an investigator to touch the data without writing a single line of code,” he said.
At the other end of the spectrum, Sally Okun, a registered nurse and vice president of advocacy for PatientsLikeMe, described the rapid growth of the organization’s online community of over 320,000 patients.
The network represents 2,300 conditions and contains tens of millions of unique data points, including what is often described as “soft and squishy information” that the team curated for research purposes.
In 2011, an analysis of data reported on the website by ALS patients predicted the results of a clinical trial before the trial ended. The finding, published in Nature, suggested that information that patients share online could play a role in clinical research.
Okun and several other presenters said patients, for the most part, are inclined to want to share their data for use in research that will help others.
“We should be encouraging collaboration with patients,” Okun said. “Big data from claims databases will not be nearly as rich as data from humans living every day with illness.”
‘Big Data is Messy’
The final speaker of the day, Tariq Khokhar, data scientist at the World Bank, discussed the use of big data in the world of economics. In 2010, the World Bank began making the data it collects open, available for free and readily searchable online, even though the sale of that data used to be a source of revenue.
Users include journalists and policymakers but also medical researchers. For instance, detailed records of mobile phone calls, which reveal who called whom, when and from where, can be used to estimate population sizes, population movement trends and even income.
“Epidemiologists can combine this with disease data to forecast the spread of disease,” said Khokhar.
The promise of big data comes with a caveat, however.
“Big data is messy,” Khokhar said. “Dealing with it appropriately is the discipline you want to cultivate.”
Other symposium speakers included Joanne Waldstreicher, chief medical officer at Johnson & Johnson, and Marsha Wilcox, scientific director of Janssen Pharmaceutical Companies of Johnson & Johnson. They presented their efforts to increase data transparency by sharing clinical trial data.
Johnson & Johnson, in collaboration with the Yale Open Data Access (YODA) Project, has already shared data from multiple trials with several independent researchers.
“There are many risks and challenges to sharing data,” said Waldstreicher. “But rich science and discovery will come if you put clinical trial data into the mix of data available for secondary analysis.”
The day following the symposium, investigators and big data owners discussed potential collaborative opportunities for reusing their data sets. Topics covered included the Human Oral Microbiome Database, led by Floyd Dewhirst of Harvard School of Dental Medicine and the Forsyth Institute ; National Sleep Research Resource, led by Susan Redline, the Peter C. Farrell Professor of Sleep Medicine at Brigham and Women’s Hospital; OPTICS Project: Open Translational Science in Schizophrenia, by Marsha Wilcox of Janssen Pharmaceutical Companies of Johnson & Johnson, and Michelle Williams; and Alternate Approaches to Explore Multi-Dimensional Clinical Spaces, led by Stephen Friend of Sage Bionetworks.