Featured Program: Cancer Data Sciences

April 13, 2017

Big Cancer Data Moves from Descriptive to Predictive

Franziska Michor, PhD

Franziska Michor, PhD

In this age of big data, all areas of medicine can benefit from robust statistical analysis. The field of cancer research is no exception. Franziska Michor, PhD (DFCI) leads the DF/HCC Cancer Data Sciences Program, which seeks to help harness the power of big data. The Program co-leaders are Xihong Lin, PhD (HSPH) and Peter Park, PhD (HMS). 

As program leader, Michor points out the field of data science in cancer is evolving. “I think we are trying to become more and more predictive,” she said. “Previously, biostatistics, which has always been involved in testing hypotheses, was much less predictive than it is becoming today. In addition to describing what we see, we are now coming up with methods that will help us predict outcomes and also to determine which treatment a certain patient should get.” 

According to Michor, the challenges facing the field of cancer data sciences include not only processing and correctly interpreting data, but collecting the right data in the right format. "Sequencing samples taken from multiple time points from multiple disease sites in the same patient - that's very hard to produce.” She added that, “Overall, we hope that the efforts of program members will ultimately improve precision medicine. These approaches take full advantage of patients' specific cancer genotype so we can identify optimum combination treatment, potentially including immunotherapy."

Technological Advances Bolster Cancer Data

Program member, Gad Getz, PhD (MGH), who directs the Cancer Genome Computational Analysis group at the Broad Institute of MIT and Harvard and the Bioinformatics program at the Massachusetts General Hospital Department of Pathology and Cancer Center, says his research involves both computational analysis and also a “wet” lab to validate the results coming from computational analysis. 

“The field is evolving very rapidly,” Getz said. “Ten years ago we generated the first Next-gen sequencing data, and in the last decade we have gone from generating data from a single tumor to generating data from many thousands of patients, whole exomes, and whole genomes.”

Xihong Lin, PhD

Xihong Lin, PhD

According to program co-leader Lin, a major challenge right now is the collection and processing of massive amounts of data, in large part due to decreasing costs of sequencing technologies. At this time, the cost of sequencing a single human genome is a little over $1000, but this cost may drop to as low as $100.1 Much of Lin’s work, therefore, focuses on developing methods to make these massive amounts of data statistically more reliable and identifying whether or not a result is real or due to chance. 

Lin is involved in several collaborations, one of which is with David C. Christiani, MD, MPH (HSPH) in the area of lung cancer. Lin and her colleagues have used genome wide association analysis to identify genetic markers associated with lung cancer to tease out which part of lung cancer is caused by genetics and which part is caused by smoking.

“We developed a strategy to integrate the genetics data with the environmental (i.e., smoking) data, then tried to understand the biological pathway and how genetics affects lung cancer in those who smoke. Notably, we found out that many lung cancers develop through biological pathways and not completely through environmental exposure to smoking2,” she said.

“It used to be that we would only look at one environmental factor a time, but now with the technology available we can evaluate a spectrum of environmental factors simultaneously and work out exposure and whether this interplays with the genetics involved in a cancer,” she said.  

To that point, Lin and program member Francesca Dominici, PhD (HSPH) are also heading a large P01 grant called “Statistical Informatics for Cancer Research.” It is designed to address gaps and barriers arising in the analysis of large and complex data from observational studies in cancer research.

Program co-leader with Lin, Peter Park specializes in analyzing genomic rearrangements and how a specific type of variant plays a role in tumorigenic processes. Whole genome sequencing, he says, is a “relatively new area, but a rapidly growing one.”  According to Park, efforts are now underway through NIH-sponsored and other programs to sequence tens of thousands of genomes. One such group in which he participates is the International Cancer Genome Consortium, a multi-national effort to analyze 2,500 whole genomes.

“Until recently,” said Park, “only a small portion of the genome has been investigated. We now have the technology to examine the entire genome.”

Park points out, however, that while they are learning a great deal from analyzing patients’ genomes, no hospital currently has a single database from which you can get genomic and clinical data. “Efforts are underway to do so but this will require many different institutions to work together,” he added. 

Cancer Data Sciences and Its Clinical Applications

Peter Park, PhD

Peter Park, PhD

According to Park, the Cancer Data Sciences Program will play an increasingly important role in cancer research and the way that clinical trials are conducted in the future will be quite different. “I think close collaboration between the program and clinicians will be central going forward. We will all need to work together to help incorporate genomic information into patients’ treatment.”

Michor’s research involves creating mathematical models of cancer evolution and evaluating how heterogeneous populations of cells evolve over time. Her research also helps predict how cancers will respond to different treatments and tries to determine the best approach to treat a heterogeneous population or to prevent a certain outcome, such as metastases or relapses after originally effective treatment. 

“We use mathematical models to evaluate different treatment strategies, pick out the one that is predicted to be best, and then we test it in a clinical trial,” Michor says. 

Cancer Data Sciences Goes Micro

Gad Getz, PhD

Gad Getz, PhD

Getz is trying to understand what processes occur during the development of mutations. These could include events such as spontaneous deamination of methylated CpGs or defects in DNA repair. "We can use the mutation data to find new signatures that are associated with different processes.” He is also studying germline mutations, how they can confer risk to cancer, and particularly the interplay between germline and somatic mutations. “Typically you have a single hit in the germline and then you have another hit in the somatic DNA, the other allele of the gene. For example, in BRCA1 you could be born with a BRCA1 pathogenic event and then during life you could get another mutation that deletes or disrupts the other wild type allele and therefore lose all the BRCA1 functionality, which could cause cancer."

Getz is working to identify the drivers of cancer, its subtypes, and their association with clinical parameters. “We want to find, for example, the genes that cause cancer - that give it the advantage to grow faster or die slower,” he said. “To do that, we will need to study many patients and look for genes that are mutated more than you would expect by chance. A statistical model can then be built of the mutations that occur just by chance, which are randomly scattered across the genome at different rates, versus those genes that have more mutations than we would expect by chance.” 

Getz says, “Part of my research is attempting to identify all events that happen when normal cells change into tumor cells, along with the events needed to happen for relapses, resistance, metastases, etc.” According to Getz, this effort is similar to when the Periodic Table of Elements was completed. “You do it once in the history of humanity—you complete the table of all genes that cause cancer, and that’s it.”  The listing of changes that take place would be the basis for understanding the pathways and processes that participate in cancer and developing targeted therapies directed towards them.

Rafael Irizarry, PhD (DFCI) is professor of Applied Statistics at Harvard T.H. Chan School of Public Health and a member of the Cancer Data Sciences Program. His work involves methods to quantify RNA sequencing (RNA-seq) data into measures of gene expression at the transcript level. “We’ve detected some errors that can arise due to experimental protocol and sequence composition that can alter the signal,” Irizarry said. “This is something that researchers didn't realize at first. Now we've discovered this and we're developing solutions to deal with that bias.” 

Irizarry is also working on the development of methodology to eliminate statistical bias from single-cell RNA-seq data, which turns out to be more challenging than analyzing bulk RNA-seq data. “We're just getting started and the data that comes out is very noisy and has a lot of systematic error,” he said. Single-cell RNA-seq examines the genomes or transcriptomes of individual cells and can detect cellular heterogeneity, identify new cell types, and characterize tumor microevolution.

The Future of Big Data 

Lin adds that an important mission for her is to train the next generation of biostatisticians and data scientists. “I work very closely with students and post docs and I feel that is a very significant part of my career. It’s also so rewarding seeing them be successful in their own career.” This need to train the next generation of data scientists was emphasized also by many program members.

Irizarry has developed several Massive Open Online Courses (MOOCs) in the area of data sciences that are hosted on HarvardX. “It's something that I'm excited about because it allows you to teach so many people,” he said. Irizarry explained that in medical sciences many people become data scientists, but often by necessity not by choice. They come to the lab and it turns out they need to analyze data to answer their question, and they may have no training. “So you have all these people in labs, post docs in particular, who all of a sudden have to analyze data and they're not trained for it.” 

Irizarry’s most popular course to date has been the most basic one he offers: “Statistics and R for the Life Sciences.” Currently, he is working on one that is an even more basic introduction, which will likely be called “Introduction to Data Sciences,” that should be available this summer. “There’s a lot of demand for information on the basics of data science,” he noted “especially at the introductory level.” 

The Cancer Data Sciences Program has more than 60 members who participate in research that includes clinical trial design and statistical evaluation, bioinformatics pipelines for data analysis, mathematical models, as well as participation in other researchers’ projects. Since 2015, DF/HCC has implemented a series of nodal awards that are offered to DF/HCC members in an effort to foster collaboration among discipline-based programs between a basic science program and a population science program, such as Cancer Data Sciences. Michor notes that this program is one of the most collaborative programs of DF/HCC, both between programs and across institutions. 

-By Emma Hitt Nichols, PhD


  1. Illumina.com.  Illumina Introduces the NovaSeq Series—a New Architecture Designed to Usher in the $100 Genome. https://www.illumina.com/company/news-center/press-releases/press-release-details.html?newsid=2236383. Accessed March 20, 2017.
  2. Huang YT, Lin X, Liu Y, Chirieac LR, et al. Cigarette smoking increases copy number alterations in nonsmall-cell lung cancer. Proc Natl Acad Sci U S A. 2011;108(39):16345-16350.