The National Cancer Institute (NCI) Genomic Data Commons (GDC) is an innovative data repository and analysis system that will contribute to cancer research by allowing large amounts of cancer data to be imported, standardized, harmonized, and analyzed with state-of-the-art methods.

Developed over two years by Chief Research Informatics Officer Bob Grossman, PhD, and his team at the Center for Data Intensive Science (CDIS), with contributions from the CRI Bioinformatics Core, the GDC brings together genetic and clinical cancer data from multiple sources on a unified platform. At the time of its launch, it already contains approximately 4.1 petabytes (that’s 4.1 million gigabytes) of data from NCI-supported programs, including several of the world’s largest cancer genomics databases such as The Cancer Genome Atlas (TCGA) and TARGET, its pediatric equivalent.

The GDC was publicly launched at the University of Chicago on June 6, 2016. The occasion was marked by a visit from U.S. Vice President Joe Biden, who announced the project as part of the National Cancer Moonshot Initiative. By democratizing access to data and analysis tools and making it easier for researchers to collaborate, the GDC will contribute significantly to the Moonshot’s goals of improving the prevention and diagnosis of cancer and making more treatments available to more patients.

The GDC seeks to overcome several barriers that have hindered cancer researchers from making discoveries based on genomic data. First, the large sample sizes necessary for this type of research have meant that the sheer size of the data involved can be a technological hurdle, requiring advanced computer hardware and long download times. In addition, these data were formerly available only piecemeal, with different research groups having access to different datasets and using various analysis methods that were not easy to unify or compare between studies. By bringing together and standardizing many datasets, providing a secure analysis environment so researchers don’t need to download to their own hard drives, and offering a unified set of analysis pipelines so data can be directly compared, the GDC acts as a centralized knowledge network that will continue to grow as new datasets are submitted and standardized.

Reflecting the collaborative work between informatics groups at the University, the CDIS team tapped the CRI’s bioinformatics expertise as they built the GDC. CRI bioinformatician Tzuni Garcia, PhD, was part of the early development process, evaluating several types of database technology under consideration for the project. He also helped analyze incoming metadata sets and wrote specialized software used in the process of determining how to import and unify data from different sources. CRI bioinformatician Kyle Hernandez, PhD, has been involved in later stages of the project. He assisted with the development of annotation, filtering, and formatting pipelines for somatic mutation datasets, including creating custom databases and filtration tools and converting data formats. Kyle is now analyzing and summarizing TCGA somatic mutation data for a peer-reviewed publication.

The CRI is committed to advancing the field of precision medicine (also called personalized medicine), in which treatments are matched to patients based on individual characteristics like the genetic profile of a patient or of their tumor. From preliminary research to clinical care, we’ve contributed to several projects of this kind, including 1200 Patients, the GAIN study, and the Department of Pathology’s Clinical Genomics Laboratory. The GDC promises to be a robust and valuable tool for precision medicine, as well as an important asset in the fight against cancer.