NCI DI-Cubed Data Integration – CENTER FOR RESEARCH INFORMATICS

The CRI has partnered with Leidos Biomedical Research, Inc. to contribute to a data integration project sponsored by the National Cancer Institute (NCI). This pilot project will allow us to use our expertise in data harmonization to develop a process to integrate medical imaging resources into the International Neuroblastoma Risk Group (INRG) Data Commons.

This project is part of a larger NCI initiative, the Data Integration and Imaging Informatics (DI-cubed) Project. The DI-cubed Project is an effort to convert data from various clinical studies into a standardized format, and to demonstrate how this standardization can enable data from multiple studies to be combined, creating larger and more useful cohort sizes and making it possible to share data more easily between institutions. In addition to standardizing data from different sources, the project will bring together data from different domains, such as clinical data, imaging data, and genomic data.

The CRI’s contribution to the DI-cubed Project leverages our work on the INRG Data Commons, an international cancer research resource that is now the largest collection of neuroblastoma patient information in the world. Due to the relative rarity of pediatric cancers, access to sufficiently large patient cohorts is a particular challenge for researchers. The INRG Data Commons includes more than 20,000 patients and more than 30 data elements, which have been collected from clinical trials and harmonized using a standard data dictionary. With only about 800 new cases of neuroblastoma diagnosed per year in the United States, access to such a large and rich collection of data about this cancer is invaluable for research, and the INRG Data Commons has contributed to more than a dozen high-impact publications in this disease area. In addition to developing the technical aspects of the Data Commons, the CRI continues to establish groundbreaking data governance practices and data sharing agreements.

Our contribution to the DI-cubed Project will demonstrate the feasibility of linking image data to clinical data in a commons environment and serving this information to researchers in real time, with the INRG Data Commons serving as the paradigm system. For the pilot study, we will extend the functionality of the INRG Data Commons by adding a category of radiology images known as MIBG scans. MIBG scans are among the most important imaging modalities for neuroblastoma patients, and easy access to these scans would enhance the usefulness of the INRG Data Commons for researchers by allowing them to perform image analysis and other studies that combine the images with clinical information. When this pilot is complete, in addition to what researchers can already access from the INRG Data Commons — clinical and phenotypic data, genomic data from TARGET, and biospecimen availability information — they will also have access to MIBG imaging data for the same patients in two ways. First, they will be able to see whether or not patients have imaging data available from within the cohort discovery tool, allowing them to use availability of scans as part of their criteria for cohort creation. Then, if they request a data set that requires MIBG images, a mechanism will exist for the scans to be extracted and delivered with the clinical data (although this delivery is beyond the scope of the initial pilot study).

To bring this to fruition, the CRI technical team has partnered with the Children’s Oncology Group to obtain de-identified MIBG images and associated reports, which we are storing locally within our secure, HIPAA-compliant infrastructure. Storing the images locally will ensure that they will be available for fast viewing and download alongside the corresponding clinical data. Our team will then link the images to patients currently in the INRG database, which will require developing and incorporating Observational Medical Outcomes Partnership (OMOP) standards for image storage and access. Digital Imaging and Communications in Medicine (DICOM) data elements are standard for images, and these will be mapped to corresponding OMOP elements wherever possible. Where necessary, the data model will be extended to incorporate new elements.

Once the data is standardized and the images are linked to patients, image availability will be added to the cohort discovery tool, and a mechanism to deliver images and reports as part of the data fulfillment process will be created. Further, CRI scientific software engineers will develop natural language processing (NLP) pipelines to extract meaning from the image reports, and this information will be added to the Data Commons as well.

Our partner on this project, Leidos Biomedical, is the operations and technical support contractor for the federally-funded Frederick National Laboratory for Cancer Research, the only national laboratory exclusively focused on biomedical research. Their work includes the management and execution of projects to support drug discovery and development for the National Cancer Institute and the National Institute of Allergy and Infectious Diseases. They have chosen to contract with the CRI for this important project because of our leadership and past successes in data standardization efforts and our development of the INRG Data Commons.

This project is a natural fit for our team. At the CRI, data harmonization is one of our deepest areas of experience and expertise. Our work in this area began with the Clinical Research Data Warehouse (CRDW), a repository of clinical data that brings together a broad range of internal and external data sources, including Epic electronic medical records (EMR), the Centricity billing system, the Cancer Registry, the National Death Registry, LabVantage, and REDCap. The consistent standards and procedures for data harmonization developed by our CRDW team mean that these disparate data are integrated and research-ready for our users.

Beyond the University of Chicago, our commitment to standardized data models and good data governance practices has also made possible some of our most fruitful and impactful collaborations. Our participation in data-sharing initiatives with other institutions, such as CAPriCORN and the NIH-funded All of Us Research Program, requires expertise in data standardization and use of shared data models. To this end, we have built data transformation pipelines to populate and maintain instances of OMOP and Informatics for Integrating Biology & the Bedside (i2b2). We have aligned ourselves with other large academic medical centers in the development of multi-institutional collaborative health care data warehouses, and are deeply committed to shifting the paradigm in healthcare data management toward portability and interoperability.

Recently, our work in this area has enabled us to extend the impact of the INRG Data Commons by playing a key role in the development of the Pediatric Cancer Data Commons (PCDC). The PCDC initiative is leveraging the successes of the INRG Data Commons to develop similar international resources for rhabdomyosarcoma, pediatric acute leukemia, and other pediatric cancers. The standards, processes, and algorithms that are developed for the PCDC will be made available to the international research community, and the use of common data elements will make it possible to compare results across studies. In the future, standardized data elements can be used prospectively as new trials are developed, greatly enhancing the value of the data collected. This effort, headquartered at the CRI and working in collaboration with research institutions and consortiums throughout the United States and Europe, is building the future of how pediatric cancer research can be conducted.

Ultimately, this project will have multiple benefits for the cancer research community, as it will both serve the goals of the DI-cubed Project and enable us to make the PCDC a more robust and useful resource. This proof-of-concept study will demonstrate linking images and clinical data for at least 500 patients, but there are likely to be many more patients with MIBG images that can be linked. Following a successful pilot, we hope to source and include other forms of radiology images as well, such as CT scans, MRI scans, and plain films. We also hope to include scans from histology slides alongside a tool for visualizing these scans within a web interface.

The CRI is committed to supporting COVID-19 research and innovation. COVID-19 Resources