What is this?

The Integrated Resource for Reproducibility in Macromolecular Crystallography includes a repository system and website designed to make the raw data of protein crystallography more widely available. Our focus is on identifying, cataloging and providing the metadata related to datasets, which could be used to reprocess the original diffraction data. The intent behind this project is to make the resulting three dimensional structures more reproducible and easier to modify and improve as processing methods advance.

The website at proteindiffraction.org provides basic browsing and search functionality. It tracks data gathered by various projects and laboratories, including the Center of Structural Genomics of Infectious Diseases and Seattle Structural Genomics Center for Infectious Disease, among others. The final planned service will be open to any data submissions related to structures deposited with the Protein Data Bank, and will allow more comprehensive search and exploration of the diffraction images.

Why do we need to store all this data?

A typical set of diffraction images used to create a protein structure may be around 10GB in size, which means that to store the data for all current PDB deposits would require disk space on the order of a petabyte. While this is a lot of space, with today's technology it is neither impractical nor particularly demanding. A more challanging task is to organize this data and provide easy ways to download it and to search the semantic metadata of datasets and their related structures. If this can be achieved, there are multiple benefits for research. For example:

  • With the input data necessary to re-process a protein structure, it can be refined and improved as technology for processing diffraction images advances.
  • Reproducing published results is the only way to reliably detect errors and potentially fraud in existing structures.
  • If good quality datasets are gathered before the resulting structures are published, we can prevent the loss of diffraction data collected by structural genomics and other programs that close before being completed.
  • Availability of large amounts of raw data enables new types of analyses that are not possible based on single datasets, such as diffuse diffraction effects. This would also be that starting point for new diffraction analysis algorithms and hardware that require large training sets or reference data.

Usage Policies

Data downloaded from IRRMC may be freely used under the Creative Commons license CC0 (Public Domain Dedication Waiver). IRRMC strongly urges users who download data to credit the source data by using the DOI in any publications and/or derived data that make use of the downloaded data.