The Integrated Resource for Reproducibility in Macromolecular Crystallography includes a repository system and website designed to make the raw data of protein crystallography more widely available. Our focus is on identifying, cataloging and providing the metadata related to datasets, which could be used to reprocess the original diffraction data. The intent behind this project is to make the resulting three dimensional structures more reproducible and easier to modify and improve as processing methods advance.
The website at proteindiffraction.org provides basic browsing and search functionality. It tracks data gathered by various projects and laboratories, including the Center of Structural Genomics of Infectious Diseases and Seattle Structural Genomics Center for Infectious Disease, among others. The final planned service will be open to any data submissions related to structures deposited with the Protein Data Bank, and will allow more comprehensive search and exploration of the diffraction images.
A typical set of diffraction images used to create a protein structure may be around 10GB in size, which means that to store the data for all current PDB deposits would require disk space on the order of a petabyte. While this is a lot of space, with today's technology it is neither impractical nor particularly demanding. A more challanging task is to organize this data and provide easy ways to download it and to search the semantic metadata of datasets and their related structures. If this can be achieved, there are multiple benefits for research. For example:
Data downloaded from IRRMC may be freely used under the Creative Commons license CC0 (Public Domain Dedication Waiver). IRRMC strongly urges users who download data to credit the source data by using the DOI in any publications and/or derived data that make use of the downloaded data.