Tale

U-Store-It

The SBGrid Data Bank provides an affordable and sustainable way to preserve and share structural biology data

Published March 28, 2016

Evidence of the Higgs boson appears as a bump on a histogram resulting from the analysis of data from millions of detectors at the Large Hadron Collider. What if all that raw data vanished, leaving nothing but the histogram? The physics community would reel.

Yet in structural biology, raw data frequently goes missing. Scientists dutifully store models of proteins in the Protein Data Bank, but the X-ray diffraction data used to derive those macromolecular structures isn’t accessible easily, if at all. One reason is that, until recently, there was no clear place to put it.

Now, however, structural biologists can publish their data in the SBGrid Databank (SBDB data.sbgrid.org), a system designed to preserve and disseminate primary experimental datasets that support scientific publications. More than a repository, the system is a databank that is integrated into the SBGrid software distribution so that datasets are immediately available to anyone in the community. The SBDB is currently a pilot grid, but it will mature into a sustainable system over the coming three years with support from the Helmsley Trust Biomedical Research Infrastructure. This project is a collaboration between SBGrid, Dataverse, and Globus and the National Data Service has selected it as a pilot activity of.

“Diffraction datasets are often misplaced,” says SBGrid Founder and Director Piotr Sliz. “Those datasets are captured at state-of-the-art synchrotron facilities and are the ultimate outcome of long-term biological experiments. Publication of this primary data will support reproducibility, model validation, training and methods development.”

The Making of a Databank

The de facto approach to data management for structural biology is, with a few exceptions, every lab to itself. With terabyte drives selling for about $50 on Amazon, storage is cheap, so why not? The fact is, the simplicity stops there. Beyond storage, data needs to be archived, curated, and validated. To be valuable to the wider scientific community, it also needs to be shared. “The cost for storing data is small compared to the cost of validating it to make sure it is all ok,” says methods developer Tom Terwilliger, a structural biologist at Los Alamos National Laboratory.

The Protein Data Bank is the go-to repository for storing models, so it might also seem a natural place to store raw data. But the storage of diffraction data presents distinct challenges. “You have to worry about volume and distribution and also about making the data available to the community in a useful form,” says Sliz.

In 2015, in response to input from the structural biology community, SBGrid began piloting the SBDB to address these challenges. SBGrid has a long history of data management that stretches back to 2002, when it started managing diffraction data for structural biology laboratories at Harvard and affiliated institutions. In 2012, the Consortium also prototyped a system to move diffraction data sets between Harvard and two synchrotrons. The work created a foundation of experience with data annotation, replication, and preservation at SBGrid. “We’ve been worrying about data for quite a while,” says Sliz.

Organized display of data collections at SBDB. (a) Graphical view of Laboratory and Institutional Collections within the SBDB; (b) PV structure viewer, displaying a published model with links to its two primary deposited data sets..

The team completed the SBDB pilot towards the end of 2015 and described results in a paper published in Nature Communications Nat Commun. 2016. 7:10882. The pilot SBDB is already storing over 170 datasets from 55 labs, with data collected at 11 synchrotron facilities. While most of the contents are X-ray diffraction datasets, SBDB also holds MicroED, lattice light-sheet microscopy, and molecular dynamics datasets. The primary storage for the grid is provided by a 100-terabyte system that is replicated to other institutions, with projections that the current system could store approximately 20,000 datasets.

In this pilot data bank, data can be published quickly and accessed by SBGrid members as if it were local. Demonstrations of this capability are up and running at Yale University and Harvard University. “The datasets are part of the SBGrid computing environment,” says Sliz. “In contrast to a website repository, SBDB provides direct access to datasets from within computing environments.”

Sustainable Bytes

In the next phase of development, SBGrid is working to move the pilot implementation to an SBDB platform that can be managed, maintained and upgraded without incurring crippling costs. “As a community, we need to worry about the costs of supporting research infrastructures, and aim to preserve precious funding resources for experimental research,” says Sliz.

To accomplish this goal, SBGrid partnered with Mercè Crosas, the Director of Data Science at the Harvard Institute of Qualitative Social Science. Crosas and her team have over a decade of experience developing the Dataverse project, a fully featured repository for sharing large datasets in the social sciences. “The collaboration is good because it brings together two strengths: subject expertise about structural biology data from SBGrid, and from our perspective, expertise in data management,” says Crosas.

Mercè Crossas, Chief Data Science and Technology Officer at the Institute for Quantitative Social Science (IQSS) at Harvard University and co-PI for the Dataverse project for data sharing and archiving. Learn more about Dataverse project from their website dataverse.org.

The Dataverse framework is an open source solution backed by a community of developers who are supporting and expanding the features. It follows best practices for data sharing and publishing to ensure that data is discoverable and able to be referenced through citations both to and from journal articles and, in the case of structural biology, the PDB. “We need to make sure that long term access to the data is sustainable over time,” says Crosas.

In addition to sustainability, the SBDB will support a publication workflow that makes it possible for scientists to self-publish their data in the SBDB when they submit a paper to a journal and a model to the PDB. All three will be linked together through citations.

The planned workflow streamlines the data publication process by removing the initial gate of data validation prior to submission. Instead, SBGrid develops tools to complete post-deposition analysis and hopes to engage members of the community in the process.

Currently, SBDB automatically processes all X-ray datasets using XIA2, an expert data processing system. Dataset pages present the results. Some challenging datasets, however, might require manual processing and unique expertise for validation. “We want the community to be able to access all primary data as soon as possible and in parallel drive automatic evaluation to eventually catch up with cases that require more expertise,” says Sliz.

Currently, 84 percent of data sets in the pilot SBDB can be reprocessed to yield statistics similar to what was reported in corresponding publications. “That’s encouraging,” says Sliz.

This phase of the SBDB project, which is supported by funding from the Helmsley Charitable Trust, will take about three years.

Ultimately, what the SBDB is providing the community is not storage or data management, but a platform for doing science as it was meant to be done. “The SBDB makes it possible to re-determine any structure from the beginning with ever-improving tools, leading to ever-improving structures in the PDB,” says Terwilliger.

As SBDB becomes established, the SBGrid team is already thinking about expansion beyond X-ray diffraction data. Indeed, the data collection already includes microED, lattice light-sheet microscopy and molecular dynamics datasets. “This platform is applicable to other data-heavy fields of biomedical research,” says Sliz. “The same mechanism could be used to distribute other large datasets across the scientific community.”

-- Elizabeth Dougherty

The Stories Antibodies Tell

Jean-Philippe Julien

Published 27 January 2026

Reshaping Membranes

Melanie Ohi

Published 23 November 2025
Probing Microbes

Gira Bhabha

Published 30 September 2025

Drawn to the Light

Emina Stojković

Published 30 July 2025
The Final Phase

George Phillips

Published 31 May 2025

Mind and Muscle

Ryan Hibbs

Published 28 March 2025
The Shapes of Energy

Luke Chao

Published 12 December 2024

Predicting Proteins

Jens Meiler

Published 25 November 2024
Death Metal

Steven Damo

Published 28 April 2024

Context Matters

Bing Chen

Published 30 January 2024
The Crystal Whisperer

Sarah Bowman

Published 29 November 2023

Data in Motion

Nozomi Ando

Published 29 September 2023
The Monstrous Maw

André Hoelz

Published 28 June 2023

Second Takes

Andrea Thorn

Published 28 February 2023
Radical reactions

Yvain Nicolet

Published 31 January 2023

Floppy Physics

Eva Nogales

Published 30 November 2022
Structure of Equity

Jamaine Davis

Published 28 September 2022

Life and Death of a Cell

Evris Gavathiotis

Published 28 July 2022
Follow the glow

Kurt Krause

Published 29 April 2022

Resolution solutions

Willy Wriggers

Published 25 February 2022
Of enzymes and membranes

Ming Zhou

Published 28 October 2021

Step-by-step

Gabrielle Rudenko

Published 26 September 2021
Moving muscle

Montserrat Samso

Published 26 July 2021

Particle catcher

Stefan Raunser

Published 28 June 2021
Designer drugs

Ho Leung Ng

Published 25 February 2021

Right place, right time

Ernesto Fuentes

Published 29 January 2021
Shape-shifting secrets of membranes

James Hurley

Published 27 November 2020

Enzymatic action

Cynthia Wolberger

Published 28 September 2020
Rules of motion

Priyamvada Acharya

Published 31 July 2020

Cosmic Squared

Michael Cianfrocco

Published 27 June 2020
Kaps are Cool

Yuh Min Chook

Published 28 April 2020

Spiraling into focus

Carsten Sachse

Published 29 March 2020
Seeing cilia

Alan Brown

Published 27 February 2020

For the Love of EM

Guy Schoehn

Published 27 January 2020
Protein Puddles

Michael Rosen

Published 16 December 2019

Changing channels

Daniel Minor Jr.

Published 27 September 2019
Listening Tips

Marcos Sotomayor

Published 30 July 2019

Beyond Cool

Published 31 May 2019
Hao Wu

A Higher Order

Published 30 May 2019

Aye Aye Captain

Alexandre Bonvin

Published 29 April 2019
The PARP Family Family

John Pascal

Published 28 February 2019

Frame by frame

Nikolaus Grigorieff

Published 28 January 2019
Predicting Success

Bil Clemons

Published 18 December 2018

Curiouser and Curiouser

Ramaswamy Subramanian

Published 27 November 2018
Rely on This

Sjors Scheres

Published 26 October 2018

Proteins out of bounds

Gerhard Wagner

Published 27 September 2018
Hiding in plain sight

Gaya Amarasinghe

Published 27 July 2018

Jumping Genes

Orsolya Barabas

Published 27 June 2018
Data Whisperer

Karolin Luger

Published 30 May 2018

Flipping the Switch

Jacqueline Cherfils

Published 27 April 2018
Tooling Around

Andrew Kruse

Published 29 March 2018

Comings and Goings

Tom Rapoport, Ph.D.

Published 23 February 2018
Transcriptional Rhythm

Seth Darst

Published 27 January 2018

The Language of Gene Regulation

Daniel Panne

Published 21 November 2017
Not Your Average Protein

James Fraser

Published 23 October 2017

Message Received

Sebastien Granier

Published 24 August 2017
Resistance is Futile

Celia Schiffer

Published 28 July 2017

Twist of Fate

Leemor Joshua-Tor

Published 28 June 2017
Drug Designer

John Buolamwini

Published 30 May 2017

Mathematically Minded

James Holton

Published 28 April 2017
Garbage Out

Kay Diederichs

Published 30 March 2017

Fixer Upper

Brandt Eichman

Published 27 February 2017
Mobilizers

Phoebe Rice

Published 31 January 2017

Escape Artist

Katya Heldwein

Published 19 December 2016
Nature’s Confectioner

Jochen Zimmer

Published 29 November 2016

State of Fusion

Jason McLellan

Published 27 October 2016
Here Be Dragons

Brian Fox

Published 28 September 2016

SBGrid Assumes Ownership of PyMOLWiki

Published 15 September 2016
Pharm Team

Oleg Tsodikov

Published 24 August 2016

Spiro-Gyra

Alejandro Buschiazzo

Published 27 July 2016
Turning the DIALS

Nicholas Sauter

Published 29 June 2016

Pipeline Dreams

Bridget Carragher and Clint Potter

Published 26 April 2016
U-Store-It

The SBGrid Data Bank provides an affordable and sustainable way to preserve and share structural biology data

Published 28 March 2016

Big Questions, Big Answers

Jennifer Doudna

Published 22 February 2016
Not a Structural Biologist

Enrico Di Cera

Published 17 December 2015

Divide and Conquer

Kevin Corbett

Published 19 November 2015
Computing Cellular Clockworks

Klaus Schulten

Published 23 October 2015

Trans-Plant

Gang Dong

Published 26 September 2015
Keep on Moving

James Berger

Published 23 August 2015

Totally Tubular

Antonina Roll-Mecak

Published 27 July 2015
From Disorder, Function

Julie Forman-Kay

Published 29 June 2015

Into Alignment

Geoff Barton

Published 27 May 2015
Two Labs, Many Methods

Michael Sattler

Published 28 April 2015

Picture This

Georgios Skiniotis

Published 20 March 2015
Intron Intrigue

Navtej Toor

Published 20 February 2015

Cut and Paste

Martin Jinek

Published 28 January 2015
Basics and Beyond

Qing Fan

Published 18 December 2014

Bloodletting and Other Studies

Pedro José Barbosa Pereira

Published 25 November 2014
Wire Models, Wired

A brief history of UCSF Chimera

Published 29 October 2014

In Search of…New Drugs

Doug Daniels

Published 30 September 2014
An Affinity for Affinity…and Corals

John C. Williams

Published 29 August 2014

Pete Meyer, Ph.D.

Research Computing Specialist

Published 22 August 2014
Justin O'Connor

Sr. System Administrator

Published 20 August 2014

Carol Herre

Software Release Engineer

Published 15 August 2014
Elizabeth Dougherty

Science Writer

Published 13 August 2014

Andrew Morin, Ph.D.

Policy Research Fellow

Published 11 August 2014
Jason Key, Ph.D.

Associate Director of Technology and Innovation

Published 8 August 2014

Piotr Sliz, Ph.D.

Principal Investigator, SBGrid

Published 1 August 2014
New Kid on the Block

James Chen

Published 29 July 2014

Membrane Master

Tamir Gonen

Published 30 June 2014
The Natural Bridge

Piotr Sliz

Published 13 June 2014

Surprise, Surprise

Catherine Drennan

Published 26 April 2014
Gone Viral

Olve Peersen

Published 20 March 2014

All Who Wander Are Not Lost

Frank Delaglio

Published 24 February 2014
The Raw and the Cooked

Graeme Winter

Published 24 January 2014

Vacc-elerator

Peter Kwong

Published 17 December 2013
Structural Storyteller

Karin Reinisch

Published 15 November 2013

The Fixer

Jane Richardson

Published 28 October 2013
Inside the Box

Mishtu Dey

Published 17 September 2013

Sensing a Change

Brian Crane

Published 16 August 2013
Towards Personalized Oncology

Mark Lemmon

Published 16 July 2013

Brush with Fame

Yizhi Jane Tao

Published 14 June 2013
Toxic Avenger

Borden Lacy

Published 21 May 2013

Pushing the Boundaries

Stephen Harrison

Published 22 April 2013
Strength in Numbers

Joseph Ho

Published 18 March 2013

One Lab, Many Methods

Wesley Sundquist

Published 12 February 2013
Unplanned Pioneer

Tim Stevens

Published 15 January 2013

From Actin to Action

Emil Pai

Published 11 January 2013
Unstructured

A Brief History of CCP4

Published 12 December 2012

Stop, Collaborate and Listen

Eleanor Dodson

Published 5 November 2012
X-PLORer

Axel Brunger

Published 1 October 2012

Share the Wealth

Zbyszek Otwinowski

Published 22 August 2012
Unraveling RNA

Anna Pyle

Published 18 July 2012

Sharper Image

Pawel Penczek and SPARX

Published 4 June 2012
Creative Copy Cat

Pamela Bjorkman

Published 25 April 2012

Charm and Diplomacy

Gerard Kleywegt

Published 7 March 2012
From Curiosity to Cure

Marc Kvansakul

Published 13 December 2011

The Lure of the Sandbox

Paul Emsley and Coot

Published 15 October 2011
Springsteen, Tolkien, Protein

Alwyn Jones and Frodo

Published 17 June 2011

Structures Solved Simply

Paul Adams and Tom Terwilliger on Phenix

Published 2 June 2011
Playing the Odds

Randy Read and Phaser

Published 19 May 2011

Escape from the Darkroom

Wolfgang Kabsch and XDS

Published 19 May 2011
Better, Faster, Stronger, More

Victor Lamzin and ARP/wARP

Published 17 May 2011

Crystallography for Kids

Lynne Howell

Published 17 May 2011

The Stories Antibodies Tell

Reshaping Membranes

Probing Microbes

Drawn to the Light

The Final Phase

Mind and Muscle

The Shapes of Energy

Predicting Proteins

Death Metal

Context Matters

The Crystal Whisperer

Data in Motion

The Monstrous Maw

Second Takes

Radical reactions

Floppy Physics

Structure of Equity

Life and Death of a Cell

Follow the glow

Resolution solutions

Of enzymes and membranes

Step-by-step

Moving muscle

Particle catcher

Designer drugs

Right place, right time

Shape-shifting secrets of membranes

Enzymatic action

Rules of motion

Cosmic Squared

Kaps are Cool

Spiraling into focus

Seeing cilia

For the Love of EM

Protein Puddles

Changing channels

Listening Tips

Beyond Cool

Hao Wu

Aye Aye Captain

The PARP Family Family

Frame by frame

Predicting Success

Curiouser and Curiouser

Rely on This

Proteins out of bounds

Hiding in plain sight

Jumping Genes

Data Whisperer

Flipping the Switch

Tooling Around

Comings and Goings

Transcriptional Rhythm

The Language of Gene Regulation

Not Your Average Protein

Message Received

Resistance is Futile

Twist of Fate

Drug Designer

Mathematically Minded

Garbage Out

Fixer Upper

Mobilizers

Escape Artist

Nature’s Confectioner

State of Fusion

Here Be Dragons

SBGrid Assumes Ownership of PyMOLWiki

Pharm Team

Spiro-Gyra

Turning the DIALS

Pipeline Dreams

U-Store-It

Big Questions, Big Answers

Not a Structural Biologist

Divide and Conquer

Computing Cellular Clockworks

Trans-Plant

Keep on Moving

Totally Tubular