An experiment

Over the course of the last several months, as we've worked with a giant collaborative network of volunteers to save government information, we've learned about the value, vulnerability and variety of government data. As we've described elsewhere, we're now working on planning a meeting where data advocates, librarians, federal data producers, and researchers will gather to begin thinking through new approaches to safeguarding copies of federal data.

In order to further inform that conversation, we've begun an experiment with a few brave volunteer librarians to see what we can learn from one model of saving data.

This experiment is not designed to offer a solution to the problem of backing up federal information, but is instead an attempt to further understand the problem.

This experiment is based on our current understanding of the landscape of federal information. When we talk about federal data, we are, in fact, talking about data stored on plain HTML webpage, in visualized and embedded content, on ftp servers, in databases available only through query interfaces, as as files that conform more closely to what we might normally think about as research datasets.

In order to provide for long term backup and re-use of data, one goal might be to turn the data from webpages and query pages into datasets that include appropriate metadata and enough contextual information for future researchers to make use of them. In some cases, making a dataset will be as simple as downloading the relevant files from data.gov.

This looks like a full dataset from data.gov

This looks like a full dataset from data.gov

Other sorts of data might require some compiling, where context from webpages needs to be combined with data scraped from within a query interface to create a dataset that can be usefully backed up and stored.

 

The experiment.

This is what we've asked of these libraries.

  1. Identify a designated community for whom you are saving data.
    1. Pick a subset of data that would be useful for that community that you'll try to save.
      For the purpose of this experiment, we are focused on data rather than webpages.
  2. In order to back up data that will be meaningful to people in the future, you'll need to decide how to chunk up the data you're saving into pieces, and how to describe those pieces. Where does one dataset end and another begin? How much information do you need about the data you'll be backing up to to make your copy re-usable and citable in the future? What files, webpages, or additional material complement the dataset you've identified? These key questions are ones you will address as you create a model of the data you'll be saving so that they can be re-used.
     
  3. Gather the necessary files you've identified so that the federal data are effectively "backed up" in your system, and made available to the public.
    1. For those libraries with data repositories, we hope they'll make the data available through those repositories, while for libraries without repositories, we can offer space in the datarefuge storage for their data.
  4. For each dataset you've backed up, create a data.json file to share with the datarefuge project for sharing in our instance of ckan (that is, in the datarefuge.org catalog). Make whatever changes to the standard format are necessary to both point to the original and to your copy.
     
  5. Look for ways to include public involvement, advocacy and education in this process. Are there ways that some of this work could be done by volunteers, by citizen scientists?

We hope that, through this experiment, we'll learn a few things:

  • What are the challenges to making government sites into datasets for backup and re-use?
  • How might these processes fit into library workflow?

There will, of course, remain a number of open questions that will need to be solved through continued collaboration and experimentation. These include:

  • Can we find ways to share our work with the data producers at agencies so that future re-use and discovery is enhanced?
  • How can we ensure that our system comprehensively addresses the data system in the federal governement?
  • How frequently should data be backed up, and what commitment will institutions make if they attempt this system?
  • What kinds of funding models and storage architecture will work best?
  • What kind of advocacy work will be most successful in continuing support for these efforts?