Data Refuge: Update on Rescued Data

The Data Refuge Stories project has really taken off and doing great work (check out the revamped website!). But a question we get asked frequently is what happened to the data collected during Data Rescue events?

From January to May 2017 over 400 datasets from 33 agencies were collected at about 50 data rescue events. The workflow* for harvesting and preparing data for datarefuge.org was developed mostly by Delphine Khanna and Rachel Appel from Temple University Libraries, Laurie Allen from UPenn Libraries, and Justin Schell from University of Michigan Libraries. Many others at DataRescue Philly also contributed. The issue of trust was very important to us in archiving this data. While our workflow isn’t the best method there are multiple validation points and a documented chain of custody.

The workflow worked thusly: URLs identified as having data that wouldn’t be picked up by the End of Term Harvest webcrawlers were added to an app developed by a member of the Environmental Data Governance Initiative, with which we worked closely in supporting data rescue events. Data rescuers would select URLs and document as much information about what the data were in the Research phase. The URL would then move to the Harvest phase, when rescuers would go in and find ways to extract the data. Next, people with some familiarity with scientific or government data would Check the data against the original and sign off on what was there. Finally, datasets would be “Bagged” into BagIt files (a form of zip file), added to the datarefuge.org repository and given some metadata.

Since this workflow involved putting everything into these zipped files it’s hard to know what we actually have or how useful it is. To find out and make whatever is there more useful, we hired some student workers to unzip the data. Yixin Wu was one of our first student workers working on this and he unpacked about 60 EPA datasets. Francisco Saldana and Liu Liu are working this summer on NASA and NOAA datasets. This work would not get done without the work these great students are doing.

The process for unpacking these data is deeply tedious. First, they check the original link to the record to verify metadata. After fixing any errors, they download the zip file and extract the documents. They inspect the files to make sure they are what they should be and are in usable formats. Then they upload the individual files back into the record.

Next, they gather some more contextual information to help users of the data. They look up the original URL in the Internet Archive and include a link to the archived version of the site where the data originated in the record. They also go into the app used by most data rescue events and gather the information about how the data was copied. That information might include methods used to capture the data, any problems the participant encountered, or other information that might help someone using the data from datarefuge.org understand the copied data. That information is entered into a PDF that is also added to the item record. Here's an example of a completed, unpacked dataset from NOAA - take a look at those PNG files!

2yr_agincourtreef_gbr_20142015.png

So far, our student workers have unpacked about 100 datasets. They’ve uncovered just a few duplicate datasets, datasets that were too messy to be usable, or datasets that were no longer available at their original location. We do not have metrics on how many, if any, of these datasets are being used. We only know about data that goes missing on government website if someone reports it to us or the media (EDGI is doing some great work on monitoring changes to websites though). What we’ve heard has gone missing has most often been moved to an archival site through the work of the National Archives and Records Administration or by some other means. Other missing data was retained by the Internet Archive. There are of course a few things we know have been taken down or are planned to be removed and are unfortunately not in our repository or rescued by any other group we know of.

A couple of universities have approached us about mirroring the rescued data at datarefuge.org. This is still in process and we’re working through the logistics of doing that but these will be great partnerships to safeguard this work.

If you have data and want to store it somewhere, we recommend DataLumos, and we're happy to talk about other options with you.

 

 

* For many reasons, we retired the workflow used at data rescue events in late May 2017 and we are no longer collecting data. The biggest reasons we did this were about sustainability. Our small team of people who supported data rescue events –many of which took place on weekends and evenings- could not continue to give as much time to train event hosts prior to events and be on call to trouble shoot during the events. Inquiries about hosting data rescue events had also begun to wane, showing a natural slowdown in engagement. Ultimately, though, we chose to retire the workflow because this method of data rescue, while still valuable, is not the most efficient or sustainable way to ensure access to these valuable data. We continue to take the many valuable lessons from data rescue events and that workflow as we think about other ways to safeguard government data through new partnerships.