From our community: Webinar updates, survey, & May meeting prep

Were you unable to attend the live webinar update on Thursday? If so, never fear! Everything is posted here, segmented into three parts (with most of the technical glitches edited out).

We had two gracious presenters: John Chodacki from California Digital Library, and Aaron Brenner from University of Pittsburgh Libraries. Each is engaged in experiments that are helping to advance our collective understanding of the challenges related to the long-term preservation and access of federal data.

John discussed efforts around a systematic means for backing up data that is systematically made accessible through data.gov, and Aaron and colleagues are exploring what it means to try to back up data that is of interest to a particular community within their institution. Hear what they have to say in the Part 1 video below.

There has been continued interest around expectations for the May Meeting. We are using the Libraries+ Network Survey as a means for casting a wide net and gathering as many ideas as possible about challenges and desired outcomes. The Part 2 video below outlines preliminary feedback received to date. MAKE YOUR VOICE HEARD! And, know that this meeting is one node among many nodes of conversations happening at conferences, across agencies, and in working groups.

Part 3 below contains additional feedback from those attending on how we might think about challenges further and other aspects related to the use of federal data that may be less familiar to some of us but important to keep on the table as discussions continue.

In addition, notes are available online as are the slides. We invite you to share the work you are doing and we appreciate your efforts. Stay tuned...

Tell us what you think! Libraries+ Network Survey

There is much to be learned by our Libraries+ Network colleagues who have deep expertise and rich experience related to access and preservation of federal data beyond federal servers.

We want to hear from you: Tell us what you think!
may-meeting.jpg

On May 8 & 9 a diverse group of people will come together to share perspectives, ideas, projects, and experiences to design possible pathways forward. However, before we can get to that point, we need to hear from our entire community. There are so many of us with experience, recommendations, and resources that could be the catalyst needed to propel our collective conversations forward.

If you are unable to attend the May meeting, we invite you to participate today in three ways:

  1. Take this community survey
    Tell us how you would define the problem, what outcomes you'd like to see from the meeting, and suggest recommended readings/videos
     
  2. Register for the April 19 Pre-meeting Webinar
    Hear from colleagues, find out about survey results, and learn about the plans for the May meeting
     
  3. Author a post on the Libraries+ Network blog
    Share your thoughts on the challenges and/or possible futures, or tell us about a federal data-related project or event that has informed your thinking. (Contact kimeke[at]upenn.edu)

Your help in thoughtfully framing the problem to be solved is invaluable. Thank you in advance for your contributions and your interest. We look forward to posting the survey results after April 18!

 

 

Stronger together: the case for cross-sector collaboration in identifying and preserving at-risk data

Written by
Matthew S. Mayernik, Robert R. Downs, Ruth Duerr, Sophie Hou, Natalie Meyers, Nancy Ritchey, Andrea Thomer, Lynn Yarmey

Cross-posted from Earth Science Information Partners

Introduction

In the past few months, a range of grassroots initiatives have gained significant momentum to duplicate US government agency data. These initiatives are inspired by recent reports that scientific data and documentation have been removed from government websites, and by concerns over US budget proposals that slash scientific budgets [1]. National media outlets have reported on numerous "data rescue," "data refuge," and "guerrilla archiving" events that have taken place around the US and in Canada during the past few months [2]. Many of these events have focused on creating copies of Earth science data generated and held by US federal agencies. These activities have attracted hundreds of volunteers who have spent considerable time and energy working on duplicating federal data.

Early connections have been made between the rescue volunteers and the federally-funded data community; these conversations have highlighted some of the different perspectives and opportunities regarding agency data. The two goals of this document are to provide the perspective of Earth science data centers holding US federal agency data on this issue, and second, to provide guidance for groups who are organizing or taking part in data rescue events. This paper is not a how-to document, and does not take a position on the political aspects of these efforts. Given the extent of the US government data holdings in the Earth sciences and other domains, it is inevitable that any grassroots data rescue will have to make strategic choices about how to invest their efforts. This document is intended to describe considerations for data rescue activities in relation to the day-to-day work of existing federal and federally-funded Earth science data archiving organizations.

The authors use the ‘data rescue’ terminology throughout this text to connect with the stated goals of the grassroots ‘data rescue’ communities, though we do wish to push back on the assumption that the data being targeted by these efforts are necessarily in need of ‘rescue.’ As we discuss below, many of these data are, in fact, well managed and safe, though sometimes in ways that are less-than-obvious to someone new to the domain. We look forward to working with these communities to develop a shared sense of risk for federal data. 

 Our context

As data managers, data curators, etc. (whom we will refer to in general as “data professionals”), we work with researchers, agency personnel, and the cross-agency data community to understand requirements, identify critical metadata, standardize practices, and disseminate our work. We have strong communities-of-practice that emphasize accountability, sustainable solutions, efficiencies of scale, and the development of shared solutions to data challenges. Like librarians and archivists, we know our collections well and we are proud of our work, knowledge, and communities [3]. We recognize the value of data preservation, description, and long-term reuse, and in many cases have spent our careers improving our infrastructure, relationships, and workflows to better support the research community, agencies, and broader public.

We see our data as being intertwined with metadata, domain and technical requirements provenance, and users. With environmental data in particular, simply capturing a web site rarely captures the data itself as each URL may point to a different part of the data package. Data may be held as files on an ftp site, as records within a database of some form, and may even be held on off-line storage media such as tape. Capturing data files or contents of a database in isolation will rarely be sufficient to enable subsequent use of the data since often the contextual information needed to understand the format and meaning of that data will be contained in a series of metadata records, web pages, or other forms of documentation. The exact nature of these parts and pieces that comprise a complete data set will vary from center to center and sometimes from data set to data set. All of these pieces and their connections have been optimised over time to meet defined needs based on the best knowledge and resources available at the time.

We see data ‘risk’ as involving technical, metadata, policy, and resource considerations. In recent public conversations, “at risk” has been used to imply that data may be deleted or become inaccessible to the public now or in the future. From a public perspective, this would be visible as broken links [4] or the removal of a particular portal. However, from a data center perspective, these seemingly ‘lost’ data may well still be well-preserved and even accessible through professionally-managed federal infrastructure as data management systems are usually detached and insulated from changes visible on the web. As of this writing, it is unclear what currently open and publicly available US federal government data are actually “at risk”. In the US, laws and budgets have been proposed which would phase out the EPA and substantially reduce the budgets of other environmental agencies [5], but the implications of these proposals on the data stewarded by federal agencies are unknown. It is certainly true that having multiple copies of a data set held by multiple organizations is central to successful data archiving, but the legal precedent for actually deleting data is not clear. Regardless, data may be characterized as being “at risk” for many reasons, with risk factors including obsolete technology or data formats, lack of metadata, lack of expertise to interpret the data, and lack of funding to maintain data [6]. It is important to be clear about what risk factors are being used to motivate data rescue events. Being unclear about this can lead to confusion or misinformation being spread. Wired.com, for example, published two news articles within a week of each other that respectively characterized NASA data as being a) in need of saving and b) not actually at much risk of being lost in their current homes [7].

Data center background

Numerous federal data centers, staffed with data professionals, infrastructure specialists, and often researchers themselves, offer usable, trustworthy data along with data preservation services. Data centers, archives, and digital repositories provide valuable services to support the long-term value and use of data by a particular domain or community [8]. Most data centers holding US agency data have some form of preservation plan that at least involves distributing multiple copies of the data in different geographic locations. Some of these centers, such as the NOAA National Centers for Environmental Information (NCEI) also have federal legal mandates for archiving particular data. US Federal data centers are also in many cases part of national or international data networks. At least nine US agency data centers, and a number of other federally-funded data centers, are members of the World Data System (WDS), an international federation that promotes and supports trustworthy data services [9]. Becoming a WDS member involves undergoing a certification process to validate the data center’s procedures for effectively stewarding data over time. As another example, through the Federal Big Data program, NASA, NOAA and other agencies have started making copies of their most popular very large data sets available on a variety of cloud providers.

In addition, the National Archives and Records Administration (NARA) manages the archives from many federal agencies, including each presidential administration’s websites, and the data and documents hosted on those sites [10]. Each White House’s website also includes subsites from different government agencies and committees that are likely relevant to scientists (e.g. the Office of Science & Technology Policy (OSTP). Unfortunately the archiving process results in many broken links, and the appearance that documents and data are disappearing with each new administration.

It must be made clear, however, that data generated by federal agencies are not uniformly managed, and not all federal data resources are housed in formal data centers. There are vast quantities of data held by researchers in the federal government, academia, and industry, that have not been deposited into any repository. There are many reasons for this, which we will not discuss here as there is substantial literature on the topic [11]. What can happen in these cases is that as the researcher approaches retirement they start thinking about their legacy and typically drop off boxes of materials at their library, archive, or favorite data repository. Data centers may be woefully unprepared to do anything with this largess especially if the researcher is not available to answer questions.

In other cases, a repository may have been around since before the digital age. Many data centers maintain a legacy library and/or archive full of data in analog form, e.g. as maps, prints, or books, and consequently not fully available to the community for use [12]. These legacy collections of out-dated media (7-track and 9-track tapes, paper tape, floppy disks, etc.) need to be migrated to modern media and data formats, since the technologies for reading those old media are obsolete and often the media itself is degrading. Moreover, often these data predated the ASCII or unicode eras and may need considerable bit-level manipulation in order to be translated into something useable by today’s technologies. These are data at considerable risk of being unusable by anybody unfamiliar with the original data collection effort.

Government data centers are typically happy to work with anybody interested in accessing or using their data. Some details on how to engage with data centers are described below.

Recommendations

This section outlines recommendations on how emerging grassroots data rescue initiatives can productively partner and collaborate with current data center services.

  • Confirm the current risk level of data sets in line for rescue. For instance, though a data set may have appeared to vanish from a previously reliable access point, it may have a) been moved to another location; b) be duplicated in other locations that are less popular, less widely known, or require different methods of access (for instance, through an API rather than a web browser). Do a bit of research on the data set, its creators, and managers to confirm that it really is “at risk” like it might first appear.
     
  • Interact with the data center
    • Contact the data center before rescuing their data at large scale for preservation or access purposes. Data center personnel will be able to give pointers for what data under their purview may be “at risk.”
    • If you already have a lists of data sets or resources that somebody has declared may be “at risk,” contact the relevant agencies for help in reviewing the list, getting connected to appropriate data center contacts, and facilitating documentation of data rescue activities.
    • Data center personnel may be able to guide you to the best mechanism for accessing the data (and associated metadata) from their facility. Hacking/scraping web pages may not be the most efficient way to download data and capture associated metadata, for example.
    • Let the data center know about your plans before asking dozens of volunteers to hit their web sites with large numbers of downloads. Contacting the center first will allow them to potentially provision additional web server space at a specific times.
    • Log-ins – Data systems may require log-in for data access in many legitimate cases, e.g. protection of sensitive or legally-restricted data, to gather usage metrics, to communicate data updates to the appropriate user communities, etc. Not all conditions for access are ill-intentioned. Sometimes sensitive data need to be protected (for instance, data that reveal the location of endangered animals, rare specimens, or rare artifacts need to be hidden from potential poachers), and data containing information about people needs to be carefully managed to mediate or prevent the sharing of personal data.
    • In some cases, technically bypassing log-ins or firewalls can be illegal. Data center staff can tell you why something requires a log-in, and how to access such data in compliance with appropriate policies.
    • If you are having difficulty contacting center staff, send a message to the data center's Help desk or User Services email address, which may be accessible from the FAQ, or Support link on their website. In the absence of a timely response from the data center's Help desk or User Services staff a written request should be sent to the data center's director. 
       
  • Gather all associated metadata and keep them with the data. Data center personnel can help you identify all of these parts and pieces and how to access them.
    • Metadata might consist of information structured in a web page, an XML file, a database, or other mechanism, and might include documents, images, or maps.
    • Gather, maintain and use all persistent identifiers (PIDs) associated with data. Many government data repositories assign Digital Object Identifiers (DOIs) or other kinds of identifiers to data sets following cross-agency standards [13]. PIDs enable persistent location, identification, and citation of particular data, which is critical to tracing their provenance and usage.
    • Provenance / chain of custody – All data must be traceable back to their original sources, and must have a demonstrable chain of custody including validation mechanisms such as checksums.
       
  • Syncing efforts
    • Plan for maintenance and versioning. Many federal data sets change over time, with new data being added, or values being changed as errors are identified and fixed. Creating snapshots of data may exacerbate problems related to authoritative versioning and communication of changes.
    • If you allow users to access the rescued data
      • Link the rescued data back to the original source, using PIDs if possible.
      • Provide usage/download metrics back to the original source. Diverting traffic from the original data center to another data location actively hurts the data center, as they rely on usage metrics to understand community needs, determine priorities, and demonstrate the value of their services to the scientific communities, the general public, and their funders.
      • Identify who is going to provide human services for these data, e.g. answer questions, provide help to users in understanding what data sets actually represent, and help people interpret data correctly.
    • Security - Government web sites have to meet legal requirements for information security, e.g. keeping their files safe from hackers [14]. Adopting this requirement for rescue efforts will help ensure that users know they are getting real uncorrupted files.
       
  • Contribute expertise and effort in rescuing legacy data
    • Multiple international organizations and collaborative working groups have been working on the rescue of legacy data for decades [15]. As one example, the Data Rescue Interest Group within the Research Data Alliance (RDA- Data Rescue IG) currently is working on guidelines for the rescue of legacy data [16]. These initiatives would benefit tremendously from additional attention, effort, and resources.
    • If volunteers are participating in data rescue events as concerned citizens, there are many opportunities to contribute to these ongoing efforts to rescue legacy data. A very valuable contribution would be to hold events where people participate in citizen science-based data rescue efforts, e.g. http://weatherwizards.org/, https://www.oldweather.org/. Contributions of funds and technical expertise could also have significant impact on these efforts.

Conclusion - Working Together

Data management, curation, and preservation efforts are chronically under-resourced and overlooked, and we all care about data safety, accuracy, use, and preservation. The trustworthiness of data is critically intertwined with the factors described above, e.g. metadata, provenance, transparency, security, and community [17]. If those factors are not taken into account, rescued data will be of no use regardless of how many times they are duplicated. There are many data professionals and other stakeholders in the data management community collaborating formally and informally to provide stewardship and to identify at-risk data, curate at-risk data, and mitigate the chances for data to become “at risk.” The grassroots data rescue efforts like DataRefuge and others have brought together an energetic, diverse community of passionate citizens and professionals with valuable skills and expertise. Initial connections between DataRefuge and broader communities such as the Research Data Alliance and ESIP have shown value, and point out important gaps and opportunities moving forward. The more we can work together to preserve the data that matter to us all, the more effective and sustainable the our work will be.

References

[1] Varinsky, Dana. (2017). Scientists across the US are scrambling to save government research in 'Data Rescue' events. Business Insider, Feb. 11, 2017. http://www.businessinsider.com/data-rescue-government-data-preservation-efforts-2017-2

Science News Staff. (2017). A grim budget day for U.S. science: analysis and reaction to Trump's plan. Science, Mar. 16, 2017. https://doi.org/10.1126/science.aal0923

[2] See for example: Dennis, Brady. Scientists are frantically copying U.S. climate data, fearing it might vanish under Trump. The Washington Post, Dec. 13, 2016. https://www.washingtonpost.com/news/energy-environment/wp/2016/12/13/scientists-are-frantically-copying-u-s-climate-data-fearing-it-might-vanish-under-trump/

Temple, James. Climate data preservation efforts mount as Trump takes office. MIT Technology Review, Jan. 20, 2017. https://www.technologyreview.com/s/603402/climate-data-preservation-efforts-mount-as-trump-takes-office/

Khan, Amina. Fearing climate change databases may be threatened in Trump era, UCLA scientists work to protect them. Los Angeles Times, Jan. 21, 2017. http://www.latimes.com/science/sciencenow/la-sci-sn-climate-change-data-20170121-story.html

Harmon, Amy. Activists rush to save government science data — If they can find it. New York Times, March 6, 2017. https://www.nytimes.com/2017/03/06/science/donald-trump-data-rescue-science.html

[3] Yarmey, K. and Yarmey, L. (2013). All in the Family: A Dinner Table Conversation about Libraries, Archives, Data, and Science. Archive Journal, Issue 3. http://www.archivejournal.net/issue/3/archives-remixed/all-in-the-family-a-dinner-table-conversation-about-libraries-archives-data-and-science/

[4] Herrmann, Victoria. (2017). I am an Arctic researcher. Donald Trump is deleting my citations. The Guardian, Mar. 28, 2017. https://www.theguardian.com/commentisfree/2017/mar/28/arctic-researcher-donald-trump-deleting-my-citations

[5] US H.R.861 - To terminate the Environmental Protection Agency. Introduced Feb. 3, 2017. https://www.congress.gov/bill/115th-congress/house-bill/861/all-actions

[6] Anderson, William L., Faundeen, John L., Greenberg, Jane, & Taylor, Fraser. (2011). Metadata for data rescue and data at risk. In Conference on Ensuring Long-Term Preservation in Adding Value to Scientific and Technical Data. http://hdl.handle.net/2152/20056

Downs, Robert R. & Chen, Robert S. (2017). Curation of scientific data at risk of loss: Data rescue and dissemination. In Johnston, Lisa (Ed). Curating Research Data. Volume One, Practical Strategies for Your Digital Repository. Association of College and Research Libraries. http://dx.doi.org/10.7916/D8W09BMQ

Griffin, R.E. (2015). When are old data new data? GeoResJ, 6: 92–97. http://dx.doi.org/10.1016/j.grj.2015.02.004

Ryan, H. (2014). Occam’s razor and file format endangerment factors. Proceedings of the 11th International Conference on Digital Preservation (iPres), October 6-10, 2014: Melbourne, Australia (pp. 179-188). https://www.nla.gov.au/sites/default/files/ipres2014-proceedings-version_1.pdf

Thompson, C.A., Robertson, W. D., & Greenberg, J. (2014). Where have all the scientific data gone? LIS perspective on the data-at-risk predicament. College & Research Libraries, 75(6), 842-861. https://doi.org/10.5860/crl.75.6.842

[7] Molteni, Megan. Diehard coders just rescued NASA's Earth science data. Wired, Feb. 13, 2017. https://www.wired.com/2017/02/diehard-coders-just-saved-nasas-earth-science-data/

Molteni, Megan. Old-guard archivists keep federal data safer than you think. Wired, Feb. 19, 2017. https://www.wired.com/2017/02/army-old-guard-archivers-federal-data-safer-think/

[8] See e.g. Ramapriyan, H.K., Pfister, R. and Weinstein, B. (2010). An overview of the EOS data distribution systems. In Land Remote Sensing and Global Environmental Change (pp. 183-202). Springer New York. http://doi.org/10.1007/978-1-4419-6749-7_9

[9] https://www.icsu-wds.org/community/membership/regular-members

[10] https://www.archives.gov/presidential-libraries/archived-websites

[11] Douglass, K., Allard, S., Tenopir, C., Wu, L., Frame, M. (2013). Managing scientific data as public assets: Data sharing practices and policies among full-time government employees. Journal of the Association for Information Science and Technology, 65(2): 251–262. https://doi.org/10.1002/asi.22988

Tenopir, C., et al. (2015). Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLoS One, 10(8): e0134826. https://doi.org/10.1371/journal.pone.0134826

[12] US Geological Survey. 2016 Data at Risk Project. https://www.fort.usgs.gov/ldi/2016-data-at-risk-project

National Oceanic and Atmospheric Administration. Climate Database Modernization Program. https://www.ncdc.noaa.gov/climate-information/research-programs/climate-database-modernization-program

[13] Earth Science Information Partners (ESIP). 2011. Interagency Data Stewardship/Citations/provider guidelines. http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Citations/provider_guidelines

[14] https://www.dhs.gov/fisma

[15] E.g. the International Environmental Data Rescue Organization (IEDRO, http://iedro.org/).

See also: Tan, L. S., S. Burton, R. Crouthamel, A. van Engelen, R. Hutchinson, L. Nicodemus, T. C. Peterson, F. Rahimzadeh. (2004). Guidelines on Climate Data Rescue. WMO/TD No. 1210. Ed. by P. Llansó and H. Kontongomde. Geneva, Switzerland: World Meteorological Organization. http://www.wmo.int/pages/prog/wcp/wcdmp/documents/WCDMP-55.pdf.

[16] Guidelines to the Rescue of Data At Risk, https://www.rd-alliance.org/guidelines-rescue-data-risk

See also the Interest Group’s home page: Research Data Alliance Data Rescue Interest Group. https://www.rd-alliance.org/groups/data-rescue.html

[17] Yakel, E., Faniel, I., Kriesberg, A., & Yoon, A. (2013). Trust in Digital Repositories. International Journal of Digital Curation, 8(1). https://doi.org/10.2218/ijdc.v8i1.251

Yoon, A. (2017). Data reusers' trust development. Journal of the Association for Information Science and Technology, 68(4): 946-956. https://doi.org/10.1002/asi.23730

 

 

Bridging DataRescue Events with Libraries+

We've recently had a influx of librarians interested in hosting DataRescue Events, including people wanting to have events connected to library conferences. We' so excited about the enthusiasm these events have inspired. However, the DataRescue Event workflow used by so many wonderful events is imperfect and we'd love to leverage the expertise of these librarians to bridge the work of events to the work going into Libraries+ 

One of the most wonderful things about DataRescue Events is how it engages communities. "Hackathon" style events draw crowds and press that libraries are not used to seeing - and it's wonderful! People who attend and organize events rightly feel that their work there is a meaningful way to take action and they learn a lot from the experience. We absolutely want to keep this invaluable component to events alive while also making their work even more meaningful by employing better practices, increasing awareness of the broader issues, and taking advantage of the expertise of librarian organizers and participants. 

Working within the standard DataRescue web and data archiving workflow, there are some tweaks and additions you might consider. First, the way the workflow is currently organized, Describing comes at the end. Any data curation librarian will tell you that describing and documentation should happen throughout the data life cycle - starting at the beginning. One way you might tweak the workflow is to expand the Research piece of it - having participants add significantly more contextual information to the record for the data. You can also employ the DataRescuePDX workflow to a similar end. 

Another easy and important way to connect the standard DataRescue activities to the work of the Libraries+ Network is to have a Long Trail path that thinks about the problems of preserving federal information and how it might be done in more sustainable ways. Insights from this activity can be shared with us -even posted on our blog- to help inform the discussion at the May meeting and beyond. More about what this might look like to come in a subsequent blog post soon!

There have already been some events that have done some different workflows and activities in the name of DataRescue. The second DataRescueNH in Dover, NH taught web archiving skills to attendees. DataRescuePhilly and DataRescueDC had teach-ins and panels to connect the archiving activities to the larger context of the problem. Virginia Tech had a small informational DataRescue event. DataRescuePDX employed a completely different workflow to work on creating metadata for datasets so they'll be more discoverable, harvestable, and usable. DataRescue@UCSD will employ the UCSD Library Digital Collections to make data identified by their scientists available. Many of the events being planned for Endangered Data Week could also be considered DataRescue by improving skills and understanding related to vulnerable digital information. The way we see it, there isn't just one way to do a DataRescue Event - it's all about what your organization can bring to the table and what's best for your community. Whatever you're thinking about doing, we'd love to chat and help you have the best event possible! 

An experiment

Over the course of the last several months, as we've worked with a giant collaborative network of volunteers to save government information, we've learned about the value, vulnerability and variety of government data. As we've described elsewhere, we're now working on planning a meeting where data advocates, librarians, federal data producers, and researchers will gather to begin thinking through new approaches to safeguarding copies of federal data.

In order to further inform that conversation, we've begun an experiment with a few brave volunteer librarians to see what we can learn from one model of saving data.

This experiment is not designed to offer a solution to the problem of backing up federal information, but is instead an attempt to further understand the problem.

This experiment is based on our current understanding of the landscape of federal information. When we talk about federal data, we are, in fact, talking about data stored on plain HTML webpage, in visualized and embedded content, on ftp servers, in databases available only through query interfaces, as as files that conform more closely to what we might normally think about as research datasets.

In order to provide for long term backup and re-use of data, one goal might be to turn the data from webpages and query pages into datasets that include appropriate metadata and enough contextual information for future researchers to make use of them. In some cases, making a dataset will be as simple as downloading the relevant files from data.gov.

This looks like a full dataset from data.gov

This looks like a full dataset from data.gov

Other sorts of data might require some compiling, where context from webpages needs to be combined with data scraped from within a query interface to create a dataset that can be usefully backed up and stored.

 

The experiment.

This is what we've asked of these libraries.

  1. Identify a designated community for whom you are saving data.
    1. Pick a subset of data that would be useful for that community that you'll try to save.
      For the purpose of this experiment, we are focused on data rather than webpages.
  2. In order to back up data that will be meaningful to people in the future, you'll need to decide how to chunk up the data you're saving into pieces, and how to describe those pieces. Where does one dataset end and another begin? How much information do you need about the data you'll be backing up to to make your copy re-usable and citable in the future? What files, webpages, or additional material complement the dataset you've identified? These key questions are ones you will address as you create a model of the data you'll be saving so that they can be re-used.
     
  3. Gather the necessary files you've identified so that the federal data are effectively "backed up" in your system, and made available to the public.
    1. For those libraries with data repositories, we hope they'll make the data available through those repositories, while for libraries without repositories, we can offer space in the datarefuge storage for their data.
  4. For each dataset you've backed up, create a data.json file to share with the datarefuge project for sharing in our instance of ckan (that is, in the datarefuge.org catalog). Make whatever changes to the standard format are necessary to both point to the original and to your copy.
     
  5. Look for ways to include public involvement, advocacy and education in this process. Are there ways that some of this work could be done by volunteers, by citizen scientists?

We hope that, through this experiment, we'll learn a few things:

  • What are the challenges to making government sites into datasets for backup and re-use?
  • How might these processes fit into library workflow?

There will, of course, remain a number of open questions that will need to be solved through continued collaboration and experimentation. These include:

  • Can we find ways to share our work with the data producers at agencies so that future re-use and discovery is enhanced?
  • How can we ensure that our system comprehensively addresses the data system in the federal governement?
  • How frequently should data be backed up, and what commitment will institutions make if they attempt this system?
  • What kinds of funding models and storage architecture will work best?
  • What kind of advocacy work will be most successful in continuing support for these efforts?

From DataRefuge to Libraries+

There have been nearly 30 DataRescue Events as of this writing. Can you believe it? We're so proud and inspired by all the hard work of event organizers and attendees. You are amazing.

As you know, DataRefuge grew quickly from grassroots. As we've grown, we've been able to get a better view of the underlying problem that led us to need these efforts. We've discussed this in many places before, but to bring it home, I'll summarize again: Government information is not archived systematically or particularly well. 

This problem isn't confined to the government, it's true of most born-digital information. We take for granted that information on the web will remain on the web - especially if it's being maintained by a trusted source like the federal government. But it's never a good idea to keep all of your eggs in one basket.

This leaves a simple sounding solution: We need to put our government information eggs into some more baskets. This is where Climate Mirror and DataRescue Events come in. These amazing basket weaving brigades have done an amazing job collecting government information. However, in our view, these methods aren't the best way to ensure government information is archived going forward. Chickens keep laying eggs. So do other birds. And lizards. And spiders.

To address the issue of sustainability, we've been envisioning a reboot of the Federal Depository Library Program wherein libraries will take on the responsibility of archiving the data and information of specific agencies in a distributed, coordinated way. And the more we think about the problem, the more complex we see it is. People in the library community have been thinking about this problem for years. As have people in the open data community and people within government agencies and researchers from a variety of disciplines.

And yet the problem remains. Clearly this is not a problem we at DataRefuge are going to solve. We need to bring together the voices, view points, and knowledge of these different communities. That's the aim of the blossoming Libraries+ Network, where representatives from all these communities will join forces to map out this problem and start to envision realistic solutions at the kick off meeting May 8-9 this year. 

We really can't say enough about how amazing the work of DataRescuers and Climate Mirrorists has been. DataRefuge will continue to stand with the efforts of DataRescue and support the work of Storytelling going forward. We're also so excited to move forward alongside the Libraries+ Network. We hope however you can, you'll join us.

Originally posted on the PPEH Blog at http://www.ppehlab.org/blogposts/2017/3/9/datarefuge-update-quo-vadimus

A rare opportunity to make a long-term difference

This post is authored by James A. Jacobs, Librarian Emeritus, University of California San Diego &
James R. Jacobs, Federal Government Information Librarian, Stanford University.

This moment in history provides us with a rare opportunity to go beyond short-term data rescue and set the much needed foundation for the long-term future of preservation of government information.

Awareness of risk. At the moment, more people than ever are aware of the risk of relying solely on the government to preserve its own information. This was not true even six months ago. This awareness goes far beyond government information librarians and archivists. It includes the communities that use government information (our Designated Communities!) and the government employees who devote their careers to creating this information. It includes our colleagues, our professional organizations, and library managers.

This awareness is documented in the many stories in the popular press this year about massive “data rescue” projects drawing literally hundreds of volunteers. It is also demonstrated by the number of people nominating seeds (URLs) for the current End of Term harvest and the number of seeds nominated. These have increased by nearly an order of magnitude or more over 2012.

EOT Year nominators seeds
2008 26 457
2012 31 1476
2016 >392 11,377

Awareness of need for planning. But beyond the numbers, more people are learning first-hand that rescuing information at the end of its life-cycle can be difficult, incomplete, and subject to error and even loss. It is clear that last minute rescue is essential in early 2017. But it is also clear that, in the future, efficient and effective preservation requires planning. This means that government agencies need to plan for the preservation of their own information and they need to do so at the beginning of the life-cycle of that information — even before it is actually created.

Opportunity to create demonstrable value. This awareness provides libraries with the opportunity to lead a movement to change government information policies that affect long-term preservation of and access to government information. By promoting this change, libraries will be laying the groundwork for the long-term preservation of information that their communities value highly. This provides an exceptional opportunity to work with motivated and inspired user communities toward a common goal. This is good news at a time when librarians are eager to demonstrate the value of libraries.

A model exists. And there is more good news. The model for a long-term government information policy not only exists, but libraries are already very familiar with it. In 2010, federal granting agencies like NSF, National Institutes of Health and Department of Energy started requiring researchers who receive Federal grants to develop Data Management Plans (DMPs) for the data collected and analyzed during the research process. Thus, data gathered at government expense by researchers must have a Plan to archive that data and make it available to other researchers. The requirements for DMPs have driven a small revolution of data management in libraries.

Ironically, there is no similar requirement for government agencies to develop a plan for the long-term management of information they gather and produce. There are, of course, a variety of requirements for managing government “Records” but there are several problems with the existing regulations.

Gaps in existing regulations. The Federal Records Act and related laws and regulations cover only a portion of the huge amount of information gathered and created by the government. In the past, it was relatively easy to distinguish between “publications” and “Records” but, in the age of digital information, databases, and transactional e-government it is much more difficult to do so. Official executive agency “Records Schedules,” which are approved by the National Archives and Records Administration (NARA), define only a subset of information gathered and created by an agency as Records suitable for deposit with NARA. (It must be noted that NARA cannot guarantee that it will provide online access to even born-digital Records deposited with it.) Further, the implementation of those Records Schedules are subject to interpretation by executive agency political appointees who may not always have preservation as their highest priority. This can make huge swaths of valuable information ineligible for deposit with NARA as Records.

Government data, documents, and publications that are not deemed official Records have no long-term preservation plan at all. In the paper-and-ink world, many agency publications that did not qualify as Records were printed by or sent to the Government Publishing Office (GPO) and deposited in Federal Depository Library Program (FDLP) libraries around the country (currently 1,147 libraries). Unfortunately, a perfect storm of policies and procedures has blocked FDLP libraries from preserving this huge class of government information. A 1983 court decision (INS v. Chadha, 462 U.S. 919, 952) makes it impossible to require agencies to deposit documents with the Government Publishing Office (GPO) or FDLP. The 1980 Paperwork Reduction Act (44 U.S.C. §§ 3501–3521) and the Office of Management and budget (OMB)’s Circular A-130 have made it more difficult to distribute government information to FDLP libraries. The shift to born-digital information has decentralized publishing and distribution, and virtually eliminated best practices of meta-data creation and standardization. GPO’s own Dissemination and Distribution Policy has further (and severely) limited the information it will distribute to FDLP libraries. Together, this “perfect storm” has reduced the deposit of this class of at-risk government information into FDLP libraries by ninety percent over the last twenty years.

The Solution: Information Management Plans. To plug the gaps in existing regulations, government agencies should be required to treat their own information with as much care as data gathered by researchers with government funding. What is needed is a new regulation that requires agencies to have Information Management Plans (IMPs) for all the information they collect, aggregate, and create.

We have proposed to the OMB a modification to their policy OMB Circular A-130: Managing Information as a Strategic Resource that would require every government agency to have an Information Management Plan.

Every government agency must have an “Information Management Plan” for the information it creates, collects, processes, or disseminates. The Information Management Plan must specify how the agency’s public information will be preserved for the long-term including its final deposit in a reputable, trusted, government (e.g., NARA, GPO, etc.) and/or non-government digital repository to guarantee free public access to it.

Many Benefits! We believe that such a requirement would provide many benefits for agencies, libraries, archives, and the general public. We think it would do more to enhance long-term public access to government information than changes to Title 44 of the US Code (which codified the “free use of government publications”) could do.

  • It would make it possible to preserve information continuously without the need for hasty last-minute rescue efforts.
  • It would make it easier to identify and select information and preserve it outside of government control.
  • It would result in digital objects that are easier to preserve accurately and securely.
  • It would make it easy for government agencies to collaborate with digital repositories and designated communities outside the government for the long-term preservation of their information.
  • The scale of the resulting digital preservation infrastructure would provide an easy path for shared Succession Plans for Trusted Digital Repositories (TDRs) (Audit And Certification Of Trustworthy Digital Repositories [ISO Standard 16363]).

IMPs would provide these benefits through the practical response of vendors that provide software to government agencies. Those vendors would have an enormous market for flexible software solutions for the creation of digital government information and records that fit the different needs of different agencies for database management, document creation, content management systems, email, and so forth, while, at the same time, making it easy for agencies to output preservable digital objects and an accurate inventory of them ready for deposit as Submission Information Packages (SIPs) into TDRs.

Your advice?

We believe this is a reasonable suggestion with a good precedent (the DMPs), but we would appreciate hearing your opinions. Is A‑130 the best target for such a regulation? What is the best way to propose, promote, and obtain such a new policy? What is the best wording for such a proposed policy?

Summary

We believe we have a singular opportunity of awareness and support for the preservation of government information. We believe that this is an opportunity, not just to preserve government information, but also to demonstrate the leadership of librarians and archivists and the value of libraries and archives.

(This is the second of two posts about setting long-term goals. The first post is A Long-Term Goal For Creating A Digital Government-Information Library Infrastructure.)

Authors:

James A. Jacobs, Librarian Emeritus, University of California San Diego
James R. Jacobs, Federal Government Information Librarian, Stanford University

A Long-Term Goal For Creating A Digital Government-Information Library Infrastructure

This post is authored by James A. Jacobs, Librarian Emeritus, University of California San Diego &
James R. Jacobs, Federal Government Information Librarian, Stanford University.

Now that so many have done so much good work to rescue so much data, it is time to reflect on our long-term goals. This is the first of two posts that suggest some steps to take. The second post is A rare opportunity to make a long-term difference.

The amount of data rescue work that has already been done by DataRefuge, ClimateMirror, Environmental Data and Governance Initiative (EDGI) projects and the End of Term crawl (EOT) 2016 is truly remarkable. In a very practical sense, however, this is only the first stage in a long process. We still have a lot of work to do to make all the captured digital content (web pages, data, PDFs, videos, etc) discoverable and understandable and usable. We believe that the next step is to articulate a long-term goal to guide the next tasks.

Of course, we do already have broad goals but up to now those goals have by necessity been more short-term than long-term. The short-term goals that have driven so much action have been either implicit (“rescue data!”) or explicit (“to document federal agencies’ presence on the World Wide Web during the transition of Presidential administrations” [EOT]). These have been sufficient to draw librarian-, scientist-, hacker-, and public volunteers who have accomplished a lot! But, as the EOT folks will remind us, most of this work is volunteer work.

The next stages will require more resources and long-term commitments. Notable next tasks include: creating metadata, identifying and acquiring DataRefuge’s uncrawlable data, and doing Quality Assurance (QA) work on content that has been acquired. This work has begun. The University of North Texas, for example, has created a pilot crowdsourcing project to catalog a cache of EOT PDFs and is looking for volunteers. This upcoming work is essential in order to make content we rescue and acquire discoverable and usable and to ensure that the content is preserved for the long-term.

As we look to the long-term, we turn to the two main international standards for long-term preservation: OAIS (Reference Model For An Open Archival Information System) and TDR (Audit And Certification Of Trustworthy Digital Repositories). Using the terminology of those standards our current actions have focused on “ingest.” Now we have to focus on the other functions of a TDR: management, preservation, access, and use. We might say that what we have been doing is Data Rescue but what we will do next is Data Preservation which includes discovery, access and use.

Given that, here is our suggestion for a long-term goal:

Create a digital government-information library infrastructure in which libraries collectively provide services for collections that are selected, acquired, organized, and preserved for specific Designated Communities (DCs).

Adopting this goal will not slow down or interrupt existing efforts. It focuses on “Designated Communities” and the life-cycle of information and, by doing so, it will help prioritize our actions. By doing this, it will help attract libraries to participate in the next stage activities. It will also make long-term participation easier and more effective by helping participants understand where their activities lead, what the outcomes will be, and what benefits they will get tomorrow by investing their resources in these activities today.

How does simply adopting a goal do all that?

First, by expressing the long-term goal in the language of OAIS and TDR it assures participants that today’s activities will ensure long-term access to information that is important to their communities.

Second, by putting the focus on the users of the information it demonstrates to our local communities that we are doing this for them. This will help make it practical to invest needed resources in the necessary work. The goal focuses on users of information by explicitly saying that our actions have been and will be designed to provide content and services for specific user groups (Designated Communities in OAIS terminology).

Third, by focusing on an infrastructure rather than isolated projects, it provides an opportunity for libraries to benefit more by participating than by not participating.

The key to delivering these benefits lies in the concept of Designated Communities. In the paper-and-ink world, libraries were limited in who they could serve. “Users” had to be local; they had to be able to walk into our buildings. It was difficult and expensive to share either collections or services, so we limited both to members of our funding institution or a geographically-local community. In the digital world, we no longer have to operate under those constraints. This means that we can build collections for Designated Communities that are defined by discipline or subject or by how a community uses digital information. This is a big change from defining a community by its institutional affiliation or by its members’ geographical proximity to an institution or to each other.

This means that each participating institution can benefit from the contributions of all participating institutions. To use a simple example, if ten libraries each invested the cost of developing collections and services for two DCs, all ten libraries (and their local/institutional communities) would get the benefits of twenty specific collections and services. There are more than one thousand Federal Depository Library Program (FDLP) libraries.

Even more importantly, this model means that the information-users will get better collections of the information they need and will get services that are tailored to how they look for, select, and use that information.

This approach may seem unconventional to government information specialists who are familiar with agency-based collections and services. The digital world allows us to combine the benefits of agency-based acquisitions with DC-based collections and services.

This means that we can still use the agency-based model for much of our work while simultaneously providing collections for DCs. For example, it is probably always more efficient and effective to identify, select, and acquire information by focusing on the the output of an agency. It is certainly easier to ensure comprehensiveness with this approach. It is often easier to create metadata and do QA for a single agency at a time. And information content can be easily stored and managed using the same agency-based approach. And information stored by agency can be viewed and served (through use of metadata and APIs) as a single “virtual” collection for a Designated Community. Any given document, dataset, or database may show up in the collections of several DCs, and any given “virtual” collection can easily contain content from many agencies.

For example, consider how this approach would affect a Designated Community of economists. A collection built to serve economists would include information from multiple agencies (e.g., Commerce, Council of Economic Advisors, CBO, GAO, NEC, USDA, ITA, etc. etc.). When one library built such a collection and provided services for it, every library with economists would be able better serve their community of economists. And every economist at every institution would be able to more easily find and use the information she needs. The same advantages would be true for DCs based on kind of use (e.g. document-based reading; computational textual-analysis; GIS; numeric data analysis; re-purposing and combining datasets; etc.).

Summary

We believe that adopting this goal will have several benefits. It will help attract more libraries to participate in the essential work that needs to be done after information is captured. It will provide a clear path for planning the long-term preservation of the information acquired. It will provide better collections and services to more users more efficiently and effectively than could be done by individual libraries working on their own. It will demonstrate the value of libraries to our local user-communities, our parent institutions, and funding agencies.

Authors:

James A. Jacobs, Librarian Emeritus, University of California San Diego
James R. Jacobs, Federal Government Information Librarian, Stanford University

Recording: Latest Lessons Learned

Laurie Allen, Assistant Director for Digital Scholarship, at Penn Libraries walks us through an overview of what our colleagues engaged in data rescue events have learned and how academic research libraries can complement those efforts.

Data rescue events are a bottom-up strategy to get as much data as we can when working with people with a wide variety of skills sets (not necessarily library-related skill sets) during a limited time-frame.

We are proposing research libraries can complement this with a top-down strategy. Librarians know how government agency data is organized, what types of information researchers need, and can target the work of downloading sets to be conducted as part of their routine work.

We are seeking a few research libraries who would be willing to commit to specific agencies and collaborating on a shared workflow as a pilot.

Emerging Ideas for Ways Libraries Can Contribute

After the initial webinar on Monday, we heard constructive feedback and good questions raised from colleagues at a number of libraries.  The scoop: Some of us have more resources than others — and finding the best way to contribute quickly and effectively isn’t obvious. In the spirit of trying to reflect back 4 levels of data rescue efforts that we are hearing, below is an approach that strives to balance flexibility with interest/resources in a way that remains true to the principle of “systematically grounded action.”

Here’s an overview:

Here’s the cycle:

 

Here are some details…

 

We look forward to hearing what you think. Thank you for all you are doing!

You're Invited

We hope you can join a collaborative project that leverages the talent and energy of librarians in addressing a wicked problem: Preserving born-digital government data.

Given the successes of the #DataRefuge project to rescue climate and environmental data, librarians have started to connect, ask, act, and contemplate collective action for more types of data. Let’s figure it out together!

Join us for a 30-minute webinar to kick-off collaborations:
Monday, February 6 @ 12:15 pm ET

Recording available

To join the collaboration, fill out this form on behalf of your library or archives that is related to this US Federal Agencies Coordination spreadsheet.

Here are some background documents that led us to reach out the ARL to convene people and energy toward positive action:

Leveraging Libraries (pdf, 1/27/17)

Libraries Network Overview (pdf, 2/1/17)

Chain of Custody (github, 2/1/17)

Hope to see you online!