Engaging in Small Data Rescue

The following post is authored by Anna E. Kijas, a Senior Digital Scholarship Librarian at Boston College Libraries.

In late January 2017, conversations began on the Music Library Association (MLA) listserv about data rescue. These conversations were primarily inquiries to see if anyone on the MLA-L was aware of data archiving or rescue efforts underway for vulnerable performing arts and music data on government websites. As a digital scholarship librarian whose disciplinary background is grounded in musicology and librarianship, I eagerly joined the conversation and reached out to colleagues in the MLA and beyond, including the Society of American Archivists (SAA), in order to start an informal environmental scan of existing data rescue initiatives, as well as, strategize and identify efforts in which we (or our colleagues and institutions) could participate. A small group of us convened virtually, as well as at the 2017 MLA conference to discuss our concerns, possible courses of action, and outreach to organizations and people. The idea of engaging in data rescue of performing arts and music data felt a bit overwhelming and we had many questions, such as:

  • Where do we start?!

  • Which federal agencies produce or host data that is vulnerable?

  • What should be archived?

  • Is anyone already archiving or rescuing this data?

  • Who is archiving social media accounts for these agencies?

We were able to answer some of these questions during these months of conversations or at least make decisions about what kind of data rescue we could engage in. We agreed that we did not wish to undertake any type of data rescue without first inquiring about existing guidelines and workflows already in place at agencies, such as those at the U.S. National Archives and Records Administration (NARA) or to duplicate efforts already underway by other institutions or people. Duplication is not a bad thing, however, as Andrew Battista and Stephen Balogh write in a recent Libraries+ Network post, entitled “The Challenge of Rescuing Federal Data: Thoughts and Lessons,” “given the scope of the challenge, it would be ideal to distribute efforts so as to benefit from the hard work of metadata remediation that all of us will inevitably do.” As Battista and Balogh also found, it is difficult to identify whether an institution has already preserved federal data because there is no central clearinghouse or data store for all data rescue efforts. So, we began an informal environmental survey and identified several organized efforts to preserve federal government data or websites, reviewing their scope, criteria and selection, methodology, and participation type. I’ve grouped these efforts into categories based on two activity types: a) Website Crawling and b) Data Archiving and Distribution. Within these categories, there was a difference in who could suggest or participate in the data rescue, which included: public participation, internal staff only, or a combination of internal staff and public participation in which staff made primary selection decisions with feedback from the public.

Website Crawling

In the first category of Website Crawling, we reviewed the following initiatives:

1.     Nomination Tool, an End of Term Presidential Harvest project, developed by the University of North Texas and designed by the Library of Congress, California Digital Library, University of North Texas Libraries, Internet Archive, George Washington University Libraries, Stanford University Libraries, and the U.S. Government Publishing Office enables librarians to contribute metadata about federal government websites in the Legislative, Executive, or Judicial branches for a focused crawl and has preserved U.S. government websites since 2008. The project is broken down into a comprehensive and prioritized crawl and provides criteria for the type of sites that are in scope.

2.     The Library of Congress Web Archives (LOCWA), established in 2000, also captures web content from websites, including U.S. federal government websites as well as non-U.S. government websites, including, but not limited to “foreign government, candidates for political office, political commentary, political parties, media, religious organizations, support groups, tributes and memorials, advocacy groups, educational and research institutions, creative expressions (cartoons, poetry, etc.), and blogs.” Staff in Library Services and the Law Library make selections for this archive.

3.    The Federal Depository Library Program (FDLP) is a program run by the U.S. Government Publishing Office to harvest and archive a selection of U.S. government websites using the Archive-It service from the Internet Archive.

4.     The Internet Archive Wayback Machine enables anyone to capture a snapshot of a webpage by crawling the site. Some institutions and organizations subscribe to Archive-It, a service developed by the Internet Archive that helps organizations harvest and preserve digital content.

While the Nomination Tool invites anyone to nominate content that will be reviewed and selected for crawling (during identified timeframes), selection for the LOC Web Archives are guided primarily by internal staff. Content for the FDLP is also selected by internal staff, but they welcome recommendations via email or through an online form.

Data Archiving and Distribution

In the second category, Data Archiving and Distribution, we identified the DataRefuge initiative that grew out of the Penn Program in the Environmental Humanities (PPEH LAB) “a collective of scholars, students, artists, scientists, and educators whose mission is to generate local and global awareness and engagement in the emergent area of the environmental humanities.” The aim of this initiative is to rescue federal climate and environmental data going beyond crawling of websites and to create “trustworthy copies of federal climate and environmental data” through data archiving, distribution, and replication. This group launched a series of events known as Data Rescue events, which have been hosted across the United States and Canada since December 2016. There are of course open data repositories where scholars can deposit their own data, as well as individual institutional repositories.

In addition, we came across initiatives that were community driven, for example Project_ARCC established in 2015 as “a community of archivists taking action on climate change.” This initiative facilitates conversation, resource sharing, and action amongst archivists (and non-archivists) who work with collections that may be impacted by climate change and to also promote collections that can educate and raise public awareness about climate change. Another initiative, which can be viewed as a community call to action was Endangered Data Week. A week in April (17-21) 2017 was designated as Endangered Data Week in order to raise awareness about vulnerable or at-risk datasets. Individual organizations, institutions, and individuals played an active role in determining the type of events and strategies they would undertake in order to contribute to this effort.

Aim & Strategy

The goal for us was to identify initiatives and projects that were focused specifically on archiving performing arts and music data from federal government websites that may be vulnerable or at-risk. During our environmental scan and outreach to individuals and institutions that work with or hold performing arts and music data, we identified the National Endowment for the Humanities (NEH), the National Endowment for the Arts (NEA), and the Institute of Museum and Library Services (IMLS) as the primary agencies for potential data rescue efforts. It is important to note that all three of these agencies were to have their funding eliminated in the budget blueprint issued by the current administration in March 2017 and since our initial exploration a new budget has been issued for FY2018 which proposes an elimination of these agencies. As mentioned earlier, there are several initiatives that crawl federal government websites, including the NEH, NEA, and IMLS, but it became apparent to us that no one else had yet “claimed” data from the NEH and IMLS aggregated in Data.gov. In addition, we initially discussed claiming data from the NEA, but determined that the ICPSR at University of Michigan already archives NEA datasets in their repository and UMass Amherst Special Collections & University Archives makes NEA publications on arts and arts management available online.

NEH and IMLS data found on the Data.gov domain is not considered big data nor does it require us to follow complex archiving workflows as climate change or science data, but it is nevertheless valuable and important for us to archive. The IMLS documents the distribution of grants since 1996 including those focused on administration, assessment, innovation, leadership, library services, museum services, preservation, professional development, and programming. The NEH data contains administrative information documenting the agency’s activities in grant funding from its inception in 1965 through the present. The activities documented in the IMLS and NEH data can be used by individuals, organizations, and communities for different purposes, including to demonstrate impact and provide justification as to why funding for the arts, humanities, and cultural heritage organizations (i.e. archives, libraries, and museums) is important to the citizens of this country. This is what makes this data valuable. It documents the work that people, communities, and institutions around the country from McGrath, Alaska, Erie, Pennsylvania, to New York, New York can accomplish with federal funding, which in many cases would be otherwise impossible.

Small Data Rescue at Boston College

I reached out to several Penn librarians working with DataRefuge to discuss our ideas and possibilities for archiving and claiming data. These conversations reinforced our decision to claim the NEH and IMLS datasets aggregated in Data.gov and to ingest the data into the shared CKAN repository maintained by DataRefuge, again reinforcing the community aspect of data rescue and making the data easier to locate by other colleagues. After speaking with colleagues at the MLA and SAA, several librarians and staff at Boston College agreed to host a data rescue event during Endangered Data Week modeled after the DataRefuge initiative. Our efforts were twofold: first, we wished to identify and rescue IMLS and NEH data by putting it into a shared CKAN repository and, second, we wanted to take this opportunity to do additional outreach to Boston College faculty and staff to encourage them to deposit their grant funded data into our institutional repository or instance of Dataverse.

To prepare for the data rescue event I met with Jesse Martinez (Library Applications Developer) to discuss pulling the data and metadata records for the NEH and IMLS datasets via the API. He was able to figure out the API query structure and pulled JSON formatted results using the following API queries:

  1. NEH
    https://catalog.data.gov/api/action/package_search?fq=(organization:%22neh-gov%22+AND+(type:dataset))
  2. IMLS
    https://catalog.data.gov/api/action/package_search?fq=(organization:%22imls-gov%22+AND+(type:dataset))

After the results were pulled, he parsed the API query results to get the dataset URIs. In the span of about 24 hours, Jesse put together a few Python scripts using Jupyter Notebook to parse and fetch each dataset. Each script will parse the JSON file to find each dataset URI, download, and save the dataset to a generated directory where it found the JSON file. If there are multiple dataset formats then each one will be downloaded to the same directory. A log file of each transaction is also generated. All of the queries, Python scripts, and additional instructions are available on our BC Digital Scholarship GitHub page.

Using the documentation developed by DataRefuge as a guide, I drafted a minimal data rescue workflow to use with the IMLS datasets and NEH datasets at our data rescue event. The workflow was fairly straightforward. We broke out into small teams and identified two broader tasks. The first task was to review the metadata records and datasets parsed via the API, check each file for integrity, and create a metadata record in the DataRefuge CKAN repository.

The steps to create the record and upload files is fairly straightforward. We added the following information in the metadata record: title, description, tags, license, organization, source, maintainer, and maintainer email. Each value was taken directly from the original IMLS and NEH dataset records found on Data.gov.

 

Once the metadata was entered, the corresponding datasets (files) were added to the record. Again, we used the original file names as provided by the agency. File formats are automatically detected in CKAN, but you can also identify the format from a dropdown list.

 

The second task was to review each IMLS record for links to a landing page, which pointed to external pages and data. Instead of including these metadata and data in CKAN, several colleagues went through each of the 78 landing pages and searched for each URL in the Internet Archive Wayback Machine. If the page was not found, the URL was crawled in Wayback Machine and reviewed to make sure that additional resources were also captured if they were linked from the initial page.

 

For example, one of the landing pages takes you to a Public Libraries Survey (PLS) Data and Reports page with downloadable data that has a snapshot in Wayback Machine. All of the metadata and files were reviewed in CKAN before they were made public. For the most part, the CKAN UI was easy to use and fairly intuitive. The only issues we encountered included uploading larger file sizes, which could not upload automatically, and deleting a file that was duplicated in the record also gave us a server error. At this time, the issues concerning file size have been addressed, however deleting a file from CKAN still generates a server error and will require further investigation.

As identified earlier, there are a number of different initiatives underway, many of which are focused on rescuing federal datasets or scientific datasets. Our small data rescue efforts were focused primarily on archiving data about NEH and IMLS activity to ensure that the impact and significance of these agency activities are documented and preserved. The rescued IMLS datasets can be found here and the NEH datasets can be found here. I also submitted information for the datasets to the US Federal Agency Coordination spreadsheet maintained by the University of Pennsylvania Libraries in an effort to document the work and make it possible for other librarians to find out whether this data has been archived. The questions and issues raised by Andrew Battista, Stephen Balogh, and Margaret Janz in their posts on the Libraries+ Network about communicating outside of silos, duplication of labor, institutional collection priorities, and data stores are also those that we discussed and considered during our travel down the path to small data rescue. These are important conversations that need to happen in and outside of our libraries in order to ensure that our efforts are happening collaboratively and meeting collective/community expectations.

There were many people involved in the initial conversations and brainstorming. I’d like to specifically thank my colleagues who participated in rescuing multiple datasets, primarily Sarah Melton, Chelcie Rowell, Kelly Webster, Ben Florin, Jesse Martinez, Sarah DeLorme, and Julia Hughes. I would also like to acknowledge colleagues who were consistently part of the ongoing discussion and helped shape our efforts, including Jason Imbesi, Scott W. Schwartz, Elizabeth Surles, Kimberly Eke, Sarah Wipperman, Margaret Janz, and Laurie Allen.