Engaging in Small Data Rescue

The following post is authored by Anna E. Kijas, a Senior Digital Scholarship Librarian at Boston College Libraries.

In late January 2017, conversations began on the Music Library Association (MLA) listserv about data rescue. These conversations were primarily inquiries to see if anyone on the MLA-L was aware of data archiving or rescue efforts underway for vulnerable performing arts and music data on government websites. As a digital scholarship librarian whose disciplinary background is grounded in musicology and librarianship, I eagerly joined the conversation and reached out to colleagues in the MLA and beyond, including the Society of American Archivists (SAA), in order to start an informal environmental scan of existing data rescue initiatives, as well as, strategize and identify efforts in which we (or our colleagues and institutions) could participate. A small group of us convened virtually, as well as at the 2017 MLA conference to discuss our concerns, possible courses of action, and outreach to organizations and people. The idea of engaging in data rescue of performing arts and music data felt a bit overwhelming and we had many questions, such as:

  • Where do we start?!

  • Which federal agencies produce or host data that is vulnerable?

  • What should be archived?

  • Is anyone already archiving or rescuing this data?

  • Who is archiving social media accounts for these agencies?

We were able to answer some of these questions during these months of conversations or at least make decisions about what kind of data rescue we could engage in. We agreed that we did not wish to undertake any type of data rescue without first inquiring about existing guidelines and workflows already in place at agencies, such as those at the U.S. National Archives and Records Administration (NARA) or to duplicate efforts already underway by other institutions or people. Duplication is not a bad thing, however, as Andrew Battista and Stephen Balogh write in a recent Libraries+ Network post, entitled “The Challenge of Rescuing Federal Data: Thoughts and Lessons,” “given the scope of the challenge, it would be ideal to distribute efforts so as to benefit from the hard work of metadata remediation that all of us will inevitably do.” As Battista and Balogh also found, it is difficult to identify whether an institution has already preserved federal data because there is no central clearinghouse or data store for all data rescue efforts. So, we began an informal environmental survey and identified several organized efforts to preserve federal government data or websites, reviewing their scope, criteria and selection, methodology, and participation type. I’ve grouped these efforts into categories based on two activity types: a) Website Crawling and b) Data Archiving and Distribution. Within these categories, there was a difference in who could suggest or participate in the data rescue, which included: public participation, internal staff only, or a combination of internal staff and public participation in which staff made primary selection decisions with feedback from the public.

Website Crawling

In the first category of Website Crawling, we reviewed the following initiatives:

1.     Nomination Tool, an End of Term Presidential Harvest project, developed by the University of North Texas and designed by the Library of Congress, California Digital Library, University of North Texas Libraries, Internet Archive, George Washington University Libraries, Stanford University Libraries, and the U.S. Government Publishing Office enables librarians to contribute metadata about federal government websites in the Legislative, Executive, or Judicial branches for a focused crawl and has preserved U.S. government websites since 2008. The project is broken down into a comprehensive and prioritized crawl and provides criteria for the type of sites that are in scope.

2.     The Library of Congress Web Archives (LOCWA), established in 2000, also captures web content from websites, including U.S. federal government websites as well as non-U.S. government websites, including, but not limited to “foreign government, candidates for political office, political commentary, political parties, media, religious organizations, support groups, tributes and memorials, advocacy groups, educational and research institutions, creative expressions (cartoons, poetry, etc.), and blogs.” Staff in Library Services and the Law Library make selections for this archive.

3.    The Federal Depository Library Program (FDLP) is a program run by the U.S. Government Publishing Office to harvest and archive a selection of U.S. government websites using the Archive-It service from the Internet Archive.

4.     The Internet Archive Wayback Machine enables anyone to capture a snapshot of a webpage by crawling the site. Some institutions and organizations subscribe to Archive-It, a service developed by the Internet Archive that helps organizations harvest and preserve digital content.

While the Nomination Tool invites anyone to nominate content that will be reviewed and selected for crawling (during identified timeframes), selection for the LOC Web Archives are guided primarily by internal staff. Content for the FDLP is also selected by internal staff, but they welcome recommendations via email or through an online form.

Data Archiving and Distribution

In the second category, Data Archiving and Distribution, we identified the DataRefuge initiative that grew out of the Penn Program in the Environmental Humanities (PPEH LAB) “a collective of scholars, students, artists, scientists, and educators whose mission is to generate local and global awareness and engagement in the emergent area of the environmental humanities.” The aim of this initiative is to rescue federal climate and environmental data going beyond crawling of websites and to create “trustworthy copies of federal climate and environmental data” through data archiving, distribution, and replication. This group launched a series of events known as Data Rescue events, which have been hosted across the United States and Canada since December 2016. There are of course open data repositories where scholars can deposit their own data, as well as individual institutional repositories.

In addition, we came across initiatives that were community driven, for example Project_ARCC established in 2015 as “a community of archivists taking action on climate change.” This initiative facilitates conversation, resource sharing, and action amongst archivists (and non-archivists) who work with collections that may be impacted by climate change and to also promote collections that can educate and raise public awareness about climate change. Another initiative, which can be viewed as a community call to action was Endangered Data Week. A week in April (17-21) 2017 was designated as Endangered Data Week in order to raise awareness about vulnerable or at-risk datasets. Individual organizations, institutions, and individuals played an active role in determining the type of events and strategies they would undertake in order to contribute to this effort.

Aim & Strategy

The goal for us was to identify initiatives and projects that were focused specifically on archiving performing arts and music data from federal government websites that may be vulnerable or at-risk. During our environmental scan and outreach to individuals and institutions that work with or hold performing arts and music data, we identified the National Endowment for the Humanities (NEH), the National Endowment for the Arts (NEA), and the Institute of Museum and Library Services (IMLS) as the primary agencies for potential data rescue efforts. It is important to note that all three of these agencies were to have their funding eliminated in the budget blueprint issued by the current administration in March 2017 and since our initial exploration a new budget has been issued for FY2018 which proposes an elimination of these agencies. As mentioned earlier, there are several initiatives that crawl federal government websites, including the NEH, NEA, and IMLS, but it became apparent to us that no one else had yet “claimed” data from the NEH and IMLS aggregated in Data.gov. In addition, we initially discussed claiming data from the NEA, but determined that the ICPSR at University of Michigan already archives NEA datasets in their repository and UMass Amherst Special Collections & University Archives makes NEA publications on arts and arts management available online.

NEH and IMLS data found on the Data.gov domain is not considered big data nor does it require us to follow complex archiving workflows as climate change or science data, but it is nevertheless valuable and important for us to archive. The IMLS documents the distribution of grants since 1996 including those focused on administration, assessment, innovation, leadership, library services, museum services, preservation, professional development, and programming. The NEH data contains administrative information documenting the agency’s activities in grant funding from its inception in 1965 through the present. The activities documented in the IMLS and NEH data can be used by individuals, organizations, and communities for different purposes, including to demonstrate impact and provide justification as to why funding for the arts, humanities, and cultural heritage organizations (i.e. archives, libraries, and museums) is important to the citizens of this country. This is what makes this data valuable. It documents the work that people, communities, and institutions around the country from McGrath, Alaska, Erie, Pennsylvania, to New York, New York can accomplish with federal funding, which in many cases would be otherwise impossible.

Small Data Rescue at Boston College

I reached out to several Penn librarians working with DataRefuge to discuss our ideas and possibilities for archiving and claiming data. These conversations reinforced our decision to claim the NEH and IMLS datasets aggregated in Data.gov and to ingest the data into the shared CKAN repository maintained by DataRefuge, again reinforcing the community aspect of data rescue and making the data easier to locate by other colleagues. After speaking with colleagues at the MLA and SAA, several librarians and staff at Boston College agreed to host a data rescue event during Endangered Data Week modeled after the DataRefuge initiative. Our efforts were twofold: first, we wished to identify and rescue IMLS and NEH data by putting it into a shared CKAN repository and, second, we wanted to take this opportunity to do additional outreach to Boston College faculty and staff to encourage them to deposit their grant funded data into our institutional repository or instance of Dataverse.

To prepare for the data rescue event I met with Jesse Martinez (Library Applications Developer) to discuss pulling the data and metadata records for the NEH and IMLS datasets via the API. He was able to figure out the API query structure and pulled JSON formatted results using the following API queries:

  1. NEH
    https://catalog.data.gov/api/action/package_search?fq=(organization:%22neh-gov%22+AND+(type:dataset))
  2. IMLS
    https://catalog.data.gov/api/action/package_search?fq=(organization:%22imls-gov%22+AND+(type:dataset))

After the results were pulled, he parsed the API query results to get the dataset URIs. In the span of about 24 hours, Jesse put together a few Python scripts using Jupyter Notebook to parse and fetch each dataset. Each script will parse the JSON file to find each dataset URI, download, and save the dataset to a generated directory where it found the JSON file. If there are multiple dataset formats then each one will be downloaded to the same directory. A log file of each transaction is also generated. All of the queries, Python scripts, and additional instructions are available on our BC Digital Scholarship GitHub page.

Using the documentation developed by DataRefuge as a guide, I drafted a minimal data rescue workflow to use with the IMLS datasets and NEH datasets at our data rescue event. The workflow was fairly straightforward. We broke out into small teams and identified two broader tasks. The first task was to review the metadata records and datasets parsed via the API, check each file for integrity, and create a metadata record in the DataRefuge CKAN repository.

The steps to create the record and upload files is fairly straightforward. We added the following information in the metadata record: title, description, tags, license, organization, source, maintainer, and maintainer email. Each value was taken directly from the original IMLS and NEH dataset records found on Data.gov.

 

Once the metadata was entered, the corresponding datasets (files) were added to the record. Again, we used the original file names as provided by the agency. File formats are automatically detected in CKAN, but you can also identify the format from a dropdown list.

 

The second task was to review each IMLS record for links to a landing page, which pointed to external pages and data. Instead of including these metadata and data in CKAN, several colleagues went through each of the 78 landing pages and searched for each URL in the Internet Archive Wayback Machine. If the page was not found, the URL was crawled in Wayback Machine and reviewed to make sure that additional resources were also captured if they were linked from the initial page.

 

For example, one of the landing pages takes you to a Public Libraries Survey (PLS) Data and Reports page with downloadable data that has a snapshot in Wayback Machine. All of the metadata and files were reviewed in CKAN before they were made public. For the most part, the CKAN UI was easy to use and fairly intuitive. The only issues we encountered included uploading larger file sizes, which could not upload automatically, and deleting a file that was duplicated in the record also gave us a server error. At this time, the issues concerning file size have been addressed, however deleting a file from CKAN still generates a server error and will require further investigation.

As identified earlier, there are a number of different initiatives underway, many of which are focused on rescuing federal datasets or scientific datasets. Our small data rescue efforts were focused primarily on archiving data about NEH and IMLS activity to ensure that the impact and significance of these agency activities are documented and preserved. The rescued IMLS datasets can be found here and the NEH datasets can be found here. I also submitted information for the datasets to the US Federal Agency Coordination spreadsheet maintained by the University of Pennsylvania Libraries in an effort to document the work and make it possible for other librarians to find out whether this data has been archived. The questions and issues raised by Andrew Battista, Stephen Balogh, and Margaret Janz in their posts on the Libraries+ Network about communicating outside of silos, duplication of labor, institutional collection priorities, and data stores are also those that we discussed and considered during our travel down the path to small data rescue. These are important conversations that need to happen in and outside of our libraries in order to ensure that our efforts are happening collaboratively and meeting collective/community expectations.

There were many people involved in the initial conversations and brainstorming. I’d like to specifically thank my colleagues who participated in rescuing multiple datasets, primarily Sarah Melton, Chelcie Rowell, Kelly Webster, Ben Florin, Jesse Martinez, Sarah DeLorme, and Julia Hughes. I would also like to acknowledge colleagues who were consistently part of the ongoing discussion and helped shape our efforts, including Jason Imbesi, Scott W. Schwartz, Elizabeth Surles, Kimberly Eke, Sarah Wipperman, Margaret Janz, and Laurie Allen.

 

The Challenge of Rescuing Federal Data: Thoughts and Lessons

The following is a guest post co-authored by Andrew Battista, Librarian for Geospatial Information Systems Services at The Wagner School of Public Service, NYU Libraries and Stephen Balogh, Data Services Specialist, NYU Libraries originally posted on Data Dispatch.

Recently, we experienced another panic over the ideological attack on scientific data. Rumors circulated that the EPA website, including all of the data it hosts, would be taken down. This appears to be just a rumor, for now. The current presidential administration notwithstanding, efforts to rescue data underscores what many people in the library community have known all along: even if federal data won’t abscond into thin air, much of it is poorly organized, partially documented, and effectively undiscoverable. Libraries can improve access to government data and should develop workflows for preserving federal data to make it more accessible.

Data rescue efforts began in January 2017, and over the past few months many institutions hosted hack-a-thon style events to scrape data and develop strategies for preservation. The Environmental Data & Governance Initiative (EDGI) developed a data rescue toolkit, which apportioned the challenge of saving data by distinct federal agency. The efforts of Data Refuge, a group based at Penn seeking to establish best practices for data rescue and preservation, have been written about in a number of places, including this blog

We’ve had a number of conversations at NYU and with other members of the library community about the implications of preserving federal data and providing access to it. The efforts, while important, call attention to a problem of organization that is very large in scope and likely cannot be solved in full by libraries.

Also a metaphor for preserving federal data

Also a metaphor for preserving federal data

Thus far, the divide-and-conquer model has postulated that individual institutions can “claim” a specific federal agency, do a deep dive to root around its websites, download data, and then mark the agency off a list as “preserved.” The process raises many questions, for libraries and for the data refuge movement. What does it mean to “claim” a federal agency? How can one institution reasonably develop a “chain of custody” for an agency’s comprehensive collection of data (and how do we define chain of custody)?

How do we avoid duplicated labor? Overlap is inevitable and isn’t necessarily a bad thing, but given the scope of the challenge, it would be ideal to distribute efforts so as to benefit from the hard work of metadata remediation that all of us will inevitably do.

These questions suggest even more questions about communication. How do we know when a given institution has preserved federal data, and at what point do we feel ready as a community to acknowledge that preservation has sufficiently taken place? Further, do we expect institutions to communicate that a piece of data has been published, and if so, by what means? What does preservation mean, especially in an environment where data is changing frequently, and what is the standard for discovery? Is it sufficient for one person or institution to download a file and save it? And when an institution claims that it has “rescued” data from a government agency, what commitment does it have to keep up with data refreshes on a regular basis?

An example of an attempt to engage with these issues is Stanford University’s recent decision to preserve the Housing and Urban Development spatial datasets, since they were directly attacked by Republican lawmakers. Early in the Spring 2017 semester, Stanford downloaded all of HUD’s spatial data, created metadata records for them, and loaded them into their spatial discovery environment (EarthWorks).

A HUD dataset preserved in Stanford’s Spatial Data Repository and digital collections

A HUD dataset preserved in Stanford’s Spatial Data Repository and digital collections

We can see from the timestamp on their metadata record that the files were added on March 24, 2017. Stanford’s collection process is very robust and implies a level of curation and preservation that is impressive. As colleagues, we know that by adding a file, Stanford has committed to preserving it in its institutional repository, presenting original FGDC or ISO 19139 metadata records, and publishing their newly created records to OpenGeoMetadata, a consortium of shared geospatial metadata records. Furthermore, we know that all records are discoverable at the layer level, which suggests a granularity in description and access that often is not present at many other sources, including Data.gov.

However, if I had not had conversations with colleagues who work at Stanford, I wouldn’t have realized they preserved the files at all and likely would’ve tried to make records for NYU’s Spatial Data Repository. Even as they exist, it’s difficult for me to know that these files were in fact saved as part of the Data Refuge effort. Furthermore, Stanford has made no public claim or longterm “chain of custody” agreement for HUD data, simply because no standards for doing so currently exist.

Maybe it wouldn’t be the worst thing for NYU to add these files to our repository, but it seems unnecessary, given the magnitude of federal data to be preserved. However, some redundancy is a part of the goals that Data Refuge imagines:

Data collected as part of the #DataRefuge initiative will be stored in multiple, trusted locations to help ensure continued accessibility. […]DataRefuge acknowledges–and in fact draws attention to–the fact that there are no guarantees of perfectly safe information. But there are ways that we can create safe and trustworthy copies. DataRefuge is thus also a project to develop the best methods, practices, and protocols to do so.

Each institution has specific curatorial needs and responsibilities, which imply choices about providing access to materials in library collections. These practices seldom coalesce with data management and publishing practices from those who work with federal agencies. There has to be some flexibility between community efforts to preserve data, individual institutions and their respective curation practices.

“That’s Where the Librarians Come In”

NYU imagines a model that dovetails with the Data Refuge effort in which individual institutions build upon their own strengths and existing infrastructure. We took as a directive some advice that Kimberly Eke at Penn circulated, including this sample protocol. We quickly began to realize that no approach is perfect, but we wanted to develop a pilot process for collecting data and bringing it into our permanent geospatial data holdings. The remainder of this post is a narrative of that experience in order to demonstrate some of the choices we made, assumptions we started with, and strategies we deployed to preserve federal data. Our goal is to preserve a small subset of data in a way that benefits our users and also meets the standards of the Data Refuge movement.

We began by collecting the entirety of publicly accessible metadata from Data.gov, using the underlying the CKAN data catalog API. This provided us with approximately 150,000 metadata records, stored as individual JSON files. Anyone who has worked with Data.gov metadata knows that it’s messy and inconsistent but is also a good starting place to develop better records. Furthermore, the concept of Data.gov serves as an effective registry or checklist (this global metadata vault could be another starting place); it’s not the only source of government data, nor is it necessarily authoritative. However, it is a good point of departure, a relatively centralized list of items that exist in a form that we can work with.

Since NYU Libraries already has a robust spatial data infrastructure and has established workflows for accessioning GIS data, we began by reducing the set of Data.gov records to those which are likely to represent spatial data. We did this by searching only for files that meet the following conditions:

  • Record contains at least one download resource with a ‘format’ field that contains any of {‘shapefile’, ‘geojson’, ‘kml’, ‘kmz’}
  • Record contains at least one resource with a ‘url’ field that contains any of {‘shapefile’, ‘geojson’, ‘kml’, [‘original’ followed by ‘.zip’]}

That search generated 6,353 records that are extremely likely to contain geospatial data. From that search we yielded a subset of records and then transformed them into a .CSV:

The next step was to filter down and look for meaningful patterns. We first filtered out all records that were not from federal sources, divided categories into like agencies, and started exploring them. Ultimately, we decided to rescue data from the Department of Agriculture, Forest Service. This agency seems to be a good test case for a number of the challenges that we’ve identified. We isolated 136 records and organized them here (click to view spreadsheet). However, we quickly realized that a sizable chunk of the records had already somehow become inactive or defunct after we had downloaded them (shaded in pink), perhaps because they had been superseded by another record. For example, this record is probably meant to represent the same data as this record. We can’t know for sure, which means we immediately had to decide what to do with potential gaps. We forged ahead with the records that were “live” in Data.gov.

About Metadata Cleaning

There are some limitations to the metadata in Data.gov that required our team to make a series of subjective decisions:

  1. Not everything in Data.gov points to an actual dataset. Often, records can point to other portals or clearinghouses of data that are not represented within Data.gov. We ultimately decided to omit these records from our data rescue effort, even if they point to a webpage, API, or geoservice that does contain some kind of data.
  2. The approach to establishing order on Data.gov is inconsistent. Most crucially for us, there is not a one-to-one correlation between a record and an individual layer of geospatial data. This happens frequently on federal sites. For instance, the record for the U.S. Forest Service Aerial Fire Retardant Hydrographic Avoidance Areas: Aquatic actually contains eight distinct shapefile layers that correspond to the different regions of coverage. NYU’s collection practice dictates that each of these layers be represented by a distinct record, but in the Data.gov catalog, they are condensed into a single record. 
  3. Not all data providers publish records for data on Data.gov consistently. Many agencies point to some element of their data that exists, but when you leave the Data.gov catalog environment and go to the source URL listed in the resources section of the record, you’ll find even more data. We had to make decisions about whether or not (and how) we would include this kind of data.
  4. It’s very common that single Data.gov metadata records remain intact, but the data that they represent changes. The Forest Service is a good example of this, as files are frequently refreshed and maintained within the USDA Forestry geodata clearinghouse. We did not make any effort in either of these cases to track down other sets of data that the Data.gov metadata records gesture toward (at least not at this time).

Relatedly, we did not make attempts to provide original records for different formats of what appeared to be the same data. In the case of the Forest Service, many of the records contained both a shapefile and a geodatabase, as well as other original metadata files. Our general approach was to save the shapefile and publish it in our collection environment, then bundle up all other “data objects” associated with a discrete Data.gov record and include them in the preservation environment of our Spatial Data Repository.

Finally, we realized that the quality of the metadata itself varies widely. We found that it’s a good starting place to creating metadata for discovery, even if we agree that a Data.gov record is an arbitrary way to describe a single piece of data. However, we had to clean the Data.gov records to adhere to the GeoBlacklight standard and our own internal cataloging practices. Here’s a snapshot of the metadata in process.

Sample Record
{
dc_identifier_s: "http://hdl.handle.net/2451/12345",
dc_title_s: "2017 Aerial Fire Retardant Hydrographic Avoidance Areas: Aquatic - Region 1",
dc_description_s: "This polygon layer depicts aerial retardant avoidance areas for hydrographic feature data. Aerial retardant avoidance area for hydrographic feature data are based on high resolution National Hydrographic Dataset (NHD) produced by USGS and available from the USFS Enterprise Data Warehouse. Forests and/or regions have had the opportunity to modify the default NHD water representation (300ft buffer from all water features) for their areas of interest to accurately represent aerial fire retardant avoidance areas as described in the 2011 Record of Decision for the Nationwide Aerial Application of Fire Retardant on National Forest System Land EIS. These changes have been integrated into this dataset depicting aerial fire retardant avoidance areas for hydrographic features.The following process was used to develop the hydrographic areas to be avoided by aerial fire retardant. Using the FCODE attribute, streams/rivers/waterbodies are categorized into perennial and intermittent/ephemeral types. Linear features (streams & rivers) FCODES 46003 and 46006 and polygonal features (lakes and other waterbody) FCODES 39001, 39005, 39006, 43612, 43614, 46601 are considered intermittentt/ephemeral features. All other FCODES are considered to be perennial features. Underground and covered water features (e.g., pipelines) are excluded. Initially, all intermittent/ephemeral and perennial features were buffered by 300 feet by the Forest/Region units. Subsequently, Forest/Region units may have extended these buffers locally based on their requirements. The resulting avoidance areas may have overlapping features due to the buffering processes.The National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that make up the nation's surface water drainage system. NHD data was originally developed at 1:100,000-scale and exists at that scale for the whole country. This high-resolution NHD, generally developed at 1:24,000/1:12,000 scale, adds detail to the original 1:100,000-scale NHD. (Data for Alaska, Puerto Rico and the Virgin Islands was developed at high-resolution, not 1:100,000 scale.) Local resolution NHD is being developed where partners and data exist. The NHD contains reach codes for networked features, flow direction, names, and centerline representations for areal water bodies. Reaches are also defined on waterbodies and the approximate shorelines of the Great Lakes, the Atlantic and Pacific Oceans and the Gulf of Mexico. The NHD also incorporates the National Spatial Data Infrastructure framework criteria established by the Federal Geographic Data Committee.This layer was preserved from the Data.gov catalog as part of the Data Refuge effort (www.datarefuge.org) and is a representation of catalog item 77781e81-17d6-4f91-a2df-7dfc7cb33eef. Some modifications to the metadata have been made. Refer to the checksum manifest for a list of all original data objects associated with this item. Refer to the documentation for original metadata and information on the data.",
dc_rights_s: "Public",
dct_provenance_s: "NYU",
dct_references_s: "{"http://schema.org/url":"http://hdl.handle.net/2451/12345","http://schema.org/downloadUrl":"https://archive.nyu.edu/bitstream/2451/12345/2/nyu_2451_12345.zip","http://www.opengis.net/def/serviceType/ogc/wfs":"https://maps-public.geo.nyu.edu/geoserver/sdr/wfs","http://www.opengis.net/def/serviceType/ogc/wms":"https://maps-public.geo.nyu.edu/geoserver/sdr/wms"}",
layer_id_s: "sdr:nyu_2451_12345",
layer_slug_s: "nyu_2451_12345",
layer_geom_type_s: "Polygon",
layer_modified_dt: "2017-5-2T19:45:8Z",
dc_format_s: "Shapefile",
dc_language_s: "English",
dc_type_s: "Dataset",
dc_publisher_s: [
"United States. Department of Agriculture"
],
dc_creator_sm: [ ],
dc_subject_sm: [
"Forest management",
"Hydrography",
"Fire prevention",
"Emergency management"
],
dct_isPartOf_sm: "Data.gov Rescue",
dct_issued_s: "04-01-2017",
dct_temporal_sm: [
"2017"
],
dct_spatial_sm: [
"United States of America"
],
dc_relation_sm: [
"http://sws.geonames.org/6252001/about/rdf"
],
solr_geom: "ENVELOPE(-170.1769013405, -64.5665435791, 71.6032483233, 24.7073204053)",
solr_year_i: 2017,
geoblacklight_version: "1.0"
}

Some of the revisions to the metadata are small and reflect choices that we make at NYU (these are highlighted in red). For instance, the titles were changed to reflect a date-title-area convention that we already use. Other fields (like Publisher) are authority controlled and were easy to change, while others, like format and provenance, were easy to add. For those unfamiliar with the GeoBlacklight standard, refer to the project schema pages and related documentation. Many of the metadata enhancements are system requirements for items to be discovered within our Spatial Data Repository. Subjects presented more of a problem, as these are drawn from an informal tagging system on Data.gov. We used an elaborate process of finding and replacing to remediate these subjects into the LCSH Authority, which connects the items we collect into our larger library discovery environment.

The most significant changes are in the descriptions. We preserved the essence of the original Data.gov description, yet we cleaned up the prose a little bit and added a way to trace the item that we are preserving back to its original representation in Data.gov. In the case of aforementioned instances, in which a single Data.gov record contains more than one shapefile, we generated an entirely new record and referenced it to the original Data.gov UUID. For example:

Sample Record
{
dc_identifier_s: "http://hdl.handle.net/2451/12346",
dc_title_s: "2017 Aerial Fire Retardant Hydrographic Avoidance Areas: Aquatic - Region 2",
dc_description_s: "This polygon layer depicts aerial retardant avoidance areas for hydrographic feature data. Aerial retardant avoidance area for hydrographic feature data are based on high resolution National Hydrographic Dataset (NHD) produced by USGS and available from the USFS Enterprise Data Warehouse. Forests and/or regions have had the opportunity to modify the default NHD water representation (300ft buffer from all water features) for their areas of interest to accurately represent aerial fire retardant avoidance areas as described in the 2011 Record of Decision for the Nationwide Aerial Application of Fire Retardant on National Forest System Land EIS. These changes have been integrated into this dataset depicting aerial fire retardant avoidance areas for hydrographic features.The following process was used to develop the hydrographic areas to be avoided by aerial fire retardant. Using the FCODE attribute, streams/rivers/waterbodies are categorized into perennial and intermittent/ephemeral types. Linear features (streams & rivers) FCODES 46003 and 46006 and polygonal features (lakes and other waterbody) FCODES 39001, 39005, 39006, 43612, 43614, 46601 are considered intermittentt/ephemeral features. All other FCODES are considered to be perennial features. Underground and covered water features (e.g., pipelines) are excluded. Initially, all intermittent/ephemeral and perennial features were buffered by 300 feet by the Forest/Region units. Subsequently, Forest/Region units may have extended these buffers locally based on their requirements. The resulting avoidance areas may have overlapping features due to the buffering processes.The National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that make up the nation's surface water drainage system. NHD data was originally developed at 1:100,000-scale and exists at that scale for the whole country. This high-resolution NHD, generally developed at 1:24,000/1:12,000 scale, adds detail to the original 1:100,000-scale NHD. (Data for Alaska, Puerto Rico and the Virgin Islands was developed at high-resolution, not 1:100,000 scale.) Local resolution NHD is being developed where partners and data exist. The NHD contains reach codes for networked features, flow direction, names, and centerline representations for areal water bodies. Reaches are also defined on waterbodies and the approximate shorelines of the Great Lakes, the Atlantic and Pacific Oceans and the Gulf of Mexico. The NHD also incorporates the National Spatial Data Infrastructure framework criteria established by the Federal Geographic Data Committee.This layer does not have a discrete representation in Data.gov; rather, it is a data object represented on the record 77781e81-17d6-4f91-a2df-7dfc7cb33eef. Some modifications to the metadata have been made. Refer to the checksum manifest for a list of all original data objects associated with this item. Refer to the documentation for original metadata and information on the data.",
dc_rights_s: "Public",
dct_provenance_s: "NYU",
dct_references_s: "{"http://schema.org/url":"http://hdl.handle.net/2451/12345","http://schema.org/downloadUrl":"https://archive.nyu.edu/bitstream/2451/12345/2/nyu_2451_12345.zip","http://www.opengis.net/def/serviceType/ogc/wfs":"https://maps-public.geo.nyu.edu/geoserver/sdr/wfs","http://www.opengis.net/def/serviceType/ogc/wms":"https://maps-public.geo.nyu.edu/geoserver/sdr/wms"}",
layer_id_s: "sdr:nyu_2451_12345",
layer_slug_s: "nyu_2451_12345",
layer_geom_type_s: "Polygon",
layer_modified_dt: "2017-5-2T19:45:8Z",
dc_format_s: "Shapefile",
dc_language_s: "English",
dc_type_s: "Dataset",
dc_publisher_s: [
"United States. Department of Agriculture"
],
dc_creator_sm: [ ],
dc_subject_sm: [
"Forest management",
"Hydrography",
"Fire prevention",
"Emergency management"
],
dct_isPartOf_sm: "Data.gov Rescue",
dct_issued_s: "04-01-2017",
dct_temporal_sm: [
"2017"
],
dct_spatial_sm: [
"United States of America"
],
dc_relation_sm: [
"http://sws.geonames.org/6252001/about/rdf"
],
solr_geom: "ENVELOPE(-170.1769013405, -64.5665435791, 71.6032483233, 24.7073204053)",
solr_year_i: 2017,
geoblacklight_version: "1.0"
}

Adding these descriptions into the metadata field not only identifies our work with the data refuge movement, but also it allows for anyone who discovers this data to track back to its presentation in an original context. Still it’s important to emphasize that this process inevitably means that not all data associated with the Forest Service has been rescued by NYU.

In all, by narrowing down to one agency and then doing a search that is likely to yield spatial data only, we ended up identifying 71 records of interest but ultimately publishing 90 individual records to represent this data (see our final spreadsheet). Note that we are still in the process of importing these records into our discovery environment.

Future Directions: Publishing Checksums

Libraries’ ability to represent precisely and accurately which datasets, or components of datasets, have been preserved is a serious impediment to embarking on a distributed repository / data-rescue project. Further, libraries need to know if data objects have been preserved and where they reside. To return to the earlier example, how is New York University to know that a particular government dataset has already been “rescued” and is being preserved (either via a publicly-accessible repository interface, or not)?

Moreover, even if there is a venue for institutions to discuss which government datasets fall within their collection priorities (e.g. “New York University cares about federal forestry data, and therefore will be responsible for the stewardship of that data”), it’s not clear that there is a good strategy for representing the myriad ways in which the data might exist in its “rescued” form. Perhaps the institution that elects to preserve a dataset wants to make a few curatorial decisions in order to better contextualize the data with the rest of the institution’s offerings (as we did with the Forest Service data). These types of decisions are not abnormal in the context of library accessioning.

The problem comes when data processing practices of an institution, which are often idiosyncratic and filled with “local” decisions to a certain degree, start to inhibit the ability for individuals to identify a copy of a dataset in the capacity of a copy. There is a potential tension between preservation –– preserving the original file structure, naming conventions, and even level of dissemination of government data products –– and discovery, where libraries often make decisions about the most useful way for users to find relevant data that are in conflict with the decisions exhibited in the source files.

For the purposes of mitigating the problem sketched above, we propose a data store that can be drawn upon by all members of the library / data-rescue community, whereby the arbitrary or locally-specific mappings and organizational decisions can be related back to original checksums of individual, atomic, files. File checksums would be unique identifiers in such a datastore, and given a checksum, this service would display “claims” about institutions that hold the corresponding file, and the context in which that file is accessible.

Consider this as an example:

  • New York University, as part of an intentional data rescue effort, decides to focus on collecting and preserving data from the U.S. Forest Service.
  • The documents and data from Forest Service are accessible through many venues:
    • They (or some subset) are linked to from a Data.gov record
    • They (or some subset) are linked to directly from the FSGeodata Clearinghouse
    • They are available directly from a geoservices or FTP endpoint maintained by the Forest Service (such as here).
  • NYU wants a way to grab all of the documents from the Forest Service that it is aware of and make those documents available in an online repository. The question is, if NYU has made organizational and curatorial decisions about the presentation of documents rescued, how can it be represented (to others) that the files in the repository are indeed preserved copies of other datasets? If, for instance, Purdue University comes along and wants to verify that everything on the Forest Service’s site is preserved somewhere, it now becomes more difficult to do so, particularly since those documents never possessed a canonical or authoritative ID in the first place, and even could have been downloaded originally from various source URLs.

Imagine instead that as NYU accessions documents ––restructuring them and adding metadata –– they not only create checksum manifests (similar to, if not even identical to the ones created by default by BagIt), but also deposit those manifests to a centralized data store in such a form that the data store could now relate essential information:

The file with checksum 8a53c3c191cd27e3472b3e717e3c2d7d979084b74ace0d1e86042b11b56f2797 appears in as a component of the document instituton_a_9876... held by New York University.

Assuming all checksums are computed at the lowest possible level on files rescued from Federal agencies (i.e., always unzip archives, or otherwise get to an atomic file before computing a checksum), such a service could use archival manifest data as a way to signal to other institutions if a file has been preserved, regardless of whether or not it exists as a smaller component of a different intellectual entity –– and it could even communicate additional data about where to find these preserved copies. In the example of the dataset mentioned above, the original Data.gov record represents 8 distinct resources, including a Shapefile, a geodatabase, an XML metadata document, an HTML file that links to an API, and more. For the sake of preservation, we could package all of these items, generate checksums for each, and then take a further step in contributing our manifest to this hypothetical datastore. Then, as other institutions look to save other data objects, they could search against this datastore and find not merely checksums of items at the package level, but actually at the package component level, allowing them to evaluate which portion or percentage of data has been preserved.

A system such as the one sketched above could efficiently communicate preservation priorities to a community of practice, and even find use for more general collection-development priorities of a library. Other work in this field, particularly that regarding IPFS, could tie in nicely –– but unlike IPFS, this would provide a way to identify content that exists within file archives, and would not necessitate any new infrastructure for hosting material. All it would require is for an institution to contribute checksum manifests and a small amount of accompanying metadata to a central datastore.

Principles

Even though our rescue of the Forest Service data is still in process, we have learned a lot about the challenges associated with this project. We’re very interested in learning about how other institutions are handling the process of rescuing federal data and look forward to more discussions at the event in Washington D.C. on May 8.

Communication across communities: Why isn't this working?

One of the lessons we've learned in working on the Data Refuge project is that librarians aren't the only people who have, for years, been discussing how to solve the problems associated with so much of our most important governmental information only available digitally and online. Data professionals within government agencies, other government workers, people in the open data community, archivists, and researchers across disciplines have all been grappling with these challenges within their own communities. In hearing all these voices we realized we need all of these perspectives to come together to solve this problem - thus the Libraries+ Network was ignited.

Many of these groups have at some point acknowledged that they needed voices and expertise from other communities, however we have all either failed to talk to each other at all, or failed to create long-term productive collaborations.

Why is communicating across communities so difficult? This question looms large in so many aspects of my professional life. It seems functional communication and collaboration is yet another problem none of us have so far solved. Working with Data Refuge and Libraries+ has really brought the issue into focus for me and, as is the theme of Libraries+, I can see the problem with much better clarity, although I may not have the solution.

Cylinders of excellence

Cylinders of excellence

I've thought through a number of metaphors to explain the problem, bear with me while I go through them. 

The idea of silos is somewhat apt, except it's more like we're in towers (ivory or otherwise); we can kind of see each other, depending on where our windows are, and we all see some of the landscape. We can holler to each other about what we see, but when we hear the hollering we only get some of the message, and it's a bit garbled from traveling so far. We need walkie talkies. We need binoculars.

We've also used the idea of having our hands on different parts of the elephant quite a bit. This metaphor also works pretty well, except elephants aren't so big that we wouldn't be able to say "Hey, this feels leathery" or "This feels hard and smooth" or "This is definitely a tail" to each other. Eventually someone would describe the trunk and we'd all be on the same page. The problem isn't so much that we can't hear or aren't listening, it's that we're actually speaking different languages to each other.  The "cold round thing" you're describing might be totally different from how I would describe a tusk and I'll keep imagining a snowball, or plate.  Jargon is a huge obstacle most of us are aware of, but never seem to try to reconcile. Our translations are not as good as we think they are, if we attempt them at all. Most of the time it feels like we just get bogged down in the differences of our semantics and not the similarities in our meaning. 

We have to be able to really listen to each other, and avoid filtering what we hear through our preconceived ideas about the problem.


These metaphors run through my brain and quickly morph together into one of my favorite children's stories, Two Monsters, by David McKee. This story is about, believe it or not, two monsters who live on opposite sides of a mountain. They talk to each other through a hole in the mountain and are friends until one of them comments about the beautiful sunset they can see. 

Scene from Two Monsters, by David McKee, (c) 1985.

Scene from Two Monsters, by David McKee, (c) 1985.


The monsters proceed to get in a huge argument and (spoilers!) throw rocks at each other over the mountain until they wear it down, the mountain is no more, and they can see what the other was trying to describe.

We need to get rid of our mountains. We have to be able to really listen to each other, and avoid filtering what we hear through our preconceived ideas about the problem. We have to be open to being wrong and open to being right with an asterisk. We have to stop being defensive when someone has a different way of doing things. We have to stop feeling like calling on other communities to support our weak spots means we are weak. We have to work together for real -- because that is how we're strongest.

I'm really excited for our May Meeting next week where so many voices will come together to describe their piece of the elephant. We're all going to need to get outside of our boxes and take a peek into others. But that's starting to be too many metaphors, isn't it?