The Challenge of Rescuing Federal Data: Thoughts and Lessons

The following is a guest post co-authored by Andrew Battista, Librarian for Geospatial Information Systems Services at The Wagner School of Public Service, NYU Libraries and Stephen Balogh, Data Services Specialist, NYU Libraries originally posted on Data Dispatch.

Recently, we experienced another panic over the ideological attack on scientific data. Rumors circulated that the EPA website, including all of the data it hosts, would be taken down. This appears to be just a rumor, for now. The current presidential administration notwithstanding, efforts to rescue data underscores what many people in the library community have known all along: even if federal data won’t abscond into thin air, much of it is poorly organized, partially documented, and effectively undiscoverable. Libraries can improve access to government data and should develop workflows for preserving federal data to make it more accessible.

Data rescue efforts began in January 2017, and over the past few months many institutions hosted hack-a-thon style events to scrape data and develop strategies for preservation. The Environmental Data & Governance Initiative (EDGI) developed a data rescue toolkit, which apportioned the challenge of saving data by distinct federal agency. The efforts of Data Refuge, a group based at Penn seeking to establish best practices for data rescue and preservation, have been written about in a number of places, including this blog

We’ve had a number of conversations at NYU and with other members of the library community about the implications of preserving federal data and providing access to it. The efforts, while important, call attention to a problem of organization that is very large in scope and likely cannot be solved in full by libraries.

Also a metaphor for preserving federal data

Also a metaphor for preserving federal data

Thus far, the divide-and-conquer model has postulated that individual institutions can “claim” a specific federal agency, do a deep dive to root around its websites, download data, and then mark the agency off a list as “preserved.” The process raises many questions, for libraries and for the data refuge movement. What does it mean to “claim” a federal agency? How can one institution reasonably develop a “chain of custody” for an agency’s comprehensive collection of data (and how do we define chain of custody)?

How do we avoid duplicated labor? Overlap is inevitable and isn’t necessarily a bad thing, but given the scope of the challenge, it would be ideal to distribute efforts so as to benefit from the hard work of metadata remediation that all of us will inevitably do.

These questions suggest even more questions about communication. How do we know when a given institution has preserved federal data, and at what point do we feel ready as a community to acknowledge that preservation has sufficiently taken place? Further, do we expect institutions to communicate that a piece of data has been published, and if so, by what means? What does preservation mean, especially in an environment where data is changing frequently, and what is the standard for discovery? Is it sufficient for one person or institution to download a file and save it? And when an institution claims that it has “rescued” data from a government agency, what commitment does it have to keep up with data refreshes on a regular basis?

An example of an attempt to engage with these issues is Stanford University’s recent decision to preserve the Housing and Urban Development spatial datasets, since they were directly attacked by Republican lawmakers. Early in the Spring 2017 semester, Stanford downloaded all of HUD’s spatial data, created metadata records for them, and loaded them into their spatial discovery environment (EarthWorks).

A HUD dataset preserved in Stanford’s Spatial Data Repository and digital collections

A HUD dataset preserved in Stanford’s Spatial Data Repository and digital collections

We can see from the timestamp on their metadata record that the files were added on March 24, 2017. Stanford’s collection process is very robust and implies a level of curation and preservation that is impressive. As colleagues, we know that by adding a file, Stanford has committed to preserving it in its institutional repository, presenting original FGDC or ISO 19139 metadata records, and publishing their newly created records to OpenGeoMetadata, a consortium of shared geospatial metadata records. Furthermore, we know that all records are discoverable at the layer level, which suggests a granularity in description and access that often is not present at many other sources, including Data.gov.

However, if I had not had conversations with colleagues who work at Stanford, I wouldn’t have realized they preserved the files at all and likely would’ve tried to make records for NYU’s Spatial Data Repository. Even as they exist, it’s difficult for me to know that these files were in fact saved as part of the Data Refuge effort. Furthermore, Stanford has made no public claim or longterm “chain of custody” agreement for HUD data, simply because no standards for doing so currently exist.

Maybe it wouldn’t be the worst thing for NYU to add these files to our repository, but it seems unnecessary, given the magnitude of federal data to be preserved. However, some redundancy is a part of the goals that Data Refuge imagines:

Data collected as part of the #DataRefuge initiative will be stored in multiple, trusted locations to help ensure continued accessibility. […]DataRefuge acknowledges–and in fact draws attention to–the fact that there are no guarantees of perfectly safe information. But there are ways that we can create safe and trustworthy copies. DataRefuge is thus also a project to develop the best methods, practices, and protocols to do so.

Each institution has specific curatorial needs and responsibilities, which imply choices about providing access to materials in library collections. These practices seldom coalesce with data management and publishing practices from those who work with federal agencies. There has to be some flexibility between community efforts to preserve data, individual institutions and their respective curation practices.

“That’s Where the Librarians Come In”

NYU imagines a model that dovetails with the Data Refuge effort in which individual institutions build upon their own strengths and existing infrastructure. We took as a directive some advice that Kimberly Eke at Penn circulated, including this sample protocol. We quickly began to realize that no approach is perfect, but we wanted to develop a pilot process for collecting data and bringing it into our permanent geospatial data holdings. The remainder of this post is a narrative of that experience in order to demonstrate some of the choices we made, assumptions we started with, and strategies we deployed to preserve federal data. Our goal is to preserve a small subset of data in a way that benefits our users and also meets the standards of the Data Refuge movement.

We began by collecting the entirety of publicly accessible metadata from Data.gov, using the underlying the CKAN data catalog API. This provided us with approximately 150,000 metadata records, stored as individual JSON files. Anyone who has worked with Data.gov metadata knows that it’s messy and inconsistent but is also a good starting place to develop better records. Furthermore, the concept of Data.gov serves as an effective registry or checklist (this global metadata vault could be another starting place); it’s not the only source of government data, nor is it necessarily authoritative. However, it is a good point of departure, a relatively centralized list of items that exist in a form that we can work with.

Since NYU Libraries already has a robust spatial data infrastructure and has established workflows for accessioning GIS data, we began by reducing the set of Data.gov records to those which are likely to represent spatial data. We did this by searching only for files that meet the following conditions:

  • Record contains at least one download resource with a ‘format’ field that contains any of {‘shapefile’, ‘geojson’, ‘kml’, ‘kmz’}
  • Record contains at least one resource with a ‘url’ field that contains any of {‘shapefile’, ‘geojson’, ‘kml’, [‘original’ followed by ‘.zip’]}

That search generated 6,353 records that are extremely likely to contain geospatial data. From that search we yielded a subset of records and then transformed them into a .CSV:

The next step was to filter down and look for meaningful patterns. We first filtered out all records that were not from federal sources, divided categories into like agencies, and started exploring them. Ultimately, we decided to rescue data from the Department of Agriculture, Forest Service. This agency seems to be a good test case for a number of the challenges that we’ve identified. We isolated 136 records and organized them here (click to view spreadsheet). However, we quickly realized that a sizable chunk of the records had already somehow become inactive or defunct after we had downloaded them (shaded in pink), perhaps because they had been superseded by another record. For example, this record is probably meant to represent the same data as this record. We can’t know for sure, which means we immediately had to decide what to do with potential gaps. We forged ahead with the records that were “live” in Data.gov.

About Metadata Cleaning

There are some limitations to the metadata in Data.gov that required our team to make a series of subjective decisions:

  1. Not everything in Data.gov points to an actual dataset. Often, records can point to other portals or clearinghouses of data that are not represented within Data.gov. We ultimately decided to omit these records from our data rescue effort, even if they point to a webpage, API, or geoservice that does contain some kind of data.
  2. The approach to establishing order on Data.gov is inconsistent. Most crucially for us, there is not a one-to-one correlation between a record and an individual layer of geospatial data. This happens frequently on federal sites. For instance, the record for the U.S. Forest Service Aerial Fire Retardant Hydrographic Avoidance Areas: Aquatic actually contains eight distinct shapefile layers that correspond to the different regions of coverage. NYU’s collection practice dictates that each of these layers be represented by a distinct record, but in the Data.gov catalog, they are condensed into a single record. 
  3. Not all data providers publish records for data on Data.gov consistently. Many agencies point to some element of their data that exists, but when you leave the Data.gov catalog environment and go to the source URL listed in the resources section of the record, you’ll find even more data. We had to make decisions about whether or not (and how) we would include this kind of data.
  4. It’s very common that single Data.gov metadata records remain intact, but the data that they represent changes. The Forest Service is a good example of this, as files are frequently refreshed and maintained within the USDA Forestry geodata clearinghouse. We did not make any effort in either of these cases to track down other sets of data that the Data.gov metadata records gesture toward (at least not at this time).

Relatedly, we did not make attempts to provide original records for different formats of what appeared to be the same data. In the case of the Forest Service, many of the records contained both a shapefile and a geodatabase, as well as other original metadata files. Our general approach was to save the shapefile and publish it in our collection environment, then bundle up all other “data objects” associated with a discrete Data.gov record and include them in the preservation environment of our Spatial Data Repository.

Finally, we realized that the quality of the metadata itself varies widely. We found that it’s a good starting place to creating metadata for discovery, even if we agree that a Data.gov record is an arbitrary way to describe a single piece of data. However, we had to clean the Data.gov records to adhere to the GeoBlacklight standard and our own internal cataloging practices. Here’s a snapshot of the metadata in process.

Sample Record
{
dc_identifier_s: "http://hdl.handle.net/2451/12345",
dc_title_s: "2017 Aerial Fire Retardant Hydrographic Avoidance Areas: Aquatic - Region 1",
dc_description_s: "This polygon layer depicts aerial retardant avoidance areas for hydrographic feature data. Aerial retardant avoidance area for hydrographic feature data are based on high resolution National Hydrographic Dataset (NHD) produced by USGS and available from the USFS Enterprise Data Warehouse. Forests and/or regions have had the opportunity to modify the default NHD water representation (300ft buffer from all water features) for their areas of interest to accurately represent aerial fire retardant avoidance areas as described in the 2011 Record of Decision for the Nationwide Aerial Application of Fire Retardant on National Forest System Land EIS. These changes have been integrated into this dataset depicting aerial fire retardant avoidance areas for hydrographic features.The following process was used to develop the hydrographic areas to be avoided by aerial fire retardant. Using the FCODE attribute, streams/rivers/waterbodies are categorized into perennial and intermittent/ephemeral types. Linear features (streams & rivers) FCODES 46003 and 46006 and polygonal features (lakes and other waterbody) FCODES 39001, 39005, 39006, 43612, 43614, 46601 are considered intermittentt/ephemeral features. All other FCODES are considered to be perennial features. Underground and covered water features (e.g., pipelines) are excluded. Initially, all intermittent/ephemeral and perennial features were buffered by 300 feet by the Forest/Region units. Subsequently, Forest/Region units may have extended these buffers locally based on their requirements. The resulting avoidance areas may have overlapping features due to the buffering processes.The National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that make up the nation's surface water drainage system. NHD data was originally developed at 1:100,000-scale and exists at that scale for the whole country. This high-resolution NHD, generally developed at 1:24,000/1:12,000 scale, adds detail to the original 1:100,000-scale NHD. (Data for Alaska, Puerto Rico and the Virgin Islands was developed at high-resolution, not 1:100,000 scale.) Local resolution NHD is being developed where partners and data exist. The NHD contains reach codes for networked features, flow direction, names, and centerline representations for areal water bodies. Reaches are also defined on waterbodies and the approximate shorelines of the Great Lakes, the Atlantic and Pacific Oceans and the Gulf of Mexico. The NHD also incorporates the National Spatial Data Infrastructure framework criteria established by the Federal Geographic Data Committee.This layer was preserved from the Data.gov catalog as part of the Data Refuge effort (www.datarefuge.org) and is a representation of catalog item 77781e81-17d6-4f91-a2df-7dfc7cb33eef. Some modifications to the metadata have been made. Refer to the checksum manifest for a list of all original data objects associated with this item. Refer to the documentation for original metadata and information on the data.",
dc_rights_s: "Public",
dct_provenance_s: "NYU",
dct_references_s: "{"http://schema.org/url":"http://hdl.handle.net/2451/12345","http://schema.org/downloadUrl":"https://archive.nyu.edu/bitstream/2451/12345/2/nyu_2451_12345.zip","http://www.opengis.net/def/serviceType/ogc/wfs":"https://maps-public.geo.nyu.edu/geoserver/sdr/wfs","http://www.opengis.net/def/serviceType/ogc/wms":"https://maps-public.geo.nyu.edu/geoserver/sdr/wms"}",
layer_id_s: "sdr:nyu_2451_12345",
layer_slug_s: "nyu_2451_12345",
layer_geom_type_s: "Polygon",
layer_modified_dt: "2017-5-2T19:45:8Z",
dc_format_s: "Shapefile",
dc_language_s: "English",
dc_type_s: "Dataset",
dc_publisher_s: [
"United States. Department of Agriculture"
],
dc_creator_sm: [ ],
dc_subject_sm: [
"Forest management",
"Hydrography",
"Fire prevention",
"Emergency management"
],
dct_isPartOf_sm: "Data.gov Rescue",
dct_issued_s: "04-01-2017",
dct_temporal_sm: [
"2017"
],
dct_spatial_sm: [
"United States of America"
],
dc_relation_sm: [
"http://sws.geonames.org/6252001/about/rdf"
],
solr_geom: "ENVELOPE(-170.1769013405, -64.5665435791, 71.6032483233, 24.7073204053)",
solr_year_i: 2017,
geoblacklight_version: "1.0"
}

Some of the revisions to the metadata are small and reflect choices that we make at NYU (these are highlighted in red). For instance, the titles were changed to reflect a date-title-area convention that we already use. Other fields (like Publisher) are authority controlled and were easy to change, while others, like format and provenance, were easy to add. For those unfamiliar with the GeoBlacklight standard, refer to the project schema pages and related documentation. Many of the metadata enhancements are system requirements for items to be discovered within our Spatial Data Repository. Subjects presented more of a problem, as these are drawn from an informal tagging system on Data.gov. We used an elaborate process of finding and replacing to remediate these subjects into the LCSH Authority, which connects the items we collect into our larger library discovery environment.

The most significant changes are in the descriptions. We preserved the essence of the original Data.gov description, yet we cleaned up the prose a little bit and added a way to trace the item that we are preserving back to its original representation in Data.gov. In the case of aforementioned instances, in which a single Data.gov record contains more than one shapefile, we generated an entirely new record and referenced it to the original Data.gov UUID. For example:

Sample Record
{
dc_identifier_s: "http://hdl.handle.net/2451/12346",
dc_title_s: "2017 Aerial Fire Retardant Hydrographic Avoidance Areas: Aquatic - Region 2",
dc_description_s: "This polygon layer depicts aerial retardant avoidance areas for hydrographic feature data. Aerial retardant avoidance area for hydrographic feature data are based on high resolution National Hydrographic Dataset (NHD) produced by USGS and available from the USFS Enterprise Data Warehouse. Forests and/or regions have had the opportunity to modify the default NHD water representation (300ft buffer from all water features) for their areas of interest to accurately represent aerial fire retardant avoidance areas as described in the 2011 Record of Decision for the Nationwide Aerial Application of Fire Retardant on National Forest System Land EIS. These changes have been integrated into this dataset depicting aerial fire retardant avoidance areas for hydrographic features.The following process was used to develop the hydrographic areas to be avoided by aerial fire retardant. Using the FCODE attribute, streams/rivers/waterbodies are categorized into perennial and intermittent/ephemeral types. Linear features (streams & rivers) FCODES 46003 and 46006 and polygonal features (lakes and other waterbody) FCODES 39001, 39005, 39006, 43612, 43614, 46601 are considered intermittentt/ephemeral features. All other FCODES are considered to be perennial features. Underground and covered water features (e.g., pipelines) are excluded. Initially, all intermittent/ephemeral and perennial features were buffered by 300 feet by the Forest/Region units. Subsequently, Forest/Region units may have extended these buffers locally based on their requirements. The resulting avoidance areas may have overlapping features due to the buffering processes.The National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that make up the nation's surface water drainage system. NHD data was originally developed at 1:100,000-scale and exists at that scale for the whole country. This high-resolution NHD, generally developed at 1:24,000/1:12,000 scale, adds detail to the original 1:100,000-scale NHD. (Data for Alaska, Puerto Rico and the Virgin Islands was developed at high-resolution, not 1:100,000 scale.) Local resolution NHD is being developed where partners and data exist. The NHD contains reach codes for networked features, flow direction, names, and centerline representations for areal water bodies. Reaches are also defined on waterbodies and the approximate shorelines of the Great Lakes, the Atlantic and Pacific Oceans and the Gulf of Mexico. The NHD also incorporates the National Spatial Data Infrastructure framework criteria established by the Federal Geographic Data Committee.This layer does not have a discrete representation in Data.gov; rather, it is a data object represented on the record 77781e81-17d6-4f91-a2df-7dfc7cb33eef. Some modifications to the metadata have been made. Refer to the checksum manifest for a list of all original data objects associated with this item. Refer to the documentation for original metadata and information on the data.",
dc_rights_s: "Public",
dct_provenance_s: "NYU",
dct_references_s: "{"http://schema.org/url":"http://hdl.handle.net/2451/12345","http://schema.org/downloadUrl":"https://archive.nyu.edu/bitstream/2451/12345/2/nyu_2451_12345.zip","http://www.opengis.net/def/serviceType/ogc/wfs":"https://maps-public.geo.nyu.edu/geoserver/sdr/wfs","http://www.opengis.net/def/serviceType/ogc/wms":"https://maps-public.geo.nyu.edu/geoserver/sdr/wms"}",
layer_id_s: "sdr:nyu_2451_12345",
layer_slug_s: "nyu_2451_12345",
layer_geom_type_s: "Polygon",
layer_modified_dt: "2017-5-2T19:45:8Z",
dc_format_s: "Shapefile",
dc_language_s: "English",
dc_type_s: "Dataset",
dc_publisher_s: [
"United States. Department of Agriculture"
],
dc_creator_sm: [ ],
dc_subject_sm: [
"Forest management",
"Hydrography",
"Fire prevention",
"Emergency management"
],
dct_isPartOf_sm: "Data.gov Rescue",
dct_issued_s: "04-01-2017",
dct_temporal_sm: [
"2017"
],
dct_spatial_sm: [
"United States of America"
],
dc_relation_sm: [
"http://sws.geonames.org/6252001/about/rdf"
],
solr_geom: "ENVELOPE(-170.1769013405, -64.5665435791, 71.6032483233, 24.7073204053)",
solr_year_i: 2017,
geoblacklight_version: "1.0"
}

Adding these descriptions into the metadata field not only identifies our work with the data refuge movement, but also it allows for anyone who discovers this data to track back to its presentation in an original context. Still it’s important to emphasize that this process inevitably means that not all data associated with the Forest Service has been rescued by NYU.

In all, by narrowing down to one agency and then doing a search that is likely to yield spatial data only, we ended up identifying 71 records of interest but ultimately publishing 90 individual records to represent this data (see our final spreadsheet). Note that we are still in the process of importing these records into our discovery environment.

Future Directions: Publishing Checksums

Libraries’ ability to represent precisely and accurately which datasets, or components of datasets, have been preserved is a serious impediment to embarking on a distributed repository / data-rescue project. Further, libraries need to know if data objects have been preserved and where they reside. To return to the earlier example, how is New York University to know that a particular government dataset has already been “rescued” and is being preserved (either via a publicly-accessible repository interface, or not)?

Moreover, even if there is a venue for institutions to discuss which government datasets fall within their collection priorities (e.g. “New York University cares about federal forestry data, and therefore will be responsible for the stewardship of that data”), it’s not clear that there is a good strategy for representing the myriad ways in which the data might exist in its “rescued” form. Perhaps the institution that elects to preserve a dataset wants to make a few curatorial decisions in order to better contextualize the data with the rest of the institution’s offerings (as we did with the Forest Service data). These types of decisions are not abnormal in the context of library accessioning.

The problem comes when data processing practices of an institution, which are often idiosyncratic and filled with “local” decisions to a certain degree, start to inhibit the ability for individuals to identify a copy of a dataset in the capacity of a copy. There is a potential tension between preservation –– preserving the original file structure, naming conventions, and even level of dissemination of government data products –– and discovery, where libraries often make decisions about the most useful way for users to find relevant data that are in conflict with the decisions exhibited in the source files.

For the purposes of mitigating the problem sketched above, we propose a data store that can be drawn upon by all members of the library / data-rescue community, whereby the arbitrary or locally-specific mappings and organizational decisions can be related back to original checksums of individual, atomic, files. File checksums would be unique identifiers in such a datastore, and given a checksum, this service would display “claims” about institutions that hold the corresponding file, and the context in which that file is accessible.

Consider this as an example:

  • New York University, as part of an intentional data rescue effort, decides to focus on collecting and preserving data from the U.S. Forest Service.
  • The documents and data from Forest Service are accessible through many venues:
    • They (or some subset) are linked to from a Data.gov record
    • They (or some subset) are linked to directly from the FSGeodata Clearinghouse
    • They are available directly from a geoservices or FTP endpoint maintained by the Forest Service (such as here).
  • NYU wants a way to grab all of the documents from the Forest Service that it is aware of and make those documents available in an online repository. The question is, if NYU has made organizational and curatorial decisions about the presentation of documents rescued, how can it be represented (to others) that the files in the repository are indeed preserved copies of other datasets? If, for instance, Purdue University comes along and wants to verify that everything on the Forest Service’s site is preserved somewhere, it now becomes more difficult to do so, particularly since those documents never possessed a canonical or authoritative ID in the first place, and even could have been downloaded originally from various source URLs.

Imagine instead that as NYU accessions documents ––restructuring them and adding metadata –– they not only create checksum manifests (similar to, if not even identical to the ones created by default by BagIt), but also deposit those manifests to a centralized data store in such a form that the data store could now relate essential information:

The file with checksum 8a53c3c191cd27e3472b3e717e3c2d7d979084b74ace0d1e86042b11b56f2797 appears in as a component of the document instituton_a_9876... held by New York University.

Assuming all checksums are computed at the lowest possible level on files rescued from Federal agencies (i.e., always unzip archives, or otherwise get to an atomic file before computing a checksum), such a service could use archival manifest data as a way to signal to other institutions if a file has been preserved, regardless of whether or not it exists as a smaller component of a different intellectual entity –– and it could even communicate additional data about where to find these preserved copies. In the example of the dataset mentioned above, the original Data.gov record represents 8 distinct resources, including a Shapefile, a geodatabase, an XML metadata document, an HTML file that links to an API, and more. For the sake of preservation, we could package all of these items, generate checksums for each, and then take a further step in contributing our manifest to this hypothetical datastore. Then, as other institutions look to save other data objects, they could search against this datastore and find not merely checksums of items at the package level, but actually at the package component level, allowing them to evaluate which portion or percentage of data has been preserved.

A system such as the one sketched above could efficiently communicate preservation priorities to a community of practice, and even find use for more general collection-development priorities of a library. Other work in this field, particularly that regarding IPFS, could tie in nicely –– but unlike IPFS, this would provide a way to identify content that exists within file archives, and would not necessitate any new infrastructure for hosting material. All it would require is for an institution to contribute checksum manifests and a small amount of accompanying metadata to a central datastore.

Principles

Even though our rescue of the Forest Service data is still in process, we have learned a lot about the challenges associated with this project. We’re very interested in learning about how other institutions are handling the process of rescuing federal data and look forward to more discussions at the event in Washington D.C. on May 8.

Communication across communities: Why isn't this working?

One of the lessons we've learned in working on the Data Refuge project is that librarians aren't the only people who have, for years, been discussing how to solve the problems associated with so much of our most important governmental information only available digitally and online. Data professionals within government agencies, other government workers, people in the open data community, archivists, and researchers across disciplines have all been grappling with these challenges within their own communities. In hearing all these voices we realized we need all of these perspectives to come together to solve this problem - thus the Libraries+ Network was ignited.

Many of these groups have at some point acknowledged that they needed voices and expertise from other communities, however we have all either failed to talk to each other at all, or failed to create long-term productive collaborations.

Why is communicating across communities so difficult? This question looms large in so many aspects of my professional life. It seems functional communication and collaboration is yet another problem none of us have so far solved. Working with Data Refuge and Libraries+ has really brought the issue into focus for me and, as is the theme of Libraries+, I can see the problem with much better clarity, although I may not have the solution.

Cylinders of excellence

Cylinders of excellence

I've thought through a number of metaphors to explain the problem, bear with me while I go through them. 

The idea of silos is somewhat apt, except it's more like we're in towers (ivory or otherwise); we can kind of see each other, depending on where our windows are, and we all see some of the landscape. We can holler to each other about what we see, but when we hear the hollering we only get some of the message, and it's a bit garbled from traveling so far. We need walkie talkies. We need binoculars.

We've also used the idea of having our hands on different parts of the elephant quite a bit. This metaphor also works pretty well, except elephants aren't so big that we wouldn't be able to say "Hey, this feels leathery" or "This feels hard and smooth" or "This is definitely a tail" to each other. Eventually someone would describe the trunk and we'd all be on the same page. The problem isn't so much that we can't hear or aren't listening, it's that we're actually speaking different languages to each other.  The "cold round thing" you're describing might be totally different from how I would describe a tusk and I'll keep imagining a snowball, or plate.  Jargon is a huge obstacle most of us are aware of, but never seem to try to reconcile. Our translations are not as good as we think they are, if we attempt them at all. Most of the time it feels like we just get bogged down in the differences of our semantics and not the similarities in our meaning. 

We have to be able to really listen to each other, and avoid filtering what we hear through our preconceived ideas about the problem.


These metaphors run through my brain and quickly morph together into one of my favorite children's stories, Two Monsters, by David McKee. This story is about, believe it or not, two monsters who live on opposite sides of a mountain. They talk to each other through a hole in the mountain and are friends until one of them comments about the beautiful sunset they can see. 

Scene from Two Monsters, by David McKee, (c) 1985.

Scene from Two Monsters, by David McKee, (c) 1985.


The monsters proceed to get in a huge argument and (spoilers!) throw rocks at each other over the mountain until they wear it down, the mountain is no more, and they can see what the other was trying to describe.

We need to get rid of our mountains. We have to be able to really listen to each other, and avoid filtering what we hear through our preconceived ideas about the problem. We have to be open to being wrong and open to being right with an asterisk. We have to stop being defensive when someone has a different way of doing things. We have to stop feeling like calling on other communities to support our weak spots means we are weak. We have to work together for real -- because that is how we're strongest.

I'm really excited for our May Meeting next week where so many voices will come together to describe their piece of the elephant. We're all going to need to get outside of our boxes and take a peek into others. But that's starting to be too many metaphors, isn't it?
 

On the Preservation of and Access to NOAA’s Open Data

By Dr. Edward J. Kearns, NOAA Chief Data Officer
Ed.Kearns@noaa.gov

Recent articles in the popular press and across various social media platforms have raised concerns over the continued preservation and utilization of federal data holdings, particularly NOAA’s climate-related data.  These concerns have produced a number of coordinated efforts to download and store significant volumes of NOAA’s data outside of the federal data systems. While I do not share those same concerns about preservation, as NOAA’s new Chief Data Officer I recognize that the essential idea that enables these efforts --  easy public access to all of NOAA’s open data -- is a laudable one that NOAA’s data stewards are striving to achieve. Let’s talk about open data access first, and I’ll come back to those concerns related to preservation later.

NOAA employs many strategies to make its open data available to all users, as quickly and easily as possible. Data are served directly from NOAA’s federal data systems to consumers through a variety of technical methods, and some data are distributed by NOAA’s partners and cooperators, including those in the commercial weather enterprise and environmental data communities.  The demand for NOAA’s data often exceeds the government’s ability to provide them routinely at a sufficient scale and timeliness to meet that demand. And NOAA’s data holdings and the demand for them (see Figure 1) continue to grow at a rapid pace.

Figure 1. The annual volume and types of data delivered from NOAA’s archives at the National Centers for Environmental Information. This is just a subset of the total amount of data accessed from NOAA. (Figure courtesy of Tim Owen and Ken Casey, NOAA/NCEI)

Figure 1. The annual volume and types of data delivered from NOAA’s archives at the National Centers for Environmental Information. This is just a subset of the total amount of data accessed from NOAA. (Figure courtesy of Tim Owen and Ken Casey, NOAA/NCEI)

How can NOAA find a scalable, and affordable, solution to this public open data access challenge? We are currently experimenting with new public-private partnerships and cloud-based access technologies.  NOAA’s Big Data Project (BDP, see www.noaa.gov/big-data-project) was established in April 2015 through 3-year, extendable Cooperative Research And Development Agreements (CRADAs) between NOAA and Amazon Web Services (AWS), Google, IBM, Microsoft and the Open Commons Consortium to discover how NOAA can:

●      discover ways for NOAA to “work smarter” through partnerships with industry and academia,

●      leverage the value inherent in NOAA’s data to broaden use and reduce costs,

●      unleash the power of industry’s modern cloud platforms and related technologies,

●      create opportunities to advance the US economy using federal data.

Through the duration of these BDP CRADAs, each Collaborator has agreed to store and make freely available to all the original data from NOAA, while they may seek other ways of monetizing those data, including the provision of new services and value-added information products. While all of NOAA’s open data are available to the Collaborators, they choose the particular datasets in which they wish to invest their time and resources, and will often partner with 3rd parties that are interested as well. As you can imagine, the Collaborators’ cloud platforms offer significant advancements in scale, processing, analytics, and tools for the users of NOAA’s data.

While over a dozen datasets are at some level of delivery via the BDP, NOAA’s NEXRAD weather radar data were among the first data to be made publicly available (see Ansari et al, in press, for details).  NOAA transferred the complete NEXRAD Level II historical archive (approximately 300TB) from its internal systems to those CRADA Collaborators that wished to receive them. AWS was the first to make those data freely available, and AWS and NOAA found after a year that:

●      weather radar data utilization has doubled by volume, compared to prior years,

●      thousands of distinct users per month are accessing NOAA data on AWS,

●      and loads have decreased by 50% on NOAA’s internal data ordering systems,

●      ...all at no net cost to the US taxpayer  

The costs of hosting the NOAA data on AWS are underwritten by those users that use the data on the AWS platform, instead of simply downloading them to a different system. By using the data on AWS instead of having to extract them from the NOAA systems, the level of data services has significantly increased and the time required to develop new information products has drastically decreased.  Other NOAA datasets under consideration for BDP delivery include fisheries catch data, integrated water resources information, numerical weather prediction model output, advanced severe weather products, marine genomics data, and new geostationary satellite data.

An upcoming challenge for NOAA is to take the lessons learned by industry and the federal government during these CRADA activities and develop a sustainable partnership model with defined levels of service on which both the federal government and industry can agree, and depend. The ultimate goal is to provide full and open utilization all of NOAA’s data, at a scale and rate that is largely determined by and underwritten by the needs of the user community, instead of solely by taxpayers’ funds.

Now that I’ve described briefly how NOAA is exploring better data access and utilization through these public-private partnerships, let’s go back to the question of preservation. Archive and long-term preservation are widely accepted as inherently governmental responsibilities, and NOAA follows laws, regulations, and policies related to archive and data management to uphold those responsibilities.  Throughout its history, NOAA has remained committed to the collection, preservation, and dissemination of environmental data in service to the Nation, in support of the US economy, and in cooperation with our international partners.

So, are NOAA’s data at greater risk for loss now? No. NOAA’s archive systems are well established, and NOAA’s data and data management practices are governed by federal laws and regulations.  Oversight of federal data management is provided by the National Archive and Records Administration (NARA) and the Office of Management and Budget (OMB). A sampling of relevant laws and regulations, including the Federal Records Act, can be found at the end of this blog post.  Executive orders and policies clarify how these laws should be carried out by NOAA and other agencies, and some of these are also listed.

I am sometimes asked if NOAA’s data in its archives can be easily deleted. No they can’t, since data may not be removed without significant effort and public deliberation. It is also unlawful to tamper, damage, delete, vandalize, or in any way alter formal federal records, including NOAA’s environmental data and its archives. There are data disposition schedules and defined NOAA processes that help us to meet the intended outcome of well-executed and efficient data preservation, which prescribe public notice and comment periods, by which NOAA may propose to remove data from its archives. Such removal has been rare.

What about authentication? While anyone is welcome to download and copy NOAA’s open data, the uncoordinated proliferation of data stores actually may introduce future issues with the trust of those data. The trust of any data is associated with the quality, stewardship, provenance and authority associated with them. The value of NOAA’s data archives include not just the simple existence of the data themselves, but the continuous investment of NOAA’s experts’ efforts towards the sustained quality and usability of the data. The integrity and accuracy of data that are stored on non-federal system and are not stewarded by NOAA’s scientists cannot always be easily verified beyond file-level distribution. NOAA is currently exploring best practices and technologies that may allow the authentication of its data throughout the wider data ecosystem, and welcomes interested parties in academia and industry to join in this exploration.

With these challenges and opportunities facing NOAA, I am certainly excited to step into the role of NOAA’s CDO.  I look forward to working with the wider open data community to discover new, more effective methods of bringing NOAA's open data for everyone’s use, while ensuring the integrity and preservation of those data.

Dr. Edward J. Kearns

 

 

A sampling of laws, regulations, and policies relevant to NOAA’s open data and data preservation:

1.     Federal Records Act of 1950, 44 U.S.C. §§ 2101 et seq., 3101 et seq., 3301 et seq.

The Federal Records Act, establishes the framework for records management programs in Federal Agencies, including the National Weather Records Center (NWRC) in 1951, now NOAA’s National Centers for Environmental Information (NCEI). NCEI is charged with archiving and servicing U.S. weather and climate records.

The Act specifically amended the Federal Property and Administrative Services Act of 1949 to provide for Agency Records Centers run by the Archivist of the General Services Administration: “The Archivist may establish, maintain, and operate records centers and centralized microfilming services for Federal agencies.” 44 U.S.C. § 2907. It also allows for Records Centers run by federal agencies, following approval: “When the head of a Federal agency determines that such action may affect substantial economies or increased operating efficiency, [s]he shall provide for the transfer of records to a records center maintained and operated by the Archivist, or, when approved by the Archivist, to a center maintained and operated by the head of the Federal agency.” 44 U.S.C. § 3103.

2.     NARA, Records Management, 36 C.F.R. § 1220.1-1239.6.  The National Archive and Records Administration’s (NARA) mission is to safeguard and preserve the records of the U.S. government, ensuring that its citizens can discover, use, and learn from the country’s documentary heritage. The NARA regulations on records management specify policies for Federal agencies’ records management programs relating to proper records creation and maintenance, adequate documentation, and records disposition. They are the implementing authority for the Federal Records Act. NARA standards for Records Management apply to all federal records, regardless of where they are stored.

3.     Office of Management and Budget, Revision of Circular No. A-130, Transmittal 4,  Management of Federal Information Resources (Nov. 30, 2000)
Revised Circular No. A-130 provides uniform government-wide information resources management policies as required by the Paperwork Reduction Act of 1980, amended by the Paperwork Reduction Act of 1995, 44 U.S.C. § 3501 et seq. This Transmittal Memorandum contains updated guidance on the "Security of Federal Automated Information Systems.” Under the Circular, “Agencies must plan in an integrated manner for managing information throughout its life cycle.”

4.      Office of Management and Budget, Circular No. A-16 Revised, Coordination of Geographic Information and Related Spatial Data Activities (Aug. 10, 2002).
Revised Circular No. A-16 “provides direction for federal agencies that produce, maintain or use spatial data either directly or indirectly in the fulfillment of their mission. This Circular establishes a coordinated approach to electronically develop the National Spatial Data Infrastructure and establishes the Federal Geographic Data Committee.” Spatial data is defined as: “information about places or geography, and has traditionally been shown on maps. “  The Circular also “describes the management and reporting requirements of Federal agencies in the acquisition, maintenance, distribution, use, and preservation of spatial data by the Federal Government” including the preparation, maintenance, publication and implementation of a strategy for advancing geographic information and related spatial data activities.

5.     Executive Office of the President, Office of Science and Technology Policy, Memorandum: Increasing Access to the Results of Federally Funded Scientific Research (Feb. 22, 2013).
This memorandum “directs each Federal agency with over $100 million in annual conduct of research and development expenditures to develop a plan to support increased public access to the results of research funded by the Federal Government. This includes any results published in peer-reviewed scholarly publications that are based on research that directly arises from Federal funds, as defined in relevant OMB circulars (e.g., A-21 and A-11). It is preferred that agencies work together, where appropriate, to develop these plans.”

6.     White House, M-13-13: Memorandum on Open Data Policy – Managing Information as an Asset (May 9, 2013)
White House Memorandum M-13-13 “establishes a framework to help institutionalize the principles of effective information management at each stage of the information’s life cycle to promote interoperability and openness” in accordance with Exec. Or. 13642, Making Open and Machine Readable the New Default for Government Information. The Memorandum “requires agencies to collect or create information in a way that supports downstream information processing and dissemination activities” including “machine-readable and open formats, data standards, and common core and extensible metadata for all new information creation and collection efforts.”

7.     NOAA Administrative Order NAO 205-1: NOAA Records Management Program (2010)
NOAA Administrative Order (NAO) 205-1 enables NOAA “to carry out an effective records management program in compliance with the Federal Records Act and other relevant legal authorities[.]” Under the order “NOAA Program Officials have the primary responsibility for creating, maintaining, protecting, and disposing of records of their program area.” This duty includes, but is not limited to, documentation of the creation of records, implementation of record protection policies, establishment of a records management system, and cooperation with the NOAA Records Management Officer in requests for information regarding the management of records.

8.     NOAA Administrative Order NAO 212-15: Management of Environmental Data and Information (Issued: 1991, Effective: 2010)
NOAA Administrative Order (NAO) 212-15 establishes the Administration’s Environmental Data Management Policy. The “NAO applies to all NOAA environmental data and to the personnel and organizations that manage these data, unless exempted by statutory or regulatory authority.” Under the Order, data management “consists of two major activities conducted in coordination: data management services and data stewardship. They constitute a comprehensive end-to-end process including movement of data and information from the observing system sensors to the data user. This process includes the acquisition, quality control, metadata cataloging, validation, reprocessing, storage, retrieval, dissemination, and archival of data.” This end-to-end data management lifecycle helps to achieve the NOAA policy objective requiring that “[e]nvironmental data will be visible, accessible and independently understandable to users[.]”