On the Preservation of and Access to NOAA’s Open Data

By Dr. Edward J. Kearns, NOAA Chief Data Officer

Recent articles in the popular press and across various social media platforms have raised concerns over the continued preservation and utilization of federal data holdings, particularly NOAA’s climate-related data.  These concerns have produced a number of coordinated efforts to download and store significant volumes of NOAA’s data outside of the federal data systems. While I do not share those same concerns about preservation, as NOAA’s new Chief Data Officer I recognize that the essential idea that enables these efforts --  easy public access to all of NOAA’s open data -- is a laudable one that NOAA’s data stewards are striving to achieve. Let’s talk about open data access first, and I’ll come back to those concerns related to preservation later.

NOAA employs many strategies to make its open data available to all users, as quickly and easily as possible. Data are served directly from NOAA’s federal data systems to consumers through a variety of technical methods, and some data are distributed by NOAA’s partners and cooperators, including those in the commercial weather enterprise and environmental data communities.  The demand for NOAA’s data often exceeds the government’s ability to provide them routinely at a sufficient scale and timeliness to meet that demand. And NOAA’s data holdings and the demand for them (see Figure 1) continue to grow at a rapid pace.

Figure 1. The annual volume and types of data delivered from NOAA’s archives at the National Centers for Environmental Information. This is just a subset of the total amount of data accessed from NOAA. (Figure courtesy of Tim Owen and Ken Casey, NOAA/NCEI)

How can NOAA find a scalable, and affordable, solution to this public open data access challenge? We are currently experimenting with new public-private partnerships and cloud-based access technologies.  NOAA’s Big Data Project (BDP, see www.noaa.gov/big-data-project) was established in April 2015 through 3-year, extendable Cooperative Research And Development Agreements (CRADAs) between NOAA and Amazon Web Services (AWS), Google, IBM, Microsoft and the Open Commons Consortium to discover how NOAA can:

●      discover ways for NOAA to “work smarter” through partnerships with industry and academia,

●      leverage the value inherent in NOAA’s data to broaden use and reduce costs,

●      unleash the power of industry’s modern cloud platforms and related technologies,

●      create opportunities to advance the US economy using federal data.

Through the duration of these BDP CRADAs, each Collaborator has agreed to store and make freely available to all the original data from NOAA, while they may seek other ways of monetizing those data, including the provision of new services and value-added information products. While all of NOAA’s open data are available to the Collaborators, they choose the particular datasets in which they wish to invest their time and resources, and will often partner with 3rd parties that are interested as well. As you can imagine, the Collaborators’ cloud platforms offer significant advancements in scale, processing, analytics, and tools for the users of NOAA’s data.

While over a dozen datasets are at some level of delivery via the BDP, NOAA’s NEXRAD weather radar data were among the first data to be made publicly available (see Ansari et al, in press, for details).  NOAA transferred the complete NEXRAD Level II historical archive (approximately 300TB) from its internal systems to those CRADA Collaborators that wished to receive them. AWS was the first to make those data freely available, and AWS and NOAA found after a year that:

●      weather radar data utilization has doubled by volume, compared to prior years,

●      thousands of distinct users per month are accessing NOAA data on AWS,

●      and loads have decreased by 50% on NOAA’s internal data ordering systems,

●      ...all at no net cost to the US taxpayer  

The costs of hosting the NOAA data on AWS are underwritten by those users that use the data on the AWS platform, instead of simply downloading them to a different system. By using the data on AWS instead of having to extract them from the NOAA systems, the level of data services has significantly increased and the time required to develop new information products has drastically decreased.  Other NOAA datasets under consideration for BDP delivery include fisheries catch data, integrated water resources information, numerical weather prediction model output, advanced severe weather products, marine genomics data, and new geostationary satellite data.

An upcoming challenge for NOAA is to take the lessons learned by industry and the federal government during these CRADA activities and develop a sustainable partnership model with defined levels of service on which both the federal government and industry can agree, and depend. The ultimate goal is to provide full and open utilization all of NOAA’s data, at a scale and rate that is largely determined by and underwritten by the needs of the user community, instead of solely by taxpayers’ funds.

Now that I’ve described briefly how NOAA is exploring better data access and utilization through these public-private partnerships, let’s go back to the question of preservation. Archive and long-term preservation are widely accepted as inherently governmental responsibilities, and NOAA follows laws, regulations, and policies related to archive and data management to uphold those responsibilities.  Throughout its history, NOAA has remained committed to the collection, preservation, and dissemination of environmental data in service to the Nation, in support of the US economy, and in cooperation with our international partners.

So, are NOAA’s data at greater risk for loss now? No. NOAA’s archive systems are well established, and NOAA’s data and data management practices are governed by federal laws and regulations.  Oversight of federal data management is provided by the National Archive and Records Administration (NARA) and the Office of Management and Budget (OMB). A sampling of relevant laws and regulations, including the Federal Records Act, can be found at the end of this blog post.  Executive orders and policies clarify how these laws should be carried out by NOAA and other agencies, and some of these are also listed.

I am sometimes asked if NOAA’s data in its archives can be easily deleted. No they can’t, since data may not be removed without significant effort and public deliberation. It is also unlawful to tamper, damage, delete, vandalize, or in any way alter formal federal records, including NOAA’s environmental data and its archives. There are data disposition schedules and defined NOAA processes that help us to meet the intended outcome of well-executed and efficient data preservation, which prescribe public notice and comment periods, by which NOAA may propose to remove data from its archives. Such removal has been rare.

What about authentication? While anyone is welcome to download and copy NOAA’s open data, the uncoordinated proliferation of data stores actually may introduce future issues with the trust of those data. The trust of any data is associated with the quality, stewardship, provenance and authority associated with them. The value of NOAA’s data archives include not just the simple existence of the data themselves, but the continuous investment of NOAA’s experts’ efforts towards the sustained quality and usability of the data. The integrity and accuracy of data that are stored on non-federal system and are not stewarded by NOAA’s scientists cannot always be easily verified beyond file-level distribution. NOAA is currently exploring best practices and technologies that may allow the authentication of its data throughout the wider data ecosystem, and welcomes interested parties in academia and industry to join in this exploration.

With these challenges and opportunities facing NOAA, I am certainly excited to step into the role of NOAA’s CDO.  I look forward to working with the wider open data community to discover new, more effective methods of bringing NOAA's open data for everyone’s use, while ensuring the integrity and preservation of those data.

Dr. Edward J. Kearns



