The May meeting is about ensuring long-term preservation and access of federal data beyond federal servers.
What do you think are the key issues to address?
Coordination of efforts, Collaboration, Access, Preservation, Discoverability & Metadata, Long-term structure of federal information, Scoping/Scaling, Costs
Coordination of efforts
I think one of the key issues to address is cataloging and indexing the data sets. One of the most challenge parts of accessing federal data is simply trying to determine which data sets are going to contain the data relevant to your query. With the mountains of information available, it's important to ensure that it's easy to locate and understand the contents of the data sets.
Improving access to open federal data
Long-term, ongoing preservation of web content (beyond "emergency" efforts during transition of government).
Ensuring there are multiple copies of data from websites in the same way there are copies of print documents in libraries across the country.
This is a huge job and not just for one community. We need to figure out how to leverage the labor / expertise of all communities and figure out how to work together to find a solution.
acquisition, preservation, and access
Ability to find and use the data that has been archived
Key issues to address include: (1) The long-term organizational structure of federal web and data archiving efforts as it moves past the grassroots stage; (2) how to identify what has already been archived elsewhere & what is held in partnerships with institution(s); and (3) the scope of the problem.
Commitment to open source, decentralized, and open data solutions, so that preservation is a process that can be made accessible to the many, not the few.
Collaborative efforts - who is working on what?
Partnerships with federal agencies, identification and adoption of existing data best practices, participation in and collaboration with the broader data community
Creating and maintaining a network of "preservers"
"Content scope is a big one: what is in scope? based on? context, authority, level of granularity? I see a spectrum of approaches with very specific on one end --work with scientists to zero in on and prioritize a critical area(s) of scientific inquiry (say, climate change), and make sure it is captured and preserved -- and very broad on the other--design a flexible structure that engages a broad constituency and enables wide preservation with lightweight standards, that builds in a tolerance for ambiguity/messiness and enables later cycles of refinement (though this could have some of the same pitfalls as web archiving)
Cost - who pays for long term preservation/access/support costs? "
What is the current state of long-term preservation and access to these data (i.e., what is currently being done? who is doing it?). How does preservation and access of federal data relate (or not) to other existing federal government and library stewardship initiatives, e.g. FDLP, FDLP-LOCKSS? How big is the territory here, and how feasible is it to consider "federal data" as the scope? What kinds of existing library infrastructure (people, repositories, protocols, formats, projects) might be brought to bear? Same question for non-library infrastructure, especially from allied domains/communities (data intensive researchers, civic data, data journalism)
working out a distributed network for authentic backup and access that is stable in the long term; locating federal policies that have an impact and advocating for more openness. Plans for immediate shelter if/when departments are shuttered or budgets are significantly cut.
Can more clearly explained directories be set up (regarding EDGI / data refuge initiatives), so more people can more easily organize events? More young people (high school kids) should be brought into these projects.
advocating for the ongoing creation & publication of federal data in forms that can be versioned, managed, preserved, and served by digital libraries; motivating & coordinating long-term commitments to this work across a variety of interested sectors; adapting & extending the spirit of public service while overcoming the 'unfunded mandate' challenges that have plagued the FDLP
Reaching vulnerable data sets across federal agencies (esp civil rights data)
1) Information accessibility & dissemination (can we organize and create optimal platforms for sharing all of the data collected in EOT and data rescue events?) 2) The use of FOIA and other government transparency laws, and preventing the passage of laws that pull back on data collection or stifle government transparency 3) Continued funding and support of government data collection programs and practices
How to build a robust, resilient infrastructure in terms of data preservation and access/discovery. How to build a network that is distributed but yet effective.
Federal data needs to be preserved and discoverable. Ideally, from my perspective, preservation via a decentralized data archiving network based in libraries is the goal. In addition, to facilitate discovery and risk assessment, a comprehensive metadata discovery system (like Svalbard (https://medium.com/@maxogden/project-svalbard-a-metadata-vault-for-research-data-7088239177ab) or similar) will make it easy to both discover data sets and ID data sets that are at risk.
The immediate issues are: 1) finding one (or multiple) long-term archival home for the data being collected in DataRescue, 2) build a list/database of sites/datasets/areas of interest to focus moving forward
How we can have a community-based approach to this problem
Building collaborative digital infrastructure that deals with the entire lifecycle, working on policy initiative to require federal agencies to have Information Management Plans (IMPs) that facilitate the collection, description, access and preservation of federal data (and by "data" I mean both numeric data sets as well as digital files (pdfs, images etc) that are "data" in a broad sense.
"While Data Rescue events serve as an initial effort to ""grab and stash"" web data that are not systematically archived, it is not a sustainable solution for making government data and web content available over the long term. Many of the web-pages and existing data were not immediately understandable and accessible in the original form they appeared online; data were not properly documented and web-sites were often sprawling and poorly organized, illustrating a broader data management problem at the agency level.
Part of the solution needs to be address the problem farther up in the process, at data creation and dissemination by government agencies. A long-term solution could involve libraries within ARL working directly with federal agencies to improve data management and website organization, including curation of data outputs, improved site mapping, archiving old sites, linking to new content, and examining domain consistency. It is also important to connect with existing repositories that currently archive government data to determine the best ways to systematically capture content that currently only exists on the agency websites, along with the larger, curated datasets traditionally preserved by archival repositories. There are precedents in the realm of government publications for individual libraries ""adopting"" certain agencies or sub-agencies to provide curation of and access to their outputs. One starting point could be a pilot with a certain agency partnering with an ARL library.
There is also a need to better define the scope of this work, its importance, and relation to existing government archiving efforts, as there have been arguments that DataRescue events are replicating work handled by ""old guard"" government archiving efforts (Molteni, Feb 19, 2017 Wired). Clear documentation about what information from agencies are already systematically archived, what information exists primarily (if not entirely) on ephemeral websites, and what information is most needed for research, policy decisions, and other uses will help guide all of these efforts towards common goals of access and preservation. To start, ARL libraries may want to aggregate lists of federal datasets or information that have currently disappeared from websites. Some organizations, such as Environmental Data & Governance Initiative (EDGI) and the Sunlight Foundation, as well as individual librarians have already been tracking this information and would be good partners in this effort.
Similarly, it is important to document what the Internet Archive (IA) has and has not captured and how long information is made available through the WayBack Machine. While automatic crawlers capture a portion of government websites, we found in this event that important information can exist many, many layers deep in a site. This content is not automatically crawled by IA, and our efforts nominated many of these ""deep"" websites for capture. However, it is unclear whether these nomination efforts were for a single snapshot, or whether the nominated sites will be added to lists that are automatically and periodically crawled. Additionally, there seem to be opportunities for the use or development of extensions or scripts that map websites and can help guide nominators (or automated crawling scripts) to previously un-saved portions of websites. Questions about how much should be preserved, how long snapshots should be kept, and what information does NOT need to be preserved also need to be answered before sustainable, long-term solutions for this work can be realized. "
I think the biggest challenge will be coming up with the resources to create the necessary metadata
Developing a framework to assess risk of data loss from federal servers; Creating comm channels with those in federal agencies who are responsible for federal data management
How to coordinate the many related but disconnected efforts emerging nationally, both rata rescue efforts and other projects like PEGI, EDGI, etc.
ensuring long term access (lockss-type issues); versioning to ensure long-term, sustained access; supporting materials beyond the data, i.e publications
Determining who is already doing what in the area of permanent public access and identifying the gaps.
So many! How to develop trust with researchers, activists, NGOs, etc if data are not stored in the places people are used to visiting? How do we move past ideas of vulnerability while still being responsive to current events (current administration or future)? How to not lose the community engagement that we developed over 35+ Data Rescue events? How to deal with legacy data and data that is constantly being updated/changing?
1) the ideal relationships between the Federal agencies responsible for archiving or collecting or providing access to Federal data and the library community. (There are other issues which I’m sure others will suggest.) In an ideal world, would the private sector *duplicate* Federal collections, for safety’s sake, would Federal agencies not be in this business, would Federal agencies be doing something different? Are the preservation and access roles separable? How does the need for access to materials change over time?
2) increased awareness of the current regulatory environment for the preservation of federal digital content by federal agencies (Federal Records Act, Presidential Records Act, Federal Paperwork Reduction Act, Title 44 U.S.C. §§1901 -1916 for GPO, etc.), and the scope of preservation efforts by federal agencies now.
3) clarification of language, particularly around risk. Loss of online access is often described as a loss of data, but these are two different issues. (Federal data is not at great risk of inappropriate destruction. Online access may not be maintained by originating or preserving agencies, however.)
Coordination across institutions; reproducibility and standards of data; forensic transparency of identity of data products, perhaps using checksums, hashes, etc.
Three key challenges for digital preservation are:
1. Identifying what content exists. The hidden, harder to identify, content is most at risk of loss. I urge consideration of industrial strength automated techniques to identify content to capture. There may be opportunities using the end of term crawl. Instead of concentrating on what was collected, look at where crawls hint at failure to collect. Without this, the effort won’t scale. Human bespoke work is expensive.
2. An incomplete or naïve view of what it takes to preserve content in unfavorable or hostile conditions (for example the current attack on climate change data). One copy and two backups with simple hash checks is an easily attacked target for actors who wish to “disappear” any particular content. The LOCKSS system was engineered against the threats to government information.
3. Libraries have limited resources; it is difficult for them to invest in the future, even the near term future.
How do we ensure that the data we capture from government sources remain authoritative, trusted and usable by the widest possible audience? How do we ensure the preservation of at-risk data in ways that connect into and respect existing workflows and safeguards?
Scope, roles/types of agents, most significant risks, laying ground for longer term management.
Scope of the problem: Is this group focused on long-term preservation and access of:
1) All federally produced data,
2) Data that might be abandoned or even suppressed because of changes in government policy, and/or
3) Data collection that might not happen in the future because of government policy.
In addition to long-term preservation and access of federal data, another consideration is ensuring the long-term usability of data.
Selection criteria, consistent metadata, cost of preservation, definition of sufficient assurance in preservation of this category of data (how many copies is enough?, etc.), how to ensure that this isn't just a surge in energy from participants that dissipates as interest wanes
I'm most interested in the social aspects of this challenge; in other words, creating the environment and political space for preservation and access to occur. How do we better tell stories of how data is used to improve lives? What infrastructure is needed to advocate for continued collection of data? What can different data communities learn from each other (such as social vs. hard sciences)? How do we ensure that efforts are not duplicated? How do we raise awareness about what activities are helpful and what are not (for example, everyone and their mother rushing to download a data set when a news story suggests it will disappear)? How should we work to inform the work of data journalists so that data is well utilized? What is/are the best rapid response mechanism(s) for threats to federal data?
coordination and collaboration among supporting orgs, groups and individual actors