Workflow

This page describes the workflow we use for the DataRefuge project, both at in-person events and when people work remotely. It explains the process that a url/dataset goes through from the time it has been identified by a seeder & sorter as “uncrawlable” until it is made available as a record in the datarefuge.org ckan data catalog. The process involves several distinct stages, and is designed to maximize smooth hand-offs so that each phase is handled by someone with distinct expertise in the area they’re tackling, while the data is always being tracked for security.

Seeders/Sorters

Seeders and Sorters canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive’s webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, otherwise they add them to the Uncrawlable spreadsheet using the project’s Chrome Extension.

Researchers

Researchers inspect the “uncrawlable” list to confirm that seeders’ assessments were correct (that is, that the URL/dataset is indeed uncrawlable), and investigate how the dataset could be best harvested. Research.md describes this process in more detail. We recommend that a Researchers and Harvesters (see below) work together in pairs, as much communication is needed between the two roles. In some case, one same person will fulfill both roles.

Harvesters

Harvesters take the “uncrawlable” data and try to figure out how to actully capture it based on the recommendations of the Researchers. This is a complex task which can require substantial technical expertise, and which requires different techniques for different tasks. Harvesters should see the included Harvesting Toolkit for more details and tools.

Checkers

Checkers inspect a harvested dataset and make sure that it is complete. The main question the checkers need to answer is “will the bag make sense to a scientist”? Checkers need to have an in-depth understanding of harvesting goals and potential content variations for datasets.

Baggers

Baggers do some quality assurance on the dataset to make sure the content is correct and corresponds to what was described in the spreadsheet. Then they package the data into a bagit file (or “bag”), which includes basic technical metadata and upload it to final DataRefuge destination.

Describers

Describers creates a descriptive record in the DataRefuge CKAN repository for each bag. Then they links the record to the bag, and make the record public

TRACKS

1. Web archiving:

Contribute to the Internet Archive’s End of Term archive.

In this track you will be working through federal websites to identify important data and figure out what we need to do with it. You should have familiarity with a web browser and attention to detail.
Goals: secure websites from offices of DOE, EPA, DOI, etc.

To join this track:

2. Data archiving:

Contribute to DataRefuge.org, a CKAN instance.

EITHER have deep domain knowledge of scientific datasets, OR be a librarian, OR be a skilled technologist in a programming language of your choice (eg python, javascript, c, java, etc), knowledge of the command line (bash, shell, powershell), experience working with structured data (eg json formatting). Experience in front-end web development a plus.
Goals: archive data from USGS, other datasets waiting in the queue, etc.

3. Storytelling

You should have a penchant for developing compelling narratives and/or social media strategies.

Goals: tell/write stories on the importance of climate and environmental data on our everyday lives. Share this work on social media.

4. Onward

For newbies and veterans of EDGI alike.
Goals: discuss / spec out the next 3 to 6 months of tech development plans as we move beyond collection; discuss security, resiliency, redundancy.

FAQs

How can we know which data to target so we don’t replicate the work at other data rescue events?

We gathered information from experts, scientists, and community members about particularly valuable and vulnerable data before the event. For those events that would like help identifying areas to focus on, we aim to provide lists of the most important datasets and sources that we’ve identified so that each event can tackle a piece of the larger set without too much duplication. These lists are being compiled here. However, understanding the data that is most valuable and vulnerable within your own community can be a really important aspect of your Data Rescue event.

Where will downloaded data go?

If your institution can’t host the data your Data Rescue event downloads, we are developing a repository using Amazon Web Services integrated with CKAN - an open source data catalog - that will be available to DataRescue events for storing and making accessible copies of data. Contact us via the EDGI website to learn more.

Will you be providing best practices for creating reliable copies?

Yes! And we welcome your collaboration. Generally, we are recommending that those materials that can be captured through webcrawling and the activities of End of Term Harvest should be captured in that way. We will rely, in part, on the toolkit developed after the event at University of Toronto, as well as locally developed code to seed the harvester. For the data that does not make sense in the Internet Archive, we’ll be adding them to an open data catalog, mentioned above.