Your Data Can Live Forever (if people can find it): How to Plan for Data Reuse with Data Rescue as a case study

This activity is designed to help you understand what someone outside your research project (or you in 5-10 years) would need to know about your data in order to build on your work. We are going to use a dataset from Data Rescue initiative - a dataset that needs rescuing - as a case study.

For beginners

Format

This is a two part exercise. First we will research a dataset and try to answer some basic questions about it. Then we will do a short writing/thinking exercise. Best done with a partner or small group, but can also be done alone.

Target Audience

Open science project leads, graduate students, and early-career researchers looking to make their data reusable.

Materials

Introduction

Data reuse saves time and accelerates the pace of scientific discovery. Ensuring your data is reusable and findable today also helps to ensure that it will not disappear tomorrow. Don't make dark data. By making your data open and available to others, you make it possible for future researchers to answer questions that haven’t yet been asked. Thinking about data reuse in advance and documenting it, saves you time by helping you plan research processes and workflow early in the research project. Finally, this documentation makes it easier for you to defend your research... remember back to second grade when your teacher told you to “show your work”. We are going to show our work using standard, machine-readable metadata.

What is metadata?

In a nutshell, metadata is descriptive, standardized, machine readable information about a dataset. It makes the dataset useful, reusable, discoverable. Good metadata is discoverable by search engines and uses open standards (today we focus on the data.json format, which is used by the USA's Data.gov). When choosing how to create metadata for your work, consult associations like the Research Data Alliance to find out what is the standard format is for your field. Standards exist! Use them!

A good JSON metadata file should address these questions using standard field codes. Standard field codes are a way of answering basic questions in a way that will map to multiple database types and be understood easily by humans and machines. Here's a relatively short data.json file:

{
  "title": "VA National Formulary",
  "maintainer": "Don Lees",
  "maintainer_email": "Don.Lees@va.gov",
  "notes": "The VA National Formulary is a listing of products (drugs and supplies) 
  that must be available for prescription at all VA facilities, and cannot be made 
  non-formulary by a Veteran Integrated Service Network (VISN) or individual 
  medical center. Regarding chemical or biological entities that by law must be 
  submitted to the United States (U.S.) Food and Drug Administration (FDA) for
  pre-marketing approval, only those entities that actually have been approved by 
  FDA using New Drug Application (NDA), Abbreviated New Drug Application (ANDA), 
  or biologics license, may be added to the VA National Formulary.",
  "license_id": "cc-zero",
  "landingPage": "URL for landing page goes here ",
  "id": "ff9ae098-eccc-41d8-bfcd-5e8ed047db05",
  "doi":"publication or data doi",
  "isPartOf": "larger project?",
  "tags": "FDA",
  "organization": {
    "description": "",
    "title": "Department of Veterans Affairs",
    "name": "va-gov",
    "is_organization": true,
    "image_url": "https://raw.githubusercontent.com/GSA/logo/master/va.png",
    "type": "organization",
    "id": ""
  }
}

Here's what some those fields mean:

  • Title: Human-readable name of the data set. Should be in plain English and include sufficient detail to facilitate search and discovery.
  • Maintainer: Contact person’s name for the data set.
  • Maintainer_email: Contact person’s email address.
  • Notes: Human-readable description of the dataset (e.g., an abstract) with sufficient detail to enable a user to quickly understand whether it is of interest.
  • Organization title: The publishing organization or entity.
  • License_title: The license or non-license (i.e. Public Domain) status with which the dataset has been published.
  • Landing_page: A dataset’s human-friendly hub or landing page that users can be directed to for all resources tied to the dataset. Who created the data set? What agency, people, missions, or experiments does it relate to?
  • id: Universal ID or uuid, for example "ff9ae098-eccc-41d8-bfcd-5e8ed047db05"
  • doi: Is there a publication associated with this dataset? List the DOI here.
  • When it comes to making your work reusable, “the devil is in the details”. We are going to make some high level metadata for a dataset, then run through how we might create metadata for our own datasets. Upon completion of this exercise, you will have a detailed data reuse plan which you can save as a README or text file to store with your data files so others can understand and reuse your data. Extra bonus feature: it provides an outline for the “Methodology” section of any publications that arise from this data.

    Steps to Complete

    1. Let's do some Data Rescue research

      Let's work as one group to try to create a high level data.json file for the datasets at from the data associated with CATS. Yeah. That's a thing.

      It's a thing on the International Space station that shoots lasers at clouds! A high level data.json file should contain information on the dataset that humans are good at finding, scrapers can capture the detailed information on the datasets. Humans, on the other hand, are good at finding information like data reuse license status and publication DOIs.

      Is this listed on Data.gov? Or are any of the datasets linked from here listed?

      Is this data reusable? What licensing or copyright information is provided?

      Are there publications that use these data? DOIs?

    2. Breakout Session: Identifying Roles to Show us Your Metadata!

      Break into groups of 2-5 people. Identify one volunteer to be the "Researcher" and describe their research data set for this exercise. This person will need to be fairly familiar with how and why the data was collected.

      Identify a note taker to record responses to questions from the group about the data set.

    3. Breakout Session: Show us Your Metadata!

      Using the Data Reuse Plan Template as a guide, members of the group ask questions of the "Researcher" about her or his data set while the note taker records responses. The note taker can (and is encouraged) to ask questions too. As you ask questions, think about how you would (or if you could) respond to a similar question about your data set.

      If you have time, upon completion of the worksheet, review your responses and make sure they would be clear to someone viewing your data set for the first time. You are writing this for someone you have never met. Avoid jargon and abbreviations where possible.

    4. Review & Discuss

      Review the following questions and be prepared to share out your responses with the larger group.

      1. Which parts of the template were particularly challenging? Why? What research best practices could you put into place to make it easier?
      2. If you weren't able to provide some of the information in the worksheet, is there a way you can get it? If not, is there something you could have done differently during your research project to collect that information?
      3. Are there pieces of information missing from this worksheet that would help someone understand your data and make it easier to reuse?

    Glossary

    Open Data

    Data that is made easily and freely available for anyone to access, use, and share without restrictions, the possible exception being a requirement of attribution.

    Metadata

    Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data). For example, metadata for a photograph may include the name of the photographer, when and where it was taken, as well as the type of camera and settings used to take the photograph. For a dataset, the metadata should be machine readable and stick to the standard format agreed upon by the field of research.

    Licensing

    A license gives explicit permissions for the use of something. This is particularly important if you want to make your data open as some jurisdictions assign copyrights to data sets which limit their use. There are several types of licenses that are in common use for data. You can read more about them here: http://www.dcc.ac.uk/resources/how-guides/license-research-data.

    Naming Conventions

    These are a set of predefined rules for the naming and structure of folders, files, field names, etc. (E.g. All files begin with a date, location and project name.) Naming conventions help provide context to a data set, as well as make sure a standard of data collection and management is being followed by all members of a team.

    Permanent Identifiers

    A permanent identifier (or PID) is a set of numbers and/or characters, frequently in the form of a URL, that points to the location of a resource. PIDs are set up in such a way that even though the storage location of the resource may change over time (e.g. moving data from one university server to another), the PID will always point to the correct location. DOI is a commonly known type of PID.

    Follow-up Resources & Materials

    You may find it useful to review this handout early on in the planning stages of your project to help design the workflows of your project.

    The following resources are useful for more information documenting your data and research best practices to make documenting your data easier.