Welcome to Mozilla Science Lab's Open Data Primers!

Edit

Creating a Data_README

In the world of computer programing, a file that comes with a piece of software or code and contains critical information about its origins and how to use it is called a README file. The README’s all-caps title emphasizes how urgent it is that you read it carefully before you use the code or software. We can borrow this documentation convention when sharing our data by bundling a README file with data to ensure that new users have all the information they need to responsibly and effectively use the data. README files are usually plain text files frequently stored in a top-level directory or folder with the rest of your data files so they can be easily found. (For more information on different types of data documentation files, see the Additional Resources section at the end of this primer.)

We can think of our DATA_README file as a data reuse plan. A DATA_README is the highest level of metadata for a dataset -- it should be written for humans and contain all the information a person might need to understand how the data was collected, processed, analyzed, licensed, and presented. A useful DATA_README will answer five questions: what, who, when, where, and how. A great DATA_README is iterative. You can create the file as you start data collection and add to it as your project progresses. Click here to walk through an exercise on creating a great DATA_README.

Here are some Dos and Don’t for a great DATA_README:

Do include a breakdown of naming conventions that will apply for all files, worksheets, and variables used in the dataset.
Do include names and contact information (when appropriate) for the people who collected, analyzed, will maintain, and/or supervised the project.
Do include licensing information and use a standard license.
Do include information documenting the equipment, software, and other tools used.
Do include an overview of the experimental/project design.
Do explain how data were collected, processed, analyzed, exported and/or presented.
Do explain which file formats come from which steps in the experiment/project.

Don’t ask users to “email authors about reuse” instead of picking a license. It’s better to pick a restrictive license than use no license at all.
Don’t include personal information about data authors or maintainers without their consent.
Don’t use abbreviations, acronyms, or code names without defining them.
Don’t make the reader do too much guesswork- if there are known issues or quirks of the data, note them in the README.

A DATA_README is intended to be read by humans. Another best practice is to create machine-readable metadata for your dataset. Machine readable data allows your data to be read, understood, and placed in its correct context by computers. Creating a machine-readable companion file will help make your data indexable, searchable, and discoverable. Detailed instructions on how to do this outside the scope of this primer, but if you're interested, you can read more about machine-readable metadata at Project Open Data.