Welcome to Mozilla Science Lab's Open Data Primers!


File Formats FTW (For the Win)!

Though it’s not the most thrilling of topics, file formats for data sharing is very, very important. You want to select a format that is accessible to as many people as possible. Avoid proprietary formats if you can! Make sure your file is readable by computers so no human is forced to re-enter your data into another document to use it. For example, the ever-popular PDF, while it can be opened on many systems, doesn’t allow other software programs to “read” the contents and extract data for analysis and re-use.

Most numeric and textual data can be communicated in text-based forms, like the comma separated value (CSV) format that can be opened with any text editor. This is the best possible format to enable wide and easy reuse. For other sustainable format types, refer to the Best Practices for File Formats guide distributed by Stanford Libraries.

Here are a few other data “do’s” and“don’ts” specifically related to formatting of spreadsheet data that will make your data difficult for any machine to read:

  • Do keep raw data separate for easiest processing and readability.
  • Do use file naming conventions for file, worksheet, and variable naming.

  • Don’t include data visualizations or summaries in the same worksheet as raw data.

  • Don’t include multiple tables in the same worksheet.
  • Don’t put special characters or values in field names; these make it more difficulty for other machines to cleanly open or load the file.
  • Don’t use formatting, like font or colored text, to add meaning to your data or identify key values; machines won’t read this formatting when running analysis and that information will be lost.