Welcome to Mozilla Science Lab's Open Data Primers!


Where to Find Data

Where you find data will depend a lot on the type of work you’re doing, the type of questions you’re asking, and the conventions of your field. However, there are some universal good starting points.

1. Other researchers

This is the old fashioned way, but it still works in a lot of situations! It used to be that the only way to get research data produced by others was by contacting the researchers directly and asking for it. This is changing because many funders and journals now require sharing data at the time of publication.

Despite the move towards sharing data underlying publications, the majority of research data is still held by the original producers. In these cases, you may have to navigate more interpersonal dynamics in order to get your hands on data. The data producer may place conditions on its use, like sharing authorship on any products, or approving the interpretation of results of analysis. Although this is not necessarily the ideal circumstance for data sharing to occur, working directly with data producers can also result in very positive collaborative relationships, deep insights into data, and network building. We advise having a frank discussion about modern conventions of data licensing when talking to collaborators and data producers.

2. Mine the literature

Sometimes, raw data is included as a supplement (or supporting information) accompanying a research paper. Supplementary data is generally hosted on the publisher's website along with the publication. In other cases, usually large or complex datasets, a paper will link to data published in a centralized or institutional data repository. There are a variety of cool tools for pulling data out of published literature - an approach that’s particularly useful if you’re interested in doing a meta-analysis on a topic that’s been studied since before data sharing was common. For example:

  • Tabulizer and Tabula an be used for pulling data out of PDF tables.
  • Tools like WebPlotDigitizer can be used to pull data out of figures and graphs.
  • ContentMine can pull text-based information out of the literature.

3. Search for existing statistics

While there are many who use the terms “data” and “statistics” interchangeably, there is actually an important distinction to be made between the two. “Data” is considered the raw source, which when summarized and interpreted, become “statistics”. It logically follows that if you find a statistical resource that relates to your data, if you track down the source of the statistic, you may just find the data set you’re looking for. For example, good places to start if you’re looking for North American statistics produced by government agencies, try Statistics Canada or USA.gov.

4. Research data repositories

In Primer 3, you learned about how to find a repository that was appropriate for your data, so if you’re looking for similar data produced by others, looking in the repository where you’d likely deposit your data is a great place to start! Common, searchable and interdisciplinary repositories include Dryad and ICPSR, but there are also popular subject-specific repositories- for example, GenBank specifically handles genetic data. Data in repositories like these undergo quality control to ensure they’re accompanied by appropriate metadata, so that future researchers can appropriately use the data. There are also repository services like Figshare and Zenodo, which host a lot of data, although metadata standards aren’t enforced, making it a little trickier to discover relevant data sets there.

5. Public data repositories

Finally, there are vast amounts of data produced by governments, non-profit organizations, and other groups outside the academic research community. Often, these data are publicly available due to government requirements. How this data is stored and made available varies by field, country, and organization. Several countries have their own government data repository using the “Data.gov” domain, such as Data.gov in the U.S., Data.gov.uk in the UK, and Data.gov.au in Australia. Individual agencies may produce their own data products, like the World Wildlife Fund. A good way to get an understanding of the breadth and depth of public data available in your field is by working with a librarian who specializes in your discipline or research data in general.