Open Data Training
Guide 3: Sharing and Publishing Data
This training module is a very quick introduction to open data for newcomers to the topic, and for those who know a bit but want to know more. This material was produced by Mozilla Science Lab, a program to encourage the use of open source practices and web technologies to do better science.
Total Time to Complete
About 1 hour, including transitions.
- Choose appropriate file formats for sharing
- Identify interdisciplinary and disciplinary repositories
- Workshop introductions, and an introduction to our topics ( minutes)
- File formats: What is a proprietary file format? (10 minutes)
- Choosing a repository(35 minutes)
- Additional Resources
- Familiarity with Open Data Training Primer 3
- Close review of Instructor Guides and all supporting materials for this module
Topic 1: Introductions and Discussion about Open Data
- Instructor (3 minutes)
- Explain your background, how you became involved in open data.
- Why this training, why Mozilla? (1 minutes)
- • Intro the training series, and how it was created (collaboration, sprints, output of fellows program)
- • Structure of the session, content exploration through activities
- • Why MSL and Mozilla are involved, your relationship to Mozilla
- Why open data now? (1 minutes)
- • More data than ever-- define types of data here
- • Pressure from funders, want more impact from data
- • The web as sharing/collaboration tool
- Instructor (3 minutes)
Topic 2: File Formats FTW!
What is a proprietary file format?
- Define ‘proprietary’ file formats for the group.
- Ask: What barriers to sharing does proprietary formatting present?
- The types of proprietary files will differ by field- ask the students what proprietary formats are common to their field.
- Participants spend 2 minutes thinking of a definition and, if possible, supporting examples.
- Working with the students, identify which file formats are prefered for archiving data in their field, and which proprietary formats need conversion prior to sharing
Buffer the information loss: many proprietary file formats can encode information in ways that are lost when converting to open file formats (for example, if a MS Excel file encoded information by cell color, that information would be lost upon converting the file to .CSV. Strategize with students- how do you buffer against information loss when changing your files to appropriate archival formats?
Topic 3: Choosing a repository
How will you decide where to share your data? There is an element of ‘knowing your audience’ here- some fields may have particular conventions, so if you know your learners’ backgrounds, this will help guide how you approach this topic. Check with a librarian at your institutional or public library.
Many librarians these days, particularly at research institutions, have librarians trained in research data services. Start there to see if you can get customized help to find a place to share your data. Librarians love to help.
For unknown or mixed groups, you’ll want to show them how to search Re3Data.org (http://www.re3data.org).
Re3Data is a global catalog of online data repositories. You can browse the catalog by subject area, country or the type of content (e.g. images, code, audiovisual, etc). It also identifies what access restrictions may be in place for a repository, what permanent identifiers they use, as well as links to policies for those repositories.
Discuss advantages and disadvantages of generic versus field-specific repositories.For example, how doe freely available generic digital repositories such as Figshare or Zenodo compare to discipline specific repositories? How do either of these compare to institutional repositories on axes of quality control, discoverability, and effort required to share?
Explore the websites for several data repositories your students might encounter. Identify one data repository to ‘deep dive’ into- explore their data submission interface. What information needs to be included? Is there information about their quality control process? How does the repository suggest the data is cited?
Let’s try it!
Take a look at these example data sets from the following repositories:
- • University of California San Diego Digital Collections: San Diego County Bee Species List
- • Purdue University Research Repository (PURR): Tower Sunflower Polarization
- • Dryad: Data from: Compliance with mandatory reporting of clinical trial results on ClinicalTrials.gov
- • Figshare: Altmetric’s Top 100 Articles
For each data set record consider the following questions:
- 1. Are the questions in the DATA_README answered by the information displayed in the data set record?
- 2. Do some repositories provide more or less information than others? What are the differences?
- 3. Would the information provided in the data set record make it easier for you to discover or use it? Which information?
- 4. Is there missing or incomplete information that would make it more difficult for you to find or use this data set?
- 5. Can you easily find the policies in place for each of these repositories? Whether you can find their policies or not, do you have questions about this data set and this repository? What are they?
Topic 4: Resources and Wrap
Provide links to Primer 3, other relevant resources.
Home | Next Lesson