To introduce our dataset to a complete stranger who might want to reuse it, we’ll need some really great metadata. Metadata is information about information. What does that mean? Let’s look at an example-- let’s meet some strange data on the internet, and try to get to know it better:
This is a number, a value, and it certainly looks like a data point. It could be the most important value in a data set, the value that proves a groundbreaking theory... or the one that sinks it. But it’s obviously meaningless if we don’t know what it measures. Here’s our data point, with a bit of context:
From this expanded dataset we can begin to make out a story. The first column seems to be date and time, so this is probably part of a time series. In the last field, each entry is tagged as an “earthquake,” and there are locations listed, too-- we might guess that we’re looking at seismic data. It seems that our highlighted value is linked to an earthquake West of Anchor Point, Alaska, but we still don’t know what that “3.2” value actually represents.
A researcher intimately acquainted with this dataset and collection methods might have no trouble reading and understanding it. But if we imagine that we’re newcomers to data and that this is all the information we have, the data set is unusable because we just don’t know enough about it. What happens if we have access to a bit more information?
Finally, labels for every column! These labels are metadata, or information that describes data. This is much better-- at last we see the highlighted value is labeled “mag.” Since we’re talking earthquakes, perhaps that’s an abbreviation for magnitude. Some of the other labels are familiar: latitude and longitude, depth and place. We can infer that his data shows seismic events logged in early May of 2016.
And yet, mysteries remain. For example, what is the unit of measure for depth? And what do “nst,” and “rms” stand for? What scale is used for “mag”? What is “mag type?” Even if we had more familiarity with the terminology, we still don’t know the origins of the data, who collected it, and why and how. Even with labels, we don’t have the whole picture.