Managing Structured Metadata
As indicated above, the main task for the Data Provider readying a dataset for the archive involves documenting the research that generated the data. This structured metadata is managed in a conceptual model that has been developed by the archive teams following a set of international metadata standards. This conceptual model defines a set of classes, each of which consists of a number of properties that need are defined for each dataset in the archive.
Overview of the Conceptual Model
The conceptual model can be thought of as a set of database tables that attempt to define the most relevant about a dataset and how it was produced. Each of these tables is represented as a class and each field in the database as a property of that table. Each class may also have relationships to other classes. A simple example is a simple Measurement class, which could have properties such as Location, Time, Phenomenon Measured and Instrument. For a given instance of a Measurement (based on a real measurement undertaken) you might populate the properties with values such as Location = (342500, 247500), Time = "2012-01-01 12:00:00", Phenomenon Measured = "N2O flux" and Instrument = "N2O Gas Chromatogram".
A useful distinction between a class and an instance is to use a real world analogy. For example, the class “car” describes the concept of a vehicle with 4 wheels, an engine etc. Your particular “blue 1982 Ford Escort” is an instance of the car class.
A class attempts to describe aspects of such real-world entities as activities, surveys, people, instruments and files. The main classes are those that describe the high level concepts that we need to capture. Figure 1 shows the main classes and the relationship between them. In this section the classes are described without listing all their properties in order to make it straightforward to understand the structure of the model. Full descriptions of all the classes along with their properties are provided later on in this document.
ADD FIG 1 HERE
Examples of the main classes
Some examples of what kind of information might constitute instances of the main classes are:
- “Activity” class: an N2O field experiment; a research project related to river pollution; a NERC thematic programme (that funds a set of related sub-activities).
- “Dataset” class: a set of data components (and/or supplementary files) produced by that N2O field experiment; a set of data components that are synthesised to generate a greenhouse gas emissions factor.
- “Data Component” classes: flux measurements; meteorological measurements; responses to a farm practice survey; a river-flow simulation.
- “Supplementary File” classes: a digital photograph; a Shapefile describing a set of geometries; a raw data binary file from which a CSV file is extracted to construct a Data Component.
The Data Component class has been developed because a Dataset is typically composed of a number of sub-components. For this community, the standard unit of a Data Component is typically a table that can be encoded into an Excel Worksheet or a CSV (comma-separated variables) file.
Types of Data Component
Five types of Data Component have been developed to describe the datasets of this scientific domain. Each Data Component type allows us to document information that relates to the particular concept we are describing. The Data Component sub-classes and their differences are explained in figure 2.
ADD IMAGE HERE
Figure 2. The 5 types of Data Component.
Additional classes
Some additional classes are defined within the conceptual model to help document the details of some of the properties of the main classes. A good example is the Process Chain class which is a useful property of the Dataset class that is used to describe how a set of Data Components are related within a workflow to generate final result. For example, the outputs of a number of Measurements might be used by a Simulation. The result of the Simulation might be compared with the results from a Literature Review in a Synthesis that generates the final product. This chain of processes, relating a set of Data Components, is sufficiently complex to warrant its own class description in the conceptual model.
Overview of all classes
Figure 3 depicts all the classes for which you might need to provide metadata.
ADD FIG 3 HERE
Figure 3. All the classes defined in the conceptual model; represented in the archive system.
In total there are 18 classes that the Data Provider may need to know about. However, a number of the classes are special cases that will not apply to all datasets. All boxes with a dashed black border in figure 3 are considered optional and their usage will depend on the type of process/data being described. It is also worth noting that the five blue boxes represent different types of Data Component. In most cases, a maximum of three of these types will actually be used.
Details of the Conceptual Model Classes
This section provides a more detailed view of the classes and their relationships. We recommend that the following text should be read whilst viewing figure 3 in order to gain an understanding of how each the classes work together to describe different aspects for the research and its associated data.
The box in the top-left corner includes Individuals and Organisations. These are defined once but are related to many different classes including the Activity, Dataset and Data Components.
The Activity defines the reason why you did your research. It is typically something like a programme, project or experiment and it may be linked hierarchically to other activities. When the archive is established then you might find the Activity already exists in the system so you do not need to define one. However, in most instances you will want to define this.
The Dataset can be thought of as a container for a set of Data Components (such as Measurements, Analyses and Simulations). It doesn’t contain much metadata itself, but it acts as a means of grouping them together. A Dataset can also contain one or more Supplementary File.
The main content of the archive are the Data Component Types. These five types (Analysis, Literature Review, Measurement, Simulation and Synthesis) are very similar but have slightly different properties or inputs. It is important to note the actual data files provided to the archive are all associated as the “Result” of Data Components. Each Data Component will have at least one data file associate with it. All Data Components contain a significant amount of useful metadata such as who did the work, the temporal and spatial extent, the phenomena measured/simulated, a description, links to documentation and links to other classes to describe the “process” that generated the results.
The Measurement (Data Component) is distinct from the others in that it has no inputs. The data is all generated through the Acquisition class. The Acquisition provides details of how the data was collected or measured. This will include a text description but may, where appropriate, include an Instrument Platform Pair. The Instrument Platform Pair is a union of an Instrument that is deployed on a given Platform. The Instrument contains a description and documentation whilst the Platform has its own documentation as well as a location. In cases where the Platform is mobile (such as on board a boat or aeroplane) then the Acquisition optionally contains a Mobile Platform Operation to explain how the platform was operated.
The Analysis, Literature Review, Simulation and Synthesis (Data Components) all use a Computation to generate the data. The Computation includes information such as documentation, software references, algorithms used and a description of input/output formats. The only difference between these 4 types of Data Component is the number of inputs they include. An Analysis has 1 or more inputs and a Literature Review has zero or more inputs (if zero then it must have Supplementary Files as inputs). A Simulation has zero or more inputs and a Synthesis has 2 or more inputs.
The other possible component of a Dataset, or input to one of the 4 types of Data Component mentioned in the paragraph above, is the Supplementary File. The Supplementary File is a general class to describe files that (a) it is important to include in the archive but (b) cannot be fully described as Data Components and/or are provided in a format not supported by the archive system. They are mentioned in more detail above.