This section provides guidance on the actual structure of CSV output files and the important role of using controlled vocabularies.
Structuring CSV output files
In order to make output files simple for a computer, and a person, to understand, we promote the use of a very simple file format. The basic constraints are:
- Only include one table per CSV file. This means avoiding nested tables and multiple tables next to, or underneath, each other.
- Remove commas. Since the file will be exported to a comma-separated format it is important to remove, or replace, any commas in the headers or data cells.
- Remove all annotations. CSV format will not keep any software-specific annotations such as Excel comments, diagrams, colours, special formatting or cell formulas.
- Remove any non-ASCII characters. Characters such as mathematical symbols and Greek characters (such as “μ”) may not be translated correctly into your CSV file. Please double-check that all characters look correct in your final output.
- Use (only) the top 3 rows to define what is in your column. See the section below for more details.
- Separate out your “compound phenomena” into separate columns. See the section below for more details.
Table 1 shows an example CSV file contents that follows the structure outlined above.
Table 1. An example CSV file that is ready to be archived. The three header rows are discussed in more detail below.
|Year||Country||Crop||Total N applied - N(FERT)|
|Year||Countries||Crops||Applied Nitrogen Rate Excluding Fertiliser Application|
|Units: N/A||Units: N/A||Units: N/A||Units: kt/yr|
Using the AEDA Controlled Vocabulary
The use of a Controlled Vocabulary (CV) gives both Data Providers and End Users the ability to identify and re-use established and standardised terms to describe the phenomena being measured or modelled in their research. AEDA provides the tools to manage a common CV for all data contained within it.
The header rows of your CSV output files (see the example in table 1 above) should be defined as follows:
- Row 1 should contain your own label to describe the phenomenon in that column.
- Row 2 should contain the “term” defined in the Archive CV.
- Row 3 should contain the units used for the phenomenon described prefixed with “Units:”. Note that the CV may indicate the preferred units for a given phenomenon. The string “Units: N/A” should be used when there is no applicable unit (for phenomena such as “Country”).
How do I find out what terms are in the CV?
The Controlled Vocabulary can be browsed and searched on the Archive web site (www.environmentdata.org/vocabulary). You can explore related terms and details such as the full definition and preferred units for a term.
There are two other controlled vocabularies in AEDA in addition to the main subject vocabulary described above, these are for geographic terms and taxonomic terms. Both of these have different relationships to other terms than the standard vocabulary does and so are managed separately. The geographic terms list is available here: www.environmentdata.org/geographicterms, and the taxonomic list here: www.environmentdata.org/plist/taxon.
What do I use if my phenomenon is not listed in the CV?
If you are describing a phenomenon that is not listed in the CV then we strongly recommend that you propose your term to the vocabulary manager. You should make proposals for new terms by contacting the Archive team (email@example.com). We're working on a more streamlined technical solution for proposing new terms but this has yet to be implemented.
Simplifying the Structure by Separating out “Compound Phenomena”
Whilst working with your datasets it is often useful to describe the phenomenon being measured in a single column/array. An example might be “Applied Nitrogen Rate to Wheat in England 2010”. This saves time and space when working with the data. However, when providing output files for the final archive this “compound phenomenon” is not something that can be effectively captured in a Controlled Vocabulary which in turn makes it harder for the end user to discover, extract and compare with other similar data.
We strongly recommend that you separate out “compound phenomena” and record each of their components into separate columns in your output files.
For example, the “Applied Nitrogen to Wheat in England 2010” can be effectively separated into four separate columns:
- Applied Nitrogen Rate
- Crop Type
Whilst this approach does increase the volume of data provided it has the following benefits:
- Each column header can refer to a specific term in the CV.
- When the archive system ingests the CSV file it is able to index the data based on the headers.
- End users will be able to understand the outputs because they have structured in a simple tabulated format with clearly defined labels.