This document it is assumed to be a “living” document. We highly appreciate any comments and suggestions for improvements. What are your experiences with research data managment tasks? Can you provide solutions for specific tasks? We want your feedback to make the document better for you and your colleages. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 2 Best Practices

A logical and consistent folder structure, naming conventions, versioning rules as well as provision of metadata for your data files help you and your colleagues find and use the data.

The following rules will save time on the long run and will help avoid data duplication. With the following recommendations we intend to comply with Hart et al. (2016).

In general, data related to one project should be accessible to all the persons that are involved in the project. It should be avoided that project relevant files are permanently stored on local hard drives or on personal network drives. Project data should therefore be stored on network drives to which all project members have access.

2.1 Folder Structure

A research institute aims at creating knowledge from data. A typical data flow may look like the following: raw data are received or collected from foreign sources or created from own measurements. The raw data are processed, i.e. cleaned, aggregated, filtered and analysed. Finally result data sets are composed from which conclusions are drawn and published. To cut it short: raw data get processed and become result data.

In computer science, the clear distinction between data input, data processing and data output is a well-known and widely-used model to describe the structure of an information processing program. It is referred to as the input-process-output (IPO) model. We recommend to apply this model to the distinction of different categories of data:

Using the IPO model has the following benefits:

  • Minimizes the risk of overwriting, deleting of files and folders by automatic data processing (e.g. scripts).

  • Rawdata are protected against accidental overwriting.

  • Helps to keep files and folders clearly organised.

  • Reflects roles and responsibilities of different project team members (e.g. # project manager is mainly interested in results).

  • Helps avoid deep folder structures.

According to the three categories we suggest to create three different areas, represented by three different network drives, on the top level of your file server. The first area is for raw data, the second area is for data in processing and the third area is for result data.

Within each area, the data are organised by project first, i.e. each project is represented by one folder within each of the network drives:

//server/rawdata
    project-1/
    project-2/
    …
    
//server/processing
    project-1/
    project-2/
    …

//server/results
    project-1/
    project-2/
    …

Sub-folder structure within the project folders in each of these top-level network drives is described in the following.

Raw data

As raw data we define data that we receive from a measurement device, project partner or other external sources (e.g. internet download) even if these data were processed externally. Raw data are, for example, analytical measurements from laboratories, filled out questionnaires from project partners, meteorological data from other institutes, measurements from loggers or sensors, or scans of hand written manual sampling protocols. Especially in environmental sciences, raw data often cannot or only with high costs be reproduced (e.g. rainfall, river discharge measurements). They are therefore of high value. Raw data are often large in terms of file size or number (e.g. measurements of devices logging at high temporal resolution). Raw data can already come in a complex, deep folder structure. Raw data are closely related to metadata such as sensor configurations generated by loggers or email correspondence when receiving data by email from external partners. We acknowledge the high value of raw data by storing them on a dedicated, protected space and by requiring them to be accompanied by metadata.

Raw data are stored in an unmodified state. All modifications of the data are to be done on a copy of the raw data in the “processing” space (see next). The only modification allowed is the renaming of a raw data file, given that the renaming is documented in the metadata. Once stored, raw data are to be protected from being accidentally deleted or modified. This is achieved by making the raw data space write-protected ([link] FAQ: How to make a file write protected).

We propose to organise raw data:

  • by project first and

  • by origin (i.e. source or owner) of the data second.

We will create a network folder //server/rawdata in which all files have set the read-only property. We suggest to store raw data by project first and by the organisation that owns (i.e. generated, provided) the data second. This could look like this:

//server/rawdata

  ORGANISATIONS.txt
  
  PROJECTS.lnk [Symbolic Link to PROJECTS.txt in //server/projects$]
  
  test-project/
    bwb/
      rain/
        METADA/
        rain.xls
      laboratory/
        METADATA/
        laboratory.xls

    kwb/
      discharge/
        METADATA/
        q01.csv
        q02.csv
        q03.csv

Data Processing

As data processing we understand data in any stage of processing between raw and final, i.e. all intermediate results, such as different stages of data cleaning, different levels of data aggregation or different types of data visualisation.

We recommend to store data in these stages in their own space on the file system. This space is assumed to be a “playground”, where the researchers are asked to store all these intermediate results. This space is where different approaches or models or scenarios can be tested and where, as a result, different versions of data are available. The data processing space is intended to be used for data only, and not for e.g. documents, presentations, or images.

Compared to the raw data network drive the data processing network drive is expected to require much more disk space.

In the data processing area the files are stored by project first. Within each project data may be organised by topic and/or data processing step.

//server/processing

test-project/
  01_data-cleaning
    METADATEN
    rain_raw.csv
    rain.csv
    quality.csv      
    discharge.csv
  02_modelling
    summer
    winter
    VERSIONEN
      v0.1
      v1.0
        summer
        winter
  software

Result Data

With result data we mean clean, aggregated, well formatted data sets. Result data sets are the basis for interpretation or assessment and for the formulation of research findings. We consider all data that are relevant for the reporting of project results as result data. Result data will very often be spreadsheet data but they can also comprise other types of data such as figures or diagrams. We propose to prepare result data in the data processing area (see above) and to put symbolic links that point to the corresponding locations in the data processing folder into the result data folder. The idea is that the result area always gives the view onto the “best available” (intermediate) project results at a given point in time. Using symbolic links instead of file copies avoids accidental modification of data in the result data area that are actually expected to happen in the data processing area.

Often, result data sets are the result of temporal aggregation. They are consequently smaller in size than raw data sets. There will also be less result data sets than there are data sets representing different stages of data processing. For these reasons, the result data space is expected to require much less disk space than the spaces dedicated to raw data and data in processing.

The structure in the result data area should represent the project structure. It could e.g. be organised by working package. When being organised by working package the folder names should not only contain the working package number but also indicate the name of the working package.

//server/projects

  test-project/
    Data-Work Packages
      wp-1_monitoring
      wp-2_modelling
        summer.lnk # symbolic links to last version 
        winter.lnk # in data processing

Clean Datasets

In a project-driven research institute, almost all data processing steps are closely related to their specific research project. One project may e.g. require to prepare rain data for being fed into a rainfall-runoff and sewer simulation software. Unprocessed (raw) rain data are received from a rain gauge station and cleaned. The clean rain data are then converted to the format that is required by the sewer simulation software.

In this example, the clean rain data are a data processing output. They are also the input to further processing and thus the source of even more valuable results.

The specific rain data file that is input to the sewer modelling software is the final (rain data) result. However, the clean rain data that are an intermediate result in the context of one project are themselves already valuable results. They can be used in other projects that require clean rain data for other purposes, e.g. for looking at climate change effects.

We recommend to store clean datasets in their own space. In this space the datasets are organised by topic, not by project:

//server/treasure
  rain/
  flow/
  level/

This increases the visibility of existing clean datasets and reduces the risk that work that has already been done in one project is done again in another project. Often, people start again from the raw data even though somebody already cleaned that data.

Metadata

We recommend to describe the meaning of subfolders in a file README.yaml in the folder that contains the subfolders.

Example for such a README.yaml file:

rlib:
  created-by: Hauke Sonnenberg
  created-on: 2019-04-05
  description: >
    R library for packages needed for R training at BWB.
    To use the packages from that folder, use 
    .libPaths(c(.libPaths(), "C:/_UserProgData/rlib"))

rlib_downloads:
  created-by: Hauke Sonnenberg
  created-on: 2019-04-05
  description: >
    files downloaded by install.packages(). Each file represents a package
    that is installed in the rlib folder.

Restrictions/Conventions:

  • Each top-level folder should represent a project, i.e. should be defined in the top level file PROJECTS.txt.

  • Each possible owner should be defined in the top level file ORGANISATIONS.txt.

  • The naming convention for the organisations is the same as for projects.

2.2 Naming of Files and Folders

A concise and meaningful name of your file and folder is the key to relocate your data. Whenever you have the freedom to name your data file and structure your project folder you should do so. Names should be concise and meaningful to you and your colleagues. Your colleague, who may not be familiar with the project, should be able to guess the content of a folder or a file by intuition. Naming conventions are also necessary to avoid read errors during automatic data processing and to prevent errors when working on different operating systems.

Please comply with the following rules:

Rule A: Allowed Characters

The following set of characters are authorized in file or folder names:

  • upper case letters A-Z,

  • lower case letters a-z,

  • numbers 0-9,

  • underscore _,

  • hyphen -,

  • dot .

If you want to know why some characters are not authorized, please check the FAQ:

Instead of German umlauts and the sharp s (ä, ö, ü, Ä, Ö, Ü, ß) use the following substitutions: ae, oe, ue, Ae, Oe, Ue, ss.

Rule B: Separation of Words or Parts of Words

Please use the characters underscore _ or hyphen instead of space. Use underscore _ to separate words that contain different types of information:

  • results_today instead of results-today

  • protocol_hauke instead of protocol-hauke

Use hyphen - instead of underscore _ to visually separate the parts of compound words or names:

  • site-1 instead of site_1,

  • dissolved-oxygen instead of dissolved_oxygen,

  • clean-data instead of clean_data.

Use hyphen - (or no separation at all) in dates (i.e. 2018-07-02 or 20180702).

Using hyphen instead of underscore in composed words will not split the composed words into their parts when splitting a file or folder name at underscore.

For example, splitting the name project-report_example-project-1_v1.0_2018-07-02 at underscore results in the following words (giving different types of information on the file or folder)

  • project-report (type of document),
  • example-project-1 (name of related project),
  • v1.0 (version number),
  • 2018-07-02 (version date).

Rule C: Capitalisation

From the pure data management’s point of view it would be best not to use upper case letters in file or folder names at all. This would avoid possible conflicts when exchanging files between operating systems that either care about case in file names (as e.g. Unix systems) or not (as e.g. Windows systems).

If allowing upper case letters it should be decided on if and when to use capitals. Having a corresponding rule in place, only one of the following spellings would, for example, be allowed:

  • dissolved-oxygen (all lower case),

  • dissolved-Oxygen (attributes lower case, nouns upper case),

  • Dissolved-oxygen (first letter upper case),

  • Dissolved-Oxygen (all parts of compound words upper case).

Rule D: Avoid Long Names

At least on Windows operating systems, very long file paths can cause trouble. When copying or moving a file to a target path that exceeds a length of 260 characters an error will occur. This is particularly unfortunate when copying or moving many files at once and when the process stops before completion. As the length of a file path mainly depends on the lengths of its components, we suggest to restrict

  • folder names to no more than 20 characters and

  • file names to no more than 50 characters.

This would allow a file path to contain nine subfolder names at maximum. The maximum number of subfolders, i.e. the maximum folder depth, should be kept small by following best-practices in Folder Structures. If folder or file names are generated by software (e.g. logger software, modelling software, or reference manager) please check if the software allows to modify the naming scheme. If we nevertheless have to deal with deeply nested folder structures and/or very long file or folder names we should store them in a flat folder hierarchy, i.e. not in

\\server\projekte$\department-name\projects\project-name\
  data-work-packages\work-package-one-meaning-the-following\modelling\
  scenario-one-meaning-the-following\results.

Rule E: Formatting of Dates and Numbers

When adding date information to file names, please use one of these formats:

  • yyyy-mm-dd (e.g. 2018-06-28)

  • yyyymmdd (e.g. 20180628)

By doing so, file or folder names that differ only in the date will be displayed in chronological order. Using the first form improves the visual distinction of the year, month and day part of the date. Using hyphen instead of underscore will keep these parts together when splitting the name at underscore (see above).

When using numbers in file or folder names to bring them into a certain order, use leading zeros as required to make all numbers used in one folder level have the same length. Otherwise they will not be displayed in chronological order in your file browser.

Example:

  • 01, 02, 03, etc. if there are 9 to 99 files/folders or

  • 001, 002, 003, etc. if there are 100 to 999 files/folders.

Rule F: Allowed Words

We recommend to define sets of allowed words in so called vocabularies. Only words from the vocabularies are then expected to appear as words in file or folder names. Getting accustomed to the words from the vocabularies and their meanings allows for more precise file searching. This is most important to clearly indicate that a file or folder relates to “special objects”, such as projects or organisations or monitoring sites. At least for projects and organisations we want to define vocabularies in which “official” acronyms are defined for all projects and all organisations from which we expect to receive data (see the chapter on acronyms). Always using the acronyms defined in the vocabularies allows to search for files or folders belonging to one specific project or being provided by one specific organisation.

We could also define vocabularies of words describing other properties of a file or folder. We could e.g. decide to always use clean-data instead of data-clean, cleaned-data, data-cleaning, Datenbereinigung, bereinigte-daten, and so on.

Rule G: Order of Words

We could go one step further and define the order in which we expect the words to appear in a file or folder name. Which types of information should go first in the filename? The order of words determines in which way files are grouped visually when being listed by their name. If the acronym of the organisation goes first, files are grouped by organisation. If the acronym of the monitoring site goes first, files are grouped by monitoring site. According rules cannot be set on a global level, i.e. for the whole company or even for a whole project. The requirements will be different depending on the type of information that are to be stored. We recommend to define naming conventions where appropriate and to describe them in a metadata file in the folder below which to apply the naming convention.

Rule H: Allowed Languages

Do not mix words from different languages within one and the same file or folder name. For example, use regen-ereignis or rain-event instead of regen-event or rain-ereignis.

Within one project, use either only English words or only German words in file or folder names. This restriction may be too strict. However, I think that we should follow this rule at least for the top level folder structures. It is not nice that we see folders AUFTRAEGE (German) and GROUNDWATER (English) as folder names within the same parent folder.

2.3 Versioning

Versioning or version control is the way by which different versions and drafts of a document (or file or record or dataset or software code) are managed. Versioning involves the process of naming and distinguishing between a series of draft documents that leads to a final or approved version in the end. Versioning “freezes” certain development steps and allows you to disclose an audit trail for the revision and update of drafts and final versions. It is essential to reproduce results that may be based on older data.

Manual

Manual versioning may costs more time and requires some discipline, but ensures long-term clean and generally understandable file structure and provides a quick overview of the actual status of development. Manual versioning does not require additional software (except a simple text editor) and is realized by following these simple guidelines:

  • A version is created by copying the current file and pasting it to a subfolder named VERSIONS

  • Each successive draft of a file in the VERSIONS folder is numbered sequentially from e.g. v0.1, v0.2, v0.3 as a postfix at the end of the file name (e.g. filename_v0.1, …v0.2, …v0.3, and so on)

  • Finalised forms (e.g. the presentations was held on a conference, the report was reviewed) become entitled with a new version number, e.g. v1.0, v2.0 and so on

  • Read-only is applied to each versioned file (to prevent accidental loss of final versions of files)

  • Only files without version name as postfix are modified

  • A VERSIONS.txt is created and kept up-to-date with a text editor, containing meta information on purpose of the modification and the person who made it

It is noteworthy that “final” does not necessarily mean ultimately. Final forms are subject to modification and it is sometimes questionable whether a final status has been reached. Therefore, it is more important to be able to track the modifications in the VERSIONS.txt rather than arguing on version numbers.

Example:

BestPractices_Workshop.ppt
VERSIONS/
  + VERSIONS.txt
  + BestPractices_Workshop_v0.1.ppt
  + BestPractices_Workshop_v0.2.ppt
  + BestPractices_Workshop_v1.0.ppt

Content of file VERSIONS.txt:

BestPractices_Workshop.ppt
- v1.0: first final version, after review by NAME
- v0.2: after additions by NAME
- v0.1: first draft version, written by NAME

Automatic

Automatic versioning is mandatory in case of programming.

The versioning is done automatically in case a version control software, like Git or Subversion are used.

At KWB we currently use the following version control software:

  • Subversion: for internally storing programm code (e.g. R-scripts/packages) we have an Subversion server, which is accessible from the KWB intranet. However, this requires:

    • the installation of the client software TortoiseSVN and a

    • valid user account (for accessing the server) which is currently provided by the IT department on request

  • Git: for publishing programm code (e.g. R packages) external on our KWB organisation group on Github. Currently all repositories are public (i.e. are visible for everyone), but also use of private repositories is possible for free as KWB is recognised as non-for-profit company by Github, offering additional benefits for free

Use of version control software is required in case of programming (e.g. in R, Python, and so on) and can be useful in case of tracking changes in small text files (e.g. configuration files that run a specific R script with different parameters for scenario analysis).

Drawbacks:

  • Special software (TortoiseSVN), login data for each user on KWB-Server and some basic training are required

  • In case of collaborate coding: sticking to ‘best-practices’ for using version control is mandatory, e.g.:

    • timely check in of code changes to the central server,

    • Speaking to each other: so that not two people work at the same time at the same program code in one script as this leads to conflicts that need to be resolved manually, which can be quite time demanding. You are much better off if you avoid this in the upfront by talking to each other

Advantages:

  • Only one filename per script (file history and code changes are managed either internally on a KWB server in case of using TortoiseSVN or externally for code hosted on Github)

  • Old versions of scripts can be restored easily

  • Additional comments during commit (i.e. at the time of transfering the code from the local computer to the central version control system about why code changes were made and build-in diff-tools for tracking changes improve the reproducibility

Attention: version control software is not designed for versioning of raw data and thus should not be used for it. General thoughts on the topic of ‘data versioning’ are available here: https://github.com/leeper/data-versioning

A presentation with different tools for version control is available here: https://www.fosteropenscience.eu/node/597

2.4 Metadata

Metadata are data about data. It is up to us to define

We plan to specify the requirements in more detail when dealing with the test projects. Then, will also check metadata standards.

Metadata about raw data should always be stored.

Metadata about processed data are also important. However, in case of automated processing with a script, it may be possible to deduce the content of a generated file from the content of the script.

2.4.1 Why Metadata?

Why should we store metadata about data? Metadata are required to

  • interpret raw data, e.g. “Why are the oxygen values so high? Ah, I see, someone was there to clean the sensor!”,

  • gain an overview about available data,

  • know what we are allowed to do with the data.

2.4.2 What Metadata to Store?

General

What information would someone need to find/re-use your data? E.g.

  • Location,

  • Title,

  • Creator name,

  • Description,

  • Date collected?

Metadata about Raw Data

What are the most important information about raw data that we receive?

  • Obtained from whom, when, via whom and which medium, e.g.

    • E-Mail from A to B on 2018-01-25 or

    • USB-Stick given personally from C to D on 2018-01-26

  • Restriction of usage, e.g.

    • only for project x or

    • only within KWB or

    • must not be published! or

    • should be published!!

  • Description of content and format

    • Where measurements were taken?

    • What methods were used (to take samples, to analyse parameters in the laboratory)?

    • What devices where used?

    • What do the columns of the table (in the database, XLS/CSV file) mean?

    • In what units are the values given?

Metadata about Processed Data

What are the most important information about the data that we produce?

  • Who created the file? If the file was created by a script, what script created the file and who ran the script?

  • When was the file created?

  • What was the input data (e.g. raw data or preprocessed data)?

  • Which methods were applied to generate the output from the input?

  • What was the environment, what were boundary conditions, e.g.

    • versions of software,

    • versions of R packages?

Metadata about Programming Scripts

  • What does the script do?

  • Who wrote the script?

  • How to use the script? Give a short tutorial.

Regarding R programming, we should consider providing scripts in the form of R packages. The R packaging system defines a framework of how to answer all of the above questions. See how we did this (not yet always with with a tutorial) with our packages on GitHub.

2.4.3 Where to Store Metadata?

The two main options of storing metadata are:

  1. together with the data or near to the data i.e. in the same folder in which the data file resides,

  2. in a central file or database.

Unless we have a professional solution (i.e. software on metadata management) we should prefer the first approach which is simpler and more flexible than the second one.

2.4.4 In What Format to Store Metadata?

Metadata should be stored in a simple plain (i.e. not formatted) text format. This format can be read and written by any text editor on different operating systems and does not require any specific software. And most important: it is human readable.

In the simplest form, metadata can be stored in a plain text file README.txt. As stated above, this file should reside in the same folder as the files that contain the data to be described.

Please keep in mind: It is better to write something than to write nothing. It does not have to be perfect. Just try the best yout can at the moment that you are storing the data. Try to write the metadata directly after storing the data, do not wait until you “feel in the mood” to do so. Anyone who is later working or planning to work with the data may be grateful finding some information on it.

Better than writing an unstructured text file is to write a structured text file. The so called YAML format is such a structured format. We want to use this standard. It seems to have established in the scientific world. The advantage of a structured format is that reading of this format can be automated. We aim at collecting all available metadata by automatically browsing for YAML files, reading their content and creating overviews on available files and data.

Once you have written a YAML file you can check the validity of the format with this online validator: https://yamlchecker.com/

2.4.5 Metadata Management Tools

Tools for metadata tracking and data standards are:

  • metadata editor, e.g. online editor of GFZ Potsdam

2.4.6 Metadata Standards

We want to use a metadata standard.

Examples for metadata standards are (in brackets are listed the institutions who publish their data by using this standard):

Best-practices roadmap:

  1. Check meta data standards, e. g. DataCite (see also: ZALF, GFZ Potsdam)

  2. Define minimum metadata requirements at KWB for raw and processed data.
    projects.

The ‘best-practices for metadata’ will be developed for the test projects which are assessed within FAKIN project.

We propose to define some special files that contain metadata related to files and folders. To indicate that these files have a special meaning, the file names are all uppercase.

2.4.7 Special Metadata Files

As stated earlier, we want to use consistent, unique identifiers to indicate the belonging of data to a certain project or data owner. We propose to define the identifiers in terms of special metadata files.

2.4.7.1 Metadata File PROJECTS.txt and related files

The project identifiers are defined in a simple YAML file PROJECTS.yml in the //server/projects$ folder. Only the identifiers defined in this file are expected to appear as top-level folder names in the project folder structure within this network drive.

Possible content of PROJECTS.yml:

flusshygiene:
  department: suw
  short-name: Flusshygiene
  long-name: >
    Hygienically relevant microorganisms and pathogens in multifunctional water
    bodies and hydrologic circles - Sustainable management of different types of
    water bodies in Germany
  financing: 
    funder: bmbf
    sponsor: bwb
ogre:
  department: suw
  short-name: OGRE
  long-name: >
    Relevance of trace organic substances in stormwater runoff of Berlin
  financing:
    funder: uep-2 
    sponsor: Veolia
optiwells-2:
  department: grw
  short-name: OPTIWELLS 2
  long-name:
  type: sponsored
reliable-sewer:
  department: suw
  short-name: RELIABLE_SEWER
  long-name:
  type: contracted
smartplant:
  department: wwt
  short-name: Smartplant
  type: sponsored
...

In the file PROJECTS.yml each entry represents a project. Each project is identified by its identifier. The project acronyms appear in alphabetical order. Each entry should at least contain information on the department, a short/long name or title of the project and the type of project (funded, sponsored, contracted). Additional information such as the year of the start of the project could be given.

It could also be useful to define three letter codes for projects. These codes could e.g. be used in tables or diagrams in which different projects are compared and in which space may be limited.

The department acronyms could be defined in a file DEPARTMENTS.yml:

grw:
  short-name: Groundwater
  head: Hella Schwarzmüller
suw:
  short-name: Urban Systems
  head: Pascale Rouault 
wwt:
  short-name: Process Innovation
  head: Ulf Miehe

The funder acronyms could be defined in a file FUNDERS.yml:

bmbf: German Federal Ministry of Education and Research (BMBF)
uep-2: >
  Umweltentlastungsprogramm des Landes Berlin, co-financed by the European Union
  (UEP II)

2.4.7.2 Metadata File ORGANISATIONS.txt

It is very important to know the origin or owner of data. This is an important piece of metadata. Therefore we define unique identifiers for the owners of data that we use. The acronyms are defined in a special file ORGANISATIONS.txt. Possible content of this file:

bwb: Berliner Wasserbetriebe
kwb: Kompetenzzentrum Wasser Berlin
uba: Umweltbundesamt

2.5 Data Processing

Data are often inconsistent, incomplete, incorrect, or misspelled. Data cleaning is essential.

For data cleaning you may use a GUI (Graphical User Interface) based tool like OpenRefine http://openrefine.org/ or choose a programmatic approach.

In the following we describe how data can be imported into the R-Programming Environment, which can be used for data cleaning, aggregation and visualisation. (Grolemund and Wickham 2017)

2.5.1 Logger Devices

The R package kwb.logger (Sonnenberg 2018) helps to import raw data from loggers used in different KWB projects into the software R (R Core Team 2017), which is used for data processing (e.g. data cleaning, aggregation and visualisation).

For details, which loggers currently are supported by the R packages kwb.logger please check the documentation website.

2.5.2 Spreadsheets

General recommendations for working with EXCEL spreadsheets is given in the FAQs section.

2.5.2.1 Import Data From One Excel File

  • Save the original file in the rawdata zone.

2.5.2.2 Import Data From Many Excel Files

2.5.2.2.1 Files Are In the Same Format

Import Excel files of the same format by

  • defining a function that is able to read the data from that file
  • calling this function in a loop for each file to import.
2.5.2.2.2 Files Are In Different Formats

We developed a general approach of importing data from many Excel files in which the formats (e.g. more than one table area within one sheet, differing numbers of header rows) differ from file to file.

2.6 Data Publishing and Sharing

Modern life context for the ten simple rules [@Boland_2017]

Figure 2.1: Modern life context for the ten simple rules (Boland, Karczewski, and Tatonetti 2017)

“This figure provides a framework for understanding how the “Ten Simple Rules to Enable Multi-site Collaborations through Data Sharing” (Boland, Karczewski, and Tatonetti 2017) can be translated into easily understood modern life concepts.

Rule 1 is Open-Source Software. The openness is signified by a window to a room filled with algorithms that are represented by gears.

Rule 2 involves making the source data available whenever possible. Source data can be very useful for researchers. However, data are often housed in institutions and are not publicly accessible. These files are often stored externally; therefore, we depict this as a shed or storehouse of data, which, if possible, should be provided to research collaborators.

Rule 3 is to “use multiple platforms to share research products.” This increases thechances that other researchers will find and be able to utilize your research product—this is represented by multiple locations (i.e., shed and house).

Rule 4 involves the need to secure all necessary permissions a priori. Many datasets have data use agreements that restrict usage. These restrictions can sometimes prevent researchers from performing certain types of analyses or publishing in certain journals (e.g., journals that require all data to be openly accessible); therefore, we represent this rule as a key that can lock or unlock the door of your research.

Rule 5 discusses the privacy issues that surround source data. Researchers need to understand what they can and cannot do (i.e., the privacy rules) with their data. Privacy often requires allowing certain users to have access to sections of data while restricting access to other sections of data. Researchers need to understand what can and cannot be revealed about their data (i.e., when to open and close the curtains).

Rule 6 is to facilitate reproducibility whenever possible. Since communication is the forte of reproducibility, we depicted it as two researchers sharing a giant scroll, because data documentation is required and is often substantial.

Rule 7 is to “think global.” We conceptualize this as a cloud. This cloud allows the research property (i.e., the house and shed) to be accessed across large distances.

Rule 8 is to publicize your work. Think of it as “shouting from the rooftops.” Publicizing is critical for enabling other researchers to access your research product.

Rule 9 is to “stay realistic.” It is important for researchers to “stay grounded” and resist the urge to overstate the claims made by their research.

Rule 10 is to be engaged, and this is depicted as a person waving an “I heart research” sign. It is vitally important to stay engaged and enthusiastic about one’s research. This enables you to draw others to care about your research."

—- (Boland, Karczewski, and Tatonetti 2017)

Recommended literature:

2.6.1 Repositories

Repositories for permanently deposing data are for example:

Repositories for publishing program code are:

However, both do not offer long term data preservation by default, but using Github it is posible to make the code citable by linking it with Zenodo (see: https://guides.github.com/activities/citable-code/).

We are currently using the following three repositories for publishing program code (mainly R packages):

Proposal: define company-wide QMS policy (“top-down”) for publishing program code

The above workflow was established from “bottom-up” (i.e. Michael Rustler and Hauke Sonnenberg) with the idea in mind to make the code as open as possible (e.g. by chossing the permisse MIT license as default for all of our public R packages).

However, up to now there is no company wide strategy (“top-down”) defined yet that would legitimate this “bottom-up” approach. This creates uncertainty (e.g. what can be published?), so that much more code than necessary is labelled as “private”. To reduce this uncertainty the following QMS policy is proposed, which should be discussed and agreed on in one of the next KWB management meetings:

  • Sponsor projects (e.g. funded by BMBF, EU): source code will be published by default at https://github.com/kwb-r in public repositories (i.e. it will be accessible for everyone) under the permissive MIT license in case that the source code does not:

    • contain security critical paths (e.g. to our company server) or

    • confidential data.

    Code should be developed in such a way that both of the criteria (security critical paths, confidential data) defined above are considered. Making the code openly available will decrease the burden to install them (e.g. not each student needs to get an “access” token to install private repositories, as required for “contract” projects, see below).

  • Contract projects (e.g. funded by BWB, Veolia): will be published in private repositories by default at https://github.com/kwb-r in case the funder does not pre-define a specific repository. Access to the source code is thus resticted to KWB researchers and students working in the contract project. Project partners and funders can access the source code only if they get an “access token” from the KWB project team.

A blog post by Bosman and Kramer (2016) provide results of a large survey carried out in 2015 among more than 15000 researchers. Insights can be gained on:

  • Which scholary communications tools are used and

  • Are there disciplinary differences in usage?

They finally summarise: “Another surprising finding is the overall low use of Zenodo – a CERN-hosted repository that is the recommended archiving and sharing solution for data from EU-projects and -institutions. The fact that Zenodo is a data-sharing platform that is available to anyone (thus not just for EU project data) might not be widely known yet.”

2.6.2 ORCID

Problem:

"Two large challenges that researchers face today are discovery and evaluation. We are overwhelmed by the volume of new research works, and traditional discovery tools are no longer sufficient. We are spending considerable amounts of time optimizing the impact—and discoverability—of our research work so as to support grant applications and promotions, and the traditional measures for this are not enough. — (Fenner and Haak 2014)

Solution:

"Open Researcher & Contributor ID (ORCID) is an international, interdisciplinary, open and not-for-profit organization created to solve the researcher name ambiguity problem for the benefit of all stakeholders. ORCIDwas built with the goal of becoming the universally accepted unique identifier for researchers:

  1. ORCID is a community-driven organization

  2. ORCID is not limited by discipline, institution, or geography

  3. ORCID is an inclusive and transparently governed not-for profit organization

  4. ORCID data and source code are available under recognized open licenses

  5. the ORCID iD is part of institutional, publisher, and funding agency infrastructures.

Furthermore, ORCID recognizes that existing researcher and identifier schemes serve specific communities, and is working to link with, rather than replace, existing infrastructures."

(Fenner and Haak 2014)

2.6.3 Licenses

“In most countries in the world, creative work is protected by copyright laws. International conventions, and primarily the Berne Convention of 1886, protect the copyright of creators even across international borders for 50 years after the death of the creator. This means that copying and using the creative work is limited by conditions set by the creator, or another copyright holder. For example, in many cases musical recordings may not be copied and further distributed without the permission of the musician, or of the production company that has acquired the copyright from the musician. Facts about the universe that are discovered through research are not subject to copyright, but the collection, aggregation, analysis and interpretation of research data may be considered creative work, and could be protected by copyright laws. Thus, the consumption of research publications is governed by copyright law. Furthermore, even data sharing is often governed by copyright laws, because the compilation of data to be shared often requires a creative effort. Another case of resarch-relevant copyrighted products is software that is developed in the course of research. In all of these cases, if license terms are not explicitly specified, the work is considered to be protected as”all rights reserved“. This means that no one but the creator of the work can use the work unencumbered. For software this means that copying and further distribution of the software is prohibited. Even running the software may be restricted. The exact selection of a license is beyond the scope of this section, but depends on your intentions and goals with regard to the software”

(Rokem and Chirigati 2018)

Recommended literature:

2.6.4 File Formats

“Scientific data is saved in a myriad of file formats. A typical file format might include a file header, describing the layout of the data on disk, metadata associated with the data, and the data itself, often stored in binary format. In some cases (e.g., CSV (or comma-separated value) files), data will be stored as text. The danger of proliferation of file formats in scientific data lies in the need to build and maintain separate software tools to read, write and process all these data formats. This makes interoperability between different practitioners more difficult, and limits the value of data sharing, because access to the data in the files remains limited.”

(Rokem and Chirigati 2018)

Table 2.1: Suitability of file formats for long-term preservation (Kaden and Kleineberg 2018)
More than ten years Up to ten years Not suitable
Text PDF/A, TXT, ASC, XML PDF, RTF, HTML, DOCX, PPTX, ODT, LATEX DOC, PPT
Data CSV XLSX, ODS XLS
Pictures TIFF, PNG, JPG 2000, SVG GIF, BMP, JPEG INDD, EPS
Audio WAV MP3, MP4
Video Motion JPG 2000, MOV MP4 WMV

2.6.5 Data Exchange Standards

WaterML2:

“…is a new data exchange standard in Hydrology which can basically be used to exchange many kinds of hydro-meteorological observations and measurements. WaterML2 has been initiated and designed over a period of several years by a group of major national and international organizations from public and private sector, such as CSIRO, CUAHSI, USGS, BOM, NOAA, KISTERS and others. WaterML2 has been developed within the OGC Hydrology Domain Working group which has a mandate by the WMO, too.”

WaterML2

ODM2: is an information model and supporting software ecosystem for feature-based earth observations

References

Boland, Mary Regina, Konrad J. Karczewski, and Nicholas P. Tatonetti. 2017. “Ten Simple Rules to Enable Multi-Site Collaborations Through Data Sharing.” PLOS Computational Biology 13 (1). Public Library of Science (PLoS): e1005278. https://doi.org/10.1371/journal.pcbi.1005278.

Bosman, Jeroen, and Bianca Kramer. 2016. “GitHub and More: Sharing Data & Code.” Blog. Innovations in Scholarly Communication - Changing Research Workflows. https://101innovations.wordpress.com/2016/10/09/github-and-more-sharing-data-code/.

Fenner, Martin, and Laure Haak. 2014. “Unique Identifiers for Researchers.” In Opening Science: The Evolving Guide on How the Internet Is Changing Research, Collaboration and Scholarly Publishing, edited by Sönke Bartling and Sascha Friesike, 293–96. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-00026-8_21.

“Forschungslizenzen.” 2019. http://forschungslizenzen.de/.

Friesike, Sascha. 2014. “Creative Commons Licences.” In Opening Science: The Evolving Guide on How the Internet Is Changing Research, Collaboration and Scholarly Publishing, edited by Sönke Bartling and Sascha Friesike, 287–88. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-00026-8_19.

Grolemund, Garrett, and Hadley Wickham, eds. 2017. R for Data Science. 1st ed. Sebastopol, CA: O‘Reilly Media. https://r4ds.had.co.nz.

Hart, Edmund M., Pauline Barmby, David LeBauer, François Michonneau, Sarah Mount, Patrick Mulrooney, Timoth’ee Poisot, Kara H. Woo, Naupaka B. Zimmerman, and Jeffrey W. Hollister. 2016. “Ten Simple Rules for Digital Data Storage.” Edited by Scott Markel. PLOS Computational Biology 12 (10). Public Library of Science (PLoS): e1005097. https://doi.org/10.1371/journal.pcbi.1005097.

Kaden, Ben, and Michael Kleineberg. 2018. “Guidelines Zur Veröffentlichung Dissertationsbezogener Forschungsdaten.” https://doi.org/10.18452/18811.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rokem, Ariel, and Fernando Chirigati. 2018. “Glossary.” In The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences, edited by Justin Kitzes, Daniel Turek, and Fatma Deniz. Oakland, CA: University of California Press. https://www.practicereproducibleresearch.org/core-chapters/7-glossary.html.

Sonnenberg, Hauke. 2018. “Kwb.logger (V 0.2.0).” https://doi.org/10.5281/zenodo.1289425.

Stodden, Victoria. 2014. “Intellectual Property and Computational Science.” In Opening Science: The Evolving Guide on How the Internet Is Changing Research, Collaboration and Scholarly Publishing, edited by Sönke Bartling and Sascha Friesike, 225–35. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-00026-8_15.