This document it is assumed to be a “living” document. We highly appreciate any comments and suggestions for improvements. What are your experiences with research data managment tasks? Can you provide solutions for specific tasks? We want your feedback to make the document better for you and your colleages. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 3 Test projects

The three following case-studies are tested in detail within FAKIN (i.e. proposed best-practices will be applied for this case studies and cross-checked whether their application is useful.

3.1 Geogenic Salination

Two types of datasets are handled in the Geosalz project, spread sheets of mostly hydrochemical laboratory analysis and archived paper files. Accordingly, the raw data and processing folders are divided in two parts labor and archive.

Data processing

  • Archive data

    • Import into Endnote DB

    • Getting data values out of pictures

  • Lab data

    • Manual pre-processing: due to heterogeneous formats -> data-cleaning

    • Automated workflow: data-import, data aggregation and export with R

Folder structure

//server/rawdata
  geosalz
    BWB_archiv
    BWB_labor 
    README.yml
    
    
//server/processing
  geosalz
    archiv
    labor 
      README.yml 
      documents/
      precleaned-data/
      cleaned-data/
      figures/
      <rawdata.ink> (link to “//server/rawdata/geosalz/BWB_labor”)
      <results.ink> (link to “//server/results/geosalz/report/”)
    
//server/results
  geosalz
    admin
    reports
      final_report.docx
      <processing.ink> (link to “//server/processing/geosalz/labor/cleaned_data/v1.0”)
      README.yml 

Folder names indicate the owner of the data, here BWB. The README.yml gives information on licensing of the data. In this case for restricted use only. The BWB_labor folder contains:

The METADATA.yml comprises information on the origin of each file. In this case data was received by email, thus each email is exported as a txt file (select - export as…) and copied to the METADATA.yml. The METADATA.yml also contains the email text itself, which may also provide meta information. The METADATA.yml makes clear when and from whom the data was send and who received it. A hyperlink can be inserted that directly links to the corresponding processing folder. No further subfolders are required.

Workflow for creating above folder structure

  1. Define project acronym geosalz (add in PROJECTS.yml)

  2. Create initial folder structure on //servername/rawdata/geosalz

  3. Create initial folder structure on //servername/processing/geosalz with one subfolder for each task/work package, i.e.

  4. Create a README.yml for each task describing the folder contents

  5. Link relevant results to //server/projekte/AUFTRAEGE/Rahmenvertrag GRW-WV/Data and documents/Versalzung

Lessons learnt:

  • Manual data preparation for heterogeneous data sets indispensable

  • Naming conventions for large heterogeneous data sets not always in compliance with best-practices

3.2 LCA Modelling

Challenge:

The LCA modelling software Umberto can produce large raw data output files (> 300 MB csv files) that sometimes are even to big for EXCEL 2010 (> 1 millions) but need to be aggregated (e.g. grouped by specific criteria). This was usually performed manually within EXCEL in case that model output data was below EXCEL`s 1 million row limit.

Workflow improvement developed within FAKIN:

An open source R package kwb.umberto was programmed for automating:

  • data import the Umberto model results,

  • performing data aggregating to the user needs and finally

  • exporting the aggregated results in an results.xlsx EXCEL spreadsheet.

This results.xlsx EXCEL spreadsheet is referenced by another EXCEL spreatsheet figues.xlsx (which contains the figure templates and just links to the results.xlsx in order to update the predefined figures).

This workflow now reduces the time consuming and error-prone formerly manually performed data aggregation in EXCEL, whilst still enabling the users to adapt the figures to their needs without coding knowledge.

3.3 Pilot Plants

Challenges:

The output of (on-line) monitoring technologies is often difficult to interpret and also inconvenient to handle as the output formats of different devices (in one water treatment scheme) can vary strongly. Furthermore, frequent reporting and documentation of the treatment performance via (on-line) monitoring can be time consuming for the personnel and requires advanced software solutions. An alternative to commercial (and often expensive) software solutions are tools which are based on the open software R (R Core Team 2017). The free software approach allows any R programmer to produce customized tools for each individual end-user.

Thus an automated reporting tool is developed within the AQUANES project for enabling an integrative assessment of the different monitoring devices and integration with water quality data obtained from analysis in the laboratory for four different pilot plant sites in order to:

  • Increase the reliability and reproducibility of handling large amounts of data by reducing the likeliness in human error in complex systems and by increasing the transparency of the data processing.

  • Promote the use of customized R tools for different end-user such as utilities, consultants and other research teams.

Therefore the open source R package aquanes.report (Rustler 2018) was programmed, which is able to:

  • import operational and lab data for each pilot site,

  • performs temporal aggregation (e.g. 5 min, 1 h, 24h median values),

  • visualises raw or aggregated data either interactively in a web browser or by

  • creating a standardised report (e.g. monthly) in html, pdf or docx

For the four different pilot plant sites the data (operational and lab data) for being imported into the R tool came from various sources at different temporal resolutions, which are detailed below:

  • Haridwar: operational data stored by Autarcon in mySQL database (temporal resolution: ~ 2-3 minutes, i.e. ~ 0.7 million data points per month), which is accessible from the web and thus could be easily imported into R. Lab data was provided by Autarcon initially in a unstructured format, which was impossible to be automatically imported into R. However, after agreeing on a standardised EXCEL spreadsheet format (e.g. one spreadsheet per site, one sheet per parameter and additional sheets providing metadata for parameters and sites) it was possible to integrate the lab data into the R tool.

  • Basel Lange-Erlen: operational data is provided by the water supplier in EXCEL spreadsheets on a weekly basis for each site (i.e. “Rein” and “Wiese”) with a temporal resolution of 5 minutes (i.e. ~ 0.5 million data points per month). Lab data are provided by the water supplier in a single comma separated csv file, which is exported from a database. Thus the structure of the lab data was standardised and could be easily imported into the R tool.

  • Berlin-Schönerlinde: operational data from the WEDECO pilot plant are collected using a SCADA system (~ temporal resolution: seconds, i.e. ~ 10 million data points per month). Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure often changes in case it is updated by BWB, making an automated importing using R impossible without adapting the import functions. Thus lab data were not integrated in the R tool for this site.

  • Berlin-Tiefwerder: operational data from the PENTAIR pilot plant are collected using a SCADA system (~ temporal resolution: ~ seconds, i.e. ~ 10 million data points per month). Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure often changes in case it is updated by BWB, making an automated importing using R impossible without adapting the import functions. Thus lab data were not integrated in the R tool for this site.

The high temporal high resolution (~ seconds) of the operational data for both Berlin pilot plants resulted in large data amounts (~ 10 million data points per month), which required an large effort to optimise the performance of the R reporting tool in oorder to enable the visualisation of the pilot plant`s raw data for its test operation period (~ 18 months) on computers with limited RAM ressources (~ 8-12 GB).

The R tool is used by KWB (for the sites Berlin-Schönerlinde and Berlin-Tiefwerder) regulary for assessing the pilot plants` operational performance interactively. In addition for an advanced assessment only the data importing and aggregation routines and combined with R scripts developed by KWB students.

For the other two pilot plant sites Haridwar and Basel Lange-Erlen the AQUANES project partners use the automated R reporting tool in a similar way.

References

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rustler, Michael. 2018. “Aquanes.report (V.0.5.0).” https://doi.org/10.5281/zenodo.1244103.