Chapter 3 Test projects
The three following case-studies are tested in detail within FAKIN (i.e. proposed best-practices will be applied for this case studies and cross-checked whether their application is useful.
3.1 Geogenic Salination
Two types of datasets are handled in the Geosalz project, spread sheets of mostly
hydrochemical laboratory analysis and archived paper files. Accordingly, the raw data
and processing
folders are divided in two parts labor
and archive
.
Data processing
Archive data
Import into Endnote DB
Getting data values out of pictures
Lab data
Manual pre-processing: due to heterogeneous formats -> data-cleaning
Automated workflow: data-import, data aggregation and export with R
Folder structure
//server/rawdata
geosalz
BWB_archiv
BWB_labor
README.yml
//server/processing
geosalz
archiv
labor
README.yml
documents/
precleaned-data/
cleaned-data/
figures/
<rawdata.ink> (link to “//server/rawdata/geosalz/BWB_labor”)
<results.ink> (link to “//server/results/geosalz/report/”)
//server/results
geosalz
admin
reports
final_report.docx
<processing.ink> (link to “//server/processing/geosalz/labor/cleaned_data/v1.0”)
README.yml
Folder names indicate the owner of the data, here BWB. The README.yml
gives
information on licensing of the data. In this case for restricted use only.
The BWB_labor
folder contains:
The METADATA.yml
comprises information on the origin of each file. In this case
data was received by email, thus each email is exported as a txt file (select
-
export as…
) and copied to the METADATA.yml
. The METADATA.yml
also contains
the email text itself, which may also provide meta information. The METADATA.yml
makes clear when and from whom the data was send and who received it.
A hyperlink can be inserted that directly links to the corresponding processing
folder. No further subfolders are required.
Workflow for creating above folder structure
Define project acronym
geosalz
(add inPROJECTS.yml
)Create initial folder structure on
//servername/rawdata/geosalz
Create initial folder structure on
//servername/processing/geosalz
with one subfolder for each task/work package, i.e.Create a
README.yml
for each task describing the folder contentsLink relevant results to
//server/projekte/AUFTRAEGE/Rahmenvertrag GRW-WV/Data and documents/Versalzung
Lessons learnt:
Manual data preparation for heterogeneous data sets indispensable
Naming conventions for large heterogeneous data sets not always in compliance with best-practices
3.2 LCA Modelling
Challenge:
The LCA modelling software Umberto can produce large raw data output files (> 300 MB csv files) that sometimes are even to big for EXCEL 2010 (> 1 millions) but need to be aggregated (e.g. grouped by specific criteria). This was usually performed manually within EXCEL in case that model output data was below EXCEL`s 1 million row limit.
Workflow improvement developed within FAKIN:
An open source R package kwb.umberto was programmed for automating:
data import the Umberto model results,
performing data aggregating to the user needs and finally
exporting the aggregated results in an
results.xlsx
EXCEL spreadsheet.
This results.xlsx
EXCEL spreadsheet is referenced by another EXCEL
spreatsheet figues.xlsx
(which contains the figure templates and just links
to the results.xlsx
in order to update the predefined figures).
This workflow now reduces the time consuming and error-prone formerly manually performed data aggregation in EXCEL, whilst still enabling the users to adapt the figures to their needs without coding knowledge.
3.3 Pilot Plants
Challenges:
The output of (on-line) monitoring technologies is often difficult to interpret and also inconvenient to handle as the output formats of different devices (in one water treatment scheme) can vary strongly. Furthermore, frequent reporting and documentation of the treatment performance via (on-line) monitoring can be time consuming for the personnel and requires advanced software solutions. An alternative to commercial (and often expensive) software solutions are tools which are based on the open software R (R Core Team 2017). The free software approach allows any R programmer to produce customized tools for each individual end-user.
Thus an automated reporting tool is developed within the AQUANES project for enabling an integrative assessment of the different monitoring devices and integration with water quality data obtained from analysis in the laboratory for four different pilot plant sites in order to:
Increase the reliability and reproducibility of handling large amounts of data by reducing the likeliness in human error in complex systems and by increasing the transparency of the data processing.
Promote the use of customized R tools for different end-user such as utilities, consultants and other research teams.
Therefore the open source R package aquanes.report (Rustler 2018) was programmed, which is able to:
import operational and lab data for each pilot site,
performs temporal aggregation (e.g. 5 min, 1 h, 24h median values),
visualises raw or aggregated data either interactively in a web browser or by
creating a standardised report (e.g. monthly) in html, pdf or docx
For the four different pilot plant sites the data (operational and lab data) for being imported into the R tool came from various sources at different temporal resolutions, which are detailed below:
Haridwar: operational data stored by Autarcon in mySQL database (temporal resolution: ~ 2-3 minutes, i.e. ~ 0.7 million data points per month), which is accessible from the web and thus could be easily imported into R. Lab data was provided by Autarcon initially in a unstructured format, which was impossible to be automatically imported into R. However, after agreeing on a standardised EXCEL spreadsheet format (e.g. one spreadsheet per site, one sheet per parameter and additional sheets providing metadata for parameters and sites) it was possible to integrate the lab data into the R tool.
Basel Lange-Erlen: operational data is provided by the water supplier in EXCEL spreadsheets on a weekly basis for each site (i.e. “Rein” and “Wiese”) with a temporal resolution of 5 minutes (i.e. ~ 0.5 million data points per month). Lab data are provided by the water supplier in a single comma separated csv file, which is exported from a database. Thus the structure of the lab data was standardised and could be easily imported into the R tool.
Berlin-Schönerlinde: operational data from the WEDECO pilot plant are collected using a SCADA system (~ temporal resolution: seconds, i.e. ~ 10 million data points per month). Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure often changes in case it is updated by BWB, making an automated importing using R impossible without adapting the import functions. Thus lab data were not integrated in the R tool for this site.
Berlin-Tiefwerder: operational data from the PENTAIR pilot plant are collected using a SCADA system (~ temporal resolution: ~ seconds, i.e. ~ 10 million data points per month). Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure often changes in case it is updated by BWB, making an automated importing using R impossible without adapting the import functions. Thus lab data were not integrated in the R tool for this site.
The R tool is used by KWB (for the sites Berlin-Schönerlinde and Berlin-Tiefwerder) regulary for assessing the pilot plants` operational performance interactively. In addition for an advanced assessment only the data importing and aggregation routines and combined with R scripts developed by KWB students.
For the other two pilot plant sites Haridwar and Basel Lange-Erlen the AQUANES project partners use the automated R reporting tool in a similar way.
References
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rustler, Michael. 2018. “Aquanes.report (V.0.5.0).” https://doi.org/10.5281/zenodo.1244103.