vignettes/introduction.Rmd
introduction.Rmd
In many cases R scripts read input files and create output files. The files are specified by their paths on the file system. Each file path that is hard-coded in an R script represents a dependency of the script. The script will fail if files that it accessess are removed or moved on the file system.
One solution to this problem is not to change paths on the file server. This is not a good solution. Very often, existing directory structures are not good from the beginning on. It is good practice to revise existing directory structures regularly and to optimise them so that they comply with current requirements.
So the aim should be to write the R script in a way that changes in file paths can be easily adapted in the script. We propose the following to achieve this:
Let’s start with an example. Assume that you have revised a script and that you have found the following hard-coded file paths:
paths <- c(
"//very/long/path/to/the/projects/project-1/wp-1/input/file 1.csv",
"//very/long/path/to/the/projects/project-1/wp-1/input/file-2.csv",
"//very/long/path/to/the/projects/project-1/wp-1/analysis/summary.pdf",
"//very/long/path/to/the/projects/project-1/wp 2/input/köpenick_dirty.csv",
"//very/long/path/to/the/projects/project-1/wp 2/output/koepenick_clean.csv",
"//very/long/path/to/the/projects/project-2/Daten/file-1.csv",
"//very/long/path/to/the/projects/project-2/Grafiken/file 1.png",
"//very/long/path/to/the/projects/project-2/Berichte/bericht-1.doc",
"//very/long/path/to/the/projects/project-2/Berichte/bericht-2.doc"
)
We propose to give a symbolic name to each of the different directories that are involved. These are:
unique(dirname(paths))
#> [1] "//very/long/path/to/the/projects/project-1/wp-1/input"
#> [2] "//very/long/path/to/the/projects/project-1/wp-1/analysis"
#> [3] "//very/long/path/to/the/projects/project-1/wp 2/input"
#> [4] "//very/long/path/to/the/projects/project-1/wp 2/output"
#> [5] "//very/long/path/to/the/projects/project-2/Daten"
#> [6] "//very/long/path/to/the/projects/project-2/Grafiken"
#> [7] "//very/long/path/to/the/projects/project-2/Berichte"
Define these directory paths in a list (dirs_1
in the following) and give each directory a short but meaningful name:
dirs_1 <- list(
p1.1_input = "//very/long/path/to/the/projects/project-1/wp-1/input" ,
p1.1_analysis = "//very/long/path/to/the/projects/project-1/wp-1/analysis",
p1.2_input = "//very/long/path/to/the/projects/project-1/wp 2/input",
p1.2_output = "//very/long/path/to/the/projects/project-1/wp 2/output",
p2_data = "//very/long/path/to/the/projects/project-2/Daten",
p2_images = "//very/long/path/to/the/projects/project-2/Grafiken",
p2_reports = "//very/long/path/to/the/projects/project-2/Berichte"
)
Whenever you use a file in the script, create the full file path by calling file.path()
, as in the following example:
# Instead of:
# "//very/long/path/to/the/projects/project-1/wp 2/output/koepenick_clean.csv"
# Compose the full path from the directory path and the file name:
file.path(dirs_1$p1.2_output, "koepenick_clean.csv")
#> [1] "//very/long/path/to/the/projects/project-1/wp 2/output/koepenick_clean.csv"
We prefer to do it this way instead of defining the full file path in a list, e.g. files
, and then accessing it with e.g. files$p1.2_output_koepenick_clean
, for two reasons:
file.path
). Otherwise we needed to change the list files
as well.The list of directory paths contains a lot of duplication. This not only makes it difficult to read but also hinders the adaptation of paths in case that the path to a common base directory changes. If, for example the path
//very/long/path/to/the/projects
moved to somewhere else, we would need to change it seven times in the list. This could be tedious and could result in errors by missing out a replacement. Therefore, we suggest to define common base paths on their own in the list and to use placeholders of the form <name>
in the paths where they occur. This could look as follows:
dirs_2 <- list(
projects = "//very/long/path/to/the/projects",
p1.1_input = "<projects>/project-1/wp-1/input",
p1.1_analysis = "<projects>/project-1/wp-1/analysis",
p1.2_input = "<projects>/project-1/wp 2/input",
p1.2_output = "<projects>/project-1/wp 2/output",
p2_data = "<projects>/project-2/Daten",
p2_images = "<projects>/project-2/Grafiken",
p2_reports = "<projects>/project-2/Berichte"
)
We could go one step futher and define list entries for each project (p1
and p2
in the following) as well:
dirs_3 <- list(
projects = "//very/long/path/to/the/projects",
p1 = "<projects>/project-1",
p2 = "<projects>/project-2",
p1.1_input = "<p1>/wp-1/input",
p1.1_analysis = "<p1>/wp-1/analysis",
p1.2_input = "<p1>/wp 2/input",
p1.2_output = "<p1>/wp 2/output",
p2_data = "<p2>/Daten",
p2_images = "<p2>/Grafiken",
p2_reports = "<p2>/Berichte"
)
Finally, we could even define entries for each directory representing a work package (wp) of project 1 (p1.1
and p1.2
in the following):
dirs_4 <- list(
projects = "//very/long/path/to/the/projects",
p1 = "<projects>/project-1",
p2 = "<projects>/project-2",
p1.1 = "<p1>/wp-1",
p1.2 = "<p1>/wp-2",
p1.1_input = "<p1.1>/input",
p1.1_analysis = "<p1.1>/analysis",
p1.2_input = "<p1.2>/input",
p1.2_output = "<p1.2>/output",
p2_data = "<p2>/Daten",
p2_images = "<p2>/Grafiken",
p2_reports = "<p2>/Berichte"
)
The above list dirs_4
may look a bit over complicated but it represents a path definition that is free of redundancies. This has the advantage that, if any directory path changes, only one entry of the list needs to be changed. Also, this condensed form of the path definition better reveals the directory structure than did the very first version dirs_1
that we defined above.
Of course, we cannot use this version containing <placeholders>
directly in calls to file.path
as we did with dirs_1
. We need a means to create the full paths as defined in dirs_1
back from the condensed path definitions in dirs_4
.
This is what the function resolve()
from our package kwb.utils does. You may use this function directly around the list assignment as in the following:
dirs_5 <- kwb.utils::resolve(list(
projects = "//very/long/path/to/the/projects",
p1 = "<projects>/project-1",
p2 = "<projects>/project-2",
p1.1 = "<p1>/wp-1",
p1.2 = "<p1>/wp 2",
p1.1_input = "<p1.1>/input",
p1.1_analysis = "<p1.1>/analysis",
p1.2_input = "<p1.2>/input",
p1.2_output = "<p1.2>/output",
p2_data = "<p2>/Daten",
p2_images = "<p2>/Grafiken",
p2_reports = "<p2>/Berichte"
))
The resolve()
function replaces (recursively) all placeholders by their corresponding values. As a result, the list now contains the full paths, just as they were defined in the first version dirs_1
of the list. We prove this by the following comparison: