Working with File Paths in R Scripts

In many cases R scripts read input files and create output files. The files are specified by their paths on the file system. Each file path that is hard-coded in an R script represents a dependency of the script. The script will fail if files that it accessess are removed or moved on the file system.

One solution to this problem is not to change paths on the file server. This is not a good solution. Very often, existing directory structures are not good from the beginning on. It is good practice to revise existing directory structures regularly and to optimise them so that they comply with current requirements.

So the aim should be to write the R script in a way that changes in file paths can be easily adapted in the script. We propose the following to achieve this:

Define the paths to all required directories in one place, preferably on top of the script.
Hold all path definitions in a list of named elements.
Avoid duplication by replacing common parts of paths with placeholders.

Define example paths

Let’s start with an example. Assume that you have revised a script and that you have found the following hard-coded file paths:

paths <- c(
  "//very/long/path/to/the/projects/project-1/wp-1/input/file 1.csv",
  "//very/long/path/to/the/projects/project-1/wp-1/input/file-2.csv",
  "//very/long/path/to/the/projects/project-1/wp-1/analysis/summary.pdf",
  "//very/long/path/to/the/projects/project-1/wp 2/input/köpenick_dirty.csv",
  "//very/long/path/to/the/projects/project-1/wp 2/output/koepenick_clean.csv",
  "//very/long/path/to/the/projects/project-2/Daten/file-1.csv",
  "//very/long/path/to/the/projects/project-2/Grafiken/file 1.png",
  "//very/long/path/to/the/projects/project-2/Berichte/bericht-1.doc",
  "//very/long/path/to/the/projects/project-2/Berichte/bericht-2.doc"
)

We propose to give a symbolic name to each of the different directories that are involved. These are:

unique(dirname(paths))
#> [1] "//very/long/path/to/the/projects/project-1/wp-1/input"   
#> [2] "//very/long/path/to/the/projects/project-1/wp-1/analysis"
#> [3] "//very/long/path/to/the/projects/project-1/wp 2/input"   
#> [4] "//very/long/path/to/the/projects/project-1/wp 2/output"  
#> [5] "//very/long/path/to/the/projects/project-2/Daten"        
#> [6] "//very/long/path/to/the/projects/project-2/Grafiken"     
#> [7] "//very/long/path/to/the/projects/project-2/Berichte"

Define directory paths in one place

Define these directory paths in a list (dirs_1 in the following) and give each directory a short but meaningful name:

dirs_1 <- list(
  p1.1_input = "//very/long/path/to/the/projects/project-1/wp-1/input"   ,
  p1.1_analysis = "//very/long/path/to/the/projects/project-1/wp-1/analysis",
  p1.2_input = "//very/long/path/to/the/projects/project-1/wp 2/input",
  p1.2_output = "//very/long/path/to/the/projects/project-1/wp 2/output",
  p2_data = "//very/long/path/to/the/projects/project-2/Daten",
  p2_images = "//very/long/path/to/the/projects/project-2/Grafiken",
  p2_reports = "//very/long/path/to/the/projects/project-2/Berichte"
)

Whenever you use a file in the script, create the full file path by calling file.path(), as in the following example:

# Instead of:
# "//very/long/path/to/the/projects/project-1/wp 2/output/koepenick_clean.csv"

# Compose the full path from the directory path and the file name:
file.path(dirs_1$p1.2_output, "koepenick_clean.csv")
#> [1] "//very/long/path/to/the/projects/project-1/wp 2/output/koepenick_clean.csv"

We prefer to do it this way instead of defining the full file path in a list, e.g. files, and then accessing it with e.g. files$p1.2_output_koepenick_clean, for two reasons:

We would need to find a meaningful name for the list element and we would most probably end up in a name that looks very similar to the file name itself.
If only the file name changes, we need to change it only in one place (where we call file.path). Otherwise we needed to change the list files as well.

Replace common parts of paths with placeholders

The list of directory paths contains a lot of duplication. This not only makes it difficult to read but also hinders the adaptation of paths in case that the path to a common base directory changes. If, for example the path

//very/long/path/to/the/projects

moved to somewhere else, we would need to change it seven times in the list. This could be tedious and could result in errors by missing out a replacement. Therefore, we suggest to define common base paths on their own in the list and to use placeholders of the form <name> in the paths where they occur. This could look as follows:

dirs_2 <- list(
  projects = "//very/long/path/to/the/projects",
  p1.1_input = "<projects>/project-1/wp-1/input",
  p1.1_analysis = "<projects>/project-1/wp-1/analysis",
  p1.2_input = "<projects>/project-1/wp 2/input",
  p1.2_output = "<projects>/project-1/wp 2/output",
  p2_data = "<projects>/project-2/Daten",
  p2_images = "<projects>/project-2/Grafiken",
  p2_reports = "<projects>/project-2/Berichte"
)

We could go one step futher and define list entries for each project (p1 and p2 in the following) as well:

dirs_3 <- list(
  projects = "//very/long/path/to/the/projects",
  p1 = "<projects>/project-1",
  p2 = "<projects>/project-2",
  p1.1_input = "<p1>/wp-1/input",
  p1.1_analysis = "<p1>/wp-1/analysis",
  p1.2_input = "<p1>/wp 2/input",
  p1.2_output = "<p1>/wp 2/output",
  p2_data = "<p2>/Daten",
  p2_images = "<p2>/Grafiken",
  p2_reports = "<p2>/Berichte"
)

Finally, we could even define entries for each directory representing a work package (wp) of project 1 (p1.1 and p1.2 in the following):

dirs_4 <- list(
  projects = "//very/long/path/to/the/projects",
  p1 = "<projects>/project-1",
  p2 = "<projects>/project-2",
  p1.1 = "<p1>/wp-1",
  p1.2 = "<p1>/wp-2",
  p1.1_input = "<p1.1>/input",
  p1.1_analysis = "<p1.1>/analysis",
  p1.2_input = "<p1.2>/input",
  p1.2_output = "<p1.2>/output",
  p2_data = "<p2>/Daten",
  p2_images = "<p2>/Grafiken",
  p2_reports = "<p2>/Berichte"
)

The above list dirs_4 may look a bit over complicated but it represents a path definition that is free of redundancies. This has the advantage that, if any directory path changes, only one entry of the list needs to be changed. Also, this condensed form of the path definition better reveals the directory structure than did the very first version dirs_1 that we defined above.

Of course, we cannot use this version containing <placeholders> directly in calls to file.path as we did with dirs_1. We need a means to create the full paths as defined in dirs_1 back from the condensed path definitions in dirs_4.

This is what the function resolve() from our package kwb.utils does. You may use this function directly around the list assignment as in the following:

dirs_5 <- kwb.utils::resolve(list(
  projects = "//very/long/path/to/the/projects",
  p1 = "<projects>/project-1",
  p2 = "<projects>/project-2",
  p1.1 = "<p1>/wp-1",
  p1.2 = "<p1>/wp 2",
  p1.1_input = "<p1.1>/input",
  p1.1_analysis = "<p1.1>/analysis",
  p1.2_input = "<p1.2>/input",
  p1.2_output = "<p1.2>/output",
  p2_data = "<p2>/Daten",
  p2_images = "<p2>/Grafiken",
  p2_reports = "<p2>/Berichte"
))

The resolve() function replaces (recursively) all placeholders by their corresponding values. As a result, the list now contains the full paths, just as they were defined in the first version dirs_1 of the list. We prove this by the following comparison:

identical(dirs_5[names(dirs_1)], dirs_1)
#> [1] TRUE

Hauke Sonnenberg

2020-01-10

Define example paths

Define directory paths in one place

Replace common parts of paths with placeholders