Compression of Paths

See the Working with File Paths in R Scripts for an introduction on how we suggest to handle file paths in R scripts. With this vignette I want to recapitulate what the functions do that somehow deal with path dictionaries and that I extracted from the package kwb.fakin into this package.

Provide Example File Paths

We start by loading some example paths:

# Read and print example (file) paths
writeLines(paths <- kwb.pathdict:::example_paths())
#> //very/long/path/to/the/projects/project-1/wp-1/input/file 1.csv
#> //very/long/path/to/the/projects/project-1/wp-1/input/file-2.csv
#> //very/long/path/to/the/projects/project-1/wp-1/analysis/summary.pdf
#> //very/long/path/to/the/projects/project-1/wp 2/input/köpenick_dirty.csv
#> //very/long/path/to/the/projects/project-1/wp 2/output/koepenick_clean.csv
#> //very/long/path/to/the/projects/project-2/Daten/file-1.csv
#> //very/long/path/to/the/projects/project-2/Grafiken/file 1.png
#> //very/long/path/to/the/projects/project-2/Berichte/bericht-1.doc
#> //very/long/path/to/the/projects/project-2/Berichte/bericht-2.doc

# Manually define some (folder) paths
directories <- c(
  "short/path/to/directory", 
  "short/path/to/directory", 
  "short/path/to/directory", 
  "longer/path/to/another/directory", 
  "longer/path/to/another/directory"
)

Compression of Folder Paths

Assume you have a folder with a long path (either due to a lot of levels or due to long folder names). This long path is repeated for each file in the folder. The package kwb.pathdict provides a compress() function that may be used to introduce short names for directory paths.

short_dirs <- kwb.pathdict:::compress(directories)

as.character(short_dirs)
#> [1] "<a1>" "<a1>" "<a1>" "<a2>" "<a2>"

attributes(kwb.pathdict:::compress_one_by_one(paths, n = 3))
#> Splitting paths ... ok. (0.00s) 
#> Elapsed: 0.0 s
#> Splitting paths ... ok. (0.00s) 
#> Elapsed: 0.0 s
#> Splitting paths ... ok. (0.00s) 
#> Elapsed: 0.0 s
#> NULL

Functions compress_paths() and sorted_importance()

The function compress_paths() takes a vector of full file paths as input. The paths are “compressed” by replacing all directory paths by placeholders <a1>, <a2>, <a3> as shown in the following:

# Compress the paths
paths_1 <- kwb.pathdict:::compress_paths(paths)

# Show the compressed paths, the dictionary and the total size
kwb.pathdict:::print_compressed(paths_1)
#> Compressed paths:
#> <a1>/file 1.csv
#> <a1>/file-2.csv
#> <a3>/summary.pdf
#> <a5>/köpenick_dirty.csv
#> <a4>/koepenick_clean.csv
#> <a7>/file-1.csv
#> <a6>/file 1.png
#> <a2>/bericht-1.doc
#> <a2>/bericht-2.doc
#> 
#> Dictionary:
#> a1 = //very/long/path/to/the/projects/project-1/wp-1/input
#> a2 = //very/long/path/to/the/projects/project-2/Berichte
#> a3 = //very/long/path/to/the/projects/project-1/wp-1/analysis
#> a4 = //very/long/path/to/the/projects/project-1/wp 2/output
#> a5 = //very/long/path/to/the/projects/project-1/wp 2/input
#> a6 = //very/long/path/to/the/projects/project-2/Grafiken
#> a7 = //very/long/path/to/the/projects/project-2/Daten
#> 
#> Size:  2848 bytes

Equal directories are replaced by equal placeholders. The placeholders and their “values” are stored in a so called dictionary. The dictionary can be seen as a lookup table that maps short names to long names. The dictionary is returned in the attribute dict of what compress_paths() returns:

(dictionary_1 <- kwb.utils::getAttribute(paths_1, "dict"))
#> $a1
#> [1] "//very/long/path/to/the/projects/project-1/wp-1/input"
#> 
#> $a2
#> [1] "//very/long/path/to/the/projects/project-2/Berichte"
#> 
#> $a3
#> [1] "//very/long/path/to/the/projects/project-1/wp-1/analysis"
#> 
#> $a4
#> [1] "//very/long/path/to/the/projects/project-1/wp 2/output"
#> 
#> $a5
#> [1] "//very/long/path/to/the/projects/project-1/wp 2/input"
#> 
#> $a6
#> [1] "//very/long/path/to/the/projects/project-2/Grafiken"
#> 
#> $a7
#> [1] "//very/long/path/to/the/projects/project-2/Daten"

The dictionary is sorted by the importance of the strings (paths) that have been given to compress() so that the most important paths get the highest positions in the dictionary.

The placeholder <a1> is given to the most “important” directory path, the placeholder <a2> to the second most “important”, and so on.

With “importance” we mean the product of the frequency of the directory path (its number of occurrences) and its string length. It is internally calculated by the function sorted_importance(). This function returns the product of string frequency and string length for the different strings given in its first argument.

kwb.pathdict:::named_vector_to_data_frame(
  kwb.pathdict:::sorted_importance(dirname(paths)
  )
)
#>                                                       name value
#> 1    //very/long/path/to/the/projects/project-1/wp-1/input   106
#> 2      //very/long/path/to/the/projects/project-2/Berichte   102
#> 3 //very/long/path/to/the/projects/project-1/wp-1/analysis    56
#> 4   //very/long/path/to/the/projects/project-1/wp 2/output    54
#> 5    //very/long/path/to/the/projects/project-1/wp 2/input    53
#> 6      //very/long/path/to/the/projects/project-2/Grafiken    51
#> 7         //very/long/path/to/the/projects/project-2/Daten    48

For example, the importance of the path //very/long/path/to/the/projects/project-1/wp-1/input is 106 because it is 53 characters long and it appears 2 times in the vector paths:

# Paths of directories
dir_paths <- dirname(paths)

# Frequency of the first path
(path_frequency <- unname(table(dir_paths)[dir_paths[1]]))
#> [1] 2

# Length of the first path (in number of characters)
(path_length <- nchar(dir_paths[1]))
#> [1] 53

# Importance of the first path
path_frequency * path_length
#> [1] 106

Just as the original file paths contain common directory paths, the paths in the directory also contain common base paths. These common sub-paths can themselves be replaced by further placeholders. Therefore, we set the argument maxdepth of compress_paths():

# Compress the full file paths by replacing directory paths by placeholders
paths_2 <- kwb.pathdict:::compress_paths(paths, maxdepth = 2)

# Show the short paths (without attribute "dict")
structure(paths_2, dict = NULL)
#> [1] "<a1>/file 1.csv"          "<a1>/file-2.csv"         
#> [3] "<a3>/summary.pdf"         "<a5>/köpenick_dirty.csv" 
#> [5] "<a4>/koepenick_clean.csv" "<a7>/file-1.csv"         
#> [7] "<a6>/file 1.png"          "<a2>/bericht-1.doc"      
#> [9] "<a2>/bericht-2.doc"

# The new paths are identical to the paths created with maxdepth = 1
identical(structure(paths_2, dict = NULL), structure(paths_1, dict = NULL))
#> [1] TRUE

# However, the "dictionary" paths are themselves shortened by new placeholders
kwb.pathdict:::print_dict(compressed = paths_2)
#> a1 = <b3>/input
#> a2 = <b1>/Berichte
#> a3 = <b3>/analysis
#> a4 = <b2>/output
#> a5 = <b2>/input
#> a6 = <b1>/Grafiken
#> a7 = <b1>/Daten
#> b1 = //very/long/path/to/the/projects/project-2
#> b2 = //very/long/path/to/the/projects/project-1/wp 2
#> b3 = //very/long/path/to/the/projects/project-1/wp-1

The following code section demonstrates how increasing maxdepth step by step compresses the dictionary even further:

kwb.pathdict:::print_dict(paths, 3)
#> a1 = <b3>/input
#> a2 = <b1>/Berichte
#> a3 = <b3>/analysis
#> a4 = <b2>/output
#> a5 = <b2>/input
#> a6 = <b1>/Grafiken
#> a7 = <b1>/Daten
#> b1 = <c2>/project-2
#> b2 = <c1>/wp 2
#> b3 = <c1>/wp-1
#> c1 = //very/long/path/to/the/projects/project-1
#> c2 = //very/long/path/to/the/projects
kwb.pathdict:::print_dict(paths, 4)
#> a1 = <b3>/input
#> a2 = <b1>/Berichte
#> a3 = <b3>/analysis
#> a4 = <b2>/output
#> a5 = <b2>/input
#> a6 = <b1>/Grafiken
#> a7 = <b1>/Daten
#> b1 = <c2>/project-2
#> b2 = <c1>/wp 2
#> b3 = <c1>/wp-1
#> c1 = <c2>/project-1
#> c2 = <d1>/projects
#> d1 = //very/long/path/to/the
kwb.pathdict:::print_dict(paths, 5)
#> a1 = <b3>/input
#> a2 = <b1>/Berichte
#> a3 = <b3>/analysis
#> a4 = <b2>/output
#> a5 = <b2>/input
#> a6 = <b1>/Grafiken
#> a7 = <b1>/Daten
#> b1 = <c2>/project-2
#> b2 = <c1>/wp 2
#> b3 = <c1>/wp-1
#> c1 = <c2>/project-1
#> c2 = <d1>/projects
#> d1 = <e1>/the
#> e1 = //very/long/path/to

The dictionary is required to convert the short names back to the original paths using the resolve() function of the kwb.utils package:

# Reproduce the original file paths from the compressed paths and the dictionary
(reproduced_paths <- kwb.utils::resolve(paths_1, dictionary_1))
#> [1] "//very/long/path/to/the/projects/project-1/wp-1/input/file 1.csv"          
#> [2] "//very/long/path/to/the/projects/project-1/wp-1/input/file-2.csv"          
#> [3] "//very/long/path/to/the/projects/project-1/wp-1/analysis/summary.pdf"      
#> [4] "//very/long/path/to/the/projects/project-1/wp 2/input/köpenick_dirty.csv"  
#> [5] "//very/long/path/to/the/projects/project-1/wp 2/output/koepenick_clean.csv"
#> [6] "//very/long/path/to/the/projects/project-2/Daten/file-1.csv"               
#> [7] "//very/long/path/to/the/projects/project-2/Grafiken/file 1.png"            
#> [8] "//very/long/path/to/the/projects/project-2/Berichte/bericht-1.doc"         
#> [9] "//very/long/path/to/the/projects/project-2/Berichte/bericht-2.doc"

# Check if we really get back the original paths
identical(paths_1, reproduced_paths)
#> [1] FALSE

Ok, I understand what happened but what can I use this for? The idea was to compress very big path lists in an effective but comprehensive and human-readable way. If we compare the size of the original path vector with the size of the compressed path vector including the dictionary, we see that the idea does not work at all for the given example paths:

# Object sizes in bytes for maxdepth = 1 to 5
sapply(1:5, kwb.pathdict:::dictionary_size, x = paths)
#> [1] 2848 3264 3584 3712 3888

However, using big file lists with thousands of paths, we get a completely different picture:

set.seed(12059)
many_paths <- kwb.pathdict::random_paths(8)

# Size of path vector
object.size(many_paths)
#> 48848 bytes

# Size of compressed path vectors including dictionary
kwb.pathdict:::dictionary_size(many_paths, 1)
#> 42480 bytes
kwb.pathdict:::dictionary_size(many_paths, 2)
#> 40248 bytes
kwb.pathdict:::dictionary_size(many_paths, 3)
#> 40424 bytes
kwb.pathdict:::dictionary_size(many_paths, 4)
#> 40616 bytes
kwb.pathdict:::dictionary_size(many_paths, 5)
#> 40792 bytes

# How do the dictionaries look like?
kwb.pathdict:::print_dict(many_paths, 1)
#> a1 = plane/position/again/oxygen/nothing/cover/thought/surface
#> a2 = plane/position/again/oxygen/nothing/cover/thought/search
#> a3 = plane/position/again/oxygen/against/chick/whole/chart
#> a4 = plane/position/again/oxygen/nothing/cover/continue/element
#> a5 = plane/position/again/oxygen/nothing/cover/thought/quotient
#> a6 = plane/position/again/oxygen/nothing/produce/excite/provide
#> a7 = plane/position/again/thank/occur/fresh/invent/except
#> a8 = plane/position/again/oxygen/against/chick/indicate/enemy
#> a9 = plane/position/again/oxygen/nothing/cover/happy/protect
#> aA = plane/position/again/oxygen/wonder/apple/motion/among
#> aB = plane/position/again/oxygen/wonder/apple/motion/their
#> aC = plane/position/again/oxygen/wonder/apple/floor
#> aD = plane/position/again/oxygen/nothing/produce/excite
#> aE = plane/position/again/oxygen/nothing/cover/gather/mouth
#> aF = plane/position/again/oxygen/against/chick/indicate
#> a10 = plane/position/again/oxygen/toward/region/bright/thousand
#> a11 = plane/position/again/thank/occur/children/suggest
#> a12 = plane/position/again/oxygen/against/chick/whole/invent
#> a13 = plane/position/again/oxygen/wonder/apple/motion/paper
#> a14 = plane/position/again/thank/occur/fresh/invent/garden
#> a15 = plane/position/again/oxygen/nothing/cover/continue/subject
#> a16 = plane/position/again/oxygen/nothing/produce/every
#> a17 = plane/position/again/oxygen/nothing/produce/every/prepare
#> a18 = plane/position/again/oxygen/nothing/produce/every/whether
#> a19 = plane/position/again/oxygen/nothing/cover/happy
#> a1A = plane/position/again/oxygen/against/chick
#> a1B = plane/position/again/oxygen/wonder/apple/motion/north
#> a1C = plane/position/again/thank/occur/children/suggest/quotient
#> a1D = plane/position/again/oxygen/nothing/produce
#> a1E = plane/position/again/oxygen/toward/region/bright
#> a1F = plane/position/again/oxygen/nothing/cover/thought/necessary
#> a20 = plane/position/again/oxygen/toward/region/climb
#> a21 = plane/position/again/oxygen/wonder/apple/motion
#> a22 = plane/position/again/oxygen/wonder/apple/spread/heart
#> a23 = plane/position/again/oxygen/toward/region
#> a24 = plane/position/again/oxygen/toward
#> a25 = plane/position/again/thank/occur/fresh/invent/whose
#> a26 = plane/position/again/oxygen/nothing/cover/gather
#> a27 = plane/position/again/oxygen/against/chick/whole
#> a28 = plane/position/again/oxygen/against/chick/indicate/clear
#> a29 = plane/position/again/oxygen/nothing/cover/thought
#> a2A = plane/position/again/thank/occur
#> a2B = plane/position/again/oxygen/nothing/cover
#> a2C = plane/position/again/oxygen/wonder/apple
#> a2D = plane/position/again/thank/occur/children/suggest/triangle
#> a2E = plane/position/indicate/color
#> a2F = plane/position/again/thank/occur/fresh
#> a30 = plane/position/again/oxygen/nothing/cover/continue/month
#> a31 = plane/position/again/oxygen/toward/region/bright/second
#> a32 = plane/position/again/oxygen/nothing
#> a33 = plane/position/again/oxygen/toward/region/claim
#> a34 = plane/position/again/oxygen
#> a35 = plane/position/again/thank
#> a36 = plane/position/again/oxygen/wonder
#> a37 = plane/position/again/character
#> a38 = plane/position/again/oxygen/nothing/cover/thought/dollar
#> a39 = plane/position/again/oxygen/nothing/cover/happy/receive
#> a3A = plane/position/again/oxygen/nothing/cover/continue
#> a3B = plane/position/again/oxygen/wonder/apple/spread
#> a3C = plane/position/again/thank/occur/fresh/flower
#> a3D = plane/position/again/thank/occur/fresh/invent
#> a3E = plane/position/again/thank/occur/ready/proper
#> a3F = plane/position/again/thank/occur/children
#> a40 = plane/position/again
#> a41 = plane/position/plain
#> a42 = plane/position/again/thank/occur/ready
#> a43 = plane/position/indicate
#> a44 = plane/position
kwb.pathdict:::print_dict(many_paths, 2)
#> a1 = <a29>/surface
#> a2 = <a29>/search
#> a3 = <a27>/chart
#> a4 = <a3A>/element
#> a5 = <a29>/quotient
#> a6 = <aD>/provide
#> a7 = <a3D>/except
#> a8 = <aF>/enemy
#> a9 = <a19>/protect
#> aA = <a21>/among
#> aB = <a21>/their
#> aC = <a2C>/floor
#> aD = <a1D>/excite
#> aE = <a26>/mouth
#> aF = <a1A>/indicate
#> a10 = <a1E>/thousand
#> a11 = <a3F>/suggest
#> a12 = <a27>/invent
#> a13 = <a21>/paper
#> a14 = <a3D>/garden
#> a15 = <a3A>/subject
#> a16 = <a1D>/every
#> a17 = <a16>/prepare
#> a18 = <a16>/whether
#> a19 = <a2B>/happy
#> a1A = <b1>/chick
#> a1B = <a21>/north
#> a1C = <a11>/quotient
#> a1D = <a32>/produce
#> a1E = <a23>/bright
#> a1F = <a29>/necessary
#> a20 = <a23>/climb
#> a21 = <a2C>/motion
#> a22 = <a3B>/heart
#> a23 = <a24>/region
#> a24 = <a34>/toward
#> a25 = <a3D>/whose
#> a26 = <a2B>/gather
#> a27 = <a1A>/whole
#> a28 = <aF>/clear
#> a29 = <a2B>/thought
#> a2A = <a35>/occur
#> a2B = <a32>/cover
#> a2C = <a36>/apple
#> a2D = <a11>/triangle
#> a2E = <a43>/color
#> a2F = <a2A>/fresh
#> a30 = <a3A>/month
#> a31 = <a1E>/second
#> a32 = <a34>/nothing
#> a33 = <a23>/claim
#> a34 = <a40>/oxygen
#> a35 = <a40>/thank
#> a36 = <a34>/wonder
#> a37 = <a40>/character
#> a38 = <a29>/dollar
#> a39 = <a19>/receive
#> a3A = <a2B>/continue
#> a3B = <a2C>/spread
#> a3C = <a2F>/flower
#> a3D = <a2F>/invent
#> a3E = <a42>/proper
#> a3F = <a2A>/children
#> a40 = <a44>/again
#> a41 = <a44>/plain
#> a42 = <a2A>/ready
#> a43 = <a44>/indicate
#> a44 = <b2>/position
#> b1 = plane/position/again/oxygen/against
#> b2 = plane
kwb.pathdict:::print_dict(many_paths, 3)
#> a1 = <a29>/surface
#> a2 = <a29>/search
#> a3 = <a27>/chart
#> a4 = <a3A>/element
#> a5 = <a29>/quotient
#> a6 = <aD>/provide
#> a7 = <a3D>/except
#> a8 = <aF>/enemy
#> a9 = <a19>/protect
#> aA = <a21>/among
#> aB = <a21>/their
#> aC = <a2C>/floor
#> aD = <a1D>/excite
#> aE = <a26>/mouth
#> aF = <a1A>/indicate
#> a10 = <a1E>/thousand
#> a11 = <a3F>/suggest
#> a12 = <a27>/invent
#> a13 = <a21>/paper
#> a14 = <a3D>/garden
#> a15 = <a3A>/subject
#> a16 = <a1D>/every
#> a17 = <a16>/prepare
#> a18 = <a16>/whether
#> a19 = <a2B>/happy
#> a1A = <b1>/chick
#> a1B = <a21>/north
#> a1C = <a11>/quotient
#> a1D = <a32>/produce
#> a1E = <a23>/bright
#> a1F = <a29>/necessary
#> a20 = <a23>/climb
#> a21 = <a2C>/motion
#> a22 = <a3B>/heart
#> a23 = <a24>/region
#> a24 = <a34>/toward
#> a25 = <a3D>/whose
#> a26 = <a2B>/gather
#> a27 = <a1A>/whole
#> a28 = <aF>/clear
#> a29 = <a2B>/thought
#> a2A = <a35>/occur
#> a2B = <a32>/cover
#> a2C = <a36>/apple
#> a2D = <a11>/triangle
#> a2E = <a43>/color
#> a2F = <a2A>/fresh
#> a30 = <a3A>/month
#> a31 = <a1E>/second
#> a32 = <a34>/nothing
#> a33 = <a23>/claim
#> a34 = <a40>/oxygen
#> a35 = <a40>/thank
#> a36 = <a34>/wonder
#> a37 = <a40>/character
#> a38 = <a29>/dollar
#> a39 = <a19>/receive
#> a3A = <a2B>/continue
#> a3B = <a2C>/spread
#> a3C = <a2F>/flower
#> a3D = <a2F>/invent
#> a3E = <a42>/proper
#> a3F = <a2A>/children
#> a40 = <a44>/again
#> a41 = <a44>/plain
#> a42 = <a2A>/ready
#> a43 = <a44>/indicate
#> a44 = <b2>/position
#> b1 = <c1>/against
#> b2 = plane
#> c1 = plane/position/again/oxygen
kwb.pathdict:::print_dict(many_paths, 4)
#> a1 = <a29>/surface
#> a2 = <a29>/search
#> a3 = <a27>/chart
#> a4 = <a3A>/element
#> a5 = <a29>/quotient
#> a6 = <aD>/provide
#> a7 = <a3D>/except
#> a8 = <aF>/enemy
#> a9 = <a19>/protect
#> aA = <a21>/among
#> aB = <a21>/their
#> aC = <a2C>/floor
#> aD = <a1D>/excite
#> aE = <a26>/mouth
#> aF = <a1A>/indicate
#> a10 = <a1E>/thousand
#> a11 = <a3F>/suggest
#> a12 = <a27>/invent
#> a13 = <a21>/paper
#> a14 = <a3D>/garden
#> a15 = <a3A>/subject
#> a16 = <a1D>/every
#> a17 = <a16>/prepare
#> a18 = <a16>/whether
#> a19 = <a2B>/happy
#> a1A = <b1>/chick
#> a1B = <a21>/north
#> a1C = <a11>/quotient
#> a1D = <a32>/produce
#> a1E = <a23>/bright
#> a1F = <a29>/necessary
#> a20 = <a23>/climb
#> a21 = <a2C>/motion
#> a22 = <a3B>/heart
#> a23 = <a24>/region
#> a24 = <a34>/toward
#> a25 = <a3D>/whose
#> a26 = <a2B>/gather
#> a27 = <a1A>/whole
#> a28 = <aF>/clear
#> a29 = <a2B>/thought
#> a2A = <a35>/occur
#> a2B = <a32>/cover
#> a2C = <a36>/apple
#> a2D = <a11>/triangle
#> a2E = <a43>/color
#> a2F = <a2A>/fresh
#> a30 = <a3A>/month
#> a31 = <a1E>/second
#> a32 = <a34>/nothing
#> a33 = <a23>/claim
#> a34 = <a40>/oxygen
#> a35 = <a40>/thank
#> a36 = <a34>/wonder
#> a37 = <a40>/character
#> a38 = <a29>/dollar
#> a39 = <a19>/receive
#> a3A = <a2B>/continue
#> a3B = <a2C>/spread
#> a3C = <a2F>/flower
#> a3D = <a2F>/invent
#> a3E = <a42>/proper
#> a3F = <a2A>/children
#> a40 = <a44>/again
#> a41 = <a44>/plain
#> a42 = <a2A>/ready
#> a43 = <a44>/indicate
#> a44 = <b2>/position
#> b1 = <c1>/against
#> b2 = plane
#> c1 = <d1>/oxygen
#> d1 = plane/position/again
kwb.pathdict:::print_dict(many_paths, 5)
#> a1 = <a29>/surface
#> a2 = <a29>/search
#> a3 = <a27>/chart
#> a4 = <a3A>/element
#> a5 = <a29>/quotient
#> a6 = <aD>/provide
#> a7 = <a3D>/except
#> a8 = <aF>/enemy
#> a9 = <a19>/protect
#> aA = <a21>/among
#> aB = <a21>/their
#> aC = <a2C>/floor
#> aD = <a1D>/excite
#> aE = <a26>/mouth
#> aF = <a1A>/indicate
#> a10 = <a1E>/thousand
#> a11 = <a3F>/suggest
#> a12 = <a27>/invent
#> a13 = <a21>/paper
#> a14 = <a3D>/garden
#> a15 = <a3A>/subject
#> a16 = <a1D>/every
#> a17 = <a16>/prepare
#> a18 = <a16>/whether
#> a19 = <a2B>/happy
#> a1A = <b1>/chick
#> a1B = <a21>/north
#> a1C = <a11>/quotient
#> a1D = <a32>/produce
#> a1E = <a23>/bright
#> a1F = <a29>/necessary
#> a20 = <a23>/climb
#> a21 = <a2C>/motion
#> a22 = <a3B>/heart
#> a23 = <a24>/region
#> a24 = <a34>/toward
#> a25 = <a3D>/whose
#> a26 = <a2B>/gather
#> a27 = <a1A>/whole
#> a28 = <aF>/clear
#> a29 = <a2B>/thought
#> a2A = <a35>/occur
#> a2B = <a32>/cover
#> a2C = <a36>/apple
#> a2D = <a11>/triangle
#> a2E = <a43>/color
#> a2F = <a2A>/fresh
#> a30 = <a3A>/month
#> a31 = <a1E>/second
#> a32 = <a34>/nothing
#> a33 = <a23>/claim
#> a34 = <a40>/oxygen
#> a35 = <a40>/thank
#> a36 = <a34>/wonder
#> a37 = <a40>/character
#> a38 = <a29>/dollar
#> a39 = <a19>/receive
#> a3A = <a2B>/continue
#> a3B = <a2C>/spread
#> a3C = <a2F>/flower
#> a3D = <a2F>/invent
#> a3E = <a42>/proper
#> a3F = <a2A>/children
#> a40 = <a44>/again
#> a41 = <a44>/plain
#> a42 = <a2A>/ready
#> a43 = <a44>/indicate
#> a44 = <b2>/position
#> b1 = <c1>/against
#> b2 = plane
#> c1 = <d1>/oxygen
#> d1 = <e1>/again
#> e1 = plane/position

There seems to be a problem. There are unexpected duplicates!

for (m in 1:5) {
  message("m = ", m)
  kwb.pathdict:::print_dict(m = m, c(
    "a/b/c/d/e1/file1.txt", 
    "a/b/c/d/e2/file2.txt"
  ))
}
#> m = 1
#> a1 = a/b/c/d/e1
#> a2 = a/b/c/d/e2
#> m = 2
#> a1 = <b1>/e1
#> a2 = <b1>/e2
#> b1 = a/b/c/d
#> m = 3
#> a1 = <b1>/e1
#> a2 = <b1>/e2
#> b1 = <c1>/d
#> c1 = a/b/c
#> m = 4
#> a1 = <b1>/e1
#> a2 = <b1>/e2
#> b1 = <c1>/d
#> c1 = <d1>/c
#> d1 = a/b
#> m = 5
#> a1 = <b1>/e1
#> a2 = <b1>/e2
#> b1 = <c1>/d
#> c1 = <d1>/c
#> d1 = <e1>/b
#> e1 = a

The function compress_paths() is a high level function that internally calls the helper function compress(). That function is demonstrated next.

Function compress()

The function compress() is a helper function that is called internally by the function compress_paths(). It replaces all distinct values in a vector with a short term that ist formatted as a “placeholder”. It returns a “dictionary”, i.e. a list that contains the original values as values and the short terms as keys

compressed <- kwb.pathdict:::compress(c("abc", "abc", "defghi"))

kwb.utils::getAttribute(compressed, "dict")
#> $a1
#> [1] "abc"
#> 
#> $a2
#> [1] "defghi"

x1 <- c(rep("short", 1), rep("very much longer", 1))
x2 <- c(rep("short", 5), rep("very much longer", 1))

kwb.utils::getAttribute(kwb.pathdict:::compress(x1), "dict")
#> $a1
#> [1] "very much longer"
#> 
#> $a2
#> [1] "short"
kwb.utils::getAttribute(kwb.pathdict:::compress(x2), "dict")
#> $a1
#> [1] "short"
#> 
#> $a2
#> [1] "very much longer"

The elements in the dictionary are ordered by “importance”, i.e. the product of frequency and length of the strings (see function sorted_importance() above).

There are three more compress functions that are described next.

compress_one_by_one()

What does this function do?

out <- capture.output(result <- kwb.pathdict:::compress_one_by_one(x1))
result
#> [[1]]
#> [1] "short" "<A>"  
#> 
#> [[2]]
#> [1] "<B>" "<A>"
#> 
#> [[3]]
#> [1] "<B>" "<C>"
#> 
#> [[4]]
#> [1] "<D>" "<C>"
#> 
#> [[5]]
#> [1] "<D>" "<E>"
#> 
#> [[6]]
#> [1] "<F>" "<E>"
#> 
#> [[7]]
#> [1] "<F>" "<G>"
#> 
#> [[8]]
#> [1] "<H>" "<G>"
#> 
#> [[9]]
#> [1] "<H>" "<I>"
#> 
#> [[10]]
#> [1] "<J>" "<I>"

compress_with_dictionary()

The compress_with_dictionary() function expects a matrix of character and a dictionary, i.e. a list of values that are assigned to short keywords as input. Values that are contained in the dictionary are replaced by their short key. Values that are not contained are replaced with what is given in fill.value. But why? I think that this is just to treat empty values in the matrix. For all non-empty values it is assumed that a corresponding entry is contained in the dictionary.

(subdirs_1 <- matrix(c("abc", "def", "ghi", "jkl"), 2, 2))
#>      [,1]  [,2] 
#> [1,] "abc" "ghi"
#> [2,] "def" "jkl"
(subdirs_2 <- matrix(c("abc", "def", "ghi", ""), 2, 2))
#>      [,1]  [,2] 
#> [1,] "abc" "ghi"
#> [2,] "def" ""

kwb.pathdict:::compress_with_dictionary(
  subdirs_1,
  dict = list(a = "abc", d = "def", g = "ghi", k = "klm"),
  fill.value = "_filled_"
)
#>      [,1] [,2]      
#> [1,] "a"  "g"       
#> [2,] "d"  "_filled_"

kwb.pathdict:::compress_with_dictionary(
  subdirs_2,
  dict = list(a = "abc", d = "def", g = "ghi", j = "jkl"),
  fill.value = "_filled_"
)
#>      [,1] [,2]      
#> [1,] "a"  "g"       
#> [2,] "d"  "_filled_"

Hauke Sonnenberg