See the Working with File Paths in R Scripts for an introduction on how we suggest to handle file paths in R scripts. With this vignette I want to recapitulate what the functions do that somehow deal with path dictionaries and that I extracted from the package kwb.fakin into this package.

Compression of Folder Paths

Assume you have a folder with a long path (either due to a lot of levels or due to long folder names). This long path is repeated for each file in the folder. The package kwb.pathdict provides a compress() function that may be used to introduce short names for directory paths.

Functions compress_paths() and sorted_importance()

The function compress_paths() takes a vector of full file paths as input. The paths are “compressed” by replacing all directory paths by placeholders <a1>, <a2>, <a3> as shown in the following:

Equal directories are replaced by equal placeholders. The placeholders and their “values” are stored in a so called dictionary. The dictionary can be seen as a lookup table that maps short names to long names. The dictionary is returned in the attribute dict of what compress_paths() returns:

The dictionary is sorted by the importance of the strings (paths) that have been given to compress() so that the most important paths get the highest positions in the dictionary.

The placeholder <a1> is given to the most “important” directory path, the placeholder <a2> to the second most “important”, and so on.

With “importance” we mean the product of the frequency of the directory path (its number of occurrences) and its string length. It is internally calculated by the function sorted_importance(). This function returns the product of string frequency and string length for the different strings given in its first argument.

For example, the importance of the path //very/long/path/to/the/projects/project-1/wp-1/input is 106 because it is 53 characters long and it appears 2 times in the vector paths:

Just as the original file paths contain common directory paths, the paths in the directory also contain common base paths. These common sub-paths can themselves be replaced by further placeholders. Therefore, we set the argument maxdepth of compress_paths():

The following code section demonstrates how increasing maxdepth step by step compresses the dictionary even further:

The dictionary is required to convert the short names back to the original paths using the resolve() function of the kwb.utils package:

Ok, I understand what happened but what can I use this for? The idea was to compress very big path lists in an effective but comprehensive and human-readable way. If we compare the size of the original path vector with the size of the compressed path vector including the dictionary, we see that the idea does not work at all for the given example paths:

However, using big file lists with thousands of paths, we get a completely different picture:

set.seed(12059)
many_paths <- kwb.pathdict::random_paths(8)

# Size of path vector
object.size(many_paths)
#> 48848 bytes

# Size of compressed path vectors including dictionary
kwb.pathdict:::dictionary_size(many_paths, 1)
#> 42480 bytes
kwb.pathdict:::dictionary_size(many_paths, 2)
#> 40248 bytes
kwb.pathdict:::dictionary_size(many_paths, 3)
#> 40424 bytes
kwb.pathdict:::dictionary_size(many_paths, 4)
#> 40616 bytes
kwb.pathdict:::dictionary_size(many_paths, 5)
#> 40792 bytes

# How do the dictionaries look like?
kwb.pathdict:::print_dict(many_paths, 1)
#> a1 = plane/position/again/oxygen/nothing/cover/thought/surface
#> a2 = plane/position/again/oxygen/nothing/cover/thought/search
#> a3 = plane/position/again/oxygen/against/chick/whole/chart
#> a4 = plane/position/again/oxygen/nothing/cover/continue/element
#> a5 = plane/position/again/oxygen/nothing/cover/thought/quotient
#> a6 = plane/position/again/oxygen/nothing/produce/excite/provide
#> a7 = plane/position/again/thank/occur/fresh/invent/except
#> a8 = plane/position/again/oxygen/against/chick/indicate/enemy
#> a9 = plane/position/again/oxygen/nothing/cover/happy/protect
#> aA = plane/position/again/oxygen/wonder/apple/motion/among
#> aB = plane/position/again/oxygen/wonder/apple/motion/their
#> aC = plane/position/again/oxygen/wonder/apple/floor
#> aD = plane/position/again/oxygen/nothing/produce/excite
#> aE = plane/position/again/oxygen/nothing/cover/gather/mouth
#> aF = plane/position/again/oxygen/against/chick/indicate
#> a10 = plane/position/again/oxygen/toward/region/bright/thousand
#> a11 = plane/position/again/thank/occur/children/suggest
#> a12 = plane/position/again/oxygen/against/chick/whole/invent
#> a13 = plane/position/again/oxygen/wonder/apple/motion/paper
#> a14 = plane/position/again/thank/occur/fresh/invent/garden
#> a15 = plane/position/again/oxygen/nothing/cover/continue/subject
#> a16 = plane/position/again/oxygen/nothing/produce/every
#> a17 = plane/position/again/oxygen/nothing/produce/every/prepare
#> a18 = plane/position/again/oxygen/nothing/produce/every/whether
#> a19 = plane/position/again/oxygen/nothing/cover/happy
#> a1A = plane/position/again/oxygen/against/chick
#> a1B = plane/position/again/oxygen/wonder/apple/motion/north
#> a1C = plane/position/again/thank/occur/children/suggest/quotient
#> a1D = plane/position/again/oxygen/nothing/produce
#> a1E = plane/position/again/oxygen/toward/region/bright
#> a1F = plane/position/again/oxygen/nothing/cover/thought/necessary
#> a20 = plane/position/again/oxygen/toward/region/climb
#> a21 = plane/position/again/oxygen/wonder/apple/motion
#> a22 = plane/position/again/oxygen/wonder/apple/spread/heart
#> a23 = plane/position/again/oxygen/toward/region
#> a24 = plane/position/again/oxygen/toward
#> a25 = plane/position/again/thank/occur/fresh/invent/whose
#> a26 = plane/position/again/oxygen/nothing/cover/gather
#> a27 = plane/position/again/oxygen/against/chick/whole
#> a28 = plane/position/again/oxygen/against/chick/indicate/clear
#> a29 = plane/position/again/oxygen/nothing/cover/thought
#> a2A = plane/position/again/thank/occur
#> a2B = plane/position/again/oxygen/nothing/cover
#> a2C = plane/position/again/oxygen/wonder/apple
#> a2D = plane/position/again/thank/occur/children/suggest/triangle
#> a2E = plane/position/indicate/color
#> a2F = plane/position/again/thank/occur/fresh
#> a30 = plane/position/again/oxygen/nothing/cover/continue/month
#> a31 = plane/position/again/oxygen/toward/region/bright/second
#> a32 = plane/position/again/oxygen/nothing
#> a33 = plane/position/again/oxygen/toward/region/claim
#> a34 = plane/position/again/oxygen
#> a35 = plane/position/again/thank
#> a36 = plane/position/again/oxygen/wonder
#> a37 = plane/position/again/character
#> a38 = plane/position/again/oxygen/nothing/cover/thought/dollar
#> a39 = plane/position/again/oxygen/nothing/cover/happy/receive
#> a3A = plane/position/again/oxygen/nothing/cover/continue
#> a3B = plane/position/again/oxygen/wonder/apple/spread
#> a3C = plane/position/again/thank/occur/fresh/flower
#> a3D = plane/position/again/thank/occur/fresh/invent
#> a3E = plane/position/again/thank/occur/ready/proper
#> a3F = plane/position/again/thank/occur/children
#> a40 = plane/position/again
#> a41 = plane/position/plain
#> a42 = plane/position/again/thank/occur/ready
#> a43 = plane/position/indicate
#> a44 = plane/position
kwb.pathdict:::print_dict(many_paths, 2)
#> a1 = <a29>/surface
#> a2 = <a29>/search
#> a3 = <a27>/chart
#> a4 = <a3A>/element
#> a5 = <a29>/quotient
#> a6 = <aD>/provide
#> a7 = <a3D>/except
#> a8 = <aF>/enemy
#> a9 = <a19>/protect
#> aA = <a21>/among
#> aB = <a21>/their
#> aC = <a2C>/floor
#> aD = <a1D>/excite
#> aE = <a26>/mouth
#> aF = <a1A>/indicate
#> a10 = <a1E>/thousand
#> a11 = <a3F>/suggest
#> a12 = <a27>/invent
#> a13 = <a21>/paper
#> a14 = <a3D>/garden
#> a15 = <a3A>/subject
#> a16 = <a1D>/every
#> a17 = <a16>/prepare
#> a18 = <a16>/whether
#> a19 = <a2B>/happy
#> a1A = <b1>/chick
#> a1B = <a21>/north
#> a1C = <a11>/quotient
#> a1D = <a32>/produce
#> a1E = <a23>/bright
#> a1F = <a29>/necessary
#> a20 = <a23>/climb
#> a21 = <a2C>/motion
#> a22 = <a3B>/heart
#> a23 = <a24>/region
#> a24 = <a34>/toward
#> a25 = <a3D>/whose
#> a26 = <a2B>/gather
#> a27 = <a1A>/whole
#> a28 = <aF>/clear
#> a29 = <a2B>/thought
#> a2A = <a35>/occur
#> a2B = <a32>/cover
#> a2C = <a36>/apple
#> a2D = <a11>/triangle
#> a2E = <a43>/color
#> a2F = <a2A>/fresh
#> a30 = <a3A>/month
#> a31 = <a1E>/second
#> a32 = <a34>/nothing
#> a33 = <a23>/claim
#> a34 = <a40>/oxygen
#> a35 = <a40>/thank
#> a36 = <a34>/wonder
#> a37 = <a40>/character
#> a38 = <a29>/dollar
#> a39 = <a19>/receive
#> a3A = <a2B>/continue
#> a3B = <a2C>/spread
#> a3C = <a2F>/flower
#> a3D = <a2F>/invent
#> a3E = <a42>/proper
#> a3F = <a2A>/children
#> a40 = <a44>/again
#> a41 = <a44>/plain
#> a42 = <a2A>/ready
#> a43 = <a44>/indicate
#> a44 = <b2>/position
#> b1 = plane/position/again/oxygen/against
#> b2 = plane
kwb.pathdict:::print_dict(many_paths, 3)
#> a1 = <a29>/surface
#> a2 = <a29>/search
#> a3 = <a27>/chart
#> a4 = <a3A>/element
#> a5 = <a29>/quotient
#> a6 = <aD>/provide
#> a7 = <a3D>/except
#> a8 = <aF>/enemy
#> a9 = <a19>/protect
#> aA = <a21>/among
#> aB = <a21>/their
#> aC = <a2C>/floor
#> aD = <a1D>/excite
#> aE = <a26>/mouth
#> aF = <a1A>/indicate
#> a10 = <a1E>/thousand
#> a11 = <a3F>/suggest
#> a12 = <a27>/invent
#> a13 = <a21>/paper
#> a14 = <a3D>/garden
#> a15 = <a3A>/subject
#> a16 = <a1D>/every
#> a17 = <a16>/prepare
#> a18 = <a16>/whether
#> a19 = <a2B>/happy
#> a1A = <b1>/chick
#> a1B = <a21>/north
#> a1C = <a11>/quotient
#> a1D = <a32>/produce
#> a1E = <a23>/bright
#> a1F = <a29>/necessary
#> a20 = <a23>/climb
#> a21 = <a2C>/motion
#> a22 = <a3B>/heart
#> a23 = <a24>/region
#> a24 = <a34>/toward
#> a25 = <a3D>/whose
#> a26 = <a2B>/gather
#> a27 = <a1A>/whole
#> a28 = <aF>/clear
#> a29 = <a2B>/thought
#> a2A = <a35>/occur
#> a2B = <a32>/cover
#> a2C = <a36>/apple
#> a2D = <a11>/triangle
#> a2E = <a43>/color
#> a2F = <a2A>/fresh
#> a30 = <a3A>/month
#> a31 = <a1E>/second
#> a32 = <a34>/nothing
#> a33 = <a23>/claim
#> a34 = <a40>/oxygen
#> a35 = <a40>/thank
#> a36 = <a34>/wonder
#> a37 = <a40>/character
#> a38 = <a29>/dollar
#> a39 = <a19>/receive
#> a3A = <a2B>/continue
#> a3B = <a2C>/spread
#> a3C = <a2F>/flower
#> a3D = <a2F>/invent
#> a3E = <a42>/proper
#> a3F = <a2A>/children
#> a40 = <a44>/again
#> a41 = <a44>/plain
#> a42 = <a2A>/ready
#> a43 = <a44>/indicate
#> a44 = <b2>/position
#> b1 = <c1>/against
#> b2 = plane
#> c1 = plane/position/again/oxygen
kwb.pathdict:::print_dict(many_paths, 4)
#> a1 = <a29>/surface
#> a2 = <a29>/search
#> a3 = <a27>/chart
#> a4 = <a3A>/element
#> a5 = <a29>/quotient
#> a6 = <aD>/provide
#> a7 = <a3D>/except
#> a8 = <aF>/enemy
#> a9 = <a19>/protect
#> aA = <a21>/among
#> aB = <a21>/their
#> aC = <a2C>/floor
#> aD = <a1D>/excite
#> aE = <a26>/mouth
#> aF = <a1A>/indicate
#> a10 = <a1E>/thousand
#> a11 = <a3F>/suggest
#> a12 = <a27>/invent
#> a13 = <a21>/paper
#> a14 = <a3D>/garden
#> a15 = <a3A>/subject
#> a16 = <a1D>/every
#> a17 = <a16>/prepare
#> a18 = <a16>/whether
#> a19 = <a2B>/happy
#> a1A = <b1>/chick
#> a1B = <a21>/north
#> a1C = <a11>/quotient
#> a1D = <a32>/produce
#> a1E = <a23>/bright
#> a1F = <a29>/necessary
#> a20 = <a23>/climb
#> a21 = <a2C>/motion
#> a22 = <a3B>/heart
#> a23 = <a24>/region
#> a24 = <a34>/toward
#> a25 = <a3D>/whose
#> a26 = <a2B>/gather
#> a27 = <a1A>/whole
#> a28 = <aF>/clear
#> a29 = <a2B>/thought
#> a2A = <a35>/occur
#> a2B = <a32>/cover
#> a2C = <a36>/apple
#> a2D = <a11>/triangle
#> a2E = <a43>/color
#> a2F = <a2A>/fresh
#> a30 = <a3A>/month
#> a31 = <a1E>/second
#> a32 = <a34>/nothing
#> a33 = <a23>/claim
#> a34 = <a40>/oxygen
#> a35 = <a40>/thank
#> a36 = <a34>/wonder
#> a37 = <a40>/character
#> a38 = <a29>/dollar
#> a39 = <a19>/receive
#> a3A = <a2B>/continue
#> a3B = <a2C>/spread
#> a3C = <a2F>/flower
#> a3D = <a2F>/invent
#> a3E = <a42>/proper
#> a3F = <a2A>/children
#> a40 = <a44>/again
#> a41 = <a44>/plain
#> a42 = <a2A>/ready
#> a43 = <a44>/indicate
#> a44 = <b2>/position
#> b1 = <c1>/against
#> b2 = plane
#> c1 = <d1>/oxygen
#> d1 = plane/position/again
kwb.pathdict:::print_dict(many_paths, 5)
#> a1 = <a29>/surface
#> a2 = <a29>/search
#> a3 = <a27>/chart
#> a4 = <a3A>/element
#> a5 = <a29>/quotient
#> a6 = <aD>/provide
#> a7 = <a3D>/except
#> a8 = <aF>/enemy
#> a9 = <a19>/protect
#> aA = <a21>/among
#> aB = <a21>/their
#> aC = <a2C>/floor
#> aD = <a1D>/excite
#> aE = <a26>/mouth
#> aF = <a1A>/indicate
#> a10 = <a1E>/thousand
#> a11 = <a3F>/suggest
#> a12 = <a27>/invent
#> a13 = <a21>/paper
#> a14 = <a3D>/garden
#> a15 = <a3A>/subject
#> a16 = <a1D>/every
#> a17 = <a16>/prepare
#> a18 = <a16>/whether
#> a19 = <a2B>/happy
#> a1A = <b1>/chick
#> a1B = <a21>/north
#> a1C = <a11>/quotient
#> a1D = <a32>/produce
#> a1E = <a23>/bright
#> a1F = <a29>/necessary
#> a20 = <a23>/climb
#> a21 = <a2C>/motion
#> a22 = <a3B>/heart
#> a23 = <a24>/region
#> a24 = <a34>/toward
#> a25 = <a3D>/whose
#> a26 = <a2B>/gather
#> a27 = <a1A>/whole
#> a28 = <aF>/clear
#> a29 = <a2B>/thought
#> a2A = <a35>/occur
#> a2B = <a32>/cover
#> a2C = <a36>/apple
#> a2D = <a11>/triangle
#> a2E = <a43>/color
#> a2F = <a2A>/fresh
#> a30 = <a3A>/month
#> a31 = <a1E>/second
#> a32 = <a34>/nothing
#> a33 = <a23>/claim
#> a34 = <a40>/oxygen
#> a35 = <a40>/thank
#> a36 = <a34>/wonder
#> a37 = <a40>/character
#> a38 = <a29>/dollar
#> a39 = <a19>/receive
#> a3A = <a2B>/continue
#> a3B = <a2C>/spread
#> a3C = <a2F>/flower
#> a3D = <a2F>/invent
#> a3E = <a42>/proper
#> a3F = <a2A>/children
#> a40 = <a44>/again
#> a41 = <a44>/plain
#> a42 = <a2A>/ready
#> a43 = <a44>/indicate
#> a44 = <b2>/position
#> b1 = <c1>/against
#> b2 = plane
#> c1 = <d1>/oxygen
#> d1 = <e1>/again
#> e1 = plane/position

There seems to be a problem. There are unexpected duplicates!

The function compress_paths() is a high level function that internally calls the helper function compress(). That function is demonstrated next.

Function compress()

The function compress() is a helper function that is called internally by the function compress_paths(). It replaces all distinct values in a vector with a short term that ist formatted as a “placeholder”. It returns a “dictionary”, i.e. a list that contains the original values as values and the short terms as keys

The elements in the dictionary are ordered by “importance”, i.e. the product of frequency and length of the strings (see function sorted_importance() above).

There are three more compress functions that are described next.

compress_with_dictionary()

The compress_with_dictionary() function expects a matrix of character and a dictionary, i.e. a list of values that are assigned to short keywords as input. Values that are contained in the dictionary are replaced by their short key. Values that are not contained are replaced with what is given in fill.value. But why? I think that this is just to treat empty values in the matrix. For all non-empty values it is assumed that a corresponding entry is contained in the dictionary.