This vignette describes the functions that have not yet been described in other vignettes.

to_dictionary(), use_dictionary()

The function to_dictionary() creates a “dictionary” for a given vector of input strings. Each unique input string is given a short name by which ich can be looked up in the returned dictionary. The dictionary is a list with the unique input strings as values and the assigned short names as keys. The entries in the dictionary are sorted decreasingly by the “importance” of the corresponding input string, i.e. the product of frequency and string length.

Once you have defined a dictionary you can replace the original strings with placeholders that correspond to the short names in the dictionary:

The short paths can be resolved back to the original paths using the resolve() function from the kwb.utils package:

get_dictionary_one_by_one()

#> Distribution of path depths:
#> n_levels
#> 2 3 4 5 
#> 1 3 2 5 
#> i = 1, n = 11...
#> utils::head(y):
#> liquid 
#>     66 
#> i = 2, n = 11...
#> utils::head(y):
#> x
#>       liquid/oxygen     liquid/consider liquid/electric.png 
#>                  78                  60                  19 
#> i = 3, n = 10...
#> utils::head(y):
#> x
#>       liquid/oxygen/sharp  liquid/consider/industry liquid/consider/child.png 
#>                        76                        72                        25 
#>   liquid/oxygen/stand.doc   liquid/oxygen/train.png 
#>                        23                        23 
#> i = 4, n = 7...
#> utils::head(y):
#> x
#>     liquid/consider/industry/quite         liquid/oxygen/sharp/phrase 
#>                                 60                                 52 
#> liquid/consider/industry/power.pdf    liquid/oxygen/sharp/written.doc 
#>                                 34                                 31 
#>     liquid/oxygen/sharp/experiment 
#>                                 30 
#> i = 5, n = 5...
#> utils::head(y):
#> x
#> liquid/oxygen/sharp/experiment/continue.xls 
#>                                          43 
#>    liquid/consider/industry/quite/drink.png 
#>                                          40 
#>    liquid/consider/industry/quite/wheel.pdf 
#>                                          40 
#>        liquid/oxygen/sharp/phrase/discuss.R 
#>                                          36 
#>        liquid/oxygen/sharp/phrase/occur.jpg 
#>                                          36 
#>                                          path score length count score2
#> 1                         liquid/oxygen/sharp    76     19     4     60
#> 2                               liquid/oxygen    78     13     6     54
#> 3              liquid/consider/industry/quite    60     30     2     52
#> 4 liquid/oxygen/sharp/experiment/continue.xls    43     43     1     39
#> 5                                      liquid    66      6    11     22
#>  i key score count length                path score2
#>  1  p1    76     4     19 liquid/oxygen/sharp     60
#>                                          path score length count score2
#> 1                               liquid/oxygen    78     13     6     54
#> 2              liquid/consider/industry/quite    60     30     2     52
#> 3 liquid/oxygen/sharp/experiment/continue.xls    43     28     1     24
#> 4                                      liquid    66      6    11     22
#>  i key score count length          path score2
#>  2  p2    78     6     13 liquid/oxygen     54
#>                                          path score length count score2
#> 1              liquid/consider/industry/quite    60     30     2     52
#> 2                                      liquid    66      6    11     22
#> 3 liquid/oxygen/sharp/experiment/continue.xls    43     19     1     15
#>  i key score count length                           path score2
#>  3  p3    60     2     30 liquid/consider/industry/quite     52
#>                                          path score length count score2
#> 1                                      liquid    66      6    11     22
#> 2 liquid/oxygen/sharp/experiment/continue.xls    43     19     1     15
#>  i key score count length   path score2
#>  4  p4    66    11      6 liquid     22
#>                                          path score length count score2
#> 1 liquid/oxygen/sharp/experiment/continue.xls    43     17     1     13
#>  i key score count length                                        path score2
#>  5  p5    43     1     17 liquid/oxygen/sharp/experiment/continue.xls     13
result
#> $p1
#> [1] "liquid/oxygen/sharp"
#> 
#> $p2
#> [1] "liquid/oxygen"
#> 
#> $p3
#> [1] "liquid/consider/industry/quite"
#> 
#> $p4
#> [1] "liquid"
#> 
#> $p5
#> [1] "liquid/oxygen/sharp/experiment/continue.xls"

Subfolder Frequency Functions

At the start of get_dictionary_one_by_one() the function get_subdir_frequencies() is called. This function can be given a vector of path strings as input. For each possible depth in the path tree all different sub-paths are determined and ordered by their “importance”, i.e. by the product of frequency and path length in number of characters. By default, only the most important path per depth is returned:

To return all different paths, set first.only = FALSE:

The function to_frequency_data() is called next within get_dictionary_one_by_one(). It converts the list returned by get_subdir_frequencies() to a data frame.

The function rescore_and_reorder_frequency_data() takes a data frame with columns length and count as input. It calculates score2 = (length - placeholder_size) * count and orders the data frame decreasingly by this score.

get_subdirs_by_frequency()

The function get_subdirs_by_frequency() is used by a FAKIN script that is not yet part of a package. It gets different inputs:

  1. subdirs: subdirectory matrix
  2. cumid: cumulative identifier, created with kwb.pathdict:::to_cumulative_id()
  3. freqinfo: one-row data frame with columns depth, n.x, n.Freq

Functions that still need to be described

kwb.pathdict:::log_result_if
#> function (dbg, x, y) 
#> {
#>     if (dbg) {
#>         kwb.utils::catLines(c("\n### x:", x))
#>         kwb.utils::catLines(c("\n### y:", y))
#>         cat("\n### str(dict):\n")
#>         utils::str(kwb.utils::getAttribute(y, "dict"))
#>     }
#> }
#> <bytecode: 0x1bb6368>
#> <environment: namespace:kwb.pathdict>
kwb.pathdict:::lookup_in_dictionary
#> function (x, dict) 
#> {
#>     ready <- x %in% to_placeholder(names(dict))
#>     out <- x
#>     out[!ready] <- to_placeholder(names(dict[match(x[!ready], 
#>         dict)]))
#>     out
#> }
#> <bytecode: 0x1b18718>
#> <environment: namespace:kwb.pathdict>
kwb.pathdict:::print_path_frequencies
#> function (x, maxchar = 80) 
#> {
#>     x$path <- substr(x$path, 1, maxchar)
#>     print(x)
#> }
#> <bytecode: 0x50a9a98>
#> <environment: namespace:kwb.pathdict>
kwb.pathdict:::replace_subdirs
#> function (s, r, p) 
#> {
#>     selected <- starts_with_parts(s, r)
#>     cols <- seq(length(r) + 1, ncol(s))
#>     fillright <- matrix(nrow = sum(selected), ncol = length(r) - 
#>         1)
#>     s[selected, ] <- cbind(p, s[selected, cols, drop = FALSE], 
#>         fillright)
#>     maxcol <- max(which(apply(s, 2, function(x) sum(!is.na(x))) > 
#>         0))
#>     s[, seq_len(maxcol)]
#> }
#> <bytecode: 0x46018c0>
#> <environment: namespace:kwb.pathdict>
kwb.pathdict:::starts_with_parts
#> function (parts, elements) 
#> {
#>     stopifnot(is.list(parts) || is.matrix(parts))
#>     stopifnot(all(!is.na(elements)))
#>     length_out <- if (is.list(parts)) 
#>         length(parts)
#>     else nrow(parts)
#>     selected_at_level <- lapply(seq_along(elements), function(i) {
#>         if (is.list(parts)) {
#>             sapply(parts, "[", i) == elements[i]
#>         }
#>         else {
#>             !is.na(parts[, i]) & (parts[, i] == elements[i])
#>         }
#>     })
#>     Reduce(`&`, selected_at_level, init = rep(TRUE, length_out))
#> }
#> <bytecode: 0x4394ba0>
#> <environment: namespace:kwb.pathdict>
kwb.pathdict:::update_frequency_data_length
#> function (frequency_data, winner, key) 
#> {
#>     get_column <- kwb.utils::selectColumns
#>     winner_length <- get_column(winner, "length")
#>     winner_path <- get_column(winner, "path")
#>     data_length <- get_column(frequency_data, "length")
#>     data_path <- get_column(frequency_data, "path")
#>     shortage <- winner_length - nchar(to_placeholder(key))
#>     matching <- (substr(data_path, 1, winner_length) == winner_path)
#>     frequency_data$length[matching] <- data_length[matching] - 
#>         shortage
#>     frequency_data
#> }
#> <bytecode: 0x517e1c8>
#> <environment: namespace:kwb.pathdict>