Understanding Encoding
Hauke Sonnenberg
2022-06-09
Source:vignettes/understanding_encoding.Rmd
understanding_encoding.Rmd
Read this first:
Escaping from character encoding hell in R on Windows
Introduction
In our document on best practices in research data managment we recommend to stick to a very basic set of characters when naming files and folders.
You may ask, why? You may never have had any problems when working with files containing spaces or special characters. If this is the case for you, you are lucky. You then most probably
do not exchange files between computers with different operating systems (i.e. Windows vs. Linux) and/or with different regional settings (e.g. German vs. English vs. French vs. Bulgarian),
do not automate tasks by programming.
Example Sessions
What does R tell us about Encoding?
?Encoding
We learn, that R offers a function Encoding()
as well as the functions enc2native()
and enc2utf8()
. How do these functions work?
Let’s have a look at the Examples given in the R documentation:
## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)
#> [1] "unknown"
Encoding(x) <- "latin1"
x
#> [1] "façile"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
#> [1] "latin1" "UTF-8"
c(x, xx)
#> [1] "façile" "façile"
### The following now gives an error
Encoding(xx) <- "bytes"
xx # will be encoded in hex
#> [1] "fa\\xc3\\xa7ile"
cat("xx = ", xx, "\n", sep = "")
#> xx = fa\xc3\xa7ile
Writing to and Reading from Files
Lets write a line containing German special characters to a text file. We use the function writeText()
from our kwb.utils package. This function does no more than using writeLines()
, but additionally gives a message about it and returns the path to the file:
text <- "Schöne Grüße"
test_file <- kwb.utils::writeText(text, tempfile(fileext = ".txt"))
#> Writing '/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmp0Fakga/file4011477a2af9.txt' ... ok.
And now read the line back with readLines()
:
readLines(test_file)
#> [1] "Schöne Grüße"
Ok, no problem so far, because we used the same system to write and read the file.
Let’s have a look at the file byte-by-byte:
con <- file(test_file, "r")
sapply(seq_len(nchar(text, "bytes")), function(i) {
readChar(con, 1, useBytes = TRUE)
})
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> Warning in readChar(con, 1, useBytes = TRUE): text connection used with
#> readChar(), results may be incorrect
#> [1] "S" "c" "h" "\xc3" "\xb6" "n" "e" " " "G" "r"
#> [11] "\xc3" "\xbc" "\xc3" "\x9f" "e"
close(con)
Wow, that is interesting. I did not know that! Standard characters are stored as one byte and only the special characters are stored as two bytes!
Now read the file again, this time as a vector of raw
:
(raw_bytes <- readBin(test_file, "raw", 100))
#> [1] 53 63 68 c3 b6 6e 65 20 47 72 c3 bc c3 9f 65 0a
Conversion Between Numeric Codes and Characters
What characters do the numeric codes given as hexedecimal numbers in raw_bytes
represent?
Let’s first see what the hexanumeric codes are in the decimal system:
as.integer(raw_bytes)
#> [1] 83 99 104 195 182 110 101 32 71 114 195 188 195 159 101 10
And now let’s convert the codes to characters:
rawToChar(raw_bytes)
#> [1] "Schöne Grüße\n"
The function rawToChar()
seems to know how to interpret the sequence of byte codes.
Note that the last character (10
in the decimal and 0a
in the hexadecimal system) represents the newline character \n
.
What happens if I ask rawToChar()
to convert the first and second byte representing the ö
character separately?
And now together:
rawToChar(raw_bytes[4:5])
#> [1] "ö"
There is an argument multiple
to rawToChar()
. It is FALSE
by default. What happens if we set it to TRUE
?
rawToChar(raw_bytes[4:5], multiple = TRUE)
#> [1] "\xc3" "\xb6"
As the documentation says, setting multiple
to TRUE
returns the single characters instead of a single string.
How does this look if we convert the whole string?
(characters_1 <- rawToChar(raw_bytes, multiple = TRUE))
#> [1] "S" "c" "h" "\xc3" "\xb6" "n" "e" " " "G" "r"
#> [11] "\xc3" "\xbc" "\xc3" "\x9f" "e" "\n"
Is this the same as splitting the original string into single characters?
strsplit(text, split = "")[[1]]
#> [1] "S" "c" "h" "ö" "n" "e" " " "G" "r" "ü" "ß" "e"
No, the German special characters are here shown as one character only instead of two. But we can achieve the same when setting the argument useBytes
to TRUE
:
Replace Special Characters with ASCII Characters
raw_bytes
#> [1] 53 63 68 c3 b6 6e 65 20 47 72 c3 bc c3 9f 65 0a
rawToChar(raw_bytes)
#> [1] "Schöne Grüße\n"
gsub("\xc3\xb6", "oe", rawToChar(raw_bytes))
#> [1] "Schoene Grüße\n"
gsub("\xc3\xb6", "oe", rawToChar(raw_bytes), useBytes = TRUE)
#> [1] "Schoene Grüße\n"
Ok, it seems that we can replace special characters if we know their byte codes (here: c3
and b6
for letter ö
).
It will be helpful to have a function that shows the special characters and the corresponding byte codes
### This function needs to be checked!
kwb.fakin::get_special_character_info(text)
#> special bytes context
#> 1 \xc3 c3 Sch [ ö ] ne Grü
#> 2 \xb6 b6 öne Gr [ ü ] ße
#> 3 \xc3 c3 ne Grü [ ß ] e
#> 4 \xbc bc Sch [ ö ] ne Grü
#> 5 \xc3 c3 öne Gr [ ü ] ße
#> 6 \x9f 9f ne Grü [ ß ] e