Tidy up strings with the R package metan
Getting started
In this quick tip, I show you how to tidy character strings with the package metan. If the package is not yet installed, you can download it from CRAN. Install the released version of metan from CRAN with:
install.packages("metan")
For the latest release notes on this development version, see the NEWS file.
Then, load it with:
library(metan)
metan
's function
tidy_strings()
can be used to tidy up characters strings by putting all word in upper case, replacing any space, tabulation, punctuation characters by _
(underscore), and putting _
between lower and upper case.
A simple example
Suppose that we have a character string, say, str = c("Env1", "env 1", "env.1")
. By definition str
should represent a unique level in plant breeding trials, e.g., environment 1, but in fact it has three levels.
str <- c("Env1", "env 1", "env.1")
levels(factor(str))
# [1] "env 1" "env.1" "Env1"
Bad idea!
We can use
tidy_strings()
to tidy up this string as follows:
tidy_strings(str)
# [1] "ENV_1" "ENV_1" "ENV_1"
Great! We have now the unique level we should have before.
More examples
All of the following will be translated into "ENV_1"
.
messy_env <- c("ENV 1", "Env 1", "Env1", "env1", "Env.1", "Env_1")
tidy_strings(messy_env)
# [1] "ENV_1" "ENV_1" "ENV_1" "ENV_1" "ENV_1" "ENV_1"
All of the following will be translated into "GEN_*"
.
messy_gen <- c("GEN1", "gen 2", "Gen.3", "gen-4", "Gen_5", "GEN_6")
tidy_strings(messy_gen)
# [1] "GEN_1" "GEN_2" "GEN_3" "GEN_4" "GEN_5" "GEN_6"
All of the following will be translated into "ENV_GEN"
messy_int <- c("EnvGen", "Env_Gen", "env gen", "Env Gen", "ENV.GEN", "ENV_GEN")
tidy_strings(messy_int)
# [1] "ENV_GEN" "ENV_GEN" "ENV_GEN" "ENV_GEN" "ENV_GEN" "ENV_GEN"
Tidy up a whole data frame
We can also tidy up strings of a whole data frame. By default the separator character is _
. To change this default use the argument sep
.
library(tibble)
df <- tibble(Env = messy_env,
gen = messy_gen,
Env_Gen = interaction(Env, gen),
y = rnorm(6, 300, 10))
df
# # A tibble: 6 x 4
# Env gen Env_Gen y
# <chr> <chr> <fct> <dbl>
# 1 ENV 1 GEN1 ENV 1.GEN1 304.
# 2 Env 1 gen 2 Env 1.gen 2 308.
# 3 Env1 Gen.3 Env1.Gen.3 301.
# 4 env1 gen-4 env1.gen-4 295.
# 5 Env.1 Gen_5 Env.1.Gen_5 294.
# 6 Env_1 GEN_6 Env_1.GEN_6 303.
tidy_strings(df, sep = "")
# # A tibble: 6 x 4
# Env gen Env_Gen y
# <chr> <chr> <chr> <dbl>
# 1 ENV1 GEN1 ENV1GEN1 304.
# 2 ENV1 GEN2 ENV1GEN2 308.
# 3 ENV1 GEN3 ENV1GEN3 301.
# 4 ENV1 GEN4 ENV1GEN4 295.
# 5 ENV1 GEN5 ENV1GEN5 294.
# 6 ENV1 GEN6 ENV1GEN6 303.
To select variables to tidy up, simply type the variable name. Here, we also put all column names to upper case
tidy_strings(df, Env) %>%
colnames_to_upper()
# # A tibble: 6 x 4
# ENV GEN ENV_GEN Y
# <chr> <chr> <fct> <dbl>
# 1 ENV_1 GEN1 ENV 1.GEN1 304.
# 2 ENV_1 gen 2 Env 1.gen 2 308.
# 3 ENV_1 Gen.3 Env1.Gen.3 301.
# 4 ENV_1 gen-4 env1.gen-4 295.
# 5 ENV_1 Gen_5 Env.1.Gen_5 294.
# 6 ENV_1 GEN_6 Env_1.GEN_6 303.