In the following, we will explain
how to use a lama-dictionary (See Creating
lama-dictionaries) in order to translate data frame variables or
atomic vectors (or factor objects). The main functions are: *
lama_translate()
and lama_translate_()
: Assign
new labels to variable values and turn them into ordered factors (if
to_factor = TRUE
). * lama_translate_all()
:
Apply lama_translate()
on all possible columns of a data
frame, if there are corresponding translations. *
lama_to_factor()
and lama_to_factor_()
:
Similar to lama_translate()
and
lama_translate_()
, but the variables already have the right
values (character or factor), but should be turned into factor variables
with the factor levels given in the corresponding translations. *
lama_to_factor_all()
: Apply lama_to_factor()
on all possible columns of a data frame, if there are corresponding
translations.
Let df
be a data frame with the following structure:
df <- data.frame(
pupil_id = rep(1:4, each = 3),
subject = rep(c("eng", "mat", "gym"), 4),
level = factor(
c("a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a"),
levels = c("a", "b")
),
result = c(1, 2, 2, NA, 2, NA, 1, 0, 1, 2, 3, NA),
stringsAsFactors = FALSE
)
df
#> pupil_id subject level result
#> 1 1 eng a 1
#> 2 1 mat a 2
#> 3 1 gym a 2
#> 4 2 eng b NA
#> 5 2 mat b 2
#> 6 2 gym b NA
#> 7 3 eng b 1
#> 8 3 mat b 0
#> 9 3 gym b 1
#> 10 4 eng a 2
#> 11 4 mat a 3
#> 12 4 gym a NA
The column subject
(character) contains
the subject codes and the column level
(factor) holds the level of the courses
(basic
and advanced
) pupils were tested in.
The column result
(integer) contains the
test results (1
and 2
are positive,
3
and 4
are negative, NA
means
that the pupil missed the test and 0
means that something
else went wrong).
We want to use the following lama-dictionary in order to translate the data frame variables:
library(labelmachine)
dict <- new_lama_dictionary(
sub = c(eng = "English", mat = "Mathematics", gym = "Gymnastics"),
lev = c(b = "Basic", a = "Advanced"),
result = c(
"1" = "Good",
"2" = "Passed",
"3" = "Not passed",
"4" = "Not passed",
NA_ = "Missed",
"0" = NA
)
)
dict
#>
#> --- lama_dictionary ---
#> Variable 'sub':
#> eng mat gym
#> "English" "Mathematics" "Gymnastics"
#>
#> Variable 'lev':
#> b a
#> "Basic" "Advanced"
#>
#> Variable 'result':
#> 1 2 3 4 NA_ 0
#> "Good" "Passed" "Not passed" "Not passed" "Missed" NA
The function lama_translate()
uses non-standard
evaluation, which means that we pass in expressions, which will be
parsed and we can spare the quotes surrounding column and translation
names:
df_new <- lama_translate(
.data = df,
dictionary = dict,
subject_new = sub(subject),
level = lev(level),
result = result(result),
keep_order = c(FALSE, TRUE, FALSE),
to_factor = c(TRUE, TRUE, FALSE)
)
str(df_new)
#> 'data.frame': 12 obs. of 5 variables:
#> $ pupil_id : int 1 1 1 2 2 2 3 3 3 4 ...
#> $ subject : chr "eng" "mat" "gym" "eng" ...
#> $ level : Factor w/ 2 levels "Advanced","Basic": 1 1 1 2 2 2 2 2 2 1 ...
#> $ result : chr "Good" "Passed" "Passed" "Missed" ...
#> $ subject_new: Factor w/ 3 levels "English","Mathematics",..: 1 2 3 1 2 3 1 2 3 1 ...
The arguments .data
and dictionary
define
which data frame should be translated and which lama-dictionary should
be used for the translation. The argument keep_order
defines for each given translation if the original ordering of the
variable should be kept (ordering of the variable in the data frame
df
) or if the ordering given in the translation should be
used. The argument to_factor
defines for each translation,
if the resulting labeled variable should be a factor variable
(to_factor = TRUE
) or a plain character variable
(to_factor = FALSE
). Besides the arguments
.data
, dictionary
and keep_order
all other arguments are label assignments. The names of the arguments
(left hand side of the equations) define the column names under which
the labeled variable should be stored. The right hand side of the
assignments define the column which should be labeled (parameter name in
the brackets) and which translation should be used (function name the
left of the brackets). Hence, the statement above does the following
things:
subject_new = sub(subject)
: The column
subject
in the data frame df
is translated
using the translation sub
and the resulting factor is
stored under the column name subject_new
. Since the first
entry in keep_order
is FALSE
, the ordering
given in the translation sub
is used for the labels. Since
the first entry in to_factor
is TRUE
the
resulting variable is a factor variable.level = lev(level)
: The column level
in
the data frame df
is translated using the translation
lev
and then overwritten by the resulting factor. Since the
second entry in keep_order
is TRUE
, the
labeled variable has the same ordering as the original column. Since the
second entry in to_factor
is TRUE
the
resulting variable is a factor variable.result = result(result)
: The column result
in the data frame df
is translated using the translation
result
and then overwritten by the resulting factor. Since
the third entry in keep_order
is FALSE
, the
ordering given in the translation is used for the labels. Since the
third entry in to_factor
is FALSE
the
resulting variable is a plain character variable.There are several abbreviations, in order to spare some writing:
result_new = result
is the same as
result_new = result(result)
.lev(level)
is the same as
level = lev(level)
.result
is the same as
result = result(result)
.The function lama_translate_()
is the standard
evaluation variant of lama_translate()
, which means that
instead of expressions, we pass in character strings holding the names
of the translations and columns we want to use:
df_new <- lama_translate_(
.data = df,
dictionary = dict,
translation = c("sub", "lev", "result"),
col = c("subject", "level", "result"),
col_new = c("subject_new", "level", "result"),
keep_order = c(FALSE, TRUE, FALSE),
to_factor = c(TRUE, TRUE, FALSE)
)
str(df_new)
#> 'data.frame': 12 obs. of 5 variables:
#> $ pupil_id : int 1 1 1 2 2 2 3 3 3 4 ...
#> $ subject : chr "eng" "mat" "gym" "eng" ...
#> $ level : Factor w/ 2 levels "Advanced","Basic": 1 1 1 2 2 2 2 2 2 1 ...
#> $ result : chr "Good" "Passed" "Passed" "Missed" ...
#> $ subject_new: Factor w/ 3 levels "English","Mathematics",..: 1 2 3 1 2 3 1 2 3 1 ...
The arguments .data
and dictionary
define
which data frame should be translated and which lama-dictionary should
be used for the translation. The argument keep_order
defines for each given translation if the original ordering of the
variable should be kept (ordering of the variable in the data frame
df
) or if the ordering given in the translation should be
used. The result is the same as before, when we used
lama_translate()
.
The function lama_translate_all()
is an extension of
lama_translate()
, which tries to automatically translate as
many columns in the data frame .data
as possible.
Therefore, the names of the columns which should be translated must
match the names of the translations which should be used:
df_new <- lama_translate_all(
.data = df,
dictionary = dict,
prefix = "new_",
fn_colname = toupper,
suffix = "_labeled",
keep_order = TRUE
)
str(df_new)
#> 'data.frame': 12 obs. of 5 variables:
#> $ pupil_id : int 1 1 1 2 2 2 3 3 3 4 ...
#> $ subject : chr "eng" "mat" "gym" "eng" ...
#> $ level : Factor w/ 2 levels "a","b": 1 1 1 2 2 2 2 2 2 1 ...
#> $ result : num 1 2 2 NA 2 NA 1 0 1 2 ...
#> $ new_RESULT_labeled: Factor w/ 4 levels "Good","Passed",..: 1 2 2 4 2 4 1 NA 1 2 ...
In the above example, only the column name result
matches the translation name and is therefore translated and stored
under the column name new_RESULT_labeled
. The name of the
new columns is a transformation of the old column name
(e.g. result
), appending the strings given in the arguments
prefix
and suffix
at the beginning and at the
end of the column name. Before this string concatenation, the name of
the original column can be transformed into a other string by using the
string transformation function fn_colname
. In our case
fn_colname
is given the function toupper
which
transforms all letters of the column name result
to upper
case RESULT
. Contrary to lama_translate()
, the
argument keep_order
is just a single boolean flag. It
defines whether the original order of all columns should be kept
(keep_order = TRUE
) or if the order in the translation
vector should be used. Like in the case of
lama_translate()
, it is possible to pass an argument
to_factor = FALSE
lama_translate_all
in order
to define that all resulting labeled variables shall be stored as plain
character vectors.
So far, we only translated variables in data frames, but it is also
possible to use lama_translate()
and
lama_translate_()
in order to translate atomic vectors
(character, logical, numeric) and factors.
Using lama_translate()
:
Using lama_translate_()
:
Sometimes, you already have labeled variables (character or factor
variables, maybe produced by lama_translate()
with argument
to_factor = FALSE
) and you want to turn them into factor
variables with a desired ordering. In this case the functions
lama_to_factor()
, lama_to_factor_()
lama_to_factor_all()
are right choices.
Let df_non_factor
a data frame holding the right labels,
but no factor variables (created with lama_translate_all()
using to_factor = FALSE
):
dict_new <- lama_rename(dict, subject = sub, level = lev)
df_non_factor <- lama_translate_all(df, dict_new, to_factor = FALSE)
str(df_non_factor)
#> 'data.frame': 12 obs. of 4 variables:
#> $ pupil_id: int 1 1 1 2 2 2 3 3 3 4 ...
#> $ subject : chr "English" "Mathematics" "Gymnastics" "English" ...
#> $ level : chr "Advanced" "Advanced" "Advanced" "Basic" ...
#> $ result : chr "Good" "Passed" "Passed" "Missed" ...
Turning variables into factors with
lama_to_factor()
:
df_factor <- lama_to_factor(
.data = df_non_factor,
dictionary = dict,
subject_new = sub(subject),
level = lev(level),
result = result(result)
)
str(df_factor)
#> 'data.frame': 12 obs. of 5 variables:
#> $ pupil_id : int 1 1 1 2 2 2 3 3 3 4 ...
#> $ subject : chr "English" "Mathematics" "Gymnastics" "English" ...
#> $ level : Factor w/ 2 levels "Basic","Advanced": 2 2 2 1 1 1 1 1 1 2 ...
#> $ result : Factor w/ 4 levels "Good","Passed",..: 1 2 2 4 2 4 1 NA 1 2 ...
#> $ subject_new: Factor w/ 3 levels "English","Mathematics",..: 1 2 3 1 2 3 1 2 3 1 ...
The function lama_to_factor()
allows the same
abbreviations as lama_translate()
. It can also be used on
factor variables and there is also a keep_order
argument
like in the case of lama_translate()
. Furthermore, the
functions lama_to_factor()
and
lama_to_factor_()
can both be applied to atomic vectors or
plain factors like in the case of lama_translate()
.
Turning variables in a data frame into factors with
lama_to_factor_()
:
df_factor <- lama_to_factor_(
.data = df_non_factor,
dictionary = dict,
translation = c("sub", "lev", "result"),
col = c("subject", "level", "result")
)
str(df_factor)
#> 'data.frame': 12 obs. of 4 variables:
#> $ pupil_id: int 1 1 1 2 2 2 3 3 3 4 ...
#> $ subject : Factor w/ 3 levels "English","Mathematics",..: 1 2 3 1 2 3 1 2 3 1 ...
#> $ level : Factor w/ 2 levels "Basic","Advanced": 2 2 2 1 1 1 1 1 1 2 ...
#> $ result : Factor w/ 4 levels "Good","Passed",..: 1 2 2 4 2 4 1 NA 1 2 ...
Since the argument col_new
was omitted, the variable
names (subject
, level
and result
)
were overwritten.
Turning all possible variables in a data frame into factors with
lama_to_factor_all()
:
df_factor <- lama_to_factor_all(
.data = df_non_factor,
dictionary = dict
)
str(df_factor)
#> 'data.frame': 12 obs. of 4 variables:
#> $ pupil_id: int 1 1 1 2 2 2 3 3 3 4 ...
#> $ subject : chr "English" "Mathematics" "Gymnastics" "English" ...
#> $ level : chr "Advanced" "Advanced" "Advanced" "Basic" ...
#> $ result : Factor w/ 4 levels "Good","Passed",..: 1 2 2 4 2 4 1 NA 1 2 ...
Since the arguments prefix
, suffix
and
fn_colname
were omitted, the variable names
(subject
, level
and result
) were
overwritten.