Strings and Data Frames

by Karl-Kuno Kunze

This post addresses a common recommendation when it comes to adding lines to data frames.

Adding lines to data frames might not be straight forward

Let us create a very small data frame with two cities and their inhabitants (in Mio.)

myDF <- data.frame(City = c("Berlin","London"), 
              Inhabitants  = c(3.375, 8.308)); myDF
##     City Inhabitants
## 1 Berlin       3.375
## 2 London       8.308

Now, add one line through the function rbind()

myDF <- rbind(myDF, 
              c(City = "Paris", 
         Inhabitants = 2.244)); myDF
## Warning in `[<-.factor`(`*tmp*`, ri, value = "Paris"): invalid factor
## level, NA generated
##     City Inhabitants
## 1 Berlin       3.375
## 2 London       8.308
## 3   <NA>       2.244

Here, a default feature of R comes to light: strings are converted to factors before being transferred to a data frame. R does not let you add a String to a factor variable. A factor is a categorial variable that may have different values, like eye color.

Common advice:

Often, you may hear the advice to suppress this default behavior of R through the option stringsAsFactors = FALSE when creating the data frame, as seen below.

myDF <- data.frame(City = c("Berlin","London"), 
              Inhabitants  = c(3.375, 8.308),
              stringsAsFactors = FALSE); myDF
##     City Inhabitants
## 1 Berlin       3.375
## 2 London       8.308

Now, R does not convert strings to factors.

If we now try the above addition of a line, we find the following:

myDF <- rbind(myDF, 
              c(City = "Paris", 
         Inhabitants = 2.244)); myDF
##     City Inhabitants
## 1 Berlin       3.375
## 2 London       8.308
## 3  Paris       2.244

Everything runs smoothly now.

str(myDF$City)
##  chr [1:3] "Berlin" "London" "Paris"

The more data scientific approach

The way we saw above is certainly correct und works nicely, but it deprives us of a structure that may be in the data. In addition, there are some functions available in R that rely on factors. Therefore, we might try again without the option stringsAsFactors = FALSE.

myDF <- data.frame(City = c("Berlin","London"), 
              Inhabitants  = c(3.375, 8.308)); myDF
##     City Inhabitants
## 1 Berlin       3.375
## 2 London       8.308

If we rewrite the code for adding a line like this, i.e. replace c by data.frame everything runs smoothly again and we conserve the property of a factor:

myDF <- rbind(myDF, 
              data.frame(City = "Paris", 
                  Inhabitants = 2.244)); myDF
##     City Inhabitants
## 1 Berlin       3.375
## 2 London       8.308
## 3  Paris       2.244
str(myDF$City)
##  Factor w/ 3 levels "Berlin","London",..: 1 2 3