Preparing Data with Package stringr

by Karl-Kuno Kunze

In the beginning of most data science projetcs, data must be prepared for the task at hand. Very often only parts of entries in columns are necessary or entries must be re-formatted, especially for date and time entries. For this task you need good tools to manipulate text.

Even in the base version R comes packed with many functions for that purpose. However, these are partly inconsistent as syntax is concerned or run somewhat behind the abilities of languages like Python – or require quite some complex code. Package stingr by Hadley Wickham comes in quite handy to fill the gap.

Let me walk you through some simple examples where we use regular expressions. You can find a nice tool to learn and play around right here. You may also go for Wiki.

Preparation

First, we load the package:

# load package
library(stringr)

This is our test-string (meaning: Kick-Off European Soccer-Championship):

# string
myText <- "Anpfiff EM: 10. Juni um 21.00 Uhr MEZ."
myText
## [1] "Anpfiff EM: 10. Juni um 21.00 Uhr MEZ."

Find and extract strings

Does the string contain a certain pattern?

str_detect(myText, "MEZ")
## [1] TRUE

You may also extract values, if they correspond to a pattern. In the example we would like to extract the time. For our purposes, time consists of two pairs of two digits, separated by a period and followed by ‘ Uhr’ (which means o’Clock):

# Extract by pattern
str_extract(myText, "[0-9]{2}.[0-9]{2} Uhr")
## [1] "21.00 Uhr"

Replace strings

You may also replace strings. Simple things first. Replace name of month by number:

# Replace
str_replace(myText, " Juni","06.")
## [1] "Anpfiff EM: 10.06. um 21.00 Uhr MEZ."

Be careful when doing multiple replacements. For example, here:

# Careful:
str_replace(myText, c(" Juni", " MEZ"), c("06.", " Mittel-Europäische Zeit"))
## [1] "Anpfiff EM: 10.06. um 21.00 Uhr MEZ."                      
## [2] "Anpfiff EM: 10. Juni um 21.00 Uhr Mittel-Europäische Zeit."

Both rules have been applied, but each one to a different copy of the string.

If we prefer to apply both replacements to the same string we better write:

# Both replacements applied to one string:
str_replace_all(myText, c(" Juni" = "06.", " MEZ" = " Mittel-Europäische Zeit"))
## [1] "Anpfiff EM: 10.06. um 21.00 Uhr Mittel-Europäische Zeit."

The replacements are input as a named vector. An extremely powerful concept – as you can see.

Functions for stringr: 101

For your first steps with stringr you may want to check out the following functions in addition to the ones above:

  • str_length()
  • str_locate()
  • str_match()
  • str_split()
  • str_sub()
  • str_trim()
  • str_wrap()

Summary

With the package stringr you have a pretty well filled toolbox that transforms cumbersome operations on data to a pleasure business. Have Fun!


Titelbild von Rainer Sturm  / pixelio.de.