Select Columns

This lesson introduces tools to remove unnecessary columns from the data set. Or, positively stated, we learn how to specify the columns we need for a particular analysis.

Summary

In this lesson, you'll learn:

How we can select or remove columns with the select function.
What the function starts_with, ends_with, contains, and matches can do to help to pick columns by name patterns.
How regular expressions make it easy to create complex patterns.
That we can select columns by their data type or index in the tibble.
How we can use lists to select sets of columns that we require on a regular basis.
How we can make a selection by excluding columns.

You can find all the code examples from this lesson in a single script in the WordLens GitHub repository.

The `select` command

The function select() is the designated tool to select columns with dplyr. By passing different things to the function, we can efficiently define the set of columns in the resulting data frame.

By Name

The intuitive way to select the columns we need is by listing their names. We can pass one or more column names to the select function. In the case of two or more, we use commas to separate the names:

tweets |> 
  select(screen_name, text) |>
  colnames()
  
# [1] "screen_name" "text"

By Column Name Pattern

Names Starting with a String

Occasionally, R we want to select columns based on a pattern of their names. Take the tweets data set as an example. Here, the variables that indicate if a tweet is a quote or a retweet have the prefix is_. We leverage this with the helper function starts_with:

tweets |> 
  select(starts_with("is_")) |>
  colnames()

# [1] "screen_name" "text"

Names Ending with a String

Similar to start_with, the function ends_with looks for a string at the end of a column name. For example, all columns that contain numerical information about a tweet, such as the number of retweets, end with the suffix _count. We can take advantage of that in case we wish to select all these columns efficiently:

tweets |> 
  select(ends_with("_count")) |>
  colnames()
  
# [1] "retweet_count"  "favorite_count" "quote_count"    "reply_count"

Names with a String Anywhere

To complete the picture, we can also search for a string somewhere in a column name. The contains function does exactly that:

tweets |> 
  select(contains("screen")) |>
  colnames()

# [1] "screen_name"             "in_reply_to_screen_name"

More Complex Matches

Sometimes, merely matching strings in column names may not suffice. You may need to deal with more complicated patterns, such as those involving wildcards or requiring symbols to appear in a specific order in a column name. Regular expressions can be an excellent solution for such complex scenarios, though they can be difficult to grasp. If you frequently encounter such complex cases, it is recommended that you familiarize yourself with the fundamentals of regular expressions. Personally, I rarely require them, but if I do, I search for the expression on the internet using a reliable search engine like Google. Or I ask ChatGPT.

A simple example is to look for columns whose names contain either the one or the other of two strings:

tweets |> 
  select(matches("is|count")) |>
  colnames()

# [1] "retweet_count"   "favorite_count"  "is_quote_status" "is_retweet"      "quote_count"     "reply_count"

Combinations of patterns

We can combine the functions that look for strings in column names to create more specific pattern searches. The example below uses the & operator to connect two functions with a logical and. This means, both expressions must evaluate to TRUE for the column to be selected:

tweets |> 
  select(contains("retweet") & where(is.numeric)) |> 
  colnames()

# [1] "retweet_count"

By Data Type

Another flexible way to select columns is by their data type. Say we want to select all numeric columns because we wish to calculate the mean value across all of them in the next step of the pipeline. There is a shortcut for this, using the where() function together with is.numeric:

tweets |> 
  select(where(is.numeric)) |>
  colnames()

# [1] "retweet_count"  "favorite_count" "quote_count"    "reply_count"

Of course, there are functions for all other data types as well:

tweets %>% 
  select(where(is.logical))

tweets %>% 
  select(where(is.character))

tweets %>% 
  select(where(is.factor))

tweets %>% 
  select(where(is.list))

# The package lubridate provides a function to check for date (without time) ...
tweets %>% 
  select(where(lubridate::is.Date))

# ... and one for date with time
tweets %>% 
  select(where(lubridate::is.POSIXct))

By Position

Another way we can address columns is by their position or index. We pass a number to the select function, which will be interpreted as an index of a column in a tibble. Alternatively, we can pass a range.

# Select first column
tweets |> 
  select(1)
  
# Select last column
tweets |> 
  select(last_col())

# Select last second last column 
tweets |> 
  select(last_col(2))

# Select a range of columns
tweets |> 
  select(2:6)

# Select everything but the last two columns
tweets |> 
  select(1:last_col(2))

By Set Affiliation

We can select all columns from a list of columns names with all_of:

cols <- c("screen_name", "text", "retweet_count")

tweets |> 
  select(all_of(cols)) |> 
  colnames()
  
# [1] "screen_name" "text" "retweet_count"

If we include a column name in the cols list that is not present in the tweets tibble, this results in an error. An alternative here would be the any_of function, which selects only those columns that are present in the tibble:

# Define a super set of columns and use any_of
cols_superset <- c("screen_name", "text", "retweet_count", "title")

tweets |>
  select(any_of(cols_superset)) |> 
  colnames()
  
# [1] "screen_name" "text" "retweet_count"

Exclude Columns

Instead of specifying which columns to include (white list), we can define which columns we wish to exclude (black list). We do this by prefixing the selection with an exclamation mark !:

tweets |> 
  select(!text)

The following example select all columns that end with "_count", but we want to exclude the two columns quote_count and reply_count. We do this by listing their names with a prefixed ! :

tweets |> 
  select(ends_with("_count") & !quote_count & !reply_count) |> 
  colnames()

# [1] "retweet_count"  "favorite_count"

Note that instead of the exclamation mark, you can also use the minus operator -. Both do the same thing, and you will encounter both in the wild.

# This is the same as above
tweets |> 
  select(-text)

The select command