Select Columns
This lesson introduces tools to remove unnecessary columns from the data set. Or, positively stated, we learn how to specify the columns we need for a particular analysis.
The select
command
select
commandThe function select()
is the designated tool to select columns with dplyr. By passing different things to the function, we can efficiently define the set of columns in the resulting data frame.
By Name
The intuitive way to select the columns we need is by listing their names. We can pass one or more column names to the select
function. In the case of two or more, we use commas to separate the names:
tweets |>
select(screen_name, text) |>
colnames()
# [1] "screen_name" "text"
By Column Name Pattern
Names Starting with a String
Occasionally, R we want to select columns based on a pattern of their names. Take the tweets data set as an example. Here, the variables that indicate if a tweet is a quote or a retweet have the prefix is_
. We leverage this with the helper function starts_with
:
tweets |>
select(starts_with("is_")) |>
colnames()
# [1] "screen_name" "text"
Names Ending with a String
Similar to start_with
, the function ends_with
looks for a string at the end of a column name. For example, all columns that contain numerical information about a tweet, such as the number of retweets, end with the suffix _count
. We can take advantage of that in case we wish to select all these columns efficiently:
tweets |>
select(ends_with("_count")) |>
colnames()
# [1] "retweet_count" "favorite_count" "quote_count" "reply_count"
Names with a String Anywhere
To complete the picture, we can also search for a string somewhere in a column name. The contains
function does exactly that:
tweets |>
select(contains("screen")) |>
colnames()
# [1] "screen_name" "in_reply_to_screen_name"
More Complex Matches
Sometimes, merely matching strings in column names may not suffice. You may need to deal with more complicated patterns, such as those involving wildcards or requiring symbols to appear in a specific order in a column name. Regular expressions can be an excellent solution for such complex scenarios, though they can be difficult to grasp. If you frequently encounter such complex cases, it is recommended that you familiarize yourself with the fundamentals of regular expressions. Personally, I rarely require them, but if I do, I search for the expression on the internet using a reliable search engine like Google. Or I ask ChatGPT.
A simple example is to look for columns whose names contain either the one or the other of two strings:
tweets |>
select(matches("is|count")) |>
colnames()
# [1] "retweet_count" "favorite_count" "is_quote_status" "is_retweet" "quote_count" "reply_count"
Combinations of patterns
We can combine the functions that look for strings in column names to create more specific pattern searches. The example below uses the &
operator to connect two functions with a logical and
. This means, both expressions must evaluate to TRUE
for the column to be selected:
tweets |>
select(contains("retweet") & where(is.numeric)) |>
colnames()
# [1] "retweet_count"
By Data Type
Another flexible way to select columns is by their data type. Say we want to select all numeric columns because we wish to calculate the mean value across all of them in the next step of the pipeline. There is a shortcut for this, using the where()
function together with is.numeric
:
tweets |>
select(where(is.numeric)) |>
colnames()
# [1] "retweet_count" "favorite_count" "quote_count" "reply_count"
Of course, there are functions for all other data types as well:
tweets %>%
select(where(is.logical))
tweets %>%
select(where(is.character))
tweets %>%
select(where(is.factor))
tweets %>%
select(where(is.list))
# The package lubridate provides a function to check for date (without time) ...
tweets %>%
select(where(lubridate::is.Date))
# ... and one for date with time
tweets %>%
select(where(lubridate::is.POSIXct))
By Position
Another way we can address columns is by their position or index. We pass a number to the select function, which will be interpreted as an index of a column in a tibble. Alternatively, we can pass a range.
# Select first column
tweets |>
select(1)
# Select last column
tweets |>
select(last_col())
# Select last second last column
tweets |>
select(last_col(2))
# Select a range of columns
tweets |>
select(2:6)
# Select everything but the last two columns
tweets |>
select(1:last_col(2))
By Set Affiliation
We can select all columns from a list of columns names with all_of
:
cols <- c("screen_name", "text", "retweet_count")
tweets |>
select(all_of(cols)) |>
colnames()
# [1] "screen_name" "text" "retweet_count"
If we include a column name in the cols
list that is not present in the tweets
tibble, this results in an error. An alternative here would be the any_of
function, which selects only those columns that are present in the tibble:
# Define a super set of columns and use any_of
cols_superset <- c("screen_name", "text", "retweet_count", "title")
tweets |>
select(any_of(cols_superset)) |>
colnames()
# [1] "screen_name" "text" "retweet_count"
Exclude Columns
Instead of specifying which columns to include (white list), we can define which columns we wish to exclude (black list). We do this by prefixing the selection with an exclamation mark !
:
tweets |>
select(!text)
The following example select all columns that end with "_count
", but we want to exclude the two columns quote_count
and reply_count
. We do this by listing their names with a prefixed !
:
tweets |>
select(ends_with("_count") & !quote_count & !reply_count) |>
colnames()
# [1] "retweet_count" "favorite_count"
Note that instead of the exclamation mark, you can also use the minus operator -
. Both do the same thing, and you will encounter both in the wild.
# This is the same as above
tweets |>
select(-text)
Further Reading
I highly recommend the book "R for Data Science" as an introduction to the topics of this course's first part. In particular, there is a small section on selecting columns in chapter "4 Data Transformation".
Last updated