Select Columns
This lesson introduces tools to remove unnecessary columns from the data set. Or, positively stated, we learn how to specify the columns we need for a particular analysis.
Last updated
This lesson introduces tools to remove unnecessary columns from the data set. Or, positively stated, we learn how to specify the columns we need for a particular analysis.
Last updated
select
commandThe function select()
is the designated tool to select columns with . By passing different things to the function, we can efficiently define the set of columns in the resulting data frame.
The intuitive way to select the columns we need is by listing their names. We can pass one or more column names to the select
function. In the case of two or more, we use commas to separate the names:
Occasionally, R we want to select columns based on a pattern of their names. Take the tweets data set as an example. Here, the variables that indicate if a tweet is a quote or a retweet have the prefix is_
. We leverage this with the helper function starts_with
:
Similar to start_with
, the function ends_with
looks for a string at the end of a column name. For example, all columns that contain numerical information about a tweet, such as the number of retweets, end with the suffix _count
. We can take advantage of that in case we wish to select all these columns efficiently:
To complete the picture, we can also search for a string somewhere in a column name. The contains
function does exactly that:
A simple example is to look for columns whose names contain either the one or the other of two strings:
We can combine the functions that look for strings in column names to create more specific pattern searches. The example below uses the &
operator to connect two functions with a logical and
. This means, both expressions must evaluate to TRUE
for the column to be selected:
Another flexible way to select columns is by their data type. Say we want to select all numeric columns because we wish to calculate the mean value across all of them in the next step of the pipeline. There is a shortcut for this, using the where()
function together with is.numeric
:
Of course, there are functions for all other data types as well:
Another way we can address columns is by their position or index. We pass a number to the select function, which will be interpreted as an index of a column in a tibble. Alternatively, we can pass a range.
We can select all columns from a list of columns names with all_of
:
If we include a column name in the cols
list that is not present in the tweets
tibble, this results in an error. An alternative here would be the any_of
function, which selects only those columns that are present in the tibble:
Instead of specifying which columns to include (white list), we can define which columns we wish to exclude (black list). We do this by prefixing the selection with an exclamation mark !
:
The following example select all columns that end with "_count
", but we want to exclude the two columns quote_count
and reply_count
. We do this by listing their names with a prefixed !
:
Note that instead of the exclamation mark, you can also use the minus operator -
. Both do the same thing, and you will encounter both in the wild.
Sometimes, merely matching strings in column names may not suffice. You may need to deal with more complicated patterns, such as those involving wildcards or requiring symbols to appear in a specific order in a column name. Regular expressions can be an excellent solution for such complex scenarios, though they can be difficult to grasp. If you frequently encounter such complex cases, it is recommended that you familiarize yourself with the fundamentals of regular expressions. Personally, I rarely require them, but if I do, I search for the expression on the internet using a reliable search engine like Google. Or I ask .
I highly recommend the book "R for Data Science" as an introduction to the topics of this course's first part. In particular, there is a small section on selecting columns in chapter "".