Joining Data Sets
A special case of adding a new column is when the column already exists in a different data set. Then we can apply the join-operator.
Last updated
A special case of adding a new column is when the column already exists in a different data set. Then we can apply the join-operator.
Last updated
Data analysis often involves working with multiple data sets. In many cases, we need to combine or merge data sets to gain insights or answer questions. This process is known as joining data sets. In this lesson, we will discuss why we need to join data sets, the different types of joins that exist, and how to implement joins using R and the Tidyverse.
There are four types of joins that we can use to combine data sets: inner join, left join, right join, and full join. In this lesson, we will illustrate them using the example of our politician's tweets. We want to enrich our data with some meta information about each politician, such as their political party, resort, age, or gender. For this, we first load the metadata from Excel into a tibble meta
:
Note that the function clean_names
automatically fixes column names for us. In Excel, people tend to use speaking names with spaces in them. The function clean_names
replaces those with underscores and resolves capital letters and more.
An inner join returns only the rows that have matching values in both data sets. In other words, it only includes the observations that are common to both data sets. This type of join is useful when we want to exclude observations that do not have a match in the other data set. For example, if we take the data set with metadata about our Twitter users, we would effectively filter our tweets if we joined both using an inner join:
A left join returns all the rows from the left data set and the matching rows from the right data set. If there is no match in the right data set, it will return NA values for the columns from that data set. This type of join is useful when we want to include all the observations from one data set and match them with the relevant observations from the other data set.
A right join returns all the rows from the right data set and the matching rows from the left data set. If there is no match in the left data set, it will return NA values for the columns from that data set. This type of join is similar to the left join, but with the roles of the data sets reversed.
A full join returns all the rows from both data sets. If there is no match in one data set, it will return NA values for the columns from that data set. This type of join is useful when we want to combine all the observations from both data sets. However, it can also result in numerous missing values if there are many observations without a match.