Joining Data Sets

A special case of adding a new column is when the column already exists in a different data set. Then we can apply the join-operator.

Data analysis often involves working with multiple data sets. In many cases, we need to combine or merge data sets to gain insights or answer questions. This process is known as joining data sets. In this lesson, we will discuss why we need to join data sets, the different types of joins that exist, and how to implement joins using R and the Tidyverse.

There are four types of joins that we can use to combine data sets: inner join, left join, right join, and full join. In this lesson, we will illustrate them using the example of our politician's tweets. We want to enrich our data with some meta information about each politician, such as their political party, resort, age, or gender. For this, we first load the metadata from Excel into a tibble meta:

library(tidyverse)
library(readxl)
library(janitor)

# Load the meta data from Excel
meta <- 
  read_excel("data/tweets_metadata.xlsx") |> 
  clean_names() # make column names nicer

Note that the function clean_names automatically fixes column names for us. In Excel, people tend to use speaking names with spaces in them. The function clean_names replaces those with underscores and resolves capital letters and more.

Inner Join

An inner join returns only the rows that have matching values in both data sets. In other words, it only includes the observations that are common to both data sets. This type of join is useful when we want to exclude observations that do not have a match in the other data set. For example, if we take the data set with metadata about our Twitter users, we would effectively filter our tweets if we joined both using an inner join:

tweets |> 
  inner_join(meta, by = join_by(screen_name == user_screenname), keep = TRUE) |>
  distinct(screen_name, user_screenname, party)

Left and Right Join

A left join returns all the rows from the left data set and the matching rows from the right data set. If there is no match in the right data set, it will return NA values for the columns from that data set. This type of join is useful when we want to include all the observations from one data set and match them with the relevant observations from the other data set.

# Keep all tweets, regardless of whether we have metadata for the user
tweets |> 
  left_join(meta, by = join_by(screen_name == user_screenname)) |> 
  distinct(screen_name, party)

A right join returns all the rows from the right data set and the matching rows from the left data set. If there is no match in the left data set, it will return NA values for the columns from that data set. This type of join is similar to the left join, but with the roles of the data sets reversed.

# Keep all users from the meta data set, even if there are no tweets
tweets |> 
  right_join(meta, by = join_by(screen_name == user_screenname)) |> 
  distinct(screen_name, party)

Full Join

A full join returns all the rows from both data sets. If there is no match in one data set, it will return NA values for the columns from that data set. This type of join is useful when we want to combine all the observations from both data sets. However, it can also result in numerous missing values if there are many observations without a match.

# Keep all rows from both data sets
tweets |> 
  full_join(meta, by = join_by(screen_name == user_screenname)) |> 
  distinct(screen_name, party)

PreviousRename Columns NextSummarize Rows

Last updated 2 years ago