The WordLens-Project
  • The WordLens-Project
  • Course Overview
  • Part 1: Transform and Visualize Data
    • 1 Working Environment
    • 2 R and the Tidyverse
    • 3 Data Loading
      • Tabular Data
      • Tidy Data
      • Exploring New Data
    • 4 Data Transformation
      • Select Columns
      • Filter Rows
      • Sort Rows
      • Add Or Change Columns
        • Calculate New Columns
        • Change Data Types
        • Rename Columns
        • Joining Data Sets
      • Summarize Rows
    • 5 Data Visualization
      • Pleas for Visualization
      • Fast and Simple Plots
      • Grammar of Graphics
  • Part 2: Rule-Based NLP
    • 6 Unstructured Data
    • 7 Searching Text
    • 8 Tokenizing Text
      • Filter or Sample Data
      • Clean and Normalize Text
      • Split Text Into Tokens
      • Removing Stop Words
      • Enrich Tokens
    • 9 Topic Classification
      • Deductive
      • Inductive
    • 10 Sentiment Analysis
    • 11 Text Classification
    • 12 Word Pairs and N-Grams
  • Part 3: NLP with Machine Learning
    • 13 Text Embeddings
    • 14 Part-Of-Speech
    • 15 Named Entities
    • 16 Syntactic Dependency
    • 17 Similarity
    • 18 Sentiment
    • 19 Text Classification
    • 20 Transformers
    • 21 Training a Model
    • 22 Large Language Models
  • Appendix
  • Resources
Powered by GitBook
On this page
  • Inner Join
  • Left and Right Join
  • Full Join
  1. Part 1: Transform and Visualize Data
  2. 4 Data Transformation
  3. Add Or Change Columns

Joining Data Sets

A special case of adding a new column is when the column already exists in a different data set. Then we can apply the join-operator.

PreviousRename ColumnsNextSummarize Rows

Last updated 2 years ago

Data analysis often involves working with multiple data sets. In many cases, we need to combine or merge data sets to gain insights or answer questions. This process is known as joining data sets. In this lesson, we will discuss why we need to join data sets, the different types of joins that exist, and how to implement joins using R and the Tidyverse.

There are four types of joins that we can use to combine data sets: inner join, left join, right join, and full join. In this lesson, we will illustrate them using the example of our politician's tweets. We want to enrich our data with some meta information about each politician, such as their political party, resort, age, or gender. For this, we first load the metadata from Excel into a tibble meta:

library(tidyverse)
library(readxl)
library(janitor)

# Load the meta data from Excel
meta <- 
  read_excel("data/tweets_metadata.xlsx") |> 
  clean_names() # make column names nicer

Note that the function clean_names automatically fixes column names for us. In Excel, people tend to use speaking names with spaces in them. The function clean_names replaces those with underscores and resolves capital letters and more.

Inner Join

An inner join returns only the rows that have matching values in both data sets. In other words, it only includes the observations that are common to both data sets. This type of join is useful when we want to exclude observations that do not have a match in the other data set. For example, if we take the data set with metadata about our Twitter users, we would effectively filter our tweets if we joined both using an inner join:

tweets |> 
  inner_join(meta, by = join_by(screen_name == user_screenname), keep = TRUE) |>
  distinct(screen_name, user_screenname, party)

Left and Right Join

A left join returns all the rows from the left data set and the matching rows from the right data set. If there is no match in the right data set, it will return NA values for the columns from that data set. This type of join is useful when we want to include all the observations from one data set and match them with the relevant observations from the other data set.

# Keep all tweets, regardless of whether we have metadata for the user
tweets |> 
  left_join(meta, by = join_by(screen_name == user_screenname)) |> 
  distinct(screen_name, party)

A right join returns all the rows from the right data set and the matching rows from the left data set. If there is no match in the left data set, it will return NA values for the columns from that data set. This type of join is similar to the left join, but with the roles of the data sets reversed.

# Keep all users from the meta data set, even if there are no tweets
tweets |> 
  right_join(meta, by = join_by(screen_name == user_screenname)) |> 
  distinct(screen_name, party)

Full Join

A full join returns all the rows from both data sets. If there is no match in one data set, it will return NA values for the columns from that data set. This type of join is useful when we want to combine all the observations from both data sets. However, it can also result in numerous missing values if there are many observations without a match.

# Keep all rows from both data sets
tweets |> 
  full_join(meta, by = join_by(screen_name == user_screenname)) |> 
  distinct(screen_name, party)

The INNER JOIN removes all rows that do not have matches in both tables.
The LEFT JOIN keeps all rows from the left table and fills missing matches with "NA".
The RIGHT JOIN keeps all rows from the right table and fills missing matches with "NA".
The FULL JOIN keeps all rows from both tables and fills missing matches with "NA" on both sides.
Drawing
Drawing
Drawing
Drawing