7 Searching Text

Searching for keywords or patterns in text is one way to make immediate use of the data, without further processing and imposing a structure.

Summary

In this lesson, you'll learn:

  • Different ways to use filter with text data, with a focus on the function str_detect.

  • What a regular expression is and how it can be used to create flexible and powerful text patterns to search and filter text

  • How we can use str_extract and str_extract_all to not only filter text, but also find the matches and add them to a new column.

  • How we can remove matches from a text with str_remove and str_remove_all.

  • If we would rather not remove but replace something, we can use str_replace or str_replace_all.

You find the examples from this lesson on GitHub in part_2/1_searching_text.R

In this part of the project, we begin analyzing text data or data that contains text columns. We start with simple ways to search in text, without any prior processing to impose a structure. This is the objective of this lesson. As we progress through this section, we'll discover how tokenization provides a new structure that enables us to use tools for structured data analysis, such as filtering, sorting, or grouping. Leveraging this structure, we employ various natural language processing (NLP) techniques, including sentiment analysis, text classification, and topic identification.

Drawing
Different ways to search in text, visualized.

Search Using Keywords

Searching for exact matches of a specific word in a text can prove very useful. For example, we might be interested in whether a politician from our tweets data set has ever mentioned ChatGPT in one of his status updates. We can find out using the str_detect function from the {stringr} package:

tweets |> 
  filter(str_detect(str_to_lower(text), "chatgpt")) |> 
  select(screen_name, text)

This code will return all tweets where "chatgpt" has been mentioned. It's important to note that we first convert the tweet text to lowercase using the str_to_lower function to ensure that we account for various capitalizations like "chatGPT" or "ChatGpt". By comparing everything in lowercase, we eliminate this potential error source. We will make use of this principle most of the time when searching in text.

Special cases of exact matches are if we want to match a word at the beginning or at the end of a text. The following should identify all tweets that are retweets and yield the same result as filtering on !is_retweet:

tweets |> 
  filter(str_starts(text, "RT", negate = T)) |> 
  select(screen_name, text)

We can use the equivalent function str_ends to search for a word at the end of a text:

tweets |> 
  filter(str_ends(text, "#SPD")) |> 
  select(screen_name, text)

Search Using Regular Expressions

In the previous section, we learned that the str_detect function together with str_starts and str_ends can be used to search for a string within a text column. However, the power of these functions goes beyond simple string searches. In fact, we can use regular expressions to create more complex search patterns. Regular expressions allow us to formulate advanced search patterns that can be used with str_detect. In the upcoming section, we will cover the key concepts of regular expressions and provide some examples using tweets.

Regular expressions, or regex, are a handy tool for finding and working with text patterns. They might seem a bit tricky at first, but they're powerful once you get the hang of them. By the way: With ChatGPT, you'll have an easier time building and understanding regular expressions.

Here are a couple of simple yet powerful regex examples:

  • To search for multiple keywords in a text, we can simply connect them using the pipe-symbol. This regex looks for three words at once: [chatgpt|gpt3|gpt4]

  • This one finds URLs in text: "https?://\S+"

  • Occasionally, we wish to find sequences of multiple white spaces and replace them with only one. This expression can help us find them: "\s{2,}". This pattern matches white spaces, which in regex are a represented by \s, and the curly braces quantify how many we look for. Here it is at least two and arbitrary many.

  • To find an email address in a text, you can use this regex pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} This pattern will match any combination of letters, numbers, and special characters before the '@' symbol, followed by a domain name and a two or more letter top-level domain (like .com, .org, etc.).

  • To verify if a text is a valid U.S. phone number (in the format xxx-xxx-xxxx), you can use this regex pattern: \d{3}-\d{3}-\d{4}. This pattern checks for three digits, followed by a hyphen, another three digits, another hyphen, and finally four digits. You can see how we can build flexible search patterns with regular expressions.

In this course, we are not aiming to become proficient in building regular expressions, and we will keep the introduction to the topic therefore short. I want you to add them to your arsenal of tools and have a basic understanding of what they can be used for. When you need to match a specific pattern, I recommend you search the internet or, even better, ask ChatGPT. Additionally, there are numerous cheat sheets available, which can be of great help. I find this one good.

Extract Matches

We might not only be interested in whether a keyword or pattern is part of a text, but also what the exact value was. We can achieve this by extracting the match into a column. For most of the functions we saw, such as str_detect, there is an equivalent to extract the match. Consider the following search for multiple keywords:

tweets |> 
  mutate(text = str_to_lower(text)) |> 
  filter(str_detect(text, "chatgpt|gpt3|gpt4|openai|künstliche intelligenz"))

With str_extract and str_extract_all, we can create a new column with matches:

# Extracting the key word matches (only the first match)
tweets |> 
  mutate(text = str_to_lower(text)) |> 
  mutate(keyword_matches = str_extract(text, "chatgpt|gpt3|gpt4|openai|künstliche intelligenz")) |>
  unnest_longer(keyword_matches) |> 
  filter(!is.na(keyword_matches)) |> 
  select(keyword_matches, text)
  
# Extracting the key word matches (all matches)
tweets |> 
  mutate(text = str_to_lower(text)) |> 
  mutate(keyword_matches = str_extract(text, "chatgpt|gpt3|gpt4|openai|künstliche intelligenz")) |>
  unnest_longer(keyword_matches) |> 
  filter(!is.na(keyword_matches)) |> 
  select(keyword_matches, text)

In the second statement with str_extract_all, we add the unnest_longer function to our pipeline. This function explodes a list of values into separate rows. The str_extract_all function returns a list of matches because it extracts all potential matches in the text. In contrast, the str_extract function only returns the first match in the text, and we don't need to deal with a list result.

In addition to extracting the matches, we might also want to replace values. There are corresponding functions like str_replace and str_replace_all for that. We will utilize them soon when we clean and tokenize our text.

Replace Matches

This section is coming soon.

Further Reading

For a general introduction to working with strings (character sequences, i.e., text) in R and the Tidyverse, read the chapter "15 Strings" from the book "R for Data Science".

To learn more about regular expressions, I recommend you read the chapter "16 Regular Expressions" from the book "R for Data Science".

Last updated