7 Searching Text
Searching for keywords or patterns in text is one way to make immediate use of the data, without further processing and imposing a structure.
Last updated
Searching for keywords or patterns in text is one way to make immediate use of the data, without further processing and imposing a structure.
Last updated
In this part of the project, we begin analyzing text data or data that contains text columns. We start with simple ways to search in text, without any prior processing to impose a structure. This is the objective of this lesson. As we progress through this section, we'll discover how tokenization provides a new structure that enables us to use tools for structured data analysis, such as filtering, sorting, or grouping. Leveraging this structure, we employ various natural language processing (NLP) techniques, including , , and .
Searching for exact matches of a specific word in a text can prove very useful. For example, we might be interested in whether a politician from our tweets data set has ever mentioned ChatGPT in one of his status updates. We can find out using the str_detect
function from the {stringr}
package:
This code will return all tweets where "chatgpt" has been mentioned. It's important to note that we first convert the tweet text to lowercase using the str_to_lower function to ensure that we account for various capitalizations like "chatGPT" or "ChatGpt". By comparing everything in lowercase, we eliminate this potential error source. We will make use of this principle most of the time when searching in text.
Special cases of exact matches are if we want to match a word at the beginning or at the end of a text. The following should identify all tweets that are retweets and yield the same result as filtering on !is_retweet
:
We can use the equivalent function str_ends
to search for a word at the end of a text:
In the previous section, we learned that the str_detect
function together with str_starts
and str_ends
can be used to search for a string within a text column. However, the power of these functions goes beyond simple string searches. In fact, we can use regular expressions to create more complex search patterns. Regular expressions allow us to formulate advanced search patterns that can be used with str_detect
. In the upcoming section, we will cover the key concepts of regular expressions and provide some examples using tweets.
Regular expressions, or regex, are a handy tool for finding and working with text patterns. They might seem a bit tricky at first, but they're powerful once you get the hang of them. By the way: With ChatGPT, you'll have an easier time building and understanding regular expressions.
Here are a couple of simple yet powerful regex examples:
To search for multiple keywords in a text, we can simply connect them using the pipe-symbol. This regex looks for three words at once: [chatgpt|gpt3|gpt4]
This one finds URLs in text: "https?://\S+
"
Occasionally, we wish to find sequences of multiple white spaces and replace them with only one. This expression can help us find them: "\s{2,}
". This pattern matches white spaces, which in regex are a represented by \s
, and the curly braces quantify how many we look for. Here it is at least two and arbitrary many.
To find an email address in a text, you can use this regex pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
This pattern will match any combination of letters, numbers, and special characters before the '@' symbol, followed by a domain name and a two or more letter top-level domain (like .com, .org, etc.).
To verify if a text is a valid U.S. phone number (in the format xxx-xxx-xxxx), you can use this regex pattern: \d{3}-\d{3}-\d{4}
. This pattern checks for three digits, followed by a hyphen, another three digits, another hyphen, and finally four digits. You can see how we can build flexible search patterns with regular expressions.
We might not only be interested in whether a keyword or pattern is part of a text, but also what the exact value was. We can achieve this by extracting the match into a column. For most of the functions we saw, such as str_detect, there is an equivalent to extract the match. Consider the following search for multiple keywords:
With str_extract
and str_extract_all
, we can create a new column with matches:
In the second statement with str_extract_all
, we add the unnest_longer
function to our pipeline. This function explodes a list of values into separate rows. The str_extract_all
function returns a list of matches because it extracts all potential matches in the text. In contrast, the str_extract function only returns the first match in the text, and we don't need to deal with a list result.
In addition to extracting the matches, we might also want to replace values. There are corresponding functions like str_replace
and str_replace_all
for that. We will utilize them soon when we clean and tokenize our text.
This section is coming soon.
In this course, we are not aiming to become proficient in building regular expressions, and we will keep the introduction to the topic therefore short. I want you to add them to your arsenal of tools and have a basic understanding of what they can be used for. When you need to match a specific pattern, I recommend you search the internet or, even better, ask . Additionally, there are numerous cheat sheets available, which can be of great help. I find good.
For a general introduction to working with strings (character sequences, i.e., text) in R and the Tidyverse, read the chapter "" from the book "".
To learn more about regular expressions, I recommend you read the chapter "" from the book "".