Filter or Sample Data

Before tokenizing the text and this multiplying our rows, we should make sure to only look at the relevant data.

Filtering

The process of tokenizing text involves breaking down a given piece of text into individual words or tokens, which are then arranged into rows. Depending on the size of the original text, this can result in a substantial increase in the amount of data being analyzed. For example, a text with 200 words may become 200 individual rows after tokenization.

To avoid working with irrelevant data, it is important to filter out any records that are not directly related to the question we are trying to answer through our analysis. In the example given, if our analysis is focused on identifying popular topics in German language tweets in 2023, we should apply a filter to our data to include only the relevant records that meet these criteria.

tweets_filtered <- 
  tweets |> 
  filter(year(created_at) == 2023) |> 
  filter(lang == "de") |> 
  select(id, created_at, screen_name, text, is_retweet)

In addition to removing unnecessary rows, it is recommended to only keep relevant columns for further processing during exploratory data analysis. This can be achieved by selecting and retaining only the columns that are indispensable for the analysis. In the given example, five columns were selected that are expected to be required for the analysis.

It's worth noting that the selection of columns is not a one-time process and can be modified at any stage of the analysis. Additional columns can be added, and the code can be re-run. The process of exploratory data analysis is iterative and cyclical, meaning that the steps may need to be revisited and modified as new insights are gained.

Sampling

In some cases, it may not be possible to narrow down a large dataset to a smaller subset before performing further analysis. This can occur when we are unsure of what exactly we are looking for, and we need to perform additional analysis that relies on the text being tokenized. Alternatively, we may need to use the entire dataset for our analysis.

Working with large datasets on a laptop can pose a challenge because of limited resources such as memory and computing power. This can make complex analysis time-consuming or not feasible at all. One solution to this problem is to perform the analysis on a cluster of computers, although this can involve additional effort and costs.

An alternative approach to enable analysis on a laptop is to use sampling techniques to reduce the size of the dataset and thereby decrease the workload. In R and the Tidyverse, the slice_sample function can be used for this purpose. It randomly selects a subset of rows from a data frame, allowing us to work with a smaller, more manageable dataset on our laptop.

# Sample 10 % of the total data
tweets |> 
  slice_sample(prop = 0.1)

# Sample 1000 tweets
tweets |> 
  slice_sample(n = 1000)

The function slice_sample draws random rows and returns a new tibble. If we required a different subset, we can use any of {dplyr}'s slice_* functions.

Previous8 Tokenizing Text NextClean and Normalize Text

Last updated 2 years ago