The WordLens-Project
  • The WordLens-Project
  • Course Overview
  • Part 1: Transform and Visualize Data
    • 1 Working Environment
    • 2 R and the Tidyverse
    • 3 Data Loading
      • Tabular Data
      • Tidy Data
      • Exploring New Data
    • 4 Data Transformation
      • Select Columns
      • Filter Rows
      • Sort Rows
      • Add Or Change Columns
        • Calculate New Columns
        • Change Data Types
        • Rename Columns
        • Joining Data Sets
      • Summarize Rows
    • 5 Data Visualization
      • Pleas for Visualization
      • Fast and Simple Plots
      • Grammar of Graphics
  • Part 2: Rule-Based NLP
    • 6 Unstructured Data
    • 7 Searching Text
    • 8 Tokenizing Text
      • Filter or Sample Data
      • Clean and Normalize Text
      • Split Text Into Tokens
      • Removing Stop Words
      • Enrich Tokens
    • 9 Topic Classification
      • Deductive
      • Inductive
    • 10 Sentiment Analysis
    • 11 Text Classification
    • 12 Word Pairs and N-Grams
  • Part 3: NLP with Machine Learning
    • 13 Text Embeddings
    • 14 Part-Of-Speech
    • 15 Named Entities
    • 16 Syntactic Dependency
    • 17 Similarity
    • 18 Sentiment
    • 19 Text Classification
    • 20 Transformers
    • 21 Training a Model
    • 22 Large Language Models
  • Appendix
  • Resources
Powered by GitBook
On this page
  • Speaking Tibbles
  • Determining the number of rows and columns
  • Data Types
  • Column names
  • Show samples
  • Frequencies
  • The skimr package
  1. Part 1: Transform and Visualize Data
  2. 3 Data Loading

Exploring New Data

PreviousTidy DataNext4 Data Transformation

Last updated 2 years ago

You can find the complete example code for this part of the lesson code in the GitHub repository .

Speaking Tibbles

Simply calling the name of a Tibble in the console reveals a lot of its information:

tweets

# # A tibble: 63,489 x 22
#    id         screen_name text  retweet_count favorite_count is_quote_status is_retweet retweeted_status_id retweeted_user
#    <chr>      <chr>       <chr>         <dbl>          <dbl> <lgl>           <lgl>      <chr>               <chr>         
#  1 164225194~ cem_oezdem~ "RT ~            23              0 FALSE           TRUE       1642158717113081856 BriHasselmann 
#  2 164225194~ cem_oezdem~ "RT ~            23              0 FALSE           TRUE       1642158717113081856 BriHasselmann 
#  3 164241681~ W_Schmidt_  "RT ~         10659              0 FALSE           TRUE       1641976869460275201 aakashg0      
#  4 164241681~ W_Schmidt_  "RT ~         10659              0 FALSE           TRUE       1641976869460275201 aakashg0      
#  5 164208262~ lisapaus    "@Ha~             1              3 FALSE           FALSE      NA                  NA            
#  6 164208262~ lisapaus    "@Ha~             1              3 FALSE           FALSE      NA                  NA            
#  7 164209011~ lisapaus    "@an~             1             11 FALSE           FALSE      NA                  NA            
#  8 164253274~ Wissing     "\U0~             6             25 FALSE           FALSE      NA                  NA            
#  9 164253273~ Wissing     "Wir~             9             93 FALSE           FALSE      NA                  NA            
# 10 164246387~ Wissing     "Wes~            15            154 FALSE           FALSE      NA                  NA            
# # i 63,479 more rows
# # i 13 more variables: lang <chr>, hashtags <list>, urls <list>, user_mentions <list>, photos <list>, source <chr>,
# #   insert_timestamp <chr>, created_at <dttm>, quote_count <dbl>, reply_count <dbl>, in_reply_to_screen_name <chr>,
# #   in_reply_to_status_id <chr>, quoted_status_id <chr>
# # i Use `print(n = ...)` to see more rows

In addition to the preview of the first few rows, a tibble also displays the total number of rows and columns. In this case, there are 63,489 rows and 22 columns. Below that, there is a comma-separated list of column names and their data types. However, this list is truncated after a few rows to avoid overloading the console with text.

Try this: instead of using read_csv, load the CSV file with the read.csv function from base R. Now, enter the name of the data frame into the console and press Enter. What is the difference in the output? Which one do you prefer?

Determining the number of rows and columns

A tibble voluntarily provides us with information about its dimensions when we call its name. We can also explicitly determine these values if we need to perform further calculations in our script based on this information.

We can use ncol to get the number of columns and nrow for the rows:

ncol(tweets)

# [1] 22

nrow(tweets)

# [1] 63489

The dim function gives us both at once:

dim(tweets)

# [1] 63489    22

These functions are helpful if we need to know the size of our data frame later on in our analysis. For example, we could calculate the number of cells:

cols <- ncol(tweets)

rows <- nrow(tweets)

cells <- cols * rows

cells

# [1] 1396758

For the number of rows, if we prefer a tibble as the result instead of a vector, we can use count to get the number of rows:

tweets |> 
  count()

We can also use nrow and ncol as columns in a new tibble and create another calculated column based on them:

tibble(
  number_cols = ncol(tweets), 
  number_rows = nrow(tweets),
  number_cells = number_cols * number_rows
)

# # A tibble: 1 x 3
#   number_cols number_rows number_cells
#         <int>       <int>        <int>
# 1          22       63489      1396758

Data Types

The quickest and, in my opinion, the easiest way to get an overview of the data types in a tibble is to use the glimpse function:

glimpse(tweets)

# Or, with the pipe
tweets |> 
  glimpse()
  
# # Rows: 63,489
# Columns: 22
# $ id                      <chr> "1642251945732653057", "1642251945732653057", "1642416819003707397", "164241681900370739~
# $ screen_name             <chr> "cem_oezdemir", "cem_oezdemir", "W_Schmidt_", "W_Schmidt_", "lisapaus", "lisapaus", "lis~
# $ text                    <chr> "RT @BriHasselmann: Endlich! Wir sind drangeblieben. Die #Tierhaltungskennzeichnung komm~
# $ retweet_count           <dbl> 23, 23, 10659, 10659, 1, 1, 1, 6, 9, 15, 15, 56, 56, 29, 1, 1, 1, 1, 10, 10, 13, 26, 26,~
# $ favorite_count          <dbl> 0, 0, 0, 0, 3, 3, 11, 25, 93, 154, 154, 448, 448, 0, 1, 2, 2, 2, 58, 58, 103, 266, 266, ~
# ...

In addition to the column names, we also get the data type and a preview of the first values contained in each column. glimpse also tells us the number of rows and columns.

If we notice that a data type has been incorrectly detected, we can either adjust our loading process and correct the data type directly while reading. Alternatively, we can modify the column using mutate, which we will learn in the lesson 4 Data Transformation.

Column names

If you're not yet familiar with a dataset, a function that lists the column names can be helpful:

tweets |> 
  colnames()
  
# [1] "id"                      "screen_name"             "text"                    "retweet_count"          
# [5] "favorite_count"          "is_quote_status"         "is_retweet"              "retweeted_status_id"    
# [9] "retweeted_user"          "lang"                    "hashtags"                "urls"            
# ...

This information is also provided by glimpse, as shown above. Using colnames, we can get the column names as a vector, which can be useful for further work.

Show samples

To get a first impression of the data, it is often sufficient to display a few rows. This can be done using different functions:

# Show the first few lines (5 per default)
tweets |>
  head()

# Show the first 20 lines
tweets |> 
  head(n = 20)

# Show the last 20 lines
tweets |> 
  tail(n = 20)

# Print more lines
tweets |> 
  print(n = 100)

If you would rather not display the top or bottom rows but a random selection, you can use slice_sample:

# Get 10 random rows
tweets |> 
  slice_sample(n = 10)
  
# Get 1% random rows
tweets |> 
  slice_sample(prop = 0.01)

Frequencies

We have already learned about count for counting rows above. The function is also suitable for counting rows grouped by a variable in the data:

# How many tweets per screen name?
tweets |> 
  count(screen_name)

We can also sort by frequency:

# How many tweets (1 row = a tweet) per screen name?
tweets |> 
  count(screen_name, sort= TRUE)  

With the janitor package, we get a function to quickly determine the frequencies, both absolute and percentage, for nominally scaled variables:

library(janitor)

tweets |>
  tabyl(is_retweet)

Here, we need to sort manually using arrange:

tweets |> 
  tabyl(is_retweet) |> 
  arrange(percent)

If we add a second variable, tabyl creates a cross-tabulation with the absolute frequencies of each combination:

tweets |>
  tabyl(screen_name, is_quote_status)
  
#     screen_name FALSE TRUE
#        ABaerbock   503   10
#  ABaerbockArchiv  2811  397
#    Bundeskanzler   960   70
#        c_lindner  5020  683
#     cem_oezdemir  4627  568
# ...

The skimr package

The skimr package provides us with the function skim, which calculates some helpful statistics for each column for us. If we call the function as is, it prints a large result to the console:

library(skimr)
skim(tweets)

# -- Data Summary ------------------------
#                            Values
# Name                       tweets
# Number of rows             63489 
# Number of columns          22    
# ... hundreds more rows

A nice feature of skim is that it returns a tibble, which we can process further and only print the information we are interested in:

# Show only missing values and complete rate for each column 
# and sort by complete rate
tweets |> 
  skim() |> 
  select(skim_variable, n_missing, complete_rate) |> 
  arrange(-complete_rate) |> 
  print(n = 22)
  
# # A tibble: 22 x 3
#    skim_variable           n_missing complete_rate
#    <chr>                       <int>         <dbl>
#  1 created_at                      0         1    
#  2 id                              0         1    
#  3 screen_name                     0         1    
#  4 text                            0         1    
#  5 lang                            0         1    
#  ...
# 21 in_reply_to_status_id       49903         0.214
# 22 quoted_status_id            53792         0.153

here