Exploring New Data
You can find the complete example code for this part of the lesson code in the GitHub repository here.
Speaking Tibbles
Simply calling the name of a Tibble in the console reveals a lot of its information:
tweets
# # A tibble: 63,489 x 22
# id screen_name text retweet_count favorite_count is_quote_status is_retweet retweeted_status_id retweeted_user
# <chr> <chr> <chr> <dbl> <dbl> <lgl> <lgl> <chr> <chr>
# 1 164225194~ cem_oezdem~ "RT ~ 23 0 FALSE TRUE 1642158717113081856 BriHasselmann
# 2 164225194~ cem_oezdem~ "RT ~ 23 0 FALSE TRUE 1642158717113081856 BriHasselmann
# 3 164241681~ W_Schmidt_ "RT ~ 10659 0 FALSE TRUE 1641976869460275201 aakashg0
# 4 164241681~ W_Schmidt_ "RT ~ 10659 0 FALSE TRUE 1641976869460275201 aakashg0
# 5 164208262~ lisapaus "@Ha~ 1 3 FALSE FALSE NA NA
# 6 164208262~ lisapaus "@Ha~ 1 3 FALSE FALSE NA NA
# 7 164209011~ lisapaus "@an~ 1 11 FALSE FALSE NA NA
# 8 164253274~ Wissing "\U0~ 6 25 FALSE FALSE NA NA
# 9 164253273~ Wissing "Wir~ 9 93 FALSE FALSE NA NA
# 10 164246387~ Wissing "Wes~ 15 154 FALSE FALSE NA NA
# # i 63,479 more rows
# # i 13 more variables: lang <chr>, hashtags <list>, urls <list>, user_mentions <list>, photos <list>, source <chr>,
# # insert_timestamp <chr>, created_at <dttm>, quote_count <dbl>, reply_count <dbl>, in_reply_to_screen_name <chr>,
# # in_reply_to_status_id <chr>, quoted_status_id <chr>
# # i Use `print(n = ...)` to see more rows
In addition to the preview of the first few rows, a tibble also displays the total number of rows and columns. In this case, there are 63,489 rows and 22 columns. Below that, there is a comma-separated list of column names and their data types. However, this list is truncated after a few rows to avoid overloading the console with text.
Try this: instead of using read_csv
, load the CSV file with the read.csv
function from base R. Now, enter the name of the data frame into the console and press Enter. What is the difference in the output? Which one do you prefer?
Determining the number of rows and columns
A tibble voluntarily provides us with information about its dimensions when we call its name. We can also explicitly determine these values if we need to perform further calculations in our script based on this information.
We can use ncol
to get the number of columns and nrow
for the rows:
ncol(tweets)
# [1] 22
nrow(tweets)
# [1] 63489
The dim
function gives us both at once:
dim(tweets)
# [1] 63489 22
These functions are helpful if we need to know the size of our data frame later on in our analysis. For example, we could calculate the number of cells:
cols <- ncol(tweets)
rows <- nrow(tweets)
cells <- cols * rows
cells
# [1] 1396758
For the number of rows, if we prefer a tibble as the result instead of a vector, we can use count
to get the number of rows:
tweets |>
count()
We can also use nrow
and ncol
as columns in a new tibble and create another calculated column based on them:
tibble(
number_cols = ncol(tweets),
number_rows = nrow(tweets),
number_cells = number_cols * number_rows
)
# # A tibble: 1 x 3
# number_cols number_rows number_cells
# <int> <int> <int>
# 1 22 63489 1396758
Data Types
The quickest and, in my opinion, the easiest way to get an overview of the data types in a tibble is to use the glimpse
function:
glimpse(tweets)
# Or, with the pipe
tweets |>
glimpse()
# # Rows: 63,489
# Columns: 22
# $ id <chr> "1642251945732653057", "1642251945732653057", "1642416819003707397", "164241681900370739~
# $ screen_name <chr> "cem_oezdemir", "cem_oezdemir", "W_Schmidt_", "W_Schmidt_", "lisapaus", "lisapaus", "lis~
# $ text <chr> "RT @BriHasselmann: Endlich! Wir sind drangeblieben. Die #Tierhaltungskennzeichnung komm~
# $ retweet_count <dbl> 23, 23, 10659, 10659, 1, 1, 1, 6, 9, 15, 15, 56, 56, 29, 1, 1, 1, 1, 10, 10, 13, 26, 26,~
# $ favorite_count <dbl> 0, 0, 0, 0, 3, 3, 11, 25, 93, 154, 154, 448, 448, 0, 1, 2, 2, 2, 58, 58, 103, 266, 266, ~
# ...
In addition to the column names, we also get the data type and a preview of the first values contained in each column. glimpse
also tells us the number of rows and columns.
If we notice that a data type has been incorrectly detected, we can either adjust our loading process and correct the data type directly while reading. Alternatively, we can modify the column using mutate
, which we will learn in the lesson 4 Data Transformation.
Column names
If you're not yet familiar with a dataset, a function that lists the column names can be helpful:
tweets |>
colnames()
# [1] "id" "screen_name" "text" "retweet_count"
# [5] "favorite_count" "is_quote_status" "is_retweet" "retweeted_status_id"
# [9] "retweeted_user" "lang" "hashtags" "urls"
# ...
This information is also provided by glimpse
, as shown above. Using colnames
, we can get the column names as a vector, which can be useful for further work.
Show samples
To get a first impression of the data, it is often sufficient to display a few rows. This can be done using different functions:
# Show the first few lines (5 per default)
tweets |>
head()
# Show the first 20 lines
tweets |>
head(n = 20)
# Show the last 20 lines
tweets |>
tail(n = 20)
# Print more lines
tweets |>
print(n = 100)
If you would rather not display the top or bottom rows but a random selection, you can use slice_sample
:
# Get 10 random rows
tweets |>
slice_sample(n = 10)
# Get 1% random rows
tweets |>
slice_sample(prop = 0.01)
Frequencies
We have already learned about count
for counting rows above. The function is also suitable for counting rows grouped by a variable in the data:
# How many tweets per screen name?
tweets |>
count(screen_name)
We can also sort by frequency:
# How many tweets (1 row = a tweet) per screen name?
tweets |>
count(screen_name, sort= TRUE)
With the janitor
package, we get a function to quickly determine the frequencies, both absolute and percentage, for nominally scaled variables:
library(janitor)
tweets |>
tabyl(is_retweet)
Here, we need to sort manually using arrange
:
tweets |>
tabyl(is_retweet) |>
arrange(percent)
If we add a second variable, tabyl
creates a cross-tabulation with the absolute frequencies of each combination:
tweets |>
tabyl(screen_name, is_quote_status)
# screen_name FALSE TRUE
# ABaerbock 503 10
# ABaerbockArchiv 2811 397
# Bundeskanzler 960 70
# c_lindner 5020 683
# cem_oezdemir 4627 568
# ...
The skimr
package
skimr
packageThe skimr
package provides us with the function skim, which calculates some helpful statistics for each column for us. If we call the function as is, it prints a large result to the console:
library(skimr)
skim(tweets)
# -- Data Summary ------------------------
# Values
# Name tweets
# Number of rows 63489
# Number of columns 22
# ... hundreds more rows
A nice feature of skim
is that it returns a tibble, which we can process further and only print the information we are interested in:
# Show only missing values and complete rate for each column
# and sort by complete rate
tweets |>
skim() |>
select(skim_variable, n_missing, complete_rate) |>
arrange(-complete_rate) |>
print(n = 22)
# # A tibble: 22 x 3
# skim_variable n_missing complete_rate
# <chr> <int> <dbl>
# 1 created_at 0 1
# 2 id 0 1
# 3 screen_name 0 1
# 4 text 0 1
# 5 lang 0 1
# ...
# 21 in_reply_to_status_id 49903 0.214
# 22 quoted_status_id 53792 0.153
Last updated