You can find the complete example code for this part of the lesson code in the GitHub repository .
Speaking Tibbles
Simply calling the name of a Tibble in the console reveals a lot of its information:
tweets
# # A tibble: 63,489 x 22
# id screen_name text retweet_count favorite_count is_quote_status is_retweet retweeted_status_id retweeted_user
# <chr> <chr> <chr> <dbl> <dbl> <lgl> <lgl> <chr> <chr>
# 1 164225194~ cem_oezdem~ "RT ~ 23 0 FALSE TRUE 1642158717113081856 BriHasselmann
# 2 164225194~ cem_oezdem~ "RT ~ 23 0 FALSE TRUE 1642158717113081856 BriHasselmann
# 3 164241681~ W_Schmidt_ "RT ~ 10659 0 FALSE TRUE 1641976869460275201 aakashg0
# 4 164241681~ W_Schmidt_ "RT ~ 10659 0 FALSE TRUE 1641976869460275201 aakashg0
# 5 164208262~ lisapaus "@Ha~ 1 3 FALSE FALSE NA NA
# 6 164208262~ lisapaus "@Ha~ 1 3 FALSE FALSE NA NA
# 7 164209011~ lisapaus "@an~ 1 11 FALSE FALSE NA NA
# 8 164253274~ Wissing "\U0~ 6 25 FALSE FALSE NA NA
# 9 164253273~ Wissing "Wir~ 9 93 FALSE FALSE NA NA
# 10 164246387~ Wissing "Wes~ 15 154 FALSE FALSE NA NA
# # i 63,479 more rows
# # i 13 more variables: lang <chr>, hashtags <list>, urls <list>, user_mentions <list>, photos <list>, source <chr>,
# # insert_timestamp <chr>, created_at <dttm>, quote_count <dbl>, reply_count <dbl>, in_reply_to_screen_name <chr>,
# # in_reply_to_status_id <chr>, quoted_status_id <chr>
# # i Use `print(n = ...)` to see more rows
In addition to the preview of the first few rows, a tibble also displays the total number of rows and columns. In this case, there are 63,489 rows and 22 columns. Below that, there is a comma-separated list of column names and their data types. However, this list is truncated after a few rows to avoid overloading the console with text.
Try this: instead of using read_csv, load the CSV file with the read.csv function from base R. Now, enter the name of the data frame into the console and press Enter. What is the difference in the output? Which one do you prefer?
Determining the number of rows and columns
A tibble voluntarily provides us with information about its dimensions when we call its name. We can also explicitly determine these values if we need to perform further calculations in our script based on this information.
We can use ncol to get the number of columns and nrow for the rows:
ncol(tweets)
# [1] 22
nrow(tweets)
# [1] 63489
The dim function gives us both at once:
dim(tweets)
# [1] 63489 22
These functions are helpful if we need to know the size of our data frame later on in our analysis. For example, we could calculate the number of cells:
In addition to the column names, we also get the data type and a preview of the first values contained in each column. glimpse also tells us the number of rows and columns.
If we notice that a data type has been incorrectly detected, we can either adjust our loading process and correct the data type directly while reading. Alternatively, we can modify the column using mutate, which we will learn in the lesson 4 Data Transformation.
Column names
If you're not yet familiar with a dataset, a function that lists the column names can be helpful:
This information is also provided by glimpse, as shown above. Using colnames, we can get the column names as a vector, which can be useful for further work.
Show samples
To get a first impression of the data, it is often sufficient to display a few rows. This can be done using different functions:
# Show the first few lines (5 per default)
tweets |>
head()
# Show the first 20 lines
tweets |>
head(n = 20)
# Show the last 20 lines
tweets |>
tail(n = 20)
# Print more lines
tweets |>
print(n = 100)
If you would rather not display the top or bottom rows but a random selection, you can use slice_sample:
# Get 10 random rows
tweets |>
slice_sample(n = 10)
# Get 1% random rows
tweets |>
slice_sample(prop = 0.01)
Frequencies
We have already learned about count for counting rows above. The function is also suitable for counting rows grouped by a variable in the data:
# How many tweets per screen name?
tweets |>
count(screen_name)
We can also sort by frequency:
# How many tweets (1 row = a tweet) per screen name?
tweets |>
count(screen_name, sort= TRUE)
With the janitor package, we get a function to quickly determine the frequencies, both absolute and percentage, for nominally scaled variables:
library(janitor)
tweets |>
tabyl(is_retweet)
Here, we need to sort manually using arrange:
tweets |>
tabyl(is_retweet) |>
arrange(percent)
If we add a second variable, tabyl creates a cross-tabulation with the absolute frequencies of each combination:
The skimr package provides us with the function skim, which calculates some helpful statistics for each column for us. If we call the function as is, it prints a large result to the console:
library(skimr)
skim(tweets)
# -- Data Summary ------------------------
# Values
# Name tweets
# Number of rows 63489
# Number of columns 22
# ... hundreds more rows
A nice feature of skim is that it returns a tibble, which we can process further and only print the information we are interested in:
# Show only missing values and complete rate for each column
# and sort by complete rate
tweets |>
skim() |>
select(skim_variable, n_missing, complete_rate) |>
arrange(-complete_rate) |>
print(n = 22)
# # A tibble: 22 x 3
# skim_variable n_missing complete_rate
# <chr> <int> <dbl>
# 1 created_at 0 1
# 2 id 0 1
# 3 screen_name 0 1
# 4 text 0 1
# 5 lang 0 1
# ...
# 21 in_reply_to_status_id 49903 0.214
# 22 quoted_status_id 53792 0.153