Tabular Data

In the first step of any data analysis, we need to provide the dataset to our computer. This is also known as loading the dataset. To accomplish this, we need to tell the computer where to find the data and instruct it to load the data into its memory. For fast access to the data in our analysis, we store a reference to the data on a variable.

Data is usually stored in the form of files, and in some cases, it can be stored in a database. If it's stored in a file, it can be in various formats. One of the most commonly used formats is the CSV format, which is based on plain text files. Other common formats for structured data include spreadsheets like Excel or Google Sheets. R even has its proprietary format called the R-Data Source (RDS).

In R, we store structured data in so-called data frames or tibbles in the context of the Tidyverse.

Data Frames and Tibbles

A data frame is a fundamental data structure in R, used for storing and manipulating tabular data. It is essentially a two-dimensional table where each column represents a variable and each row represents an observation. Data frames can store different types of data, such as numerical, categorical, and character data.

Data frames are used in various data analysis tasks, including data cleaning, transformation, and visualization. They provide a convenient and familiar format for working with data, making it easier for users to perform operations on the data and draw insights from it.

A tibble is a modern take on the classic data frame, introduced by the Tidyverse. Tibbles are an enhanced version of data frames, providing several improvements and additional features that make them more user-friendly and efficient for data analysis.

Tibbles provide an improved printing method compared to traditional data frames, making it easier to view and understand the contents of the dataset. Here are some key aspects in which Tibbles offer better printing:

Compact display: When printing a tibble, only the first 10 rows and the columns that fit on the screen are displayed by default. This ensures that you get a concise view of the data without being overwhelmed by many rows and columns, which can happen when printing large data frames.
Truncation: If a column contains long text or character values, Tibbles will automatically truncate the displayed content to fit the screen width, ensuring that the output remains readable and well-organized.
Column data type information: Tibbles include the data type of each column below the column name when printed. This makes it easier to understand the structure of the dataset at a glance and can help identify potential issues related to data types.
Column width adjustment: Tibbles automatically adjust the width of each column based on the content, ensuring that the printed output is visually appealing and easy to read. This is particularly useful when working with datasets containing columns with varying lengths of data.
Color coding: Tibbles use color coding to differentiate between different parts of the printed output, such as column names, data types, and data values. This enhances the readability of the output, making it easier to interpret the dataset.

These improvements in printing make Tibbles a more user-friendly option for working with tabular data in R. Next to better printing, there are some more advantages:

Column data types: Tibbles are stricter when it comes to preserving the data types of columns, resulting in fewer unexpected data type conversions. This makes working with Tibbles more predictable and reduces the chances of errors due to data type inconsistencies.
Non-syntactic column names: Tibbles can handle column names with spaces, special characters, and reserved words without any issues. This makes it easier to work with datasets that have unconventional column names.
Subsetting: When you subset a tibble using single brackets ([ ]), it always returns another tibble, even if the result is a single column. In contrast, subsetting a data frame with a single column using single brackets can return a vector instead of a data frame. This consistent behavior of Tibbles makes them more reliable for subsetting operations.
Lazy evaluation: Tibbles are built with a "lazy" approach to evaluation, meaning that they only evaluate expressions when absolutely necessary. This can lead to improved performance, especially when working with large datasets.
Enhanced compatibility with Tidyverse packages: Since Tibbles are part of the Tidyverse, they are designed to work seamlessly with other Tidyverse packages, such as dplyr and ggplot2. This ensures that you can easily take advantage of the full range of Tidyverse tools when working with Tibbles.

Loading CSV Files with `readr`

To load datasets from CSV files, the tidyverse package provides a package called readr. This package is automatically loaded with the tidyverse. The package provides the function read_csv for CSV files that use a comma as a separator:

orders <- read_csv("data/campusbier/orders.csv")

In some European countries, a semicolon is commonly used as a separator and a comma as a decimal separator (as in Germany). In this case, readr provides the function read_csv2. Alternatively, the function read_delim can be used, and the separator can be manually set using the delim parameter. Learn more about the different read_ functions in the online documentation:

Read a delimited file (including CSV and TSV) into a tibble — read_delim

The base R also offers a similar function for this specific use case. It's called read.csv, but pay attention to the detail: unlike the underscore (_) used in the Tidyverse package, a period (.) is used in the base R function between the two words read and csv. If you are working with the Tidyverse and tibbles (as we do throughout this course), make sure to always use the read_csv function provided by readr, as it returns data as a tibble and also provides some useful additional features.

Loading only specific columns

The read_csv function allows for selecting specific columns while loading the data. I recommend using the select function, which we will cover in the lesson on 4 Data Transformation. However, I'd like to briefly demonstrate how to filter columns directly while loading.

The following code loads the columns order_id and name, as well as all columns whose name starts with "customer":

orders <- read_csv("data/campusbier/orders.csv",
                   col_select = c(order_id, name, starts_with("customer")))

As a result, the resulting Tibble is much narrower and only contains the desired columns:

colnames(orders)

# [1] "order_id"                             
# [2] "name"                                 
# [3] "customer_id"                          
# [4] "customer_accepts_marketing"           
# [5] "customer_accepts_marketing_updated_at"
# [6] "customer_marketing_opt_in_level"
# ...

If you only have one filter condition, you can skip using the c function:

orders <- read_csv("data/campusbier/orders.csv") |>
  select(starts_with("shipping"))

You will learn more about selecting columns with dplyr's select command in Select Columns.

Setting the decimal separator

If we want to change only the decimal separator, for example because we exported data from a German system that uses a comma, we can do so using the locale parameter. In the example below, we also use the pipe symbol | as a separator. Both settings can be adjusted using read_delim:

# Records in hypothetical file sales.csv
# year,month,turnover
# 2022|01|2700,85
# 2022|02|2910,10
# 2022|03|1802,37

turnover <- read_delim("sales.csv", 
                       delim = "|", 
                       locale = locale(decimal_mark = ",")
                       )

The result:

turnover

# # A tibble: 3 x 3
#    year month turnover
#   <dbl> <chr>    <dbl>
# 1  2022 01       2701.
# 2  2022 02       2910.
# 3  2022 03       1802.

Setting data types while loading

When loading data with read_csv, some columns' data type may be incorrectly detected. The function examines the first few rows and determines the data type based on them. If it's detected incorrectly, we can explicitly specify the data type for the columns. Similar to selecting columns, we can also specify the data type while loading. Alternatively, we can modify the data type later using mutate.

Sometimes, increasing the value of the guess_max parameter can help. This parameter determines how many values the function considers determining a column's data type. By default, it is limited to 1000 values.

In our orders.csv dataset, some data types have been incorrectly recognized. For example, the order_id column is recognized as a double data type. Even though all values can be considered decimal numbers, an order number is not a number in the sense that it should be used in calculations. It is rather a string that conventionally only consists of numbers (but doesn't have to). The character data type would be more appropriate. We can change it as follows:

orders <- read_csv("data/campusbier/orders.csv",
                   col_types = list("order_id" = col_character()))

# The column order_id is now "chr"
orders |> 
  select(order_id)

We can do the same thing for multiple columns simultaneously. In fact, the two columns with zip codes have also been incorrectly recognized as double:

orders <- read_csv("data/campusbier/orders.csv",
                   col_types = list(
                     "order_id" = col_character(),
                     "app_id" = col_character(),
                     "billing_address_zip"= col_character(),
                     "shipping_address_zip"= col_character()
                   )
)

Loading data from a web server

The CSV file doesn't have to be stored locally on your computer; instead, read_csv can retrieve it directly from a web server via the HTTP protocol. In this case, you pass the URL to the function instead of the local file name. The code below loads the daily updated version of the Covid-19 dataset hosted on the servers of Our World in Data:

covid <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")

Loading Excel Files with `readxl`

Read the section "21.2 Excel" from the book "R for Data Science" by Hadley Wickham and Garret Grolemund.

Loading Google Sheets

Read the section "21.3 Google Sheets" from the book "R for Data Science" by Hadley Wickham and Garret Grolemund.

Loading R-Data Source Files (RDS)

Reading data from the internal R-format is a simple as that:

tweets <- readRDS("data/tweets.rds")

Previous3 Data Loading NextTidy Data

Last updated 2 years ago