The WordLens-Project
  • The WordLens-Project
  • Course Overview
  • Part 1: Transform and Visualize Data
    • 1 Working Environment
    • 2 R and the Tidyverse
    • 3 Data Loading
      • Tabular Data
      • Tidy Data
      • Exploring New Data
    • 4 Data Transformation
      • Select Columns
      • Filter Rows
      • Sort Rows
      • Add Or Change Columns
        • Calculate New Columns
        • Change Data Types
        • Rename Columns
        • Joining Data Sets
      • Summarize Rows
    • 5 Data Visualization
      • Pleas for Visualization
      • Fast and Simple Plots
      • Grammar of Graphics
  • Part 2: Rule-Based NLP
    • 6 Unstructured Data
    • 7 Searching Text
    • 8 Tokenizing Text
      • Filter or Sample Data
      • Clean and Normalize Text
      • Split Text Into Tokens
      • Removing Stop Words
      • Enrich Tokens
    • 9 Topic Classification
      • Deductive
      • Inductive
    • 10 Sentiment Analysis
    • 11 Text Classification
    • 12 Word Pairs and N-Grams
  • Part 3: NLP with Machine Learning
    • 13 Text Embeddings
    • 14 Part-Of-Speech
    • 15 Named Entities
    • 16 Syntactic Dependency
    • 17 Similarity
    • 18 Sentiment
    • 19 Text Classification
    • 20 Transformers
    • 21 Training a Model
    • 22 Large Language Models
  • Appendix
  • Resources
Powered by GitBook
On this page
  1. Part 1: Transform and Visualize Data

4 Data Transformation

For data with a clear structure, there is a set of five transformation techniques we need to master. In this lesson, we'll introduce them step by step.

PreviousExploring New DataNextSelect Columns

Last updated 2 years ago

Data is the new oil, at least according to the mathematician :

“Data is the new oil. Like oil, data is valuable, but if unrefined, it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity. So, must data be broken down, analysed for it to have value.”

If we take this analogy seriously, the data, like oil, needs to be refined and turned into something of value. Two important tools for refining data into a valuable output are data transformation and data visualization, both of which are the main focus of this book. In this part of the book, we first need to learn how to transform data from one form into another so that we can apply .

To master data transformation, we have to master five basic transformation techniques that work in structured data. We always start with a given data frame that we want to change into something else. In doing that, we typically want to do one of the following:

  1. Remove variables we don’t currently need (or specify those we do need). You will learn how to do this in Select Columns.

  2. Remove any records we don’t currently need (or specify those we do need). We'll introduce ways to do this in Filter Rows.

  3. Change the order of the records. That's an easy one covered in Sort Rows.

  4. Add new variables we require, but that don’t exist yet. We'll learn the general techniques and look at some examples in Add Or Change Columns.

  5. Summarize many records into one or a few numbers. That's what data analysis is all about, and in the lesson Summarize Rows we'll look at concrete examples.

The following figure illustrates the schematic working of each of the five transformations:

Clive Humby
visualization techniques later on
The five types of transformations visually explained.
Drawing