The WordLens-Project
  • The WordLens-Project
  • Course Overview
  • Part 1: Transform and Visualize Data
    • 1 Working Environment
    • 2 R and the Tidyverse
    • 3 Data Loading
      • Tabular Data
      • Tidy Data
      • Exploring New Data
    • 4 Data Transformation
      • Select Columns
      • Filter Rows
      • Sort Rows
      • Add Or Change Columns
        • Calculate New Columns
        • Change Data Types
        • Rename Columns
        • Joining Data Sets
      • Summarize Rows
    • 5 Data Visualization
      • Pleas for Visualization
      • Fast and Simple Plots
      • Grammar of Graphics
  • Part 2: Rule-Based NLP
    • 6 Unstructured Data
    • 7 Searching Text
    • 8 Tokenizing Text
      • Filter or Sample Data
      • Clean and Normalize Text
      • Split Text Into Tokens
      • Removing Stop Words
      • Enrich Tokens
    • 9 Topic Classification
      • Deductive
      • Inductive
    • 10 Sentiment Analysis
    • 11 Text Classification
    • 12 Word Pairs and N-Grams
  • Part 3: NLP with Machine Learning
    • 13 Text Embeddings
    • 14 Part-Of-Speech
    • 15 Named Entities
    • 16 Syntactic Dependency
    • 17 Similarity
    • 18 Sentiment
    • 19 Text Classification
    • 20 Transformers
    • 21 Training a Model
    • 22 Large Language Models
  • Appendix
  • Resources
Powered by GitBook
On this page
  • Structured vs. Unstructured Data
  • The Challenge of Unstructured Data
  • The 3Vs of Big Data
  • Solving the Challenges of Big Data
  1. Part 2: Rule-Based NLP

6 Unstructured Data

PreviousGrammar of GraphicsNext7 Searching Text

Last updated 2 years ago

In today's world, we are generating an enormous amount of data every day, from various sources such as social media, e-commerce, healthcare, and more. This data is known as "Big Data." Big Data refers to the large, complex, and diverse datasets that cannot be processed or analyzed using traditional data processing methods. This data requires advanced tools and techniques to analyze, process, and extract useful insights. Let's delve deeper into the concept of Big Data and explore its various aspects.

Structured vs. Unstructured Data

Data can be broadly categorized into two types - structured and unstructured data. Structured data is organized and presented in a specific format, making it easy to analyze and understand. For example, the data stored in a database, an Excel sheet, or a CSV file is structured data. This data has a pre-defined format, usually in the form of rows and columns, and can be easily queried and analyzed using traditional data analysis tools. You learned about them in part 1 of this course.

On the other hand, unstructured data doesn't follow any pre-defined format and is not organized in a particular way. This data is generated from various sources such as social media, videos, images, and text, making it challenging to analyze and process. Text data, for example, is a type of unstructured data that includes email messages, documents, social media posts, news articles, customer reviews, and more.

The Challenge of Unstructured Data

Unstructured data poses a significant challenge to organizations because it cannot be analyzed using traditional data analysis techniques. Therefore, it's essential to impose a structure on unstructured data to make it meaningful and useful. One way to impose structure on unstructured data is by using Natural Language Processing (NLP) techniques. NLP is a branch of Artificial Intelligence that enables machines to understand human language and make sense of unstructured data.

For example, a company that wants to analyze customer reviews to improve its products and services can use NLP to extract the sentiment, keywords, and themes from the text data. This analysis can provide valuable insights into customer behavior and preferences, allowing the company to make data-driven decisions.

The 3Vs of Big Data

The three Vs of Big Data - Volume, Velocity, and Variety - highlight the challenges of processing and analyzing Big Data.

Volume refers to the sheer amount of data generated every day. With the rise of social media, e-commerce, and the Internet of Things (IoT), we are generating more data than ever before.

Velocity refers to the speed at which data is generated and needs to be analyzed. Real-time data processing is critical in many industries such as finance, healthcare, and retail, where decisions need to be made quickly based on the latest data.

Variety refers to the diverse sources and formats of Big Data. Data can be generated in various forms, such as text, audio, video, images, and more. This diversity makes it challenging to analyze and process Big Data using traditional data processing methods.

Solving the Challenges of Big Data

To address the challenges of Big Data, several advanced technologies can be utilized. One solution is to use distributed data storage such as HDFS or Amazon S3, which allows large amounts of data to be stored across multiple servers, making it easier to manage and process.

Distributed data processing tools like Hadoop or Apache Spark can also be used to process data in parallel across multiple servers, reducing the processing time.

Additionally, Machine Learning can be used to recognize patterns in the data and provide valuable insights that would be difficult to detect using traditional data processing methods. These technologies, when used together, can help organizations to efficiently process and analyze Big Data, leading to better decision-making and improved business outcomes.

Structured data is a clear target for filtering, grouping, counting etc. Unstructured data like text is not.
Examples for unstructured data and methods we can apply to derive structured data from it.
Drawing
Drawing