Data Preparation [01]: Data Collection Considerations

4 minute read

Published:

Google Data Analytics: Prepare Data for Exploration - Week 1 & Week 2


Data Collection Considerations

  • Data types
    • Time frame
    • String
    • Numeric
    • Bool
    • Ordinal
  • Data sources
    • First-party data: Collected by oneself.
    • Second-party data: Collected directly by another group and then sold.
    • Third-party data: Sold by a provider that didn’t collect the data themselves, which might come from a number of different sources.
  • Data collection method
    • Interviews
    • Observations: Used a lot by scientists
    • Forms
    • Questionnaires
    • Survey
    • Cookies: Web data collection
  • Data volume
    • Population
    • Sample

drawing


Data Format and Structures

Data Format

drawing

Structured vs Unstructured

drawing


Data Modeling Levels and Techniques

Data Modeling Levels

  • Conceptual data modeling
    • Includes the important entities and the relationships among them.
    • No attribute is specified.
    • No primary key is specified.
  • Logical data modeling
    • Includes all entities and relationships among them.
    • All attributes for each entity are specified.
    • The primary key for each entity is specified.
    • Foreign keys (keys identifying the relationship between different entities) are specified.
    • Normalization occurs at this level.
  • Physical data modeling
    • Specification all tables and columns.
      • Entity names are now table names.
      • Attributes are now column names.
      • Data type for each column is specified.
    • Foreign keys are used to identify relationships between tables.
    • Denormalization may occur based on user requirements.
    • Physical considerations may cause the physical data model to be quite different from the logical data model.

Data Modeling Techniques

  • Entity Relationship Diagrams (ERD)
    • Entities representing objects (or tables in relational database),
    • Attributes of entities including data type,
    • Relationships between entities/objects (or foreign keys in a database).
  • UML Class Diagrams
    • Class: Equivalent to entities in an ERD
    • Attributes: Equivalent to attributes in an ERD
    • Methods
    • Relationships
      • Between objects: Equivalent to relationships in an ERD
      • Between classes
  • Data Dictionary
    • List of data sets/tables
    • List of attributes/columns of each table with data type
    • Item descriptions,
    • Relationships between tables/columns,
    • Additional constraints

Data Transformation

Data transformation involves:

  • Adding, copying, or replicating data
  • Deleting fields or records
  • Standardizing the names of variables
  • Renaming, moving, or combining columns in a database
  • Joining one set of data with another
  • Saving a file in a different format
  • Long -> Wide, Wide -> Long
    • Long data is data where each row contains a single data point for a particular item.
    • Wide data is data where each row contains multiple data points for the particular items identified in the columns. Wide data is data where each row contains multiple data points for the particular items identified in the columns.

Wide Table

 ABC
Iabc
IIdef
IIIghi

Long Table

 F1F2F3
1IAa
2IBb
3ICc
4IIAd
5IIBe
6IICf
7IIIAg
8IIIBh
9IIICi

Hands-On Exploration

Kaggle


Data Bias

  • Sample bias: A sample isn’t representative of the population as a whole.
  • Observer bias: Also called experimenter/research bias, tendency for different people to observe things differently.
  • Interpretation bias: Tendency to always interpret ambiguous situations in a positive, or negative way.
  • Confirmation bias: Tendency to search for, or interpret information in a way that confirms preexisting beliefs.

Data Ethics

  • Ownership: Individuals who own the raw data they provide, and they have primary control over its usage, how it’s processed and how it’s shared.
  • Transaction transparency: All data processing activities and algorithms should be completely explainable and understood by the individual who provides their data.
  • Consent: An individual’s right to know explicit details about how and why their data will be used before agreeing to provide it.
  • Currency: Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
  • Privacy: Preserving a data subject’s information and activity any time a data transaction occurs.
    • Protection from unauthorized access to our private data
    • Freedom from inappropriate use of our data
    • The right to inspect, update, or correct our data
    • Ability to give consent to use our data
    • Legal right to access the data
  • Openness: free access, usage and sharing of data.
    • Be available and accessible to the public as a complete dataset
    • Be provided under terms that allow it to be reused and redistributed
    • Allow universal participation so that anyone can use, reuse, and redistribute the data

Comments