Data Types [03]: Pandas DataFrame Basics

7 minute read

Published: February 03, 2022

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R. This post covers the basics of DataFrame, more functions will be explored in practice.

DataFrame Overview

Below is an illustration of the structure of DataFrame. Each column in a DataFrame is a Series, and each row is a record.

drawing

We can use pd.DataFrame() to generate DataFrame objects.

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

where the data can abe ndarray, Iterable, dict, or DataFrame

Example 1: Create DataFrame from ndarray.

import pandas as pd
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
df = pd.DataFrame(A, index = ['row1','row2','row3'], columns = ['col1','col2','col3'])

df
Out[203]: 
      col1  col2  col3
row1     1     2     3
row2     4     5     6
row3     7     8     9

Example 2: Create DataFrame from dict.

import pandas as pd
import numpy as np
B = {'col1': [1,4,7], 'col2': [2,5,8], 'col3': [3,6,9]}
df = pd.DataFrame(B, index = ['row1','row2','row3'])

df
Out[203]: 
      col1  col2  col3
row1     1     2     3
row2     4     5     6
row3     7     8     9

When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets [].

df['col1']
Out[207]: 
row1    1
row2    4
row3    7
Name: col1, dtype: int64

We can use the method describe() to generate descriptive statistics of the table. Here is a complete list of the methods supported.

df.describe()
Out[208]: 
       col1  col2  col3
count   3.0   3.0   3.0
mean    4.0   5.0   6.0
std     3.0   3.0   3.0
min     1.0   2.0   3.0
25%     2.5   3.5   4.5
50%     4.0   5.0   6.0
75%     5.5   6.5   7.5
max     7.0   8.0   9.0

Tabular Data Operations

To read data from csv, excel, sql, json, parquet, etc., use pd.read_*(), where * represents data types; similarly, to write data to csv, excel, sql, json, parquet, etc., use pd.to_*().

Here is a dataset of passengers’ responses to the survey from an aerospace company about whether they are satisfied with the journey.

airline = pd.read_csv('airline.csv',sep = ',')

The first thing we do after reading the data is to check it. When displaying the data, the first and last 5 rows will be displayed by default.

airline

       Unnamed: 0     id  ... Arrival Delay in Minutes             satisfaction
             0  19556  ...                     44.0                satisfied
             1  90035  ...                      0.0                satisfied
             2  12360  ...                      0.0  neutral or dissatisfied
             3  77959  ...                      6.0                satisfied
             4  36875  ...                     20.0                satisfied
          ...    ...  ...                      ...                      ...
     25971  78463  ...                      0.0  neutral or dissatisfied
     25972  71167  ...                      0.0                satisfied
     25973  37675  ...                      0.0  neutral or dissatisfied
     25974  90086  ...                      0.0                satisfied
     25975  34799  ...                      0.0  neutral or dissatisfied

[25976 rows x 25 columns]

We may want to know the header names of the data and the data types.

airline.columns
Out[221]: 
Index(['Unnamed: 0', 
       'id', 'Gender', 'Customer Type', 'Age', 'Type of Travel',
       'Class', 'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'satisfaction'],
      dtype='object')

airline.dtypes
Out[222]: 
Unnamed: 0                             int64
id                                     int64
Gender                                object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Inflight wifi service                  int64
Departure/Arrival time convenient      int64
Ease of Online booking                 int64
Gate location                          int64
Food and drink                         int64
Online boarding                        int64
Seat comfort                           int64
Inflight entertainment                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Inflight service                       int64
Cleanliness                            int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
satisfaction                          object
dtype: object

df.info() provides more specific information.

airline.info()
Out[223]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25976 entries, 0 to 25975
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 Unnamed: 0                         25976 non-null  int64  
 id                                 25976 non-null  int64  
 Gender                             25976 non-null  object 
 Customer Type                      25976 non-null  object 
 Age                                25976 non-null  int64  
 Type of Travel                     25976 non-null  object 
 Class                              25976 non-null  object 
 Flight Distance                    25976 non-null  int64  
 Inflight wifi service              25976 non-null  int64  
 Departure/Arrival time convenient  25976 non-null  int64  
Ease of Online booking             25976 non-null  int64  
Gate location                      25976 non-null  int64  
Food and drink                     25976 non-null  int64  
Online boarding                    25976 non-null  int64  
Seat comfort                       25976 non-null  int64  
Inflight entertainment             25976 non-null  int64  
On-board service                   25976 non-null  int64  
Leg room service                   25976 non-null  int64  
Baggage handling                   25976 non-null  int64  
Checkin service                    25976 non-null  int64  
Inflight service                   25976 non-null  int64  
Cleanliness                        25976 non-null  int64  
Departure Delay in Minutes         25976 non-null  int64  
Arrival Delay in Minutes           25893 non-null  float64
satisfaction                       25976 non-null  object 
dtypes: float64(1), int64(19), object(5)
memory usage: 5.0+ MB

When selecting subsets of data, square brackets [] are used.
Inside these brackets, we can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
Select specific rows and/or columns using loc when using the row and column names
Select specific rows and/or columns using iloc when using the positions in the table
We can also assign new values to a selection based on loc/iloc.

Plotting Using DataFrame

Some plot methods are available in DataFrame, we can get a overview of the methods:

[
    method_name
    for method_name in dir(airline.plot)
    if not method_name.startswith("_")
]

Out[225]: 
['area',
 'bar',
 'barh',
 'box',
 'density',
 'hexbin',
 'hist',
 'kde',
 'line',
 'pie',
 'scatter']

Better choices would be matplotlib and seaborn.

Summary Statistics

DataFrame.mean(), DataFrame.median(), DataFrame.describe(), …
DataFrame.agg()
DataFrame.groupby()
DataGrame.value_counts(), DataFrame.count()

Reshape Layout

Sort table rows

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

where by should be name or list of names to sort by; key should be a callable function, which apply the key function to the values before sorting.

Example:

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})

df.sort_values(by='col1', ascending=False)
Out[243]: 
  col1  col2  col3 col4
4    D     7     2    e
5    C     4     3    F
2    B     9     9    c
0    A     2     0    a
1    A     1     1    B
3  NaN     8     4    D

df.sort_values(by='col4', key=lambda col: col.str.lower())
Out[244]: 
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F

Long to wide table

pivot()

DataFrame.pivot(index=None, columns=None, values=None)

pivot_table()

DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)

drawing

In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated pivot_table() can be used, providing an aggregation function (e.g. mean) on how to combine these values.

Wide to long table

melt()

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)

drawing

Combine Multiple Tables

concat()
merge()
- Similar to SQL

Twitter Facebook LinkedIn

Chao Huang

Data Types [03]: Pandas DataFrame Basics

DataFrame Overview

Tabular Data Operations

Plotting Using DataFrame

Summary Statistics

Reshape Layout

Combine Multiple Tables

Comments