Data Types [03]: Pandas DataFrame Basics

7 minute read

Published:

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R. This post covers the basics of DataFrame, more functions will be explored in practice.


DataFrame Overview

Below is an illustration of the structure of DataFrame. Each column in a DataFrame is a Series, and each row is a record.

drawing

We can use pd.DataFrame() to generate DataFrame objects.

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

where the data can abe ndarray, Iterable, dict, or DataFrame

Example 1: Create DataFrame from ndarray.

import pandas as pd
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
df = pd.DataFrame(A, index = ['row1','row2','row3'], columns = ['col1','col2','col3'])

df
Out[203]: 
      col1  col2  col3
row1     1     2     3
row2     4     5     6
row3     7     8     9

Example 2: Create DataFrame from dict.

import pandas as pd
import numpy as np
B = {'col1': [1,4,7], 'col2': [2,5,8], 'col3': [3,6,9]}
df = pd.DataFrame(B, index = ['row1','row2','row3'])

df
Out[203]: 
      col1  col2  col3
row1     1     2     3
row2     4     5     6
row3     7     8     9

When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets [].

df['col1']
Out[207]: 
row1    1
row2    4
row3    7
Name: col1, dtype: int64

We can use the method describe() to generate descriptive statistics of the table. Here is a complete list of the methods supported.

df.describe()
Out[208]: 
       col1  col2  col3
count   3.0   3.0   3.0
mean    4.0   5.0   6.0
std     3.0   3.0   3.0
min     1.0   2.0   3.0
25%     2.5   3.5   4.5
50%     4.0   5.0   6.0
75%     5.5   6.5   7.5
max     7.0   8.0   9.0

Tabular Data Operations

To read data from csv, excel, sql, json, parquet, etc., use pd.read_*(), where * represents data types; similarly, to write data to csv, excel, sql, json, parquet, etc., use pd.to_*().

Here is a dataset of passengers’ responses to the survey from an aerospace company about whether they are satisfied with the journey.

airline = pd.read_csv('airline.csv',sep = ',')

The first thing we do after reading the data is to check it. When displaying the data, the first and last 5 rows will be displayed by default.

airline

       Unnamed: 0     id  ... Arrival Delay in Minutes             satisfaction
0               0  19556  ...                     44.0                satisfied
1               1  90035  ...                      0.0                satisfied
2               2  12360  ...                      0.0  neutral or dissatisfied
3               3  77959  ...                      6.0                satisfied
4               4  36875  ...                     20.0                satisfied
          ...    ...  ...                      ...                      ...
25971       25971  78463  ...                      0.0  neutral or dissatisfied
25972       25972  71167  ...                      0.0                satisfied
25973       25973  37675  ...                      0.0  neutral or dissatisfied
25974       25974  90086  ...                      0.0                satisfied
25975       25975  34799  ...                      0.0  neutral or dissatisfied

[25976 rows x 25 columns]

We may want to know the header names of the data and the data types.

airline.columns
Out[221]: 
Index(['Unnamed: 0', 
       'id', 'Gender', 'Customer Type', 'Age', 'Type of Travel',
       'Class', 'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'satisfaction'],
      dtype='object')

airline.dtypes
Out[222]: 
Unnamed: 0                             int64
id                                     int64
Gender                                object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Inflight wifi service                  int64
Departure/Arrival time convenient      int64
Ease of Online booking                 int64
Gate location                          int64
Food and drink                         int64
Online boarding                        int64
Seat comfort                           int64
Inflight entertainment                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Inflight service                       int64
Cleanliness                            int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
satisfaction                          object
dtype: object

df.info() provides more specific information.

airline.info()
Out[223]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25976 entries, 0 to 25975
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         25976 non-null  int64  
 1   id                                 25976 non-null  int64  
 2   Gender                             25976 non-null  object 
 3   Customer Type                      25976 non-null  object 
 4   Age                                25976 non-null  int64  
 5   Type of Travel                     25976 non-null  object 
 6   Class                              25976 non-null  object 
 7   Flight Distance                    25976 non-null  int64  
 8   Inflight wifi service              25976 non-null  int64  
 9   Departure/Arrival time convenient  25976 non-null  int64  
 10  Ease of Online booking             25976 non-null  int64  
 11  Gate location                      25976 non-null  int64  
 12  Food and drink                     25976 non-null  int64  
 13  Online boarding                    25976 non-null  int64  
 14  Seat comfort                       25976 non-null  int64  
 15  Inflight entertainment             25976 non-null  int64  
 16  On-board service                   25976 non-null  int64  
 17  Leg room service                   25976 non-null  int64  
 18  Baggage handling                   25976 non-null  int64  
 19  Checkin service                    25976 non-null  int64  
 20  Inflight service                   25976 non-null  int64  
 21  Cleanliness                        25976 non-null  int64  
 22  Departure Delay in Minutes         25976 non-null  int64  
 23  Arrival Delay in Minutes           25893 non-null  float64
 24  satisfaction                       25976 non-null  object 
dtypes: float64(1), int64(19), object(5)
memory usage: 5.0+ MB
  • When selecting subsets of data, square brackets [] are used.
  • Inside these brackets, we can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
  • Select specific rows and/or columns using loc when using the row and column names
  • Select specific rows and/or columns using iloc when using the positions in the table
  • We can also assign new values to a selection based on loc/iloc.

Plotting Using DataFrame

Some plot methods are available in DataFrame, we can get a overview of the methods:

[
    method_name
    for method_name in dir(airline.plot)
    if not method_name.startswith("_")
]

Out[225]: 
['area',
 'bar',
 'barh',
 'box',
 'density',
 'hexbin',
 'hist',
 'kde',
 'line',
 'pie',
 'scatter']

Better choices would be matplotlib and seaborn.


Summary Statistics

  • DataFrame.mean(), DataFrame.median(), DataFrame.describe(), …
  • DataFrame.agg()
  • DataFrame.groupby()
  • DataGrame.value_counts(), DataFrame.count()

Reshape Layout

Sort table rows

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

where by should be name or list of names to sort by; key should be a callable function, which apply the key function to the values before sorting.

Example:

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})
df.sort_values(by='col1', ascending=False)
Out[243]: 
  col1  col2  col3 col4
4    D     7     2    e
5    C     4     3    F
2    B     9     9    c
0    A     2     0    a
1    A     1     1    B
3  NaN     8     4    D

df.sort_values(by='col4', key=lambda col: col.str.lower())
Out[244]: 
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F

Long to wide table

  • pivot()
    DataFrame.pivot(index=None, columns=None, values=None)
    
  • pivot_table()
    DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)
    

drawing

In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated pivot_table() can be used, providing an aggregation function (e.g. mean) on how to combine these values.

Wide to long table

  • melt()
    pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
    

drawing


Combine Multiple Tables

  • concat()
  • merge()
    • Similar to SQL

Comments