Data Types [03]: Pandas DataFrame Basics
Published:
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R. This post covers the basics of DataFrame, more functions will be explored in practice.
DataFrame Overview
Below is an illustration of the structure of DataFrame. Each column in a DataFrame is a Series, and each row is a record.
We can use pd.DataFrame()
to generate DataFrame objects.
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
where the data
can abe ndarray
, Iterable
, dict
, or DataFrame
Example 1: Create DataFrame from ndarray.
import pandas as pd
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
df = pd.DataFrame(A, index = ['row1','row2','row3'], columns = ['col1','col2','col3'])
df
Out[203]:
col1 col2 col3
row1 1 2 3
row2 4 5 6
row3 7 8 9
Example 2: Create DataFrame from dict.
import pandas as pd
import numpy as np
B = {'col1': [1,4,7], 'col2': [2,5,8], 'col3': [3,6,9]}
df = pd.DataFrame(B, index = ['row1','row2','row3'])
df
Out[203]:
col1 col2 col3
row1 1 2 3
row2 4 5 6
row3 7 8 9
When selecting a single column of a pandas DataFrame, the result is a pandas Series
. To select the column, use the column label in between square brackets []
.
df['col1']
Out[207]:
row1 1
row2 4
row3 7
Name: col1, dtype: int64
We can use the method describe()
to generate descriptive statistics of the table. Here is a complete list of the methods supported.
df.describe()
Out[208]:
col1 col2 col3
count 3.0 3.0 3.0
mean 4.0 5.0 6.0
std 3.0 3.0 3.0
min 1.0 2.0 3.0
25% 2.5 3.5 4.5
50% 4.0 5.0 6.0
75% 5.5 6.5 7.5
max 7.0 8.0 9.0
Tabular Data Operations
To read data from csv
, excel
, sql
, json
, parquet
, etc., use pd.read_*()
, where *
represents data types; similarly, to write data to csv
, excel
, sql
, json
, parquet
, etc., use pd.to_*()
.
Here is a dataset of passengers’ responses to the survey from an aerospace company about whether they are satisfied with the journey.
airline = pd.read_csv('airline.csv',sep = ',')
The first thing we do after reading the data is to check it. When displaying the data, the first and last 5 rows will be displayed by default.
airline
Unnamed: 0 id ... Arrival Delay in Minutes satisfaction
0 0 19556 ... 44.0 satisfied
1 1 90035 ... 0.0 satisfied
2 2 12360 ... 0.0 neutral or dissatisfied
3 3 77959 ... 6.0 satisfied
4 4 36875 ... 20.0 satisfied
... ... ... ... ...
25971 25971 78463 ... 0.0 neutral or dissatisfied
25972 25972 71167 ... 0.0 satisfied
25973 25973 37675 ... 0.0 neutral or dissatisfied
25974 25974 90086 ... 0.0 satisfied
25975 25975 34799 ... 0.0 neutral or dissatisfied
[25976 rows x 25 columns]
We may want to know the header names of the data and the data types.
airline.columns
Out[221]:
Index(['Unnamed: 0',
'id', 'Gender', 'Customer Type', 'Age', 'Type of Travel',
'Class', 'Flight Distance', 'Inflight wifi service',
'Departure/Arrival time convenient', 'Ease of Online booking',
'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
'Inflight entertainment', 'On-board service', 'Leg room service',
'Baggage handling', 'Checkin service', 'Inflight service',
'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
'satisfaction'],
dtype='object')
airline.dtypes
Out[222]:
Unnamed: 0 int64
id int64
Gender object
Customer Type object
Age int64
Type of Travel object
Class object
Flight Distance int64
Inflight wifi service int64
Departure/Arrival time convenient int64
Ease of Online booking int64
Gate location int64
Food and drink int64
Online boarding int64
Seat comfort int64
Inflight entertainment int64
On-board service int64
Leg room service int64
Baggage handling int64
Checkin service int64
Inflight service int64
Cleanliness int64
Departure Delay in Minutes int64
Arrival Delay in Minutes float64
satisfaction object
dtype: object
df.info()
provides more specific information.
airline.info()
Out[223]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25976 entries, 0 to 25975
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 25976 non-null int64
1 id 25976 non-null int64
2 Gender 25976 non-null object
3 Customer Type 25976 non-null object
4 Age 25976 non-null int64
5 Type of Travel 25976 non-null object
6 Class 25976 non-null object
7 Flight Distance 25976 non-null int64
8 Inflight wifi service 25976 non-null int64
9 Departure/Arrival time convenient 25976 non-null int64
10 Ease of Online booking 25976 non-null int64
11 Gate location 25976 non-null int64
12 Food and drink 25976 non-null int64
13 Online boarding 25976 non-null int64
14 Seat comfort 25976 non-null int64
15 Inflight entertainment 25976 non-null int64
16 On-board service 25976 non-null int64
17 Leg room service 25976 non-null int64
18 Baggage handling 25976 non-null int64
19 Checkin service 25976 non-null int64
20 Inflight service 25976 non-null int64
21 Cleanliness 25976 non-null int64
22 Departure Delay in Minutes 25976 non-null int64
23 Arrival Delay in Minutes 25893 non-null float64
24 satisfaction 25976 non-null object
dtypes: float64(1), int64(19), object(5)
memory usage: 5.0+ MB
- When selecting subsets of data, square brackets
[]
are used. - Inside these brackets, we can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
- Select specific rows and/or columns using
loc
when using the row and column names - Select specific rows and/or columns using
iloc
when using the positions in the table - We can also assign new values to a selection based on
loc
/iloc
.
Plotting Using DataFrame
Some plot methods are available in DataFrame, we can get a overview of the methods:
[
method_name
for method_name in dir(airline.plot)
if not method_name.startswith("_")
]
Out[225]:
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']
Better choices would be matplotlib
and seaborn
.
Summary Statistics
DataFrame.mean()
,DataFrame.median()
,DataFrame.describe()
, …DataFrame.agg()
DataFrame.groupby()
DataGrame.value_counts()
,DataFrame.count()
Reshape Layout
Sort table rows
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
where by
should be name or list of names to sort by; key
should be a callable function, which apply the key function to the values before sorting.
Example:
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})
df.sort_values(by='col1', ascending=False)
Out[243]:
col1 col2 col3 col4
4 D 7 2 e
5 C 4 3 F
2 B 9 9 c
0 A 2 0 a
1 A 1 1 B
3 NaN 8 4 D
df.sort_values(by='col4', key=lambda col: col.str.lower())
Out[244]:
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
Long to wide table
pivot()
DataFrame.pivot(index=None, columns=None, values=None)
pivot_table()
DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)
In the case of pivot()
, the data is only rearranged. When multiple values need to be aggregated pivot_table()
can be used, providing an aggregation function (e.g. mean
) on how to combine these values.
Wide to long table
melt()
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
Combine Multiple Tables
concat()
merge()
- Similar to SQL
Comments