Data Visualization [02]: Pandas and Seaborn Basics
Published:
matplotlib
is a fairly low-level tool. pandas
itself has built-in methods that simplify creating visualizations from DataFrame and Series objects. And seaborn
further simplify the procedures. Unfortuantely, seaborn
doesn’t have built-in support for 3D functionalities. However, we can still use seaborn
style for 3D matplotlib
plots. Let’s learn the basics of pandas
and seaborn
through some simple examples.
Flight Dataset
The dataset used will be the flight
dataset, which has 10 years of monthly airline passenger data.
import seaborn as sn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
sn.set(style='darkgrid')
data = pd.read_csv('flight.csv', sep=',')
data
Out[95]:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
.. ... ... ...
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432
[144 rows x 3 columns]
# check the detailed info of the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 144 non-null int64
1 month 144 non-null object
2 passengers 144 non-null int64
dtypes: int64(2), object(1)
memory usage: 3.5+ KB
pandas.DataFrame.plot()
In pandas, the plot
method on Series and DataFrame is just a simple wrapper around plt.plot(), with some additional parameters.
DataFrame.plot(*args, **kwargs)
Here is a list of the parameters, where the kind
can be used to indicate the type of the plot:
x
y
kind
line
: line plot (default)bar
: vertical bar plotbarh
: horizontal bar plothist
: histogrambox
: boxplotkde
: Kernel Density Estimation plotdensity
: same askde
area
: area plotpie
: pie plotscatter
: scatter plot (DataFrame only)hexbin
: hexbin plot (DataFrame only)
ax
subplot
sharex
sharey
layout
figsize
use_inex
title
grid
legend
style
logx
logy
loglog
xticks
yticks
xlim
ylim
xlabel
ylabel
rot
fontsize
colormap
colorbar
position
table
yerr
xerr
stacked
sort_columns
secondary_y
mark_right
include_bool
- `backed’
Now, let’s consider plotting a line graph of columns ‘year’ and ‘passengers’ using DataFrame.plot()
, which will create something like this:
plt.figure()
data.plot(x = 'year', y = 'passengers')
The graph looks kind of weired. We will improve it using seaborn
below. Let’s now draw the line graph in groups grouped by month
.
Firstly, we need to convert it to a pivot table:
data_wide = data.pivot("year", "month", "passengers")
data_wide
Out[102]:
month April August December February ... May November October September
year ...
1949 129 148 118 118 ... 121 104 119 136
1950 135 170 140 126 ... 125 114 133 158
1951 163 199 166 150 ... 172 146 162 184
1952 181 242 194 180 ... 183 172 191 209
1953 235 272 201 196 ... 229 180 211 237
1954 227 293 229 188 ... 234 203 229 259
1955 269 347 278 233 ... 270 237 274 312
1956 313 405 306 277 ... 318 271 306 355
1957 348 467 336 301 ... 355 305 347 404
1958 348 505 337 318 ... 363 310 359 404
1959 396 559 405 342 ... 420 362 407 463
1960 461 606 432 391 ... 472 390 461 508
[12 rows x 12 columns]
Now we can plot the graph simply by data_wide.plot()
, which creates the graph below.
seaborn.lineplot()
Let’s consider improving the line graph using seaborn.lineplot()
.
By default, seaborn.lineplot()
will plot mean and 95% confidence levels, making it look much better:
plt.figure()
sn.lineplot(x = 'year', y = 'passengers', data=data)
The confidence interval is displayed in translucent error bands by default and with 95% confidence level. We can also change the error style to ‘bars’ by parameter err_style
and confidence level by parameter ci
:
plt.figure()
sn.lineplot(x = 'year', y = 'passengers', data = data,
err_style = 'bars', ci = 95)
To group the data, we can use parameter hue
, size
or style
(which is very convenient):
plt.figure()
sn.lineplot(x = 'year', y = 'passengers', hue = 'month', data = data)
To distinguish the groups, besides different line styles, we can also use markers for different groups using markers
:
plt.figure()
sn.lineplot(x = 'year', y = 'passengers', hue = 'month', data = data,
style = 'month', markers=True)
Now, the line graph looks much better.
Reference to DataFrame.plot() and seaborn.lineplot()
Others Plots: Overview
pandas
- In addition to the
kind
parameter inDataFrame.plot()
, there are theDataFrame.hist()
andDataFrame.boxplot()
methods, which use a separate interface. - We can also use the methods
DataFrame.plot.<kind>
, including:DataFrame.plot.area
DataFrame.plot.barh
DataFrame.plot.density
DataFrame.plot.hist
DataFrame.plot.line
DataFrame.plot.scatter
DataFrame.plot.bar
DataFrame.plot.box
DataFrame.plot.hexbin
DataFrame.plot.kde
DataFrame.plot.pie
- There are also several plotting functions in
pandas.plotting
that take a Series or DataFrame as an argument, including:andrews_curves
autocorrelation_plot
bootstrap_plot
boxplot
lag_plot
parallel_coordinates
radviz
scatter_matrix
seaborn
- Relational plots
replot
: Figure-level interface for drawing relational plots onto a FacetGrid.scatterplot
lineplot
- Distribution plots
displot
: Figure-level interface for drawing distribution plots onto a FacetGrid.histplot
kdeplot
: Plot univariate or bivariate distributions using kernel density estimation.ecdfplot
: Plot empirical cumulative distribution functions.rugplot
: Plot marginal distributions by drawing ticks along the x and y axes.
- Categorical plots
catplot
: Figure-level interface for drawing categorical plots onto a FacetGrid.stripplot
: Draw a scatterplot where one variable is categorical.swarmplot
: Draw a categorical scatterplot with non-overlapping points.boxplot
: Draw a box plot to show distributions with respect to categories.violinplot
: Draw a combination of boxplot and kernel density estimate.boxenplot
: Draw an enhanced box plot for larger datasets.pointplot
: Show point estimates and confidence intervals using scatter plot glyphs.barplot
: Show point estimates and confidence intervals as rectangular bars.countplot
: Show the counts of observations in each categorical bin using bars.
- Regression plots
lmplot
: Plot data and regression model fits across a FacetGrid.regplot
residplot
- Matrix plots
heatmap
clustermap
- Multiplot grids
FacetGrid
: Multi-plot grid for plotting conditional relationships.pairplot
: Plot pairwise relationships in a dataset.PairGrid
: Subplot grid for plotting pairwise relationships in a dataset.joinplot
: Draw a plot of two variables with bivariate and univariate graphs.JointGrid
: Grid for drawing a bivariate plot with marginal univariate plots.
An overview of seaborn
can be referenced to this page. We will delve deeper into these plots in the future plots.
Comments