Statistics [17]: Python [01] - Data Representation

3 minute read

Published:

Data representation and manipulation in Python.


Data As Table

In Python, the data can be seen as a 2D table, or matrix, with columns giving the different attributes of the data, and rows the observations. The following example is taken from data.scv, where there are 7 attributes and 5 observations.

"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353
"5";"Female";137;132;134;"147";"65.0";951545
...

Importing Data

To import data into the workspace, we can use pandas.read_csv (or other formats, which can be referenced to pandas) from pandas module, the imported data would be a pandas.DataFrame.

import pandas
data1 = pandas.read_csv('data.csv', sep=';', na_values=".")

The results look like below.

   Unnamed: 0  Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count
0           1  Female   133  132  124   118.0    64.5     816932
1           2    Male   140  150  124     NaN    72.5    1001121
2           3    Male   139  123  150   143.0    73.3    1038437
3           4    Male   133  129  128   172.0    68.8     965353
4           5  Female   137  132  134   147.0    65.0     951545
...

Alternately, we can also create data from arrays using pandas.DataFrame, which can be seen as a dictionary of 1D series.

t = np.linspace(-6, 6, 20)
sin_t = np.sin(t)
cos_t = np.cos(t)
data2 = pandas.DataFrame({'t': t, 'sin': sin_t, 'cos': cos_t})  

The results look like below.

           t       sin       cos
0  -6.000000  0.279415  0.960170
1  -5.368421  0.792419  0.609977
2  -4.736842  0.999701  0.024451
3  -4.105263  0.821291 -0.570509
4  -3.473684  0.326021 -0.945363
...

Manipulating Data

pandas provides a lot of methods for manipulating the data, [here] is a list of the methods supported. We can call DataFrame.describe() to have a glimpse of the data imported.

Here is the result.

       Unnamed: 0        FSIQ         VIQ  ...      Weight     Height     MRI_Count
count   40.000000   40.000000   40.000000  ...   38.000000  39.000000  4.000000e+01
mean    20.500000  113.450000  112.350000  ...  151.052632  68.525641  9.087550e+05
std     11.690452   24.082071   23.616107  ...   23.478509   3.994649  7.228205e+04
min      1.000000   77.000000   71.000000  ...  106.000000  62.000000  7.906190e+05
25%     10.750000   89.750000   90.000000  ...  135.250000  66.000000  8.559185e+05
50%     20.500000  116.500000  113.000000  ...  146.500000  68.000000  9.053990e+05
75%     30.250000  135.500000  129.750000  ...  172.000000  70.500000  9.500780e+05
max     40.000000  144.000000  150.000000  ...  192.000000  77.000000  1.079549e+06

[8 rows x 7 columns]

One of the useful method is the DataFrame.groupby() method, which can be used to split a dataframe on values of categorical variables. Below is an example.

groupby_gender = data1.groupby('Gender')
for gender, value in groupby_gender['VIQ']:
    print((gender, value.mean()))

Here is the result.

('Female', 109.45)
('Male', 115.25)

When we call groupby_gender.mean(), the results would be the following.

        Unnamed: 0   FSIQ     VIQ     PIQ      Weight     Height  MRI_Count
Gender                                                                     
Female       19.65  111.9  109.45  110.45  137.200000  65.765000   862654.6
Male         21.35  115.0  115.25  111.60  166.444444  71.431579   954855.4

Plotting Data

pandas.DataFrame has a couple of methods that can be used to plot simple figures. For example, the boxplot() method can be used to draw a plot figure.

data1.boxplot(column=['FSIQ', 'VIQ', 'PIQ'])

drawing

Apply this method to groupby_gender.

groupby_gender.boxplot(column=['FSIQ', 'VIQ', 'PIQ'])

drawing

More complex figures can be plotted using plotting from pandas, which uses matplotlib behind the scene.

from pandas import plotting
plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']])

drawing

Alternately, we can also use seaborn for data visualization, which is based on matplotlib. However, it provides a high-level interface for drawing attractive and informative statistical graphics.


Table of Contents

Comments