Dataframe Introduction
DataFrame Introduction
In this notebook, we will learn to load the data and look at top row of the data, shape (i.e., number of rows and columns) of the data, list of name of columns, list of name of index and summary of data statistics (e.g., mean, standard deviation, median).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
1. Creating DataFrame by loading data
- To load data to pandas DataFrame from
csv
file, we can use read_csv()
functionality. Pandas DataFrame
is an object. When we load data, DataFrame
holds the data with extra functionality integrated into the DataFrme
object. Once the data is loaded, we can set up one column as the index by set_index()
function. Following is the defult setting of data upload with read_csv()
:
pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None,...)
titanic = pd.read_csv('data/titanic.csv')
titanic = titanic.set_index('Name')
- To see top 3 row of data in the
DataFrame
.
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Name |
|
|
|
|
|
|
|
|
|
|
|
Braund, Mr. Owen Harris |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
Heikkinen, Miss. Laina |
3 |
1 |
3 |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
- To know shape of the
DataFrame
:
- To find the list of column names:
Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
- To find the list of index name:
Index(['Braund, Mr. Owen Harris',
'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
'Heikkinen, Miss. Laina',
'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
'Allen, Mr. William Henry', 'Moran, Mr. James',
'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
'Nasser, Mrs. Nicholas (Adele Achem)',
...
'Markun, Mr. Johann', 'Dahlberg, Miss. Gerda Ulrika',
'Banfield, Mr. Frederick James', 'Sutehall, Mr. Henry Jr',
'Rice, Mrs. William (Margaret Norton)', 'Montvila, Rev. Juozas',
'Graham, Miss. Margaret Edith',
'Johnston, Miss. Catherine Helen "Carrie"', 'Behr, Mr. Karl Howell',
'Dooley, Mr. Patrick'],
dtype='object', name='Name', length=891)
- To find preliminary satatistics of the each column of the
DataFrame
.
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
PassengerId |
891.0 |
446.000000 |
257.353842 |
1.00 |
223.5000 |
446.0000 |
668.5 |
891.0000 |
Survived |
891.0 |
0.383838 |
0.486592 |
0.00 |
0.0000 |
0.0000 |
1.0 |
1.0000 |
Pclass |
891.0 |
2.308642 |
0.836071 |
1.00 |
2.0000 |
3.0000 |
3.0 |
3.0000 |
Age |
714.0 |
29.699118 |
14.526497 |
0.42 |
20.1250 |
28.0000 |
38.0 |
80.0000 |
SibSp |
891.0 |
0.523008 |
1.102743 |
0.00 |
0.0000 |
0.0000 |
1.0 |
8.0000 |
Parch |
891.0 |
0.381594 |
0.806057 |
0.00 |
0.0000 |
0.0000 |
0.0 |
6.0000 |
Fare |
891.0 |
32.204208 |
49.693429 |
0.00 |
7.9104 |
14.4542 |
31.0 |
512.3292 |
- Styling data sample visualization:
cm = sns.light_palette("green", as_cmap=True)
s = titanic[0:5].style.background_gradient(cmap=cm)
s
| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
Name | | | | | | | | | | | |
Braund, Mr. Owen Harris |
1 |
0 |
3 |
male |
22 |
1 |
0 |
A/5 21171 |
7.25 |
nan |
S |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
2 |
1 |
1 |
female |
38 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
Heikkinen, Miss. Laina |
3 |
1 |
3 |
female |
26 |
0 |
0 |
STON/O2. 3101282 |
7.925 |
nan |
S |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 |
1 |
1 |
female |
35 |
1 |
0 |
113803 |
53.1 |
C123 |
S |
Allen, Mr. William Henry |
5 |
0 |
3 |
male |
35 |
0 |
0 |
373450 |
8.05 |
nan |
S |
- To drop a column from a
DataFrame
:
titanic = titanic.drop('Ticket', axis=1)
titanic.head(2)
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Cabin |
Embarked |
Name |
|
|
|
|
|
|
|
|
|
|
Braund, Mr. Owen Harris |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
NaN |
S |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C85 |
C |
- To drop the row data if there is
NaN
value:
titanic = titanic.dropna(axis=0)
titanic.head(2)
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Cabin |
Embarked |
Name |
|
|
|
|
|
|
|
|
|
|
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C85 |
C |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
C123 |
S |
- To fill the
NaN
value with 0.
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df = df.fillna(0)
df
|
A |
B |
C |
D |
0 |
0.0 |
2.0 |
0.0 |
0 |
1 |
3.0 |
4.0 |
0.0 |
1 |
2 |
0.0 |
0.0 |
0.0 |
5 |
3 |
0.0 |
3.0 |
0.0 |
4 |
- To invert or transpose the
DataFrame
:
|
0 |
1 |
2 |
3 |
A |
0.0 |
3.0 |
0.0 |
0.0 |
B |
2.0 |
4.0 |
0.0 |
3.0 |
C |
0.0 |
0.0 |
0.0 |
0.0 |
D |
0.0 |
1.0 |
5.0 |
4.0 |
References:
- Pydata document for Styling DataFrame visualization
- Pandas API References