Skip to content

Dataframe Introduction

DataFrame Introduction

In this notebook, we will learn to load the data and look at top row of the data, shape (i.e., number of rows and columns) of the data, list of name of columns, list of name of index and summary of data statistics (e.g., mean, standard deviation, median).

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

1. Creating DataFrame by loading data

  • To load data to pandas DataFrame from csv file, we can use read_csv() functionality. Pandas DataFrame is an object. When we load data, DataFrame holds the data with extra functionality integrated into the DataFrme object. Once the data is loaded, we can set up one column as the index by set_index() function. Following is the defult setting of data upload with read_csv():

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None,...)

titanic = pd.read_csv('data/titanic.csv')
titanic = titanic.set_index('Name')
  • To see top 3 row of data in the DataFrame.
titanic.head(3)
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
  • To know shape of the DataFrame:
titanic.shape
(891, 11)
  • To find the list of column names:
titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
  • To find the list of index name:
titanic.index
Index(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)',
       ...
       'Markun, Mr. Johann', 'Dahlberg, Miss. Gerda Ulrika',
       'Banfield, Mr. Frederick James', 'Sutehall, Mr. Henry Jr',
       'Rice, Mrs. William (Margaret Norton)', 'Montvila, Rev. Juozas',
       'Graham, Miss. Margaret Edith',
       'Johnston, Miss. Catherine Helen "Carrie"', 'Behr, Mr. Karl Howell',
       'Dooley, Mr. Patrick'],
      dtype='object', name='Name', length=891)
  • To find preliminary satatistics of the each column of the DataFrame.
titanic.describe().T
count mean std min 25% 50% 75% max
PassengerId 891.0 446.000000 257.353842 1.00 223.5000 446.0000 668.5 891.0000
Survived 891.0 0.383838 0.486592 0.00 0.0000 0.0000 1.0 1.0000
Pclass 891.0 2.308642 0.836071 1.00 2.0000 3.0000 3.0 3.0000
Age 714.0 29.699118 14.526497 0.42 20.1250 28.0000 38.0 80.0000
SibSp 891.0 0.523008 1.102743 0.00 0.0000 0.0000 1.0 8.0000
Parch 891.0 0.381594 0.806057 0.00 0.0000 0.0000 0.0 6.0000
Fare 891.0 32.204208 49.693429 0.00 7.9104 14.4542 31.0 512.3292
  • Styling data sample visualization:
cm = sns.light_palette("green", as_cmap=True)
s = titanic[0:5].style.background_gradient(cmap=cm)
s
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22 1 0 A/5 21171 7.25 nan S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26 0 0 STON/O2. 3101282 7.925 nan S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35 1 0 113803 53.1 C123 S
Allen, Mr. William Henry 5 0 3 male 35 0 0 373450 8.05 nan S
  • To drop a column from a DataFrame:
titanic = titanic.drop('Ticket', axis=1)
titanic.head(2)
PassengerId Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 71.2833 C85 C
  • To drop the row data if there is NaN value:
titanic = titanic.dropna(axis=0)
titanic.head(2)
PassengerId Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
Name
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 71.2833 C85 C
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 53.1000 C123 S
  • To fill the NaN value with 0.
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))
df = df.fillna(0)
df
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
  • To invert or transpose the DataFrame:
df.T
0 1 2 3
A 0.0 3.0 0.0 0.0
B 2.0 4.0 0.0 3.0
C 0.0 0.0 0.0 0.0
D 0.0 1.0 5.0 4.0

References:

  1. Pydata document for Styling DataFrame visualization
  2. Pandas API References