Dataframe Introduction

DataFrame Introduction

In this notebook, we will learn to load the data and look at top row of the data, shape (i.e., number of rows and columns) of the data, list of name of columns, list of name of index and summary of data statistics (e.g., mean, standard deviation, median).

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

1. Creating DataFrame by loading data

To load data to pandas DataFrame from csv file, we can use read_csv() functionality. Pandas DataFrame is an object. When we load data, DataFrame holds the data with extra functionality integrated into the DataFrme object. Once the data is loaded, we can set up one column as the index by set_index() function. Following is the defult setting of data upload with read_csv():

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None,...)

titanic = pd.read_csv('data/titanic.csv')
titanic = titanic.set_index('Name')

To see top 3 row of data in the DataFrame.

titanic.head(3)

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	PC 17599	71.2833	C85	C
Heikkinen, Miss. Laina	3	1	3	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S

To know shape of the DataFrame:

titanic.shape

(891, 11)

To find the list of column names:

titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

To find the list of index name:

titanic.index

Index(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)',
       ...
       'Markun, Mr. Johann', 'Dahlberg, Miss. Gerda Ulrika',
       'Banfield, Mr. Frederick James', 'Sutehall, Mr. Henry Jr',
       'Rice, Mrs. William (Margaret Norton)', 'Montvila, Rev. Juozas',
       'Graham, Miss. Margaret Edith',
       'Johnston, Miss. Catherine Helen "Carrie"', 'Behr, Mr. Karl Howell',
       'Dooley, Mr. Patrick'],
      dtype='object', name='Name', length=891)

To find preliminary satatistics of the each column of the DataFrame.

titanic.describe().T

	count	mean	std	min	25%	50%	75%	max
PassengerId	891.0	446.000000	257.353842	1.00	223.5000	446.0000	668.5	891.0000
Survived	891.0	0.383838	0.486592	0.00	0.0000	0.0000	1.0	1.0000
Pclass	891.0	2.308642	0.836071	1.00	2.0000	3.0000	3.0	3.0000
Age	714.0	29.699118	14.526497	0.42	20.1250	28.0000	38.0	80.0000
SibSp	891.0	0.523008	1.102743	0.00	0.0000	0.0000	1.0	8.0000
Parch	891.0	0.381594	0.806057	0.00	0.0000	0.0000	0.0	6.0000
Fare	891.0	32.204208	49.693429	0.00	7.9104	14.4542	31.0	512.3292

Styling data sample visualization:

cm = sns.light_palette("green", as_cmap=True)
s = titanic[0:5].style.background_gradient(cmap=cm)
s

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22	1	0	A/5 21171	7.25	nan	S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38	1	0	PC 17599	71.2833	C85	C
Heikkinen, Miss. Laina	3	1	3	female	26	0	0	STON/O2. 3101282	7.925	nan	S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35	1	0	113803	53.1	C123	S
Allen, Mr. William Henry	5	0	3	male	35	0	0	373450	8.05	nan	S

To drop a column from a DataFrame:

titanic = titanic.drop('Ticket', axis=1)
titanic.head(2)

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	7.2500	NaN	S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	71.2833	C85	C

To drop the row data if there is NaN value:

titanic = titanic.dropna(axis=0)
titanic.head(2)

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
Name
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	71.2833	C85	C
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35.0	1	0	53.1000	C123	S

To fill the NaN value with 0.

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))
df = df.fillna(0)
df

	A	B	D
0	0.0	2.0	0
1	3.0	4.0	1
2	0.0	0.0	5
3	0.0	3.0	4

To invert or transpose the DataFrame:

df.T

	0	1	2	3
A	0.0	3.0	0.0	0.0
B	2.0	4.0	0.0	3.0
C	0.0	0.0	0.0	0.0
D	0.0	1.0	5.0	4.0

Dataframe Introduction

DataFrame Introduction

1. Creating DataFrame by loading data

References: