Mastering Pandas

An interactive infographic for the 80/20 of Data Manipulation in Python.

🐼 Pandas Master Reference Table

Your quick lookup guide. Click any function or method to go directly to the official Pandas documentation.

📚 Key Documentation Links

Pandas API Reference → The main index page for all public methods and functions.

General Functions → Top-level functions like `merge`, `concat`, `pivot`, `date_range` etc.

User Guide: Essential Basic Functionality → Describes core methods like selection, statistics etc.

Important Functions (Supporting Roles)

Frequently used to support the critical functions.

sort_values()rename()drop()astype()sum()mean()count()std()apply()concat()to_datetime()set_index()reset_index()

Utility Functions (Specialized Tools)

Valuable for edge cases and specific applications.

pivot_table()melt()stack()unstack()cut()qcut()duplicated()drop_duplicates()resample()to_excel()read_sql()pipe()explode()

Recommended Learning Path

Follow this path for maximum return on your learning investment.

1

Data Structures & I/O

Series, DataFrame, read_csv

2

Inspection & Selection

head, info, describe, loc, iloc

3

Grouping & Merging

groupby, merge, value_counts

4

Cleaning & Advanced

dropna, fillna, apply, pivot_table

Full A-Z Quick Reference

A searchable list of all covered pandas functions.

Function Description

*(Note: This is a subset; pandas has hundreds of functions.)*

The Critical 12 Functions (80% of Use Cases)

Master these to handle the vast majority of your data analysis tasks.

1. Data Structures & I/O

pd.DataFrame()

The primary data structure: a 2D table with labeled axes (rows and columns).

df = pd.DataFrame({'A': [1,2], 'B': [3,4]})

Visualization:

{'A':[1,2]...}

Python Dict

A
B
1
3
2
4

pd.read_csv()

The most common way to load data from a file into a DataFrame.

df = pd.read_csv('data.csv')

Visualization:

data.csv

col1
col2
a
b
c
d

df.to_csv()

Save a DataFrame to a CSV file, completing the I/O cycle.

df.to_csv('output.csv')

Visualization:

A
B
1
2

output.csv

2. Inspection & Selection

df.head() / info()

Quickly inspect the first few rows, data types, and null counts of your data.

df.head(2)

Visualization (head):

Row 1
Row 2
...
Row N
Row 1
Row 2

df.loc[]

Select data by labels (row index names, column names, or boolean conditions).

df.loc[df['A'] > 1]

Visualization:

Filter: A > 1
A
B
1
3
2
4

df.iloc[]

Select data by integer position (e.g., the first row, last three columns).

df.iloc[0:2, 1]

Visualization: Rows 0-1, Col 1

A
B
1
3
2
4
3
4

3. Manipulation & Analysis

df.groupby()

The core of analysis: split data into groups, apply a function, and combine results.

df.groupby('category').sum()

Visualization:

A
B
A
A
A
B
Σ(A)
Σ(B)

pd.merge()

Combine datasets by linking rows using one or more keys (like a SQL join).

pd.merge(df1, df2, on='key')

Visualization:

DF1
+
DF2
DF1
DF2

df.value_counts()

Quickly count the number of occurrences of each unique value in a column.

df['category'].value_counts()

Visualization:

A
B
A
A: 2
B: 1

4. Data Cleaning

df.dropna()

Remove missing values (NaN) from a DataFrame, essential for clean analysis.

df.dropna()

Visualization:

Row 1
Row with NaN
Row 3
Row 1
Row 3

df.fillna()

Fill in missing values with a specified value or method (e.g., the mean).

df.fillna(0)

Visualization:

1
NaN
3
1
0
3

df.astype()

Convert a column to a different data type (e.g., object to integer or datetime).

df['col'].astype('int')

Visualization:

"123"

Object

123

Integer

Why Are These Critical?

Core Workflow: They represent the entire data analysis pipeline: loading data (`read_csv`), inspecting it (`head`, `info`), selecting subsets (`loc`, `iloc`), analyzing it (`groupby`), combining it (`merge`), and cleaning it (`dropna`, `fillna`).

Universal Need: Virtually every data science project using Python and pandas will rely heavily on this specific set of functions. Mastering them provides the foundation for everything else.

The Important Functions (Supporting Roles)

Frequently used to support critical functions and handle common data transformations.

1. Aggregation & Summarization

sum() / mean()

Basic aggregations that calculate the total or average of a column's values.

df['sales'].sum()

Visualization (sum):

100
200
150
450

describe()

Generate a quick statistical summary (count, mean, std, etc.) of numerical columns.

df.describe()

Visualization:

DataFrame
count: 3.0
mean: 150.0
std: 50.0
...

drop_duplicates()

A common cleaning operation to remove entire duplicate rows from a DataFrame.

df.drop_duplicates()

Visualization:

Row A
Row B
Row B
Row A
Row B

2. Transformation & Reshaping

apply()

Apply a custom function along an axis (rows or columns). Powerful but can be slow.

df['col'].apply(lambda x: x*2)

Visualization:

1
2

f(x) = x*2

2
4

map()

Substitute each value in a Series with another value (from a dict or function).

df['col'].map({'A': 'Apple'})

Visualization:

A
B

{'A':'Apple'}

Apple
NaN

pd.concat()

Stack pandas objects (like DataFrames) vertically or horizontally.

pd.concat([df1, df2])

Visualization:

DF1
DF2
DF1
DF2

set_index() / reset_index()

Promote a column to be the DataFrame's index, or demote the index to a column.

df.set_index('ID')

Visualization (set_index):

ID
A
A

pivot_table()

Reshape data by creating a spreadsheet-style pivot table based on columns.

df.pivot_table(index='A', columns='B')

Visualization:

Long Data
Wide Data

3. Filtering & Selection

isin()

Filter a DataFrame where a column's value is present in a given list of items.

df[df['col'].isin(['A', 'C'])]

Visualization:

A
B
C

['A', 'C']

A
C

query()

An alternative way to filter rows using a boolean expression as a string.

df.query('sales > 100 and region == "West"')

Visualization:

DataFrame

"sales > 100"

Filtered DF

Why Are These Important?

Bridge to Advanced Analysis: These functions are the workhorses that prepare your data for critical operations. You use isin() or query() to select data *before* feeding it to groupby().

Flexibility and Power: Functions like apply() and pivot_table() provide immense flexibility to perform complex, custom transformations that are not possible with basic aggregations alone.

The Utility Functions (Specialized Tools)

Valuable for edge cases, convenience, or specific applications.

1. Statistical Analysis & Visualization

corr() / var()

Calculate the correlation matrix between columns or the variance of a series.

df.corr()

Visualization (corr):

DF
1.0
0.9
0.9
1.0

quantile()

Return value at the given quantile (e.g., 0.5 for the median).

df['col'].quantile(0.5)

Visualization (0.5 Quantile):

10
20
100
20.0

nunique()

Count the number of distinct/unique elements in a Series.

df['col'].nunique()

Visualization:

A
B
A
2

plot()

A wrapper for Matplotlib to quickly generate plots from a DataFrame or Series.

df['sales'].plot(kind='line')

Visualization:

Data

2. Selection, Reshaping & Conversion

astype()

Cast a pandas object (like a column) to a specified data type.

df['col'].astype(float)

Visualization:

"123"

Object

123.0

Float

transpose() or .T

Reflect the DataFrame over its main diagonal by swapping rows and columns.

df.T

Visualization:

2x3 DF
3x2 DF

sample()

Return a random sample of items from an axis of an object.

df.sample(n=2)

Visualization:

Full DF
Random
Sample

filter()

Subset rows or columns of a DataFrame according to labels in a specified index.

df.filter(like='_id')

Visualization (filter columns):

user_id, sale, item_id
user_id, item_id

to_dict() / to_numpy()

Convert the DataFrame to a Python dictionary or a NumPy array.

df.to_dict('records')

Visualization:

DF
[{'A':1}, {'A':2}]

Why Are These "Utility"?

Niche Use Cases: Many of these functions solve specific problems that don't appear in every project. For example, .plot() is for quick visualization, and .filter() is for label-based selection, which is less common than conditional filtering.

Convenience & Interoperability: Some are helpers that offer convenient syntax (.T) or are used for converting data to other formats for use with other libraries (to_numpy(), to_dict()).

🐼 Pandas Series & DataFrame Reference

The two fundamental data structures that power everything in Pandas.

1. Pandas Series

A Series is a one-dimensional labeled array, like a single column in a spreadsheet with an index.

✅ How to Create a Series

From a List

The most basic way, with an auto-generated integer index.

s = pd.Series([10, 20, 30])

Visualization:

[10, 20, 30]
0
1
2
10
20
30
With a Custom Index

Label your data points with meaningful keys.

s = pd.Series([1,2], index=['a','b'])

Visualization:

Data: [1,2]
Index: ['a','b']
a
b
1
2
From a Dictionary

Keys become the index, values become the data.

s = pd.Series({'a':100, 'b':200})

Visualization:

{'a':100,
'b':200}
a
b
100
200

✅ How to Access Elements

By Position

Use integer indices just like a list.

s[0]  # Access first element

Visualization:

0
1
2
10
20
30
10
By Label

Use the custom index labels for direct access.

s['b'] # Access element with label 'b'

Visualization:

a
b
c
1
2
3
2
Slicing

Select a range of elements. Works with positions and labels.

s[:2] # Select first two elements

Visualization:

0
1
2
10
20
30

✅ Useful Series Functions & Attributes

.head() / .tail()

Preview the first or last 5 elements.

.index / .values

Get the index or the data as a NumPy array.

.dtype / .size

Check data type and number of elements.

.describe()

Get quick summary statistics (count, mean, std, etc.).

.unique()

Get an array of the unique values in the Series.

.value_counts()

Count the occurrences of each unique value.

.mean() / .sum()

Calculate the mean or sum of the values.

.min() / .max()

Find the minimum or maximum value.

2. Pandas DataFrame

A DataFrame is a 2D labeled table with columns of potentially different types, like a full spreadsheet.

✅ How to Create a DataFrame

From a Dictionary of Lists

Dictionary keys become column headers, lists become the column data.

data = {'name': ["A", "B"], 'age': [25, 30]}
df = pd.DataFrame(data)

Visualization:

{'name':["A","B"],
'age':[25,30]}
name age
0 A 25
1 B 30
From a List of Dictionaries

Each dictionary in the list becomes a row in the DataFrame.

data = [{'name':'A','age':25}, {'name':'B'}]
df = pd.DataFrame(data)

Visualization:

[{'n':'A','a':25},
{'n':'B'}]
name age
0 A 25.0
1 B NaN

✅ How to Access Data

Access a Column
df['age']
name age
A 25
B 30
Access a Row (.iloc)
df.iloc[0]
name age
A 25
B 30
Access a Cell (.at)
df.at[0, 'age']
name age
A 25
B 30

✅ Useful DataFrame Functions & Attributes

.head() / .tail()

Preview the first or last 5 rows.

.shape / .columns

Get `(rows, cols)` tuple or list of column names.

.info() / .dtypes

Get a full summary or check column data types.

.describe()

Summary statistics for numerical columns.

.T (Transpose)

Swap rows and columns.

.sort_values()

Sort the DataFrame by one or more columns.

.drop()

Remove rows or columns by label.

.isnull().sum()

Count missing values in each column.

.fillna()

Replace missing (NaN) values with a specified value.

Column Ops

Run functions like `.mean()` or `.max()` on a single column.

.groupby()

Group rows based on a column to perform aggregate functions.

📂 Importing & Exporting Data in Pandas

Your visual guide to moving data between Pandas and various file formats.

Format Import Function Export Function
CSV pd.read_csv() df.to_csv()
Excel pd.read_excel() df.to_excel()
JSON pd.read_json() df.to_json()
SQL pd.read_sql() df.to_sql()
HTML pd.read_html() df.to_html()
Parquet pd.read_parquet() df.to_parquet()
Pickle pd.read_pickle() df.to_pickle()
Feather pd.read_feather() df.to_feather()
Clipboard pd.read_clipboard() df.to_clipboard()

📌 Common File Formats

1. CSV Files

Comma-Separated Values are the most common flat file format.

# Read from CSV
df = pd.read_csv("data.csv")
# Write to CSV
df.to_csv("output.csv", index=False)

Visualization:

DataFrame
2. Excel Files

Read from and write to Microsoft Excel spreadsheets.

# Read from Excel
df = pd.read_excel("data.xlsx")
# Write to Excel
df.to_excel("output.xlsx", index=False)

Visualization:

DataFrame
3. JSON Files

JavaScript Object Notation is common for web APIs.

# Read from JSON
df = pd.read_json("data.json")
# Write to JSON
df.to_json("output.json", orient="records")

Visualization:

DataFrame
4. SQL Databases

Execute a query and load results directly into a DataFrame.

# Read from SQL
df = pd.read_sql("SELECT * FROM users", conn)
# Write to SQL
df.to_sql("new_table", conn)

Visualization:

DataFrame
5. Parquet

An efficient, columnar storage format for big data.

# Read from Parquet
df = pd.read_parquet("data.parquet")
# Write to Parquet
df.to_parquet("output.parquet")

Visualization:

DataFrame
6. HTML Tables

Read tables directly from a webpage URL.

# Read tables from URL
dfs = pd.read_html("page.html")
# Write to HTML
df.to_html("output.html")

Visualization:

DataFrame

📌 Specialized & Binary Formats

Pickle

Saves any Python object, including DataFrames with complex types.

df.to_pickle("data.pkl")
df = pd.read_pickle("data.pkl")
HDF5

Hierarchical format for storing multiple datasets in one file.

df.to_hdf("data.h5", key="my_data")
df = pd.read_hdf("data.h5", "my_data")
Feather

A fast, lightweight binary format for sharing data.

df.to_feather("data.feather")
df = pd.read_feather("data.feather")

📊 Example Quick Use: Read, Process, Save

A common workflow is to load data from one format, perform some calculations, and save the result in another.

# Read CSV
df = pd.read_csv("sales.csv")
# Process Data
df["total"] = df["price"] * df["quantity"]
# Save to Excel
df.to_excel("sales_report.xlsx", index=False)

Workflow Visualization:

sales.csv

Process in
DataFrame

report.xlsx

🐼 Pandas Data Selection, Filtering & Indexing

Visually mastering how to access, query, and isolate your data.

Sample DataFrame for Examples

All examples in this section will use the following DataFrame, df.

name age city
0 Alice 25 NY
1 Bob 30 LA
2 Charlie 35 SF
3 David 40 TX

📌 1. Selecting Columns

Single Column (returns a Series)
df['name']

Visualization:

name age city
Alice 25 NY
Bob 30 LA
Multiple Columns (returns a DataFrame)
df[['name', 'age']]

Visualization:

name age city
Alice 25 NY
Bob 30 LA

📌 2. Selecting Rows

By Position (.iloc)
df.iloc[0]
name age city
Alice 25 NY
Bob 30 LA
Slicing by Position (.iloc)
df.iloc[1:3]
name age city
Alice 25 NY
Bob 30 LA
Charlie 35 SF
David 40 TX
By Condition
df[df['age'] > 30]
name age city
Alice 25 NY
Bob 30 LA
Charlie 35 SF
David 40 TX

📌 3. Selecting Specific Cells

By Position (.iloc)
df.iloc[0, 1]
name age city
Alice 25 NY
By Label (.loc, .at)
df.loc[0, 'age']
name age city
Alice 25 NY

📌 4. Boolean Filtering

Multiple Conditions (&, |)
df[(df['age'] > 25) & (df['city'] == "LA")]
name age city
Alice 25 NY
Bob 30 LA
Multiple Values (.isin())
df[df['city'].isin(["NY", "LA"])]
name age city
Alice 25 NY
Bob 30 LA
Charlie 35 SF
Not Null (.notnull())
df[df['city'].notnull()]

Filters out rows with missing values (NaN).

📌 5. Indexing Concepts

Set & Reset Index
df.set_index('name')
...
name age city
Alice 25 NY
Bob 30 LA
MultiIndex (Hierarchical)

Create an index with multiple levels for grouping.

score subject
Alice Test1 85 Math
Test2 90 Science
Bob Test1 78 Math
Test2 88 Science

📌 6. Useful Indexing Functions

.index / .columns

Get the index or column labels.

.sample(n)

Get `n` random rows from the DataFrame.

.nlargest(n, 'col')

Get the top `n` rows based on a column's values.

.nsmallest(n, 'col')

Get the bottom `n` rows based on a column's values.

🐼 Pandas Slicing Concepts

Visually extracting subsets of your data with precision and ease.

Sample DataFrame for Examples

All examples in this section use the following DataFrame, df.

name age city score
0 Alice 25 NY 88
1 Bob 30 LA 92
2 Charlie 35 SF 95
3 David 40 TX 70
4 Eva 28 NV 85
5 Frank 33 NY 90

📌 1. Row & Column Slicing

Row Slice by Position (.iloc)
df.iloc[1:4]
name age city
Alice 25 NY
Bob 30 LA
Charlie 35 SF
David 40 TX
Eva 28 NV
Row Slice by Label (.loc)
df.loc[0:2]
name age city
Alice 25 NY
Bob 30 LA
Charlie 35 SF
David 40 TX

Note: .loc is inclusive, so index 2 is included.

Column Slice by Label (.loc)
df.loc[:, 'age':'score']
name age city score
Alice 25 NY 88
Bob 30 LA 92
Row & Column Combined
df.iloc[1:4, 0:2]
name age city
Alice 25 ...
Bob 30 ...
Charlie 35 ...
David 40 ...
Step Slicing
df.iloc[::2]
name age city
Alice 25 NY
Bob 30 LA
Charlie 35 SF
David 40 TX
Eva 28 NV
Negative Indexing
df.iloc[-2:]
name age city
... ... ...
Eva 28 NV
Frank 33 NY

📌 2. Conditional Slicing & Filtering

Simple Condition
df[df['age'] > 30]
name age city
Charlie 35 SF
David 40 TX
Frank 33 NY
List Membership (.isin())
df[df['city'].isin(['NY', 'LA'])]
name age city
Alice 25 NY
Bob 30 LA
Frank 33 NY
Query Method (.query())
df.query('age > 30 & score > 80')
name age score
Charlie 35 95
Frank 33 90

📌 3. MultiIndex Slicing

MultiIndex DataFrame Structure

An index with multiple levels for grouping.

score
Alice Test1 88
Test2 92
Bob Test1 90
Test2 85
Slicing by Index Level
df_multi.loc['Alice']

Result:

score
Test1 88
Test2 92

⚡ Summary of Slicing Techniques

Method Example Notes
Row slice by position df.iloc[1:4] End index is excluded.
Row slice by label df.loc[0:3] End label is included.
Step slicing df.iloc[::2] Selects every Nth element.
Conditional slice df[df['age']>30] Uses a boolean mask.
Query-based slice df.query('age>30') A readable string-based alternative.
MultiIndex slice df_multi.loc['Alice'] Selects all entries from the first index level.

🐼 Pandas Grouping & Aggregation

Visually summarizing, transforming, and filtering your data by category.

Sample DataFrame for Examples

All examples in this section use the following DataFrame, df.

department employee salary age gender
0 HR Alice 50000 25 F
1 IT Bob 60000 30 M
2 HR Charlie 55000 28 M
3 IT David 65000 35 M
4 Finance Eva 70000 40 F
5 HR Frank 52000 32 M
6 Finance Grace 72000 38 F
7 IT Hannah 68000 29 F

📌 1. Basic Grouping & Aggregation

Simple Aggregation
df.groupby('department')['salary'].sum()
department salary
Finance 142000
HR 157000
IT 193000
Multiple Aggregations
df.groupby('department')['salary']
.agg(['sum', 'mean', 'max'])
dept sum mean max
Finance 142k 71k 72k
HR 157k 52.3k 55k
IT 193k 64.3k 68k
Multi-level Grouping
df.groupby(['department', 'gender'])['salary'].mean()
salary
Finance F 71000
HR F 50000
M 53500

📌 2. Group Operations (Iterate, Select, Apply)

Iterating Over Groups
for dept, group in grouped: ...

Example: First Iteration (dept='Finance')

employee salary
Eva 70000
Grace 72000
Selecting a Specific Group
grouped.get_group('HR')
employee salary
Alice 50000
Charlie 55000
Frank 52000
Applying Custom Functions
def salary_range(x): ...
grouped['salary'].apply(salary_range)
department salary_range
Finance 2000
HR 5000
IT 8000

📌 3. Transform & Filter

Transform & Add Aggregated Values
df['dept_avg_salary'] = 
df.groupby('department')['salary'].transform('mean')
dept employee salary dept_avg_salary
HR Alice 50k 52333
IT Bob 60k 64333
HR Charlie 55k 52333
IT David 65k 64333
Filter Groups Based on Condition
grouped.filter(lambda x: x['salary'].mean() > 60000)
dept employee salary
IT Bob 60000
IT David 65000
Finance Eva 70000
Finance Grace 72000
IT Hannah 68000

📌 4. Reshaping & Frequency Tables

Pivot Table
pd.pivot_table(df, values='salary',
index='department', columns='gender', aggfunc='mean')
gender F M
department
Finance 71k 0
HR 50k 53.5k
IT 68k 62.5k
Crosstab (Frequency Table)
pd.crosstab(df['department'], df['gender'])
gender F M
department
Finance 2 0
HR 1 2
IT 1 2
Cumulative Operations
df.groupby('department')['salary'].cumsum()
dept salary cum_salary
HR 50k 50k
HR 52k 102k
HR 55k 157k
IT 60k 60k

Note: Cumulation restarts for each group.

⚡ Summary of Grouping Methods

Method Description
.groupby('col') Groups DataFrame by a column.
.agg(['sum', 'mean']) Computes multiple summary statistics.
.apply(func) Applies a custom function to each group.
.transform('mean') Broadcasts a group calculation back to the original rows.
.filter(lambda x: ...) Removes entire groups based on a condition.
pd.pivot_table(...) Creates an Excel-style pivot table.
pd.crosstab(...) Computes a frequency table of factors.

🐼 Pivot Tables & Cross Tabulation

Visually summarizing and analyzing relationships between variables.

Sample DataFrame for Examples

All examples in this section use the following DataFrame, df.

department employee salary age gender
0 HR Alice 50000 25 F
1 IT Bob 60000 30 M
2 HR Charlie 55000 28 M
3 IT David 65000 35 M
4 Finance Eva 70000 40 F

📌 1. Pivot Tables

Basic Pivot (Single Aggregation)
pd.pivot_table(df, values='salary', index='department', aggfunc='mean')
department salary
Finance 71000.0
HR 52333.3
IT 64333.3
Pivot with Columns
pd.pivot_table(df, values='salary', index='department', columns='gender', aggfunc='mean', fill_value=0)
gender F M
department
Finance 71000 0
HR 50000 53500
IT 68000 62500
Pivot with Multiple Aggregations
pd.pivot_table(df, values='salary', index='department', aggfunc=['sum', 'mean', 'max'])
department salary
sum mean max
Finance 142k 71k 72k
HR 157k 52.3k 55k
Advanced Pivot (Multiple Values & Columns)
pd.pivot_table(df, values=['salary', 'age'], index='department', columns='gender', aggfunc='mean')
department salary age
F M F M
Finance 71k 0 39 0
HR 50k 53.5k 25 30

📌 2. Cross Tabulation (crosstab)

Basic Crosstab (Frequency Count)
pd.crosstab(df['department'], df['gender'])
gender F M
department
Finance 2 0
HR 1 2
IT 1 2
Normalized Crosstab (Percentage)
pd.crosstab(df['department'], df['gender'], normalize='index')
gender F M
Finance 1.00 0.00
HR 0.33 0.67
IT 0.33 0.67
Crosstab with Margins (Totals)
pd.crosstab(df['department'], df['gender'], margins=True)
gender F M All
Finance 2 0 2
HR 1 2 3
IT 1 2 3
All 4 4 8

📌 3. When to Use Pivot vs. Crosstab

Feature Pivot Table Crosstab
Purpose Summarize numerical data Count/summarize categorical data
Aggregation mean, sum, max, min, etc. Mainly count (can also do %)
Input A DataFrame Series or array-like objects
Margins/Totals Supported via margins=True Supported via margins=True