🐼 Pandas Master Reference Table
Your quick lookup guide. Click any function or method to go directly to the official Pandas documentation.
Category | Core / Must-Know Functions & Methods | Other Useful Functions & Methods |
---|---|---|
Creation & Data Structures | ||
High-Level Manipulation | ||
Missing Data Handling | ||
Input / Output (IO) | ||
Selection & Indexing | ||
Statistics & Aggregation | ||
GroupBy & Window |
Custom aggregations, applying multiple functions at once. |
|
Time Series / Date-Time | ||
String / Text Methods | ||
Utility Methods |
📚 Key Documentation Links
Pandas API Reference → The main index page for all public methods and functions.
General Functions → Top-level functions like `merge`, `concat`, `pivot`, `date_range` etc.
User Guide: Essential Basic Functionality → Describes core methods like selection, statistics etc.
Important Functions (Supporting Roles)
Frequently used to support the critical functions.
Utility Functions (Specialized Tools)
Valuable for edge cases and specific applications.
Recommended Learning Path
Follow this path for maximum return on your learning investment.
Data Structures & I/O
Series, DataFrame, read_csv
Inspection & Selection
head, info, describe, loc, iloc
Grouping & Merging
groupby, merge, value_counts
Cleaning & Advanced
dropna, fillna, apply, pivot_table
Full A-Z Quick Reference
A searchable list of all covered pandas functions.
Function | Description |
---|
*(Note: This is a subset; pandas has hundreds of functions.)*
The Critical 12 Functions (80% of Use Cases)
Master these to handle the vast majority of your data analysis tasks.
1. Data Structures & I/O
pd.DataFrame()
The primary data structure: a 2D table with labeled axes (rows and columns).
df = pd.DataFrame({'A': [1,2], 'B': [3,4]})
Visualization:
{'A':[1,2]...}
Python Dict
pd.read_csv()
The most common way to load data from a file into a DataFrame.
df = pd.read_csv('data.csv')
Visualization:
data.csv
df.to_csv()
Save a DataFrame to a CSV file, completing the I/O cycle.
df.to_csv('output.csv')
Visualization:
output.csv
2. Inspection & Selection
df.head() / info()
Quickly inspect the first few rows, data types, and null counts of your data.
df.head(2)
Visualization (head):
df.loc[]
Select data by labels (row index names, column names, or boolean conditions).
df.loc[df['A'] > 1]
Visualization:
df.iloc[]
Select data by integer position (e.g., the first row, last three columns).
df.iloc[0:2, 1]
Visualization: Rows 0-1, Col 1
3. Manipulation & Analysis
df.groupby()
The core of analysis: split data into groups, apply a function, and combine results.
df.groupby('category').sum()
Visualization:
pd.merge()
Combine datasets by linking rows using one or more keys (like a SQL join).
pd.merge(df1, df2, on='key')
Visualization:
df.value_counts()
Quickly count the number of occurrences of each unique value in a column.
df['category'].value_counts()
Visualization:
4. Data Cleaning
df.dropna()
Remove missing values (NaN) from a DataFrame, essential for clean analysis.
df.dropna()
Visualization:
df.fillna()
Fill in missing values with a specified value or method (e.g., the mean).
df.fillna(0)
Visualization:
df.astype()
Convert a column to a different data type (e.g., object to integer or datetime).
df['col'].astype('int')
Visualization:
"123"
Object
123
Integer
Why Are These Critical?
Core Workflow: They represent the entire data analysis pipeline: loading data (`read_csv`), inspecting it (`head`, `info`), selecting subsets (`loc`, `iloc`), analyzing it (`groupby`), combining it (`merge`), and cleaning it (`dropna`, `fillna`).
Universal Need: Virtually every data science project using Python and pandas will rely heavily on this specific set of functions. Mastering them provides the foundation for everything else.
The Important Functions (Supporting Roles)
Frequently used to support critical functions and handle common data transformations.
1. Aggregation & Summarization
sum() / mean()
Basic aggregations that calculate the total or average of a column's values.
df['sales'].sum()
Visualization (sum):
describe()
Generate a quick statistical summary (count, mean, std, etc.) of numerical columns.
df.describe()
Visualization:
drop_duplicates()
A common cleaning operation to remove entire duplicate rows from a DataFrame.
df.drop_duplicates()
Visualization:
2. Transformation & Reshaping
apply()
Apply a custom function along an axis (rows or columns). Powerful but can be slow.
df['col'].apply(lambda x: x*2)
Visualization:
f(x) = x*2
→map()
Substitute each value in a Series with another value (from a dict or function).
df['col'].map({'A': 'Apple'})
Visualization:
{'A':'Apple'}
→pd.concat()
Stack pandas objects (like DataFrames) vertically or horizontally.
pd.concat([df1, df2])
Visualization:
set_index() / reset_index()
Promote a column to be the DataFrame's index, or demote the index to a column.
df.set_index('ID')
Visualization (set_index):
pivot_table()
Reshape data by creating a spreadsheet-style pivot table based on columns.
df.pivot_table(index='A', columns='B')
Visualization:
3. Filtering & Selection
isin()
Filter a DataFrame where a column's value is present in a given list of items.
df[df['col'].isin(['A', 'C'])]
Visualization:
['A', 'C']
→query()
An alternative way to filter rows using a boolean expression as a string.
df.query('sales > 100 and region == "West"')
Visualization:
"sales > 100"
→Why Are These Important?
Bridge to Advanced Analysis:
These functions are the workhorses that prepare your data for
critical operations. You use
isin()
or
query()
to select data *before* feeding it to
groupby()
.
Flexibility and Power:
Functions like
apply()
and
pivot_table()
provide immense flexibility to perform complex, custom
transformations that are not possible with basic aggregations
alone.
The Utility Functions (Specialized Tools)
Valuable for edge cases, convenience, or specific applications.
1. Statistical Analysis & Visualization
corr() / var()
Calculate the correlation matrix between columns or the variance of a series.
df.corr()
Visualization (corr):
quantile()
Return value at the given quantile (e.g., 0.5 for the median).
df['col'].quantile(0.5)
Visualization (0.5 Quantile):
nunique()
Count the number of distinct/unique elements in a Series.
df['col'].nunique()
Visualization:
plot()
A wrapper for Matplotlib to quickly generate plots from a DataFrame or Series.
df['sales'].plot(kind='line')
Visualization:
2. Selection, Reshaping & Conversion
astype()
Cast a pandas object (like a column) to a specified data type.
df['col'].astype(float)
Visualization:
"123"
Object
123.0
Float
transpose() or .T
Reflect the DataFrame over its main diagonal by swapping rows and columns.
df.T
Visualization:
sample()
Return a random sample of items from an axis of an object.
df.sample(n=2)
Visualization:
Sample
filter()
Subset rows or columns of a DataFrame according to labels in a specified index.
df.filter(like='_id')
Visualization (filter columns):
to_dict() / to_numpy()
Convert the DataFrame to a Python dictionary or a NumPy array.
df.to_dict('records')
Visualization:
Why Are These "Utility"?
Niche Use Cases: Many of
these functions solve specific problems that don't appear in every
project. For example,
.plot()
is for quick visualization, and
.filter()
is for label-based selection, which is less common than
conditional filtering.
Convenience & Interoperability:
Some are helpers that offer convenient syntax (.T
) or are used for converting data to other formats for use with
other libraries (to_numpy()
,
to_dict()
).
🐼 Pandas Series & DataFrame Reference
The two fundamental data structures that power everything in Pandas.
1. Pandas Series
A Series is a one-dimensional labeled array, like a single column in a spreadsheet with an index.
✅ How to Create a Series
From a List
The most basic way, with an auto-generated integer index.
s = pd.Series([10, 20, 30])
Visualization:
1
2
20
30
With a Custom Index
Label your data points with meaningful keys.
s = pd.Series([1,2], index=['a','b'])
Visualization:
Index: ['a','b']
b
2
From a Dictionary
Keys become the index, values become the data.
s = pd.Series({'a':100, 'b':200})
Visualization:
'b':200}
b
200
✅ How to Access Elements
By Position
Use integer indices just like a list.
s[0] # Access first element
Visualization:
1
2
20
30
By Label
Use the custom index labels for direct access.
s['b'] # Access element with label 'b'
Visualization:
b
c
2
3
Slicing
Select a range of elements. Works with positions and labels.
s[:2] # Select first two elements
Visualization:
1
2
20
30
✅ Useful Series Functions & Attributes
.head() / .tail()
Preview the first or last 5 elements.
.index / .values
Get the index or the data as a NumPy array.
.dtype / .size
Check data type and number of elements.
.describe()
Get quick summary statistics (count, mean, std, etc.).
.unique()
Get an array of the unique values in the Series.
.value_counts()
Count the occurrences of each unique value.
.mean() / .sum()
Calculate the mean or sum of the values.
.min() / .max()
Find the minimum or maximum value.
2. Pandas DataFrame
A DataFrame is a 2D labeled table with columns of potentially different types, like a full spreadsheet.
✅ How to Create a DataFrame
From a Dictionary of Lists
Dictionary keys become column headers, lists become the column data.
data = {'name': ["A", "B"], 'age': [25, 30]}
df = pd.DataFrame(data)
Visualization:
'age':[25,30]}
name | age | |
---|---|---|
0 | A | 25 |
1 | B | 30 |
From a List of Dictionaries
Each dictionary in the list becomes a row in the DataFrame.
data = [{'name':'A','age':25}, {'name':'B'}]
df = pd.DataFrame(data)
Visualization:
{'n':'B'}]
name | age | |
---|---|---|
0 | A | 25.0 |
1 | B | NaN |
✅ How to Access Data
Access a Column
df['age']
name | age |
---|---|
A | 25 |
B | 30 |
Access a Row (.iloc)
df.iloc[0]
name | age |
---|---|
A | 25 |
B | 30 |
Access a Cell (.at)
df.at[0, 'age']
name | age |
---|---|
A | 25 |
B | 30 |
✅ Useful DataFrame Functions & Attributes
.head() / .tail()
Preview the first or last 5 rows.
.shape / .columns
Get `(rows, cols)` tuple or list of column names.
.info() / .dtypes
Get a full summary or check column data types.
.describe()
Summary statistics for numerical columns.
.T (Transpose)
Swap rows and columns.
.sort_values()
Sort the DataFrame by one or more columns.
.drop()
Remove rows or columns by label.
.isnull().sum()
Count missing values in each column.
.fillna()
Replace missing (NaN) values with a specified value.
Column Ops
Run functions like `.mean()` or `.max()` on a single column.
.groupby()
Group rows based on a column to perform aggregate functions.
📂 Importing & Exporting Data in Pandas
Your visual guide to moving data between Pandas and various file formats.
Format | Import Function | Export Function |
---|---|---|
CSV |
pd.read_csv()
|
df.to_csv()
|
Excel |
pd.read_excel()
|
df.to_excel()
|
JSON |
pd.read_json()
|
df.to_json()
|
SQL |
pd.read_sql()
|
df.to_sql()
|
HTML |
pd.read_html()
|
df.to_html()
|
Parquet |
pd.read_parquet()
|
df.to_parquet()
|
Pickle |
pd.read_pickle()
|
df.to_pickle()
|
Feather |
pd.read_feather()
|
df.to_feather()
|
Clipboard |
pd.read_clipboard()
|
df.to_clipboard()
|
📌 Common File Formats
1. CSV Files
Comma-Separated Values are the most common flat file format.
# Read from CSV
df = pd.read_csv("data.csv")
# Write to CSV
df.to_csv("output.csv", index=False)
Visualization:
2. Excel Files
Read from and write to Microsoft Excel spreadsheets.
# Read from Excel
df = pd.read_excel("data.xlsx")
# Write to Excel
df.to_excel("output.xlsx", index=False)
Visualization:
3. JSON Files
JavaScript Object Notation is common for web APIs.
# Read from JSON
df = pd.read_json("data.json")
# Write to JSON
df.to_json("output.json", orient="records")
Visualization:
4. SQL Databases
Execute a query and load results directly into a DataFrame.
# Read from SQL
df = pd.read_sql("SELECT * FROM users", conn)
# Write to SQL
df.to_sql("new_table", conn)
Visualization:
5. Parquet
An efficient, columnar storage format for big data.
# Read from Parquet
df = pd.read_parquet("data.parquet")
# Write to Parquet
df.to_parquet("output.parquet")
Visualization:
6. HTML Tables
Read tables directly from a webpage URL.
# Read tables from URL
dfs = pd.read_html("page.html")
# Write to HTML
df.to_html("output.html")
Visualization:
📌 Specialized & Binary Formats
Pickle
Saves any Python object, including DataFrames with complex types.
df.to_pickle("data.pkl")
df = pd.read_pickle("data.pkl")
HDF5
Hierarchical format for storing multiple datasets in one file.
df.to_hdf("data.h5", key="my_data")
df = pd.read_hdf("data.h5", "my_data")
Feather
A fast, lightweight binary format for sharing data.
df.to_feather("data.feather")
df = pd.read_feather("data.feather")
📊 Example Quick Use: Read, Process, Save
A common workflow is to load data from one format, perform some calculations, and save the result in another.
# Read CSV
df = pd.read_csv("sales.csv")
# Process Data
df["total"] = df["price"] * df["quantity"]
# Save to Excel
df.to_excel("sales_report.xlsx", index=False)
Workflow Visualization:
sales.csv
DataFrame
report.xlsx
🐼 Pandas Data Selection, Filtering & Indexing
Visually mastering how to access, query, and isolate your data.
Sample DataFrame for Examples
All examples in this section will use the following DataFrame,
df
.
name | age | city | |
---|---|---|---|
0 | Alice | 25 | NY |
1 | Bob | 30 | LA |
2 | Charlie | 35 | SF |
3 | David | 40 | TX |
📌 1. Selecting Columns
Single Column (returns a
Series
)
df['name']
Visualization:
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Multiple Columns (returns a
DataFrame
)
df[['name', 'age']]
Visualization:
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
📌 2. Selecting Rows
By Position (.iloc
)
df.iloc[0]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Slicing by Position (.iloc
)
df.iloc[1:3]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Charlie | 35 | SF |
David | 40 | TX |
By Condition
df[df['age'] > 30]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Charlie | 35 | SF |
David | 40 | TX |
📌 3. Selecting Specific Cells
By Position (.iloc
)
df.iloc[0, 1]
name | age | city |
---|---|---|
Alice | 25 | NY |
By Label (.loc
, .at
)
df.loc[0, 'age']
name | age | city |
---|---|---|
Alice | 25 | NY |
📌 4. Boolean Filtering
Multiple Conditions (&
,
|
)
df[(df['age'] > 25) & (df['city'] == "LA")]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Multiple Values (.isin()
)
df[df['city'].isin(["NY", "LA"])]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Charlie | 35 | SF |
Not Null (.notnull()
)
df[df['city'].notnull()]
Filters out rows with missing values (NaN).
📌 5. Indexing Concepts
Set & Reset Index
df.set_index('name')
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
MultiIndex (Hierarchical)
Create an index with multiple levels for grouping.
score | subject | ||
---|---|---|---|
Alice | Test1 | 85 | Math |
Test2 | 90 | Science | |
Bob | Test1 | 78 | Math |
Test2 | 88 | Science |
📌 6. Useful Indexing Functions
.index / .columns
Get the index or column labels.
.sample(n)
Get `n` random rows from the DataFrame.
.nlargest(n, 'col')
Get the top `n` rows based on a column's values.
.nsmallest(n, 'col')
Get the bottom `n` rows based on a column's values.
🐼 Pandas Slicing Concepts
Visually extracting subsets of your data with precision and ease.
Sample DataFrame for Examples
All examples in this section use the following DataFrame,
df
.
name | age | city | score | |
---|---|---|---|---|
0 | Alice | 25 | NY | 88 |
1 | Bob | 30 | LA | 92 |
2 | Charlie | 35 | SF | 95 |
3 | David | 40 | TX | 70 |
4 | Eva | 28 | NV | 85 |
5 | Frank | 33 | NY | 90 |
📌 1. Row & Column Slicing
Row Slice by Position (.iloc
)
df.iloc[1:4]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Charlie | 35 | SF |
David | 40 | TX |
Eva | 28 | NV |
Row Slice by Label (.loc
)
df.loc[0:2]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Charlie | 35 | SF |
David | 40 | TX |
Note: .loc
is
inclusive, so index 2 is included.
Column Slice by Label (.loc
)
df.loc[:, 'age':'score']
name | age | city | score |
---|---|---|---|
Alice | 25 | NY | 88 |
Bob | 30 | LA | 92 |
Row & Column Combined
df.iloc[1:4, 0:2]
name | age | city |
---|---|---|
Alice | 25 | ... |
Bob | 30 | ... |
Charlie | 35 | ... |
David | 40 | ... |
Step Slicing
df.iloc[::2]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Charlie | 35 | SF |
David | 40 | TX |
Eva | 28 | NV |
Negative Indexing
df.iloc[-2:]
name | age | city |
---|---|---|
... | ... | ... |
Eva | 28 | NV |
Frank | 33 | NY |
📌 2. Conditional Slicing & Filtering
Simple Condition
df[df['age'] > 30]
name | age | city |
---|---|---|
Charlie | 35 | SF |
David | 40 | TX |
Frank | 33 | NY |
List Membership (.isin()
)
df[df['city'].isin(['NY', 'LA'])]
name | age | city |
---|---|---|
Alice | 25 | NY |
Bob | 30 | LA |
Frank | 33 | NY |
Query Method (.query()
)
df.query('age > 30 & score > 80')
name | age | score |
---|---|---|
Charlie | 35 | 95 |
Frank | 33 | 90 |
📌 3. MultiIndex Slicing
MultiIndex DataFrame Structure
An index with multiple levels for grouping.
score | ||
---|---|---|
Alice | Test1 | 88 |
Test2 | 92 | |
Bob | Test1 | 90 |
Test2 | 85 |
Slicing by Index Level
df_multi.loc['Alice']
Result:
score | |
---|---|
Test1 | 88 |
Test2 | 92 |
⚡ Summary of Slicing Techniques
Method | Example | Notes |
---|---|---|
Row slice by position |
df.iloc[1:4]
|
End index is excluded. |
Row slice by label |
df.loc[0:3]
|
End label is included. |
Step slicing |
df.iloc[::2]
|
Selects every Nth element. |
Conditional slice |
df[df['age']>30]
|
Uses a boolean mask. |
Query-based slice |
df.query('age>30')
|
A readable string-based alternative. |
MultiIndex slice |
df_multi.loc['Alice']
|
Selects all entries from the first index level. |
🐼 Pandas Grouping & Aggregation
Visually summarizing, transforming, and filtering your data by category.
Sample DataFrame for Examples
All examples in this section use the following DataFrame,
df
.
department | employee | salary | age | gender | |
---|---|---|---|---|---|
0 | HR | Alice | 50000 | 25 | F |
1 | IT | Bob | 60000 | 30 | M |
2 | HR | Charlie | 55000 | 28 | M |
3 | IT | David | 65000 | 35 | M |
4 | Finance | Eva | 70000 | 40 | F |
5 | HR | Frank | 52000 | 32 | M |
6 | Finance | Grace | 72000 | 38 | F |
7 | IT | Hannah | 68000 | 29 | F |
📌 1. Basic Grouping & Aggregation
Simple Aggregation
df.groupby('department')['salary'].sum()
department | salary |
---|---|
Finance | 142000 |
HR | 157000 |
IT | 193000 |
Multiple Aggregations
df.groupby('department')['salary']
.agg(['sum', 'mean', 'max'])
dept | sum | mean | max |
---|---|---|---|
Finance | 142k | 71k | 72k |
HR | 157k | 52.3k | 55k |
IT | 193k | 64.3k | 68k |
Multi-level Grouping
df.groupby(['department', 'gender'])['salary'].mean()
salary | ||
---|---|---|
Finance | F | 71000 |
HR | F | 50000 |
M | 53500 |
📌 2. Group Operations (Iterate, Select, Apply)
Iterating Over Groups
for dept, group in grouped: ...
Example: First Iteration (dept='Finance')
employee | salary |
---|---|
Eva | 70000 |
Grace | 72000 |
Selecting a Specific Group
grouped.get_group('HR')
employee | salary |
---|---|
Alice | 50000 |
Charlie | 55000 |
Frank | 52000 |
Applying Custom Functions
def salary_range(x): ...
grouped['salary'].apply(salary_range)
department | salary_range |
---|---|
Finance | 2000 |
HR | 5000 |
IT | 8000 |
📌 3. Transform & Filter
Transform & Add Aggregated Values
df['dept_avg_salary'] =
df.groupby('department')['salary'].transform('mean')
dept | employee | salary | dept_avg_salary |
---|---|---|---|
HR | Alice | 50k | 52333 |
IT | Bob | 60k | 64333 |
HR | Charlie | 55k | 52333 |
IT | David | 65k | 64333 |
Filter Groups Based on Condition
grouped.filter(lambda x: x['salary'].mean() > 60000)
dept | employee | salary |
---|---|---|
IT | Bob | 60000 |
IT | David | 65000 |
Finance | Eva | 70000 |
Finance | Grace | 72000 |
IT | Hannah | 68000 |
📌 4. Reshaping & Frequency Tables
Pivot Table
pd.pivot_table(df, values='salary',
index='department', columns='gender', aggfunc='mean')
gender | F | M |
---|---|---|
department | ||
Finance | 71k | 0 |
HR | 50k | 53.5k |
IT | 68k | 62.5k |
Crosstab (Frequency Table)
pd.crosstab(df['department'], df['gender'])
gender | F | M |
---|---|---|
department | ||
Finance | 2 | 0 |
HR | 1 | 2 |
IT | 1 | 2 |
Cumulative Operations
df.groupby('department')['salary'].cumsum()
dept | salary | cum_salary |
---|---|---|
HR | 50k | 50k |
HR | 52k | 102k |
HR | 55k | 157k |
IT | 60k | 60k |
Note: Cumulation restarts for each group.
⚡ Summary of Grouping Methods
Method | Description |
---|---|
.groupby('col')
|
Groups DataFrame by a column. |
.agg(['sum', 'mean'])
|
Computes multiple summary statistics. |
.apply(func)
|
Applies a custom function to each group. |
.transform('mean')
|
Broadcasts a group calculation back to the original rows. |
.filter(lambda x: ...)
|
Removes entire groups based on a condition. |
pd.pivot_table(...)
|
Creates an Excel-style pivot table. |
pd.crosstab(...)
|
Computes a frequency table of factors. |
🐼 Pivot Tables & Cross Tabulation
Visually summarizing and analyzing relationships between variables.
Sample DataFrame for Examples
All examples in this section use the following DataFrame,
df
.
department | employee | salary | age | gender | |
---|---|---|---|---|---|
0 | HR | Alice | 50000 | 25 | F |
1 | IT | Bob | 60000 | 30 | M |
2 | HR | Charlie | 55000 | 28 | M |
3 | IT | David | 65000 | 35 | M |
4 | Finance | Eva | 70000 | 40 | F |
📌 1. Pivot Tables
Basic Pivot (Single Aggregation)
pd.pivot_table(df, values='salary', index='department', aggfunc='mean')
department | salary |
---|---|
Finance | 71000.0 |
HR | 52333.3 |
IT | 64333.3 |
Pivot with Columns
pd.pivot_table(df, values='salary', index='department', columns='gender', aggfunc='mean', fill_value=0)
gender | F | M |
---|---|---|
department | ||
Finance | 71000 | 0 |
HR | 50000 | 53500 |
IT | 68000 | 62500 |
Pivot with Multiple Aggregations
pd.pivot_table(df, values='salary', index='department', aggfunc=['sum', 'mean', 'max'])
department | salary | ||
---|---|---|---|
sum | mean | max | |
Finance | 142k | 71k | 72k |
HR | 157k | 52.3k | 55k |
Advanced Pivot (Multiple Values & Columns)
pd.pivot_table(df, values=['salary', 'age'], index='department', columns='gender', aggfunc='mean')
department | salary | age | ||
---|---|---|---|---|
F | M | F | M | |
Finance | 71k | 0 | 39 | 0 |
HR | 50k | 53.5k | 25 | 30 |
📌 2. Cross Tabulation (crosstab
)
Basic Crosstab (Frequency Count)
pd.crosstab(df['department'], df['gender'])
gender | F | M |
---|---|---|
department | ||
Finance | 2 | 0 |
HR | 1 | 2 |
IT | 1 | 2 |
Normalized Crosstab (Percentage)
pd.crosstab(df['department'], df['gender'], normalize='index')
gender | F | M |
---|---|---|
Finance | 1.00 | 0.00 |
HR | 0.33 | 0.67 |
IT | 0.33 | 0.67 |
Crosstab with Margins (Totals)
pd.crosstab(df['department'], df['gender'], margins=True)
gender | F | M | All |
---|---|---|---|
Finance | 2 | 0 | 2 |
HR | 1 | 2 | 3 |
IT | 1 | 2 | 3 |
All | 4 | 4 | 8 |
📌 3. When to Use Pivot vs. Crosstab
Feature | Pivot Table | Crosstab |
---|---|---|
Purpose | Summarize numerical data | Count/summarize categorical data |
Aggregation |
mean, sum, max, min , etc.
|
Mainly count (can
also do %)
|
Input | A DataFrame | Series or array-like objects |
Margins/Totals |
Supported via
margins=True
|
Supported via
margins=True
|