import numpy as np
import pandas as pd
= pd.DataFrame(
df
{"A": np.random.uniform(low=0, high=1, size=12),
"B": pd.date_range(start="20230102", periods=12),
"C": pd.Series(range(1, 13), index=["R1", "R2", "R3", "R4", "R5", "R6", "R7", "R8" ,"R9", "R10", "R11", "R12"]),
"D": np.random.randint(1, high=100, size=12),
"E": pd.Categorical(["red", "green", "blue", "white", "pink", "brown", "black", "purple", "orange", "grey", "violet", "yellow"]),
"F": "foo",
= ["R1", "R2", "R3", "R4", "R5", "R6", "R7", "R8" ,"R9", "R10", "R11", "R12"]
}, index
)
= pd.read_csv('data/patient_satisfaction/cms_hospital_patient_satisfaction.csv') cms
Data Manipulation with pandas
In this section, we will explore various data manipulation and analysis techniques using the Pandas library. We’ll begin by learning how to set, update, or replace data within a DataFrame. Then, we’ll cover how to handle missing data effectively. Next, we will delve into grouping data, transforming it, and joining different datasets to prepare for downstream analysis. Finally, we’ll look at generating summary statistics to gain a deeper understanding of the underlying data.
We will use the following DataFrames in this section.
Statistics on data
The pandas library provides a wide range of statistical functions and methods to compute summary statistics for your data. Below provides some of the key statistical measures you can compute using this library.
Consider the following dataset which contains 15 rows. Each row contains 8 features (columns).
cms
ID | Facility Name | County | Hospital Type | Star Rating | No of Surveys | Response Rate | Overall Rating | |
---|---|---|---|---|---|---|---|---|
0 | 50424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
1 | 140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
2 | 100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
3 | 40062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
4 | 440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
5 | 450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
6 | 151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
7 | 61327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
8 | 490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
9 | 110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 |
10 | 50704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
11 | 100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
12 | 440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
13 | 501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
14 | 50116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
You can compute the mean of a Series (a single column of data) or a DataFrame (a table of data) using the .mean()
method.
Calculate the mean value for a certain column:
'No of Surveys'].mean() cms[
1063.1333333333334
Calculate the mean value for each column:
= True) cms.mean(numeric_only
ID 211738.200000 Star Rating 3.000000 No of Surveys 1063.133333 Response Rate 25.000000 Overall Rating 2.733333 dtype: float64
Calculate the mean value for each row:
=1, numeric_only = True) cms.mean(axis
0 10716.8 1 28075.2 2 20291.4 3 8521.8 4 88373.8 5 90284.0 6 30291.8 7 12323.0 8 98143.0 9 22390.8 10 10193.0 11 20144.0 12 88144.8 13 100352.6 14 10250.2 dtype: float64
The median is the middle value of a dataset when the values are arranged in ascending order. It is not affected by extreme values (outliers) and is often used to describe the central tendency of data. In Pandas, you can compute the median using the
.median()
method.= True) cms.median(numeric_only
ID 110215.0 Star Rating 3.0 No of Surveys 680.0 Response Rate 24.0 Overall Rating 3.0 dtype: float64
The standard deviation measures the amount of variation or dispersion in a dataset. A lower standard deviation indicates that data points are close to the mean, while a higher standard deviation indicates greater variability. In Pandas, you can compute the standard deviation using the
.std()
method.= True) cms.std(numeric_only
ID 188317.557348 Star Rating 0.925820 No of Surveys 909.054600 Response Rate 9.296697 Overall Rating 0.798809 dtype: float64
Variance quantifies how much individual data points deviate from the mean. It is the square of the standard deviation. In Pandas, you can compute the variance using the
.var()
method.= True) cms.var(numeric_only
ID 3.546350e+10 Star Rating 8.571429e-01 No of Surveys 8.263803e+05 Response Rate 8.642857e+01 Overall Rating 6.380952e-01 dtype: float64
You can also compute the sum of values using
.sum()
and count the total number of non-missing values using.count()
.sum(numeric_only = True) cms.
ID 3176073 Star Rating 45 No of Surveys 15947 Response Rate 375 Overall Rating 41 dtype: int64
cms.count()
ID 15 Facility Name 15 County 15 Hospital Type 15 Star Rating 15 No of Surveys 15 Response Rate 15 Overall Rating 15 dtype: int64
"Overall Rating"].sum() cms[
41
Here is a quick reference summary table of common useful functions.
Function | Description |
---|---|
count |
Number of non-NA observations |
sum |
Sum of values |
mean |
Mean of values |
median |
Arithmetic median of values |
min |
Minimum |
max |
Maximum |
mode |
Mode |
abs |
Absolute Value |
prod |
Product of values |
std |
Bessel-corrected sample standard deviation |
var |
Unbiased variance |
sem |
Standard error of the mean |
skew |
Sample skewness (3rd moment) |
kurt |
Sample kurtosis (4th moment) |
quantile |
Sample quantile (value at %) |
cumsum |
Cumulative sum |
cumprod |
Cumulative product |
cummax |
Cumulative maximum |
cummin |
Cumulative minimum |
Another useful function to count the frequency of values is shown below.
Consider the DataFrame,
= {
d "a": pd.Series(np.random.randint(0, 5, size=10)),
"b": pd.Series(np.random.randint(-3, 3, size=10))
}= pd.DataFrame(d)
df2 df2
a | b | |
---|---|---|
0 | 3 | -2 |
1 | 1 | 2 |
2 | 0 | 0 |
3 | 1 | -1 |
4 | 0 | 2 |
5 | 4 | -3 |
6 | 2 | -3 |
7 | 3 | -3 |
8 | 3 | 1 |
9 | 0 | 0 |
Frequency of values in all rows:
df2.value_counts()
a b 0 0 2 2 1 1 -1 1 2 1 2 -3 1 3 -3 1 -2 1 1 1 4 -3 1 Name: count, dtype: int64
Frequency of values in a single column:
'a'].value_counts() df2[
a 3 3 0 3 1 2 4 1 2 1 Name: count, dtype: int64
Additionally, two powerful functions, agg
and transform
, allow you to perform calculations and transformations on DataFrames. These two functions applies user defined function that reduces or broadcasts its results, repectively. However, these functions serve different purposes and have distinct use cases and we will explore these in detail later.
Missing data
Handling missing values is an essential part of data pre-processing and analysis in Pandas. Missing values can arise due to various reasons, such as data collection errors, incomplete data or sensor failures.
Let’s create a DataFrame with missing values.
Consider the following DataFrame:
= pd.DataFrame(np.random.randn(6, 4), index=['a', 'b', 'c', 'd', 'e', 'f'], columns=list("ABCD"))
df
= pd.date_range("20130101", periods=6)
dates "E"] = dates
df[
= pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])
s1 "F"] = s1
df[ df
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
a | 0.423350 | 0.143592 | -0.642234 | 0.345545 | 2013-01-01 | 1 |
b | 0.165784 | 1.102192 | -0.931715 | -1.135617 | 2013-01-02 | 2 |
c | 0.563209 | -1.123406 | 0.382713 | 1.210963 | 2013-01-03 | 3 |
d | -1.435517 | 0.916215 | -0.047471 | 0.646783 | 2013-01-04 | 4 |
e | 1.363078 | 1.061762 | -1.189557 | 1.561393 | 2013-01-05 | 5 |
f | -0.808026 | 1.027726 | 1.405489 | 0.724930 | 2013-01-06 | 6 |
Re-indexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data which includes several missing values.
= df.reindex(index=['a', 'b', 'c', 'h', 'd', 'e', 'f', 'g'], columns=list(df.columns) + ["G", "H"])
df1 'c' : 'f', "G"] = 1
df1.loc['h','C'] = 0.634336
df1.loc[ df1
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
a | 0.423350 | 0.143592 | -0.642234 | 0.345545 | 2013-01-01 | 1.0 | NaN | NaN |
b | 0.165784 | 1.102192 | -0.931715 | -1.135617 | 2013-01-02 | 2.0 | NaN | NaN |
c | 0.563209 | -1.123406 | 0.382713 | 1.210963 | 2013-01-03 | 3.0 | 1.0 | NaN |
h | NaN | NaN | 0.634336 | NaN | NaT | NaN | 1.0 | NaN |
d | -1.435517 | 0.916215 | -0.047471 | 0.646783 | 2013-01-04 | 4.0 | 1.0 | NaN |
e | 1.363078 | 1.061762 | -1.189557 | 1.561393 | 2013-01-05 | 5.0 | 1.0 | NaN |
f | -0.808026 | 1.027726 | 1.405489 | 0.724930 | 2013-01-06 | 6.0 | 1.0 | NaN |
g | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN |
Detecting missing values
Pandas provides several methods for detecting and dealing with missing values.
isna()
and isnull()
methods return a DataFrame of the same shape as the input, with Boolean values indicating whether each element is missing (True
) or not (False
).
pd.isna(df1)
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
a | False | False | False | False | False | False | True | True |
b | False | False | False | False | False | False | True | True |
c | False | False | False | False | False | False | False | True |
h | True | True | False | True | True | True | False | True |
d | False | False | False | False | False | False | False | True |
e | False | False | False | False | False | False | False | True |
f | False | False | False | False | False | False | False | True |
g | True | True | True | True | True | True | True | True |
pd.isnull(df1)
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
a | False | False | False | False | False | False | True | True |
b | False | False | False | False | False | False | True | True |
c | False | False | False | False | False | False | False | True |
h | True | True | False | True | True | True | False | True |
d | False | False | False | False | False | False | False | True |
e | False | False | False | False | False | False | False | True |
f | False | False | False | False | False | False | False | True |
g | True | True | True | True | True | True | True | True |
notna()
and notnull()
methods are the opposite of isna()
and isnull()
. They return True
for non-missing values and False
for missing values.
pd.notna(df1)
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
a | True | True | True | True | True | True | False | False |
b | True | True | True | True | True | True | False | False |
c | True | True | True | True | True | True | True | False |
h | False | False | True | False | False | False | True | False |
d | True | True | True | True | True | True | True | False |
e | True | True | True | True | True | True | True | False |
f | True | True | True | True | True | True | True | False |
g | False | False | False | False | False | False | False | False |
pd.notnull(df1)
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
a | True | True | True | True | True | True | False | False |
b | True | True | True | True | True | True | False | False |
c | True | True | True | True | True | True | True | False |
h | False | False | True | False | False | False | True | False |
d | True | True | True | True | True | True | True | False |
e | True | True | True | True | True | True | True | False |
f | True | True | True | True | True | True | True | False |
g | False | False | False | False | False | False | False | False |
Check if a column does not contain missing values:
"G"].notna() df1[
a False
b False
c True
h True
d True
e True
f True
g False
Name: G, dtype: bool
Subsetting rows to return rows that does not contain missing values in column G
:
"G"].notna()] df1[df1[
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
c | 0.563209 | -1.123406 | 0.382713 | 1.210963 | 2013-01-03 | 3.0 | 1.0 | NaN |
h | NaN | NaN | 0.634336 | NaN | NaT | NaN | 1.0 | NaN |
d | -1.435517 | 0.916215 | -0.047471 | 0.646783 | 2013-01-04 | 4.0 | 1.0 | NaN |
e | 1.363078 | 1.061762 | -1.189557 | 1.561393 | 2013-01-05 | 5.0 | 1.0 | NaN |
f | -0.808026 | 1.027726 | 1.405489 | 0.724930 | 2013-01-06 | 6.0 | 1.0 | NaN |
The info()
method provides a summary of the DataFrame, including the count of non-null values in each column.
df1.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, a to g
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 6 non-null float64
1 B 6 non-null float64
2 C 7 non-null float64
3 D 6 non-null float64
4 E 6 non-null datetime64[ns]
5 F 6 non-null float64
6 G 5 non-null float64
7 H 0 non-null float64
dtypes: datetime64[ns](1), float64(7)
memory usage: 876.0+ bytes
Handling missing values
Once you’ve detected missing values, you can choose from several strategies to handle them.
Consider the following DataFrame,
df1
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
a | 0.423350 | 0.143592 | -0.642234 | 0.345545 | 2013-01-01 | 1.0 | NaN | NaN |
b | 0.165784 | 1.102192 | -0.931715 | -1.135617 | 2013-01-02 | 2.0 | NaN | NaN |
c | 0.563209 | -1.123406 | 0.382713 | 1.210963 | 2013-01-03 | 3.0 | 1.0 | NaN |
h | NaN | NaN | 0.634336 | NaN | NaT | NaN | 1.0 | NaN |
d | -1.435517 | 0.916215 | -0.047471 | 0.646783 | 2013-01-04 | 4.0 | 1.0 | NaN |
e | 1.363078 | 1.061762 | -1.189557 | 1.561393 | 2013-01-05 | 5.0 | 1.0 | NaN |
f | -0.808026 | 1.027726 | 1.405489 | 0.724930 | 2013-01-06 | 6.0 | 1.0 | NaN |
g | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN |
Dropping missing values
Consider the following DataFrame for this example,
# dataframe excluding column H
= df1.loc[:, "A":"G"]
df_without_colH df_without_colH
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
a | 0.423350 | 0.143592 | -0.642234 | 0.345545 | 2013-01-01 | 1.0 | NaN |
b | 0.165784 | 1.102192 | -0.931715 | -1.135617 | 2013-01-02 | 2.0 | NaN |
c | 0.563209 | -1.123406 | 0.382713 | 1.210963 | 2013-01-03 | 3.0 | 1.0 |
h | NaN | NaN | 0.634336 | NaN | NaT | NaN | 1.0 |
d | -1.435517 | 0.916215 | -0.047471 | 0.646783 | 2013-01-04 | 4.0 | 1.0 |
e | 1.363078 | 1.061762 | -1.189557 | 1.561393 | 2013-01-05 | 5.0 | 1.0 |
f | -0.808026 | 1.027726 | 1.405489 | 0.724930 | 2013-01-06 | 6.0 | 1.0 |
g | NaN | NaN | NaN | NaN | NaT | NaN | NaN |
Dropping missing values by rows
Drop the rows where at least one element is missing:
df_without_colH.dropna()
A B C D E F G c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 Drop rows that have all missing data:
="all") df_without_colH.dropna(how
A B C D E F G a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 h NaN NaN 0.634336 NaN NaT NaN 1.0 d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 Drop rows that have any missing data:
="any") df_without_colH.dropna(how
A B C D E F G c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 Keep only the rows with at least 2 non-NA values:
=2) df_without_colH.dropna(thresh
A B C D E F G a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 h NaN NaN 0.634336 NaN NaT NaN 1.0 d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0
Dropping missing values by columns
Consider the following DataFrame for the remaining examples,
'g','C'] = 0.877525 df1.loc[ df1
A B C D E F G H a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN NaN b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN NaN c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN h NaN NaN 0.634336 NaN NaT NaN 1.0 NaN d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN g NaN NaN 0.877525 NaN NaT NaN NaN NaN Drop the columns where at least one element is missing:
='columns') # same as axis=1 df1.dropna(axis
C a -0.642234 b -0.931715 c 0.382713 h 0.634336 d -0.047471 e -1.189557 f 1.405489 g 0.877525 Drop columns that have all missing data:
="all", axis=1) df1.dropna(how
A B C D E F G a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 h NaN NaN 0.634336 NaN NaT NaN 1.0 d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 g NaN NaN 0.877525 NaN NaT NaN NaN Drop columns that have any missing data (shown as NaN):
="any", axis=1) df1.dropna(how
C a -0.642234 b -0.931715 c 0.382713 h 0.634336 d -0.047471 e -1.189557 f 1.405489 g 0.877525 Keep only the columns with at least 6 non-NA values:
=6, axis=1) df1.dropna(thresh
A B C D E F a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 h NaN NaN 0.634336 NaN NaT NaN d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 g NaN NaN 0.877525 NaN NaT NaN Define in which columns to look for missing values:
=["B", "G"]) df1.dropna(subset
A B C D E F G H c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN
Filling missing values
You can fill missing values with a specific value or a calculated value using the fillna()
method.
Consider the following DataFrame,
df1
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
a | 0.423350 | 0.143592 | -0.642234 | 0.345545 | 2013-01-01 | 1.0 | NaN | NaN |
b | 0.165784 | 1.102192 | -0.931715 | -1.135617 | 2013-01-02 | 2.0 | NaN | NaN |
c | 0.563209 | -1.123406 | 0.382713 | 1.210963 | 2013-01-03 | 3.0 | 1.0 | NaN |
h | NaN | NaN | 0.634336 | NaN | NaT | NaN | 1.0 | NaN |
d | -1.435517 | 0.916215 | -0.047471 | 0.646783 | 2013-01-04 | 4.0 | 1.0 | NaN |
e | 1.363078 | 1.061762 | -1.189557 | 1.561393 | 2013-01-05 | 5.0 | 1.0 | NaN |
f | -0.808026 | 1.027726 | 1.405489 | 0.724930 | 2013-01-06 | 6.0 | 1.0 | NaN |
g | NaN | NaN | 0.877525 | NaN | NaT | NaN | NaN | NaN |
Replace missing values with a scalar value
Fill missing values with 0:
0) df1.fillna(
A B C D E F G H a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 00:00:00 1.0 0.0 0.0 b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 00:00:00 2.0 0.0 0.0 c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 00:00:00 3.0 1.0 0.0 h 0.000000 0.000000 0.634336 0.000000 0 0.0 1.0 0.0 d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 00:00:00 4.0 1.0 0.0 e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 00:00:00 5.0 1.0 0.0 f -0.808026 1.027726 1.405489 0.724930 2013-01-06 00:00:00 6.0 1.0 0.0 g 0.000000 0.000000 0.877525 0.000000 0 0.0 0.0 0.0 Fill missing values with a scalar, but limit the number of column fills to 2:
12, axis='columns', limit=2) df1.fillna(
A B C D E F G H a 0.42335 0.143592 -0.642234 0.345545 2013-01-01 00:00:00 1.0 12 12 b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 00:00:00 2.0 12 12 c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 00:00:00 3.0 1.0 12 h 12 12 0.634336 NaN NaT NaN 1.0 NaN d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 00:00:00 4.0 1.0 12 e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 00:00:00 5.0 1.0 12 f -0.808026 1.027726 1.405489 0.72493 2013-01-06 00:00:00 6.0 1.0 12 g 12 12 0.877525 NaN NaT NaN NaN NaN
Fill gaps forward or backward
Propogate missing values forward:
# ffill ≡ pad ≡ fillna(method='ffill') df1.ffill()
A B C D E F G H a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN NaN b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN NaN c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN h 0.563209 -1.123406 0.634336 1.210963 2013-01-03 3.0 1.0 NaN d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN g -0.808026 1.027726 0.877525 0.724930 2013-01-06 6.0 1.0 NaN Propogate missing values backward along rows:
# bfill ≡ backfill ≡ fillna(method='bfill') df1.bfill()
A B C D E F G H a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 1.0 NaN b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 1.0 NaN c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN h -1.435517 0.916215 0.634336 0.646783 2013-01-04 4.0 1.0 NaN d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN g NaN NaN 0.877525 NaN NaT NaN NaN NaN
Fill with a Pandas object
Fill the missing values of a column with the mean of the column:
df1.fillna(df1.mean())
A B C D E F G H a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 00:00:00 1.0 1.0 NaN b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 00:00:00 2.0 1.0 NaN c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 00:00:00 3.0 1.0 NaN h 0.045313 0.521347 0.634336 0.559000 2013-01-03 12:00:00 3.5 1.0 NaN d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 00:00:00 4.0 1.0 NaN e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 00:00:00 5.0 1.0 NaN f -0.808026 1.027726 1.405489 0.724930 2013-01-06 00:00:00 6.0 1.0 NaN g 0.045313 0.521347 0.877525 0.559000 2013-01-03 12:00:00 3.5 1.0 NaN Fill only a subset of the columns using the corresponding median
"F":"G"]) df1.fillna(df1.median()[
A B C D E F G H a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 1.0 NaN b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 1.0 NaN c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN h NaN NaN 0.634336 NaN NaT 3.5 1.0 NaN d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN g NaN NaN 0.877525 NaN NaT 3.5 1.0 NaN
Interpolation
The interpolate method is used to fill in missing values with estimated values based on the surrounding data points. It’s particularly useful for time series data or datasets where values are expected to follow a pattern.
The interpolate method employs various interpolation techniques to estimate missing values, including linear interpolation, polynomial interpolation, and more. The specific interpolation method used depends on the kind of data and the desired behavior.
- Linear Interpolation: Linear interpolation is the default method used by interpolate. It estimates missing values by drawing a straight line between two adjacent known data points and filling in the missing value with a value along that line. This method is suitable for data that appears to change linearly over time or between data points.
- Polynomial Interpolation: Polynomial interpolation uses higher-degree polynomial functions to estimate missing values. This method can capture more complex relationships between data points but may be susceptible to overfitting if not used carefully.
- Time-Based Interpolation: When working with time series data, you can use the method parameter to specify time-based interpolation methods such as ‘time’, ‘index’, or ‘values’. These methods consider the time or index values to estimate missing values.
Here’s an example using linear interpolation:
df1.interpolate()
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
a | 0.423350 | 0.143592 | -0.642234 | 0.345545 | 2013-01-01 00:00:00 | 1.0 | NaN | NaN |
b | 0.165784 | 1.102192 | -0.931715 | -1.135617 | 2013-01-02 00:00:00 | 2.0 | NaN | NaN |
c | 0.563209 | -1.123406 | 0.382713 | 1.210963 | 2013-01-03 00:00:00 | 3.0 | 1.0 | NaN |
h | -0.436154 | -0.103596 | 0.634336 | 0.928873 | 2013-01-03 12:00:00 | 3.5 | 1.0 | NaN |
d | -1.435517 | 0.916215 | -0.047471 | 0.646783 | 2013-01-04 00:00:00 | 4.0 | 1.0 | NaN |
e | 1.363078 | 1.061762 | -1.189557 | 1.561393 | 2013-01-05 00:00:00 | 5.0 | 1.0 | NaN |
f | -0.808026 | 1.027726 | 1.405489 | 0.724930 | 2013-01-06 00:00:00 | 6.0 | 1.0 | NaN |
g | -0.808026 | 1.027726 | 0.877525 | 0.724930 | 2013-01-06 00:00:00 | 6.0 | 1.0 | NaN |
Grouping data
The groupby
function is a powerful feature within pandas that allows you to group and aggregate data in a DataFrame based on one or more columns. This can be especially useful when you want to perform summary statistics, calculations, or transformations on subsets of your data based on certain criteria. It involves one or more of the following steps:
- Splitting: The first step in a groupby operation is to split the DataFrame into groups based on the values in one or more columns. You specify the column(s) by which you want to group your data. This column is often referred to as the “key” or “grouping column”. Each unique value in the grouping column(s) forms a group, and the rows in the DataFrame are distributed among these groups.
- Applying a Function: After splitting the data into groups, you can apply various aggregation or transformation functions to each group. These functions are typically applied to one or more columns in each group. Common aggregation functions include sum, mean, count, min, max, and more. You can also apply custom functions or perform complex operations on the grouped data.
- Aggregation: compute a summary statistic (or statistics) for each group. Examples: compute group sums or means, compute group size.
- Transformation: perform some group-specific computations and return a like-indexed object. Examples: standardize data (zscore) within a group, filling NAs within groups with a value derived from each group.
- Filtration: discard some groups, according to a group-wise computation that evaluates to True or False. Examples: discard data that belong to groups with only a few members, filter out data based on the group sum or mean.
- Combining Results: Once the specified function(s) are applied to each group, the results are combined into a new DataFrame or data structure. This final result will often have a hierarchical structure, with the grouping columns as index levels.
Consider the following metabric dataset for the examples listed below.
# Load the Metabric dataset from the URL into a DataFrame
= pd.read_csv("https://zenodo.org/record/6450144/files/metabric_clinical_and_expression_data.csv")
metabric metabric
Patient_ID | Cohort | Age_at_diagnosis | Survival_time | Survival_status | Vital_status | Chemotherapy | Radiotherapy | Tumour_size | Tumour_stage | ... | Integrative_cluster | Mutation_count | ESR1 | ERBB2 | PGR | TP53 | PIK3CA | GATA3 | FOXA1 | MLPH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MB-0000 | 1 | 75.65 | 140.500000 | LIVING | Living | NO | YES | 22.0 | 2.0 | ... | 4ER+ | NaN | 8.929817 | 9.333972 | 5.680501 | 6.338739 | 5.704157 | 6.932146 | 7.953794 | 9.729728 |
1 | MB-0002 | 1 | 43.19 | 84.633333 | LIVING | Living | NO | YES | 10.0 | 1.0 | ... | 4ER+ | 2.0 | 10.047059 | 9.729606 | 7.505424 | 6.192507 | 5.757727 | 11.251197 | 11.843989 | 12.536570 |
2 | MB-0005 | 1 | 48.87 | 163.700000 | DECEASED | Died of Disease | YES | NO | 15.0 | 2.0 | ... | 3 | 2.0 | 10.041281 | 9.725825 | 7.376123 | 6.404516 | 6.751566 | 9.289758 | 11.698169 | 10.306115 |
3 | MB-0006 | 1 | 47.68 | 164.933333 | LIVING | Living | YES | YES | 25.0 | 2.0 | ... | 9 | 1.0 | 10.404685 | 10.334979 | 6.815637 | 6.869241 | 7.219187 | 8.667723 | 11.863379 | 10.472181 |
4 | MB-0008 | 1 | 76.97 | 41.366667 | DECEASED | Died of Disease | YES | YES | 40.0 | 2.0 | ... | 9 | 2.0 | 11.276581 | 9.956267 | 7.331223 | 6.337951 | 5.817818 | 9.719781 | 11.625006 | 12.161961 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1899 | MB-7295 | 4 | 43.10 | 196.866667 | LIVING | Living | NO | YES | 25.0 | NaN | ... | 3 | 4.0 | 9.591235 | 9.935179 | 7.984515 | 6.753291 | 6.279207 | 9.207323 | 11.281194 | 11.337601 |
1900 | MB-7296 | 4 | 42.88 | 44.733333 | DECEASED | Died of Disease | NO | YES | 20.0 | NaN | ... | 5 | 6.0 | 9.733986 | 13.753037 | 5.616082 | 6.271912 | 5.999093 | 9.530390 | 11.532033 | 11.626140 |
1901 | MB-7297 | 4 | 62.90 | 175.966667 | DECEASED | Died of Disease | NO | YES | 25.0 | NaN | ... | 1 | 4.0 | 11.053198 | 10.228570 | 7.478069 | 6.212256 | 6.192399 | 9.540589 | 11.482761 | 11.180360 |
1902 | MB-7298 | 4 | 61.16 | 86.233333 | DECEASED | Died of Other Causes | NO | NO | 25.0 | NaN | ... | 1 | 15.0 | 11.055114 | 9.892589 | 8.282737 | 6.466712 | 6.287254 | 10.365901 | 11.371176 | 12.827069 |
1903 | MB-7299 | 4 | 60.02 | 201.900000 | DECEASED | Died of Other Causes | NO | YES | 20.0 | NaN | ... | 10 | 3.0 | 10.696475 | 10.227787 | 5.533486 | 6.180511 | 6.208784 | 9.749368 | 10.867527 | 9.847856 |
1904 rows × 32 columns
Aggregation
This is primarily used for aggregating data (within groups when you use groupby
). It allows you to apply one or more aggregation functions to each group and obtain a summarized result for each group.
Calculate the total number of patients, categorized by cancer type.
"Cancer_type")["Patient_ID"].count() metabric.groupby(
Patient_ID Cancer_type Breast 17 Breast Invasive Ductal Carcinoma 1500 Breast Invasive Lobular Carcinoma 142 Breast Invasive Mixed Mucinous Carcinoma 22 Breast Mixed Ductal and Lobular Carcinoma 207 Metaplastic Breast Cancer 1 count()
computes the number of non-NA values in the groups whereassize()
computes the number of values in each group. To demonstrate the difference between these two methods, group the metabric dataset by3-gene-classifier
and find the number of values in each group based onTumour_size
column."3-gene_classifier")["Tumour_size"].count() metabric.groupby(
Tumour_size 3-gene_classifier ER+/HER2- High Prolif 596 ER+/HER2- Low Prolif 615 ER-/HER2- 287 HER2+ 186 "3-gene_classifier")["Tumour_size"].size() metabric.groupby(
Tumour_size 3-gene_classifier ER+/HER2- High Prolif 603 ER+/HER2- Low Prolif 619 ER-/HER2- 290 HER2+ 188 Determine the median age at diagnosis for each type of vital status.
"Vital_status")["Age_at_diagnosis"].median() metabric.groupby(
Age_at_diagnosis Vital_status Died of Disease 61.135 Died of Other Causes 70.505 Living 56.740 Calculate the total, mean, and standard deviation of Tumour_size for each combination of cancer type and ER status.
= metabric.groupby(["Cancer_type", "ER_status"]) grouped "Tumour_size"].agg(['sum','mean','std']) grouped[
sum mean std Cancer_type ER_status Breast Negative 173.00 43.250000 17.951323 Positive 289.00 22.230769 13.417363 Breast Invasive Ductal Carcinoma Negative 11241.77 28.388308 19.209870 Positive 27468.87 25.154643 12.961027 Breast Invasive Lobular Carcinoma Negative 588.80 29.440000 19.155277 Positive 3579.00 30.075630 19.792221 Breast Invasive Mixed Mucinous Carcinoma Negative 40.00 40.000000 NaN Positive 518.00 25.900000 15.403007 Breast Mixed Ductal and Lobular Carcinoma Negative 326.00 25.076923 13.174860 Positive 4832.00 25.298429 12.900426 Metaplastic Breast Cancer Negative 16.00 16.000000 NaN
Transformation
This is used for element-wise transformations. It applies a given function to each element in a DataFrame or Series and returns a new Series with the same index as the original DataFrame. This function is commonly used when you want to broadcast a computed value back to the original DataFrame, maintaining the original shape of the data.
Compute the difference between adjacent values of expression of ESR1 within each group of cancer types.
"Cancer_type")["ESR1"].diff() metabric.groupby(
ESR1 0 NaN 1 1.117242 2 -0.005778 3 NaN 4 0.871896 ... ... 1899 -1.833044 1900 -1.556991 1901 1.319212 1902 0.001916 1903 -0.358639 1904 rows × 1 columns
Compute the cumulative sum of mutation count within each integrative cluster group.
"Chemotherapy")["Mutation_count"].transform("cumsum") metabric.groupby(
Mutation_count 0 NaN 1 2.0 2 2.0 3 3.0 4 5.0 ... ... 1899 8583.0 1900 8589.0 1901 8593.0 1902 8608.0 1903 8611.0 1904 rows × 1 columns
- Convert the survival time which is given in months to years.
"Survival_time"].transform(lambda x: x / 12) metabric[
0 11.708333 1 7.052778 2 13.641667 3 13.744444 4 3.447222 ... 1899 16.405556 1900 3.727778 1901 14.663889 1902 7.186111 1903 16.825000 Name: Survival_time, Length: 1904, dtype: float64
A lambda function in Python is a small, anonymous, and inline function. It is also known as a lambda expression or lambda notation. Lambda functions are a way to create small, one-time-use functions without needing to define them using the def keyword. Lambda functions are typically used for short, simple operations where defining a full function using def would be overly verbose.
lambda arguments: expression
Here’s a simple example to illustrate the use of lambda functions:
# Regular function to calculate the square of a number def square(x): return x ** 2 # Equivalent lambda function to calculate the square of a number = lambda x: x ** 2 square_lambda
Filtration
This is the process of selecting rows from a DataFrame based on grouping. It may either filter out entire groups, part of groups, or both.
Select the 3rd row of each group categorized based on cancer type.
"Cancer_type").nth(3) metabric.groupby(
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage ... Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH 5 MB-0010 1 78.77 7.800000 DECEASED Died of Disease NO YES 31.0 4.0 ... 7 4.0 11.239750 9.739996 5.954311 5.419711 6.123056 9.787085 12.142178 11.433164 15 MB-0050 1 44.64 75.333333 LIVING Living YES YES 33.0 2.0 ... 8 NaN 9.492446 10.399286 6.665170 6.843690 5.617014 9.396238 11.188134 12.243680 32 MB-0102 1 51.38 140.766667 DECEASED Died of Disease YES YES 40.0 2.0 ... 3 4.0 9.913163 9.980732 7.651020 6.431794 5.970082 9.402294 11.052930 11.474559 230 MB-0365 1 76.84 87.233333 DECEASED Died of Disease NO YES 15.0 1.0 ... 4ER+ 6.0 8.365138 12.345780 5.370492 6.046610 6.092679 8.708236 10.604942 10.728391 429 MB-0620 1 62.97 112.800000 LIVING Living YES YES 14.0 2.0 ... 4ER+ NaN 8.449960 9.666006 5.468108 6.080813 6.233319 8.377827 8.580251 9.925870 5 rows × 32 columns
Print the top rows of each group categorized based on cancer type and three gene classifier.
"Cancer_type", "3-gene_classifier"]).head() metabric.groupby([
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage ... Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH 0 MB-0000 1 75.65 140.500000 LIVING Living NO YES 22.0 2.0 ... 4ER+ NaN 8.929817 9.333972 5.680501 6.338739 5.704157 6.932146 7.953794 9.729728 1 MB-0002 1 43.19 84.633333 LIVING Living NO YES 10.0 1.0 ... 4ER+ 2.0 10.047059 9.729606 7.505424 6.192507 5.757727 11.251197 11.843989 12.536570 4 MB-0008 1 76.97 41.366667 DECEASED Died of Disease YES YES 40.0 2.0 ... 9 2.0 11.276581 9.956267 7.331223 6.337951 5.817818 9.719781 11.625006 12.161961 5 MB-0010 1 78.77 7.800000 DECEASED Died of Disease NO YES 31.0 4.0 ... 7 4.0 11.239750 9.739996 5.954311 5.419711 6.123056 9.787085 12.142178 11.433164 8 MB-0028 1 86.41 36.566667 DECEASED Died of Other Causes NO YES 16.0 2.0 ... 9 4.0 12.521038 10.678267 5.325554 6.220372 6.250678 10.260059 12.148375 10.936002 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 1524 MB-6047 5 72.94 37.866667 DECEASED Died of Disease NO NO 15.0 NaN ... 9 14.0 9.457621 10.750387 6.588294 6.442729 6.114882 9.633760 10.743632 10.220908 1607 MB-6212 5 61.81 224.233333 LIVING Living NO YES 15.0 1.0 ... 7 4.0 11.720822 10.060722 6.324464 6.606875 5.476758 10.147003 11.494327 12.624759 1651 MB-6314 5 68.57 134.833333 LIVING Living NO NO 14.0 NaN ... 1 5.0 10.128997 13.637355 5.249247 6.975344 5.938845 10.767529 12.502447 11.903197 1696 MB-7030 4 74.21 31.333333 DECEASED Died of Other Causes NO YES 50.0 NaN ... 4ER+ 5.0 6.192748 10.030603 4.969329 7.290813 5.523130 6.343169 6.251026 6.423873 1856 MB-7235 4 62.45 170.800000 LIVING Living NO NO 20.0 NaN ... 8 12.0 11.854216 11.241157 5.898537 6.665210 5.863743 10.676374 11.860498 11.966185 84 rows × 32 columns
apply
function
Some operations on the grouped data might not fit into the aggregation, transformation, or filtration categories. For these, you can use the apply
function.
- Select all patients with age between 50 and 70 for each group categorized based on cancer type.
'ER_status')["Age_at_diagnosis"].apply(lambda x: x[(x >= 50) & (x <= 70)]) metabric.groupby(
ER_status
Negative 20 52.14
24 50.42
30 68.68
52 52.11
54 63.53
...
Positive 1897 56.90
1898 59.20
1901 62.90
1902 61.16
1903 60.02
Name: Age_at_diagnosis, Length: 983, dtype: float64
# Filter the DataFrame directly based on 'Age_at_diagnosis' and then group by 'ER_status'
= metabric[(metabric['Age_at_diagnosis'] >= 50) & (metabric['Age_at_diagnosis'] <= 70)].groupby('ER_status')
filtered_df
# Display each group in the filtered DataFrame
for cancer_type, group in filtered_df:
print(f"Cancer Type: {cancer_type}")
print(group)
print() # For better readability
Cancer Type: Negative
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status \
20 MB-0062 1 52.14 153.966667 LIVING
24 MB-0079 1 50.42 28.500000 DECEASED
30 MB-0100 1 68.68 8.066667 DECEASED
52 MB-0127 1 52.11 132.066667 LIVING
54 MB-0129 1 63.53 38.566667 DECEASED
... ... ... ... ... ...
1865 MB-7251 4 54.66 21.566667 DECEASED
1869 MB-7256 4 50.92 34.700000 DECEASED
1879 MB-7270 4 50.84 175.166667 LIVING
1887 MB-7281 4 51.22 49.533333 DECEASED
1895 MB-7291 4 53.87 6.833333 DECEASED
Vital_status Chemotherapy Radiotherapy Tumour_size \
20 Living YES YES 17.00
24 Died of Disease YES YES 40.00
30 Died of Disease YES YES 39.00
52 Living YES YES 13.00
54 Died of Other Causes YES YES 24.00
... ... ... ... ...
1865 Died of Disease NO NO 15.47
1869 Died of Disease YES YES 40.00
1879 Living YES NO 16.00
1887 Died of Disease YES YES 16.00
1895 Died of Disease NO NO 50.00
Tumour_stage ... Integrative_cluster Mutation_count ESR1 \
20 1.0 ... 10 3.0 5.965770
24 2.0 ... 10 4.0 6.312633
30 2.0 ... 10 1.0 6.204958
52 2.0 ... 3 5.0 7.383420
54 2.0 ... 5 NaN 6.217353
... ... ... ... ... ...
1865 NaN ... 5 1.0 6.552171
1869 NaN ... 5 4.0 6.425775
1879 NaN ... 10 8.0 5.996827
1887 NaN ... 5 3.0 6.074817
1895 NaN ... 1 4.0 6.096158
ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
20 7.789817 5.164429 5.444836 7.332574 5.987214 6.461705 6.266163
24 9.068778 5.305683 6.673993 7.068624 5.667747 10.690917 9.056929
30 8.881671 5.172111 7.072938 6.189728 5.687771 6.311955 7.963707
52 11.011174 5.368214 5.877772 5.997610 8.749941 10.912581 11.732541
54 13.495315 5.582046 5.678390 6.556857 7.141058 10.474060 10.331051
... ... ... ... ... ... ... ...
1865 14.224677 5.078956 6.921696 5.385871 9.574542 11.608117 11.280612
1869 14.464282 5.356381 6.107157 6.480691 8.570477 11.123834 12.386091
1879 9.515256 5.261880 6.981713 5.724471 7.574704 10.685767 9.210209
1887 12.545156 5.218277 6.423540 5.856297 7.084129 11.029630 10.737725
1895 12.600238 5.260572 6.351516 5.829438 8.116684 10.298983 10.747426
[221 rows x 32 columns]
Cancer Type: Positive
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status \
6 MB-0014 1 56.45 164.333333 LIVING
14 MB-0048 1 51.46 103.833333 LIVING
17 MB-0054 1 66.91 160.300000 LIVING
18 MB-0056 1 62.62 62.866667 LIVING
21 MB-0064 1 69.13 108.933333 LIVING
... ... ... ... ... ...
1897 MB-7293 4 56.90 199.233333 LIVING
1898 MB-7294 4 59.20 82.733333 DECEASED
1901 MB-7297 4 62.90 175.966667 DECEASED
1902 MB-7298 4 61.16 86.233333 DECEASED
1903 MB-7299 4 60.02 201.900000 DECEASED
Vital_status Chemotherapy Radiotherapy Tumour_size \
6 Living YES YES 10.0
14 Living YES YES 25.0
17 Living NO YES 36.0
18 Living NO NO 29.0
21 Living NO NO 18.0
... ... ... ... ...
1897 Living NO NO 45.0
1898 Died of Disease NO NO 15.0
1901 Died of Disease NO YES 25.0
1902 Died of Other Causes NO NO 25.0
1903 Died of Other Causes NO YES 20.0
Tumour_stage ... Integrative_cluster Mutation_count ESR1 \
6 2.0 ... 3 4.0 10.793832
14 2.0 ... 4ER+ NaN 8.101955
17 2.0 ... 10 3.0 10.194473
18 1.0 ... 1 4.0 11.560030
21 1.0 ... 3 NaN 11.288632
... ... ... ... ... ...
1897 NaN ... 3 5.0 10.879891
1898 NaN ... 1 2.0 11.290976
1901 NaN ... 1 4.0 11.053198
1902 NaN ... 1 15.0 11.055114
1903 NaN ... 10 3.0 10.696475
ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
6 9.276507 7.720952 5.992706 7.481835 8.365527 11.482627 10.755199
14 12.847992 6.114007 6.397985 7.779824 7.294771 10.963978 9.063062
17 9.451855 6.769441 5.350892 7.564274 7.465016 10.822652 8.665371
18 8.674760 6.336389 5.845364 7.234934 8.838840 10.905011 9.829488
21 8.367936 8.161056 6.433638 6.412103 8.528343 11.847914 10.443699
... ... ... ... ... ... ... ...
1897 10.219154 6.431113 5.791022 6.000993 9.551231 11.302246 11.301274
1898 10.846545 7.312247 5.660943 6.190000 9.424235 11.075688 11.567166
1901 10.228570 7.478069 6.212256 6.192399 9.540589 11.482761 11.180360
1902 9.892589 8.282737 6.466712 6.287254 10.365901 11.371176 12.827069
1903 10.227787 5.533486 6.180511 6.208784 9.749368 10.867527 9.847856
[762 rows x 32 columns]
assign
function
This function is used to create new columns or modify existing columns in a DataFrame in a concise and flexible way. It returns a new DataFrame with the added or updated columns while keeping the original DataFrame unchanged, unless explicitly reassigned.
Create a new column named Tumor_size_cm
in the metabric DataFrame to represent tumor size in centimeters.
= metabric.Tumour_size / 10) metabric.assign(Tumour_size_cm
Patient_ID | Cohort | Age_at_diagnosis | Survival_time | Survival_status | Vital_status | Chemotherapy | Radiotherapy | Tumour_size | Tumour_stage | ... | Mutation_count | ESR1 | ERBB2 | PGR | TP53 | PIK3CA | GATA3 | FOXA1 | MLPH | Tumour_size_cm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MB-0000 | 1 | 75.65 | 140.500000 | LIVING | Living | NO | YES | 22.0 | 2.0 | ... | NaN | 8.929817 | 9.333972 | 5.680501 | 6.338739 | 5.704157 | 6.932146 | 7.953794 | 9.729728 | 2.2 |
1 | MB-0002 | 1 | 43.19 | 84.633333 | LIVING | Living | NO | YES | 10.0 | 1.0 | ... | 2.0 | 10.047059 | 9.729606 | 7.505424 | 6.192507 | 5.757727 | 11.251197 | 11.843989 | 12.536570 | 1.0 |
2 | MB-0005 | 1 | 48.87 | 163.700000 | DECEASED | Died of Disease | YES | NO | 15.0 | 2.0 | ... | 2.0 | 10.041281 | 9.725825 | 7.376123 | 6.404516 | 6.751566 | 9.289758 | 11.698169 | 10.306115 | 1.5 |
3 | MB-0006 | 1 | 47.68 | 164.933333 | LIVING | Living | YES | YES | 25.0 | 2.0 | ... | 1.0 | 10.404685 | 10.334979 | 6.815637 | 6.869241 | 7.219187 | 8.667723 | 11.863379 | 10.472181 | 2.5 |
4 | MB-0008 | 1 | 76.97 | 41.366667 | DECEASED | Died of Disease | YES | YES | 40.0 | 2.0 | ... | 2.0 | 11.276581 | 9.956267 | 7.331223 | 6.337951 | 5.817818 | 9.719781 | 11.625006 | 12.161961 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1899 | MB-7295 | 4 | 43.10 | 196.866667 | LIVING | Living | NO | YES | 25.0 | NaN | ... | 4.0 | 9.591235 | 9.935179 | 7.984515 | 6.753291 | 6.279207 | 9.207323 | 11.281194 | 11.337601 | 2.5 |
1900 | MB-7296 | 4 | 42.88 | 44.733333 | DECEASED | Died of Disease | NO | YES | 20.0 | NaN | ... | 6.0 | 9.733986 | 13.753037 | 5.616082 | 6.271912 | 5.999093 | 9.530390 | 11.532033 | 11.626140 | 2.0 |
1901 | MB-7297 | 4 | 62.90 | 175.966667 | DECEASED | Died of Disease | NO | YES | 25.0 | NaN | ... | 4.0 | 11.053198 | 10.228570 | 7.478069 | 6.212256 | 6.192399 | 9.540589 | 11.482761 | 11.180360 | 2.5 |
1902 | MB-7298 | 4 | 61.16 | 86.233333 | DECEASED | Died of Other Causes | NO | NO | 25.0 | NaN | ... | 15.0 | 11.055114 | 9.892589 | 8.282737 | 6.466712 | 6.287254 | 10.365901 | 11.371176 | 12.827069 | 2.5 |
1903 | MB-7299 | 4 | 60.02 | 201.900000 | DECEASED | Died of Other Causes | NO | YES | 20.0 | NaN | ... | 3.0 | 10.696475 | 10.227787 | 5.533486 | 6.180511 | 6.208784 | 9.749368 | 10.867527 | 9.847856 | 2.0 |
1904 rows × 33 columns
Tidy Data
Tidy data is a structured and organized format for presenting data that follows a simple convention: variables are placed in columns, observations are placed in rows and values are placed in cells. This standardized arrangement makes it easy to work with and analyze data efficiently. The principles of tidy data, are designed to promote consistency and ease of use in data analysis.
Data is often entered in a wide format, where each row typically represents a site, subject, or patient, and there are multiple observation variables containing the same type of data.
For instance, consider the AirPassengers dataset. It contains information on monthly airline passenger numbers from 1949 to 1960. In this dataset, each row corresponds to a single year, and the columns represent each month from January to December.
Wide format is intuitive for data entry. But it is less so for data analysis. Consider calculating the monthly mean; where would you place it? Would it be another row?
Data needs to be reshaped to conform to the tidy data structure. It involves using two primary verbs (or pairs of opposites):
- Gather columns into rows (
melt()
). - Spread rows into columns (
pivot()
).
Converting data from wide to long format
First read the counts file called GSE60450_normalized_data.csv that is in a folder called data (i.e. the path to the file should be data/GSE60450_normalized_data.csv).
= pd.read_csv("data/GSE60450_normalized_data.csv")
counts counts.head()
Output
X | gene_symbol | GSM1480291 | GSM1480292 | GSM1480293 | GSM1480294 | GSM1480295 | GSM1480296 | GSM1480297 | GSM1480298 | GSM1480299 | GSM1480300 | GSM1480301 | GSM1480302 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSMUSG00000031805 | Jak3 | 82.22869 | 81.18604 | 36.13808 | 36.62751 | 12.35039 | 11.85235 | 10.59006 | 14.88337 | 7.57182 | 7.05763 | 3.16982 | 4.25451 |
1 | ENSMUSG00000033909 | Usp36 | 88.22360 | 94.59098 | 76.11531 | 63.73863 | 27.10872 | 24.87050 | 95.67017 | 100.73912 | 78.07470 | 59.35009 | 36.36491 | 40.37057 |
2 | ENSMUSG00000065524 | Mir135a-2 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
3 | ENSMUSG00000030374 | Strn4 | 94.26324 | 85.05204 | 59.54855 | 54.30684 | 24.42892 | 24.90936 | 125.00263 | 102.99289 | 110.64124 | 97.98130 | 77.96883 | 105.70093 |
4 | ENSMUSG00000098547 | Mir6921 | 0.08948 | 0.08404 | 0.00000 | 0.08459 | 0.03884 | 0.07772 | 0.00000 | 0.08505 | 0.07726 | 0.08255 | 0.04403 | 0.04727 |
To transform this table from a wide format to a long format, we use the melt()
function. It’s important to note that this function does not create tidy data as it duplicates rows. However, the output in ‘long format’ from melt()
is often necessary for visualization and for left join, which will be introduced later.
This operation will convert multiple columns with counts for each sample into a single column containing all the expression values, as illustrated in the image below.
The melt()
function takes four arguments:
- id_vars = : Column(s) to use as identifier variables.
- value_vars = : Columns to be unpivot or converted into labels,values in long form. If not specified, uses all columns that are not set as id_vars =.
- var_name = : a name(s) for the new column(s) containing the labels from the specified columns.
- value_name =: a name(s) for the new column(s) containing the values corresponding to the specified columns.
=['X', 'gene_symbol'], var_name='Sample', value_name='Count') counts.melt(id_vars
Output
X | gene_symbol | Sample | Count | |
---|---|---|---|---|
0 | ENSMUSG00000031805 | Jak3 | GSM1480291 | 82.22869 |
1 | ENSMUSG00000033909 | Usp36 | GSM1480291 | 88.22360 |
2 | ENSMUSG00000065524 | Mir135a-2 | GSM1480291 | 0.00000 |
3 | ENSMUSG00000030374 | Strn4 | GSM1480291 | 94.26324 |
4 | ENSMUSG00000098547 | Mir6921 | GSM1480291 | 0.08948 |
... | ... | ... | ... | ... |
235 | ENSMUSG00000066189 | Cacng3 | GSM1480302 | 0.00000 |
236 | ENSMUSG00000005611 | Mrvi1 | GSM1480302 | 185.16572 |
237 | ENSMUSG00000064299 | 4921528I07Rik | GSM1480302 | 0.00000 |
238 | ENSMUSG00000028174 | Rpe65 | GSM1480302 | 0.00000 |
239 | ENSMUSG00000024902 | Mrpl11 | GSM1480302 | 10.06901 |
240 rows × 4 columns
='X', value_vars=counts.columns[counts.columns.str.startswith('GSM')], var_name='Sample', value_name='Count') counts.melt(id_vars
Output
X | Sample | Count | |
---|---|---|---|
0 | ENSMUSG00000031805 | GSM1480291 | 82.22869 |
1 | ENSMUSG00000033909 | GSM1480291 | 88.22360 |
2 | ENSMUSG00000065524 | GSM1480291 | 0.00000 |
3 | ENSMUSG00000030374 | GSM1480291 | 94.26324 |
4 | ENSMUSG00000098547 | GSM1480291 | 0.08948 |
... | ... | ... | ... |
235 | ENSMUSG00000066189 | GSM1480302 | 0.00000 |
236 | ENSMUSG00000005611 | GSM1480302 | 185.16572 |
237 | ENSMUSG00000064299 | GSM1480302 | 0.00000 |
238 | ENSMUSG00000028174 | GSM1480302 | 0.00000 |
239 | ENSMUSG00000024902 | GSM1480302 | 10.06901 |
240 rows × 3 columns
The counts.columns[counts.columns.str.startswith('GSM')]
command returns a vector of columns whose names starts with ‘GSM’. melt()
will then transform those columns into two new columns, denoted as “Sample” and “Count.” The parameter var_name = 'Sample'
indicates the new column containing the specified columns (defined by cols) should be named “Sample,” while value_name = 'Count'
specifies that the new column containing the values should be named “Count”.
Converting data from long to wide format
First, read the annotation file called GSE60450_annotation.csv (the path to the file should be data/GSE60450_annotation.csv).
= pd.read_csv("data/GSE60450_annotation.csv")
annot annot.head()
Output
ENSEMBL | Type | Annotation | |
---|---|---|---|
0 | ENSMUSG00000031805 | SYMBOL | Jak3 |
1 | ENSMUSG00000031805 | GENENAME | Janus kinase 3 |
2 | ENSMUSG00000033909 | SYMBOL | Usp36 |
3 | ENSMUSG00000033909 | GENENAME | ubiquitin specific peptidase 36 |
4 | ENSMUSG00000065524 | SYMBOL | Mir135a-2 |
To transform this table so that it conforms to the tidy principles, we use the pivot()
function.
This operation will convert multiple rows with type and annotation into columns containing the Symbol and Gene_name, as illustrated in the image below.
The pivot()
function takes three arguments:
- columns = : column containing the labels that will be transformed into the new column names.
- index = : column to make new frames’ index. If not given, uses existing index.
- values = : column(s) containing the values that will fill the new columns.
In our scenario, to reshape the annot data frame, we will use the column names Type and Annotation:
='ENSEMBL', columns='Type', values='Annotation') annot.pivot(index
Output
Type | GENENAME | SYMBOL |
---|---|---|
ENSEMBL | ||
ENSMUSG00000005611 | inositol 1,4,5-triphosphate receptor associated 1 | Irag1 |
ENSMUSG00000024902 | mitochondrial ribosomal protein L11 | Mrpl11 |
ENSMUSG00000026283 | inhibitor of growth family, member 5 | Ing5 |
ENSMUSG00000028174 | retinal pigment epithelium 65 | Rpe65 |
ENSMUSG00000030374 | striatin, calmodulin binding protein 4 | Strn4 |
ENSMUSG00000031805 | Janus kinase 3 | Jak3 |
ENSMUSG00000033475 | translocase of outer mitochondrial membrane 6 | Tomm6 |
ENSMUSG00000033909 | ubiquitin specific peptidase 36 | Usp36 |
ENSMUSG00000035629 | RUN domain and cysteine-rich domain containing... | Rubcn |
ENSMUSG00000037331 | La ribonucleoprotein 1, translational regulator | Larp1 |
ENSMUSG00000038246 | family with sequence similarity 50, member B | Fam50b |
ENSMUSG00000039715 | dynein 2 intermediate chain 2 | Dync2i2 |
ENSMUSG00000053080 | zinc finger translocation associated | Zfta |
ENSMUSG00000055491 | peroxisome proliferative activated receptor, g... | Pprc1 |
ENSMUSG00000064299 | RIKEN cDNA 4921528I07 gene | 4921528I07Rik |
ENSMUSG00000065524 | microRNA 135a-2 | Mir135a-2 |
ENSMUSG00000066189 | calcium channel, voltage-dependent, gamma subu... | Cacng3 |
ENSMUSG00000074489 | bone gamma-carboxyglutamate protein 3 | Bglap3 |
ENSMUSG00000094053 | secretoglobin, family 2B, member 7 | Scgb2b7 |
ENSMUSG00000098547 | microRNA 6921 | Mir6921 |
The above operation changes the ‘shape’ of the dataframe from a longer format (more rows) to a wider format (more columns). While the original table consists of 40 rows, using pivot()
results in only 20 rows. This reduction is due to the de-duplication of rows during the creation of new columns.
It’s important to note that since we only have two distinct labels in the Type column, we are essentially replacing the existing two columns with just two new columns. Consequently, the shape of the output doesn’t technically become wider than the input data frame. However, when there are more than two unique labels in the columns=
column, the output will indeed become wider compared to the input.
Separate Columns
First, read the metadata file called GSE60450_metadata.csv (the path to the file should be data/GSE60450_metadata.csv).
= pd.read_csv("data/GSE60450_metadata.csv")
metadata metadata.head()
Output
gene_id | characteristics | |
---|---|---|
0 | GSM1480291 | mammary gland;luminal cells;virgin |
1 | GSM1480292 | mammary gland;luminal cells;virgin |
2 | GSM1480293 | mammary gland;luminal cells;18.5 day pregnancy |
3 | GSM1480294 | mammary gland;luminal cells;18.5 day pregnancy |
4 | GSM1480295 | mammary gland;luminal cells;2 day lactation |
To transform this table so that it conforms to the tidy principles, we use the str.split()
function. This operation will separate characteristic column into 3 separate columns containing the tissue_type, immunophenotype and development_stage, as illustrated in the image below.
To separate characteristic column in the metadata data frame into three separate columns based on the delimeter ;
(semi colon), we can use the str.split()
function:
# split characteristic column into 3 strings based on delimeter ; and update the metadata dataframe
"tissue_type", "immunophenotype", "development_stage"]] = metadata.characteristics.str.split(';', expand=True)
metadata[[# remove characteristics column
= metadata.drop(columns='characteristics')
metadata metadata
Output
gene_id | tissue_type | immunophenotype | development_stage | |
---|---|---|---|---|
0 | GSM1480291 | mammary gland | luminal cells | virgin |
1 | GSM1480292 | mammary gland | luminal cells | virgin |
2 | GSM1480293 | mammary gland | luminal cells | 18.5 day pregnancy |
3 | GSM1480294 | mammary gland | luminal cells | 18.5 day pregnancy |
4 | GSM1480295 | mammary gland | luminal cells | 2 day lactation |
5 | GSM1480296 | mammary gland | luminal cells | 2 day lactation |
6 | GSM1480297 | mammary gland | basal cells | virgin |
7 | GSM1480298 | mammary gland | basal cells | virgin |
8 | GSM1480299 | mammary gland | basal cells | 18.5 day pregnancy |
9 | GSM1480300 | mammary gland | basal cells | 18.5 day pregnancy |
10 | GSM1480301 | mammary gland | basal cells | 2 day lactation |
11 | GSM1480302 | mammary gland | basal cells | 2 day lactation |
Uniting Columns
This is the complement of separate. Therefore, let’s revert what we did in the previous section to combine multiple columns to a single column as illustrated in the image below.
To combine tissue type, immunophenotype, and development stage columns into characteristic column in the metadata dataframe:
'characteristics'] = metadata.tissue_type.str.cat(metadata[['immunophenotype','development_stage']], sep = ';')
metadata[# remove tissue type, immunophenotype, and development stage columns
= metadata.drop(columns=['tissue_type','immunophenotype','development_stage'])
metadata metadata
Output
gene_id | tissue_type | immunophenotype | development_stage | characteristics | |
---|---|---|---|---|---|
0 | GSM1480291 | mammary gland | luminal cells | virgin | mammary gland;luminal cells;virgin |
1 | GSM1480292 | mammary gland | luminal cells | virgin | mammary gland;luminal cells;virgin |
2 | GSM1480293 | mammary gland | luminal cells | 18.5 day pregnancy | mammary gland;luminal cells;18.5 day pregnancy |
3 | GSM1480294 | mammary gland | luminal cells | 18.5 day pregnancy | mammary gland;luminal cells;18.5 day pregnancy |
4 | GSM1480295 | mammary gland | luminal cells | 2 day lactation | mammary gland;luminal cells;2 day lactation |
5 | GSM1480296 | mammary gland | luminal cells | 2 day lactation | mammary gland;luminal cells;2 day lactation |
6 | GSM1480297 | mammary gland | basal cells | virgin | mammary gland;basal cells;virgin |
7 | GSM1480298 | mammary gland | basal cells | virgin | mammary gland;basal cells;virgin |
8 | GSM1480299 | mammary gland | basal cells | 18.5 day pregnancy | mammary gland;basal cells;18.5 day pregnancy |
9 | GSM1480300 | mammary gland | basal cells | 18.5 day pregnancy | mammary gland;basal cells;18.5 day pregnancy |
10 | GSM1480301 | mammary gland | basal cells | 2 day lactation | mammary gland;basal cells;2 day lactation |
11 | GSM1480302 | mammary gland | basal cells | 2 day lactation | mammary gland;basal cells;2 day lactation |
A quick and easy method to unite columns using the +
operator:
= metadata.tissue_type + ';' + metadata.immunophenotype + ';' + metadata.development_stage
metadata.characteristics # remove tissue type, immunophenotype, and development stage columns
= metadata.drop(columns=['tissue_type','immunophenotype','development_stage'])
metadata metadata
Since we have already updated the metadata
DataFrame to combine columns in the previous example, this code will not work. The tissue type, immunophenotype, and development stage columns are no longer available in the DataFrame as we combined them to create the characteristics column.
Joining Data
Pandas provides two primary methods for combining DataFrames: concat and merge. These methods allow you to combine DataFrames along rows and columns, and they support various types of joins, including inner, left, right, and outer joins. Here’s an explanation of both methods and the types of joins they support:
concat
The concat
method is used for concatenating (stacking) DataFrames along a particular axis, which can be either rows or columns. It’s particularly useful when you want to combine DataFrames with the same structure along a common axis. It does not require a common key to merge DataFrames, as it is primarily for stacking them.
Consider the following three DataFrames,
= pd.DataFrame({'A': ['A0', 'A1', 'A2'],
df1 'B': ['B0', 'B1', 'B2']})
= pd.DataFrame({'A': ['A3', 'A4', 'A5'],
df2 'B': ['B3', 'B4', 'B5']})
= pd.DataFrame({'C': ['C0', 'C1', 'C2'],
df3 'D': ['D0', 'D1', 'D2']})
- Concatenate along rows (vertically)
df1
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
2 | A2 | B2 |
df2
A | B | |
---|---|---|
0 | A3 | B3 |
1 | A4 | B4 |
2 | A5 | B5 |
=0) pd.concat([df1, df2], axis
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
2 | A2 | B2 |
0 | A3 | B3 |
1 | A4 | B4 |
2 | A5 | B5 |
- Concatenating along columns (horizontally)
df1
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
2 | A2 | B2 |
df3
C | D | |
---|---|---|
0 | C0 | D0 |
1 | C1 | D1 |
2 | C2 | D2 |
=1) pd.concat([df1, df3], axis
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
merge
The merge
method is used for merging DataFrames based on common columns or indexes, similar to SQL joins. It’s especially useful when you have different DataFrames with related data and want to combine them based on a shared key.
Consider the following two DataFrames,
= pd.DataFrame({
left "key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})= pd.DataFrame({
right "key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
Inner join
An inner join returns only the rows with matching values in the specified columns (the common key). It combines data from two or more tables or DataFrames based on the intersection of keys, excluding rows that do not have corresponding matches in both tables.
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
="inner", on=["key1", "key2"]) pd.merge(left, right, how
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
Outer join
An outer join returns all rows from both tables, including rows with matching keys and rows with non-matching keys. When there’s no match for a particular row in one of the tables, the missing values are filled with NULL (or NaN in pandas), indicating no corresponding data.
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
="outer", on=["key1", "key2"]) pd.merge(left, right, how
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K0 | NaN | NaN | C3 | D3 |
5 | K2 | K1 | A3 | B3 | NaN | NaN |
Left join
A left join returns all rows from the left table (the first table specified) and matching rows from the right table (the second table specified). Non-matching rows in the right table have NULL (or NaN) values in the result.
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
="left", on=["key1", "key2"]) pd.merge(left, right, how
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
Right join
A right join is similar to a left join but returns all rows from the right table and matching rows from the left table. Non-matching rows in the left table have NULL (or NaN) values in the result.
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
="right", on=["key1", "key2"]) pd.merge(left, right, how
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K0 | NaN | NaN | C3 | D3 |
Cross join
A cross join is used to create all possible combinations of rows from multiple tables or DataFrames. It can be useful in specific situations, such as when you want to generate all possible pairs or combinations of data.
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
="cross") pd.merge(left, right, how
key1_x | key2_x | A | B | key1_y | key2_y | C | D | |
---|---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | K0 | K0 | A0 | B0 | K1 | K0 | C1 | D1 |
2 | K0 | K0 | A0 | B0 | K1 | K0 | C2 | D2 |
3 | K0 | K0 | A0 | B0 | K2 | K0 | C3 | D3 |
4 | K0 | K1 | A1 | B1 | K0 | K0 | C0 | D0 |
5 | K0 | K1 | A1 | B1 | K1 | K0 | C1 | D1 |
6 | K0 | K1 | A1 | B1 | K1 | K0 | C2 | D2 |
7 | K0 | K1 | A1 | B1 | K2 | K0 | C3 | D3 |
8 | K1 | K0 | A2 | B2 | K0 | K0 | C0 | D0 |
9 | K1 | K0 | A2 | B2 | K1 | K0 | C1 | D1 |
10 | K1 | K0 | A2 | B2 | K1 | K0 | C2 | D2 |
11 | K1 | K0 | A2 | B2 | K2 | K0 | C3 | D3 |
12 | K2 | K1 | A3 | B3 | K0 | K0 | C0 | D0 |
13 | K2 | K1 | A3 | B3 | K1 | K0 | C1 | D1 |
14 | K2 | K1 | A3 | B3 | K1 | K0 | C2 | D2 |
15 | K2 | K1 | A3 | B3 | K2 | K0 | C3 | D3 |
String Manipulation in Pandas
String manipulation is an essential part of data pre-processing and analysis in Pandas, especially when dealing with textual data. The str
accessor in Pandas provides a wide range of string functions that allow you to clean, transform, and extract valuable information from text data within a DataFrame.
Here, we explore some of the most commonly used string manipulation functions available in Pandas for DataFrames.
str.replace()
Replaces occurrences of a pattern (string or regex) with another string.str.replace(' ','_') # Replace "_" with spaces cms.columns.
Index(['ID', 'Facility_Name', 'County', 'Hospital_Type', 'Star_Rating', 'No_of_Surveys', 'Response_Rate', 'Overall_Rating'], dtype='object')
str.contains()
Checks if each string contains a specified substring (can use regular expressions).
= cms["Facility Name"].str.contains('GENERAL') mask cms[mask]
ID Facility Name County Hospital Type Star Rating No of Surveys Response Rate Overall Rating 6 151317 GREENE COUNTY GENERAL HOSPITAL GREENE Critical Access Hospital 3 114 22 3 8 490057 SENTARA GENERAL HOSPITAL VIRGINIA BEACH Acute Care Hospital 4 619 32 3 13 501339 WHIDBEY GENERAL HOSPITAL ISLAND Critical Access Hospital 3 389 29 3 = cms.columns.str.contains('Rating') col_mask cms.columns[col_mask]
Index(['Star Rating', 'Overall Rating'], dtype='object')
Checks if each string contains a specified regular expression pattern.
= cms.columns.str.contains('[A-Z][a-z]*\s[A-Z][a-z]{3}$') mask_col cms.columns[mask_col]
Index(['Facility Name', 'Hospital Type', 'Response Rate'], dtype='object')
Here, the regular expression
[A-Z][a-z]*\s[A-Z][a-z]{3}$
can be broken down into smaller chunks for better understanding:- [A-Z] matches an uppercase character from ‘A’ to ‘Z’.
- [a-z]* matches a set of lowercase characters from ‘a’ to ‘z’.
- matches a space.
- [A-Z] matches an uppercase character from ‘A’ to ‘Z’.
- [a-z]{3} matches any three lowercase characters from ‘a’ to ‘z’.
- $ matches the end of the string
Putting this together, the expression selects column names that contains two words of characters separated by a space with the first character in each word in upper case and the second word have four characters. Thus, it should match column names: Facility Name, Hospital Type, and Response Rate.
If you’re unfamiliar with regular expressions, you can skip this section for now. However, interested readers can find many online resources to learn about regular expressions. One of my favorite online tools for building and testing regular expressions is https://regexr.com. You can use this tool to test the correctness of a regular expression.
str.startswith()
andstr.endswith()
Checks if each string starts or ends with a specified substring.= cms.County.str.startswith('S') mask_county cms.County[mask_county]
0 SAN DIEGO 3 SEBASTIAN 4 SHELBY 12 SUMNER Name: County, dtype: object
= cms.columns.str.endswith('e') mask_col cms.columns[mask_col]
Index(['Facility Name', 'Hospital Type', 'Response Rate'], dtype='object')
str.len()
Returns the length of each string in the Series.'County'].str.len() cms[
0 9 1 4 2 4 3 9 4 6 5 6 6 6 7 9 8 14 9 7 10 11 11 10 12 6 13 6 14 11 Name: County, dtype: int64
str.split()
Splits each string in the Series into a list of substrings based on a specified delimiter.'Hospital Type'].str.split().str[0] # Split the Hospital Type column and take the first part cms[
0 Acute 1 Acute 2 Acute 3 Acute 4 Acute 5 Acute 6 Critical 7 Critical 8 Acute 9 Acute 10 Acute 11 Acute 12 Acute 13 Critical 14 Acute Name: Hospital Type, dtype: object
str.join()
Concatenates a list of strings into a single string with a specified separator.'Facility Name', 'County']].apply(','.join, axis=1) # Join first and last names cms[[
0 SCRIPPS GREEN HOSPITAL,SAN DIEGO 1 ST BERNARD HOSPITAL,COOK 2 SOUTH LAKE HOSPITAL,LAKE 3 MERCY HOSPITAL FORT SMITH,SEBASTIAN 4 BAPTIST MEMORIAL HOSPITAL,SHELBY 5 ST JOSEPH REGIONAL HEALTH CENTER,BRAZOS 6 GREENE COUNTY GENERAL HOSPITAL,GREENE 7 SOUTHWEST MEMORIAL HOSPITAL,MONTEZUMA 8 SENTARA GENERAL HOSPITAL,VIRGINIA BEACH 9 PIEDMONT FAYETTE HOSPITAL,FAYETTE 10 MISSION COMMUNITY HOSPITAL,LOS ANGELES 11 DOCTORS HOSPITAL,MIAMI-DADE 12 SUMNER REGIONAL MEDICAL CENTER,SUMNER 13 WHIDBEY GENERAL HOSPITAL,ISLAND 14 NORTHRIDGE MEDICAL CENTER,LOS ANGELES dtype: object
str.find()
Returns the lowest index of the substring; returns-1
if not found.'Facility Name'].str.find('MEMORIAL') cms[
0 -1 1 -1 2 -1 3 -1 4 8 5 -1 6 -1 7 10 8 -1 9 -1 10 -1 11 -1 12 -1 13 -1 14 -1 Name: Facility Name, dtype: int64
str.zfill()
Pads each string in the Series with zeros on the left to achieve a specified width.'ID'].astype(str).str.zfill(6) # Ensure all zip codes have 5 digits cms[
0 050424 1 140103 2 100051 3 040062 4 440048 5 450011 6 151317 7 061327 8 490057 9 110215 10 050704 11 100296 12 440003 13 501339 14 050116 Name: ID, dtype: object
str.cat()
Concatenates strings in the Series/Index with an optional separator.'Facility Name'].str.cat(cms['County'], sep='-') cms[
0 SCRIPPS GREEN HOSPITAL-SAN DIEGO 1 ST BERNARD HOSPITAL-COOK 2 SOUTH LAKE HOSPITAL-LAKE 3 MERCY HOSPITAL FORT SMITH-SEBASTIAN 4 BAPTIST MEMORIAL HOSPITAL-SHELBY 5 ST JOSEPH REGIONAL HEALTH CENTER-BRAZOS 6 GREENE COUNTY GENERAL HOSPITAL-GREENE 7 SOUTHWEST MEMORIAL HOSPITAL-MONTEZUMA 8 SENTARA GENERAL HOSPITAL-VIRGINIA BEACH 9 PIEDMONT FAYETTE HOSPITAL-FAYETTE 10 MISSION COMMUNITY HOSPITAL-LOS ANGELES 11 DOCTORS HOSPITAL-MIAMI-DADE 12 SUMNER REGIONAL MEDICAL CENTER-SUMNER 13 WHIDBEY GENERAL HOSPITAL-ISLAND 14 NORTHRIDGE MEDICAL CENTER-LOS ANGELES Name: Facility Name, dtype: object
str.lower()
andstr.upper()
Converts all characters in a string to lowercase or uppercase, respectively.
"Facility Name"].str.lower() # Convert to lowercase cms[
0 scripps green hospital 1 st bernard hospital 2 south lake hospital 3 mercy hospital fort smith 4 baptist memorial hospital 5 st joseph regional health center 6 greene county general hospital 7 southwest memorial hospital 8 sentara general hospital 9 piedmont fayette hospital 10 mission community hospital 11 doctors hospital 12 sumner regional medical center 13 whidbey general hospital 14 northridge medical center Name: Facility Name, dtype: object
"Hospital Type"].str.upper() # Convert to uppercase cms[
0 ACUTE CARE HOSPITAL 1 ACUTE CARE HOSPITAL 2 ACUTE CARE HOSPITAL 3 ACUTE CARE HOSPITAL 4 ACUTE CARE HOSPITAL 5 ACUTE CARE HOSPITAL 6 CRITICAL ACCESS HOSPITAL 7 CRITICAL ACCESS HOSPITAL 8 ACUTE CARE HOSPITAL 9 ACUTE CARE HOSPITAL 10 ACUTE CARE HOSPITAL 11 ACUTE CARE HOSPITAL 12 ACUTE CARE HOSPITAL 13 CRITICAL ACCESS HOSPITAL 14 ACUTE CARE HOSPITAL Name: Hospital Type, dtype: object
str.capitalize()
andstr.title()
str.capitalize()
converts the first character of each string to uppercase and the rest to lowercase.str.title()
converts the first character of each word to uppercase and the rest to lowercase.
'Facility Name'].str.capitalize() cms[
0 Scripps green hospital 1 St bernard hospital 2 South lake hospital 3 Mercy hospital fort smith 4 Baptist memorial hospital 5 St joseph regional health center 6 Greene county general hospital 7 Southwest memorial hospital 8 Sentara general hospital 9 Piedmont fayette hospital 10 Mission community hospital 11 Doctors hospital 12 Sumner regional medical center 13 Whidbey general hospital 14 Northridge medical center Name: Facility Name, dtype: object
'Facility Name'].str.title() cms[
0 Scripps Green Hospital 1 St Bernard Hospital 2 South Lake Hospital 3 Mercy Hospital Fort Smith 4 Baptist Memorial Hospital 5 St Joseph Regional Health Center 6 Greene County General Hospital 7 Southwest Memorial Hospital 8 Sentara General Hospital 9 Piedmont Fayette Hospital 10 Mission Community Hospital 11 Doctors Hospital 12 Sumner Regional Medical Center 13 Whidbey General Hospital 14 Northridge Medical Center Name: Facility Name, dtype: object
This concludes the data manipulation section. By the end of this section, you should now be able to confidently read and write datasets into and from DataFrames, pre-process and clean data, handle missing values, group, join, and transform data, and perform basic statistical analysis to gain insights into the underlying data.
In the next section, we will dive into data visualization, where you’ll learn how to create insightful and compelling charts and plots using the plotnine
package to better communicate your findings.