Data Manipulation with pandas

In this section, we will explore various data manipulation and analysis techniques using the Pandas library. We’ll begin by learning how to set, update, or replace data within a DataFrame. Then, we’ll cover how to handle missing data effectively. Next, we will delve into grouping data, transforming it, and joining different datasets to prepare for downstream analysis. Finally, we’ll look at generating summary statistics to gain a deeper understanding of the underlying data.

We will use the following DataFrames in this section.

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {
        "A": np.random.uniform(low=0, high=1, size=12),
        "B": pd.date_range(start="20230102", periods=12),
        "C": pd.Series(range(1, 13), index=["R1", "R2", "R3", "R4", "R5", "R6", "R7", "R8" ,"R9", "R10", "R11", "R12"]),
        "D": np.random.randint(1, high=100, size=12),
        "E": pd.Categorical(["red", "green", "blue", "white", "pink", "brown", "black", "purple", "orange", "grey", "violet", "yellow"]),
        "F": "foo",
    }, index = ["R1", "R2", "R3", "R4", "R5", "R6", "R7", "R8" ,"R9", "R10", "R11", "R12"]
)

cms = pd.read_csv('data/patient_satisfaction/cms_hospital_patient_satisfaction.csv')

Statistics on data

The pandas library provides a wide range of statistical functions and methods to compute summary statistics for your data. Below provides some of the key statistical measures you can compute using this library.

Consider the following dataset which contains 15 rows. Each row contains 8 features (columns).

cms

	ID	Facility Name	County	Hospital Type	Star Rating	No of Surveys	Response Rate	Overall Rating
0	50424	SCRIPPS GREEN HOSPITAL	SAN DIEGO	Acute Care Hospital	4	3110	41	5
1	140103	ST BERNARD HOSPITAL	COOK	Acute Care Hospital	1	264	6	2
2	100051	SOUTH LAKE HOSPITAL	LAKE	Acute Care Hospital	2	1382	20	2
3	40062	MERCY HOSPITAL FORT SMITH	SEBASTIAN	Acute Care Hospital	3	2506	35	3
4	440048	BAPTIST MEMORIAL HOSPITAL	SHELBY	Acute Care Hospital	2	1799	18	2
5	450011	ST JOSEPH REGIONAL HEALTH CENTER	BRAZOS	Acute Care Hospital	3	1379	24	3
6	151317	GREENE COUNTY GENERAL HOSPITAL	GREENE	Critical Access Hospital	3	114	22	3
7	61327	SOUTHWEST MEMORIAL HOSPITAL	MONTEZUMA	Critical Access Hospital	4	247	34	3
8	490057	SENTARA GENERAL HOSPITAL	VIRGINIA BEACH	Acute Care Hospital	4	619	32	3
9	110215	PIEDMONT FAYETTE HOSPITAL	FAYETTE	Acute Care Hospital	2	1714	21	2
10	50704	MISSION COMMUNITY HOSPITAL	LOS ANGELES	Acute Care Hospital	3	241	14	3
11	100296	DOCTORS HOSPITAL	MIAMI-DADE	Acute Care Hospital	4	393	24	3
12	440003	SUMNER REGIONAL MEDICAL CENTER	SUMNER	Acute Care Hospital	4	680	35	2
13	501339	WHIDBEY GENERAL HOSPITAL	ISLAND	Critical Access Hospital	3	389	29	3
14	50116	NORTHRIDGE MEDICAL CENTER	LOS ANGELES	Acute Care Hospital	3	1110	20	2

You can compute the mean of a Series (a single column of data) or a DataFrame (a table of data) using the .mean() method.

Calculate the mean value for a certain column:
```
cms['No of Surveys'].mean()
```
```
1063.1333333333334
```

Calculate the mean value for each column:

cms.mean(numeric_only = True)

ID                211738.200000
Star Rating            3.000000
No of Surveys       1063.133333
Response Rate         25.000000
Overall Rating         2.733333
dtype: float64

Calculate the mean value for each row:

cms.mean(axis=1, numeric_only = True)

0      10716.8
1      28075.2
2      20291.4
3       8521.8
4      88373.8
5      90284.0
6      30291.8
7      12323.0
8      98143.0
9      22390.8
10     10193.0
11     20144.0
12     88144.8
13    100352.6
14     10250.2
dtype: float64

The median is the middle value of a dataset when the values are arranged in ascending order. It is not affected by extreme values (outliers) and is often used to describe the central tendency of data. In Pandas, you can compute the median using the .median() method.
```
cms.median(numeric_only = True)
```
```
ID                110215.0
Star Rating            3.0
No of Surveys        680.0
Response Rate         24.0
Overall Rating         3.0
dtype: float64
```
The standard deviation measures the amount of variation or dispersion in a dataset. A lower standard deviation indicates that data points are close to the mean, while a higher standard deviation indicates greater variability. In Pandas, you can compute the standard deviation using the .std() method.
```
cms.std(numeric_only = True)
```
```
ID                188317.557348
Star Rating            0.925820
No of Surveys        909.054600
Response Rate          9.296697
Overall Rating         0.798809
dtype: float64
```

Variance quantifies how much individual data points deviate from the mean. It is the square of the standard deviation. In Pandas, you can compute the variance using the .var() method.

cms.var(numeric_only = True)

ID                3.546350e+10
Star Rating       8.571429e-01
No of Surveys     8.263803e+05
Response Rate     8.642857e+01
Overall Rating    6.380952e-01
dtype: float64

You can also compute the sum of values using .sum() and count the total number of non-missing values using .count().

cms.sum(numeric_only = True)

ID                3176073
Star Rating            45
No of Surveys       15947
Response Rate         375
Overall Rating         41
dtype: int64

cms.count()

ID                15
Facility Name     15
County            15
Hospital Type     15
Star Rating       15
No of Surveys     15
Response Rate     15
Overall Rating    15
dtype: int64

cms["Overall Rating"].sum()

Here is a quick reference summary table of common useful functions.

Function	Description
`count`	Number of non-NA observations
`sum`	Sum of values
`mean`	Mean of values
`median`	Arithmetic median of values
`min`	Minimum
`max`	Maximum
`mode`	Mode
`abs`	Absolute Value
`prod`	Product of values
`std`	Bessel-corrected sample standard deviation
`var`	Unbiased variance
`sem`	Standard error of the mean
`skew`	Sample skewness (3rd moment)
`kurt`	Sample kurtosis (4th moment)
`quantile`	Sample quantile (value at %)
`cumsum`	Cumulative sum
`cumprod`	Cumulative product
`cummax`	Cumulative maximum
`cummin`	Cumulative minimum

Another useful function to count the frequency of values is shown below.

Consider the DataFrame,

d = {
  "a": pd.Series(np.random.randint(0, 5, size=10)),
  "b": pd.Series(np.random.randint(-3, 3, size=10))
}
df2 = pd.DataFrame(d)
df2

	a	b
0	3	-2
1	1	2
2	0	0
3	1	-1
4	0	2
5	4	-3
6	2	-3
7	3	-3
8	3	1
9	0	0

Frequency of values in all rows:

df2.value_counts()

a  b 
0   0    2
    2    1
1  -1    1
    2    1
2  -3    1
3  -3    1
   -2    1
    1    1
4  -3    1
Name: count, dtype: int64

Frequency of values in a single column:

df2['a'].value_counts()

a
3    3
0    3
1    2
4    1
2    1
Name: count, dtype: int64

Additionally, two powerful functions, agg and transform, allow you to perform calculations and transformations on DataFrames. These two functions applies user defined function that reduces or broadcasts its results, repectively. However, these functions serve different purposes and have distinct use cases and we will explore these in detail later.

Missing data

Handling missing values is an essential part of data pre-processing and analysis in Pandas. Missing values can arise due to various reasons, such as data collection errors, incomplete data or sensor failures.

Let’s create a DataFrame with missing values.

Consider the following DataFrame:

df = pd.DataFrame(np.random.randn(6, 4), index=['a', 'b', 'c', 'd', 'e', 'f'], columns=list("ABCD"))

dates = pd.date_range("20130101", periods=6)
df["E"] = dates

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])
df["F"] = s1
df

	A	B	C	D	E	F
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6

Re-indexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data which includes several missing values.

df1 = df.reindex(index=['a', 'b', 'c', 'h', 'd', 'e', 'f', 'g'], columns=list(df.columns) + ["G", "H"])
df1.loc['c' : 'f', "G"] = 1
df1.loc['h','C'] = 0.634336
df1

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN
g	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN

Detecting missing values

Pandas provides several methods for detecting and dealing with missing values.

isna() and isnull() methods return a DataFrame of the same shape as the input, with Boolean values indicating whether each element is missing (True) or not (False).

pd.isna(df1)

	A	B	C	D	E	F	G	H
a	False	False	False	False	False	False	True	True
b	False	False	False	False	False	False	True	True
c	False	False	False	False	False	False	False	True
h	True	True	False	True	True	True	False	True
d	False	False	False	False	False	False	False	True
e	False	False	False	False	False	False	False	True
f	False	False	False	False	False	False	False	True
g	True	True	True	True	True	True	True	True

pd.isnull(df1)

	A	B	C	D	E	F	G	H
a	False	False	False	False	False	False	True	True
b	False	False	False	False	False	False	True	True
c	False	False	False	False	False	False	False	True
h	True	True	False	True	True	True	False	True
d	False	False	False	False	False	False	False	True
e	False	False	False	False	False	False	False	True
f	False	False	False	False	False	False	False	True
g	True	True	True	True	True	True	True	True

notna() and notnull() methods are the opposite of isna()and isnull(). They return True for non-missing values and False for missing values.

pd.notna(df1)

	A	B	C	D	E	F	G	H
a	True	True	True	True	True	True	False	False
b	True	True	True	True	True	True	False	False
c	True	True	True	True	True	True	True	False
h	False	False	True	False	False	False	True	False
d	True	True	True	True	True	True	True	False
e	True	True	True	True	True	True	True	False
f	True	True	True	True	True	True	True	False
g	False	False	False	False	False	False	False	False

pd.notnull(df1)

	A	B	C	D	E	F	G	H
a	True	True	True	True	True	True	False	False
b	True	True	True	True	True	True	False	False
c	True	True	True	True	True	True	True	False
h	False	False	True	False	False	False	True	False
d	True	True	True	True	True	True	True	False
e	True	True	True	True	True	True	True	False
f	True	True	True	True	True	True	True	False
g	False	False	False	False	False	False	False	False

Check if a column does not contain missing values:

df1["G"].notna()

a    False
b    False
c     True
h     True
d     True
e     True
f     True
g    False
Name: G, dtype: bool

Subsetting rows to return rows that does not contain missing values in column G:

df1[df1["G"].notna()]

	A	B	C	D	E	F	G	H
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN

The info() method provides a summary of the DataFrame, including the count of non-null values in each column.

df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, a to g
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       6 non-null      float64       
 1   B       6 non-null      float64       
 2   C       7 non-null      float64       
 3   D       6 non-null      float64       
 4   E       6 non-null      datetime64[ns]
 5   F       6 non-null      float64       
 6   G       5 non-null      float64       
 7   H       0 non-null      float64       
dtypes: datetime64[ns](1), float64(7)
memory usage: 876.0+ bytes

Handling missing values

Once you’ve detected missing values, you can choose from several strategies to handle them.

Consider the following DataFrame,

df1

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN
g	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN

Dropping missing values

Consider the following DataFrame for this example,

# dataframe excluding column H
df_without_colH = df1.loc[:, "A":"G"]   
df_without_colH

	A	B	C	D	E	F	G
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0
g	NaN	NaN	NaN	NaN	NaT	NaN	NaN

Dropping missing values by rows

Drop the rows where at least one element is missing:

df_without_colH.dropna()

	A	B	C	D	E	F	G
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0

Drop rows that have all missing data:

df_without_colH.dropna(how="all")

	A	B	C	D	E	F	G
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0

Drop rows that have any missing data:

df_without_colH.dropna(how="any")

	A	B	C	D	E	F	G
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0

Keep only the rows with at least 2 non-NA values:

df_without_colH.dropna(thresh=2)

	A	B	C	D	E	F	G
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0

Dropping missing values by columns

Consider the following DataFrame for the remaining examples,

df1.loc['g','C'] = 0.877525
df1

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN
g	NaN	NaN	0.877525	NaN	NaT	NaN	NaN	NaN

Drop the columns where at least one element is missing:
```
df1.dropna(axis='columns') # same as axis=1
```
C

a -0.642234

b -0.931715

c 0.382713

h 0.634336

d -0.047471

e -1.189557

f 1.405489

g 0.877525

	C
a	-0.642234
b	-0.931715
c	0.382713
h	0.634336
d	-0.047471
e	-1.189557
f	1.405489
g	0.877525

Drop columns that have all missing data:

df1.dropna(how="all", axis=1)

	A	B	C	D	E	F	G
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0
g	NaN	NaN	0.877525	NaN	NaT	NaN	NaN

Drop columns that have any missing data (shown as NaN):
```
df1.dropna(how="any", axis=1)
```
C

a -0.642234

b -0.931715

c 0.382713

h 0.634336

d -0.047471

e -1.189557

f 1.405489

g 0.877525

	C
a	-0.642234
b	-0.931715
c	0.382713
h	0.634336
d	-0.047471
e	-1.189557
f	1.405489
g	0.877525

Keep only the columns with at least 6 non-NA values:

df1.dropna(thresh=6, axis=1)

	A	B	C	D	E	F
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0
h	NaN	NaN	0.634336	NaN	NaT	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0
g	NaN	NaN	0.877525	NaN	NaT	NaN

Define in which columns to look for missing values:

df1.dropna(subset=["B", "G"])

	A	B	C	D	E	F	G	H
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN

Filling missing values

You can fill missing values with a specific value or a calculated value using the fillna() method.

Consider the following DataFrame,

df1

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
h	NaN	NaN	0.634336	NaN	NaT	NaN	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN
g	NaN	NaN	0.877525	NaN	NaT	NaN	NaN	NaN

Replace missing values with a scalar value

Fill missing values with 0:

df1.fillna(0)

	A	B	C	D	E	F	G
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01 00:00:00	1.0	0.0
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02 00:00:00	2.0	0.0
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03 00:00:00	3.0	1.0
h	0.000000	0.000000	0.634336	0.000000	0	0.0	1.0
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04 00:00:00	4.0	1.0
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05 00:00:00	5.0	1.0
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06 00:00:00	6.0	1.0
g	0.000000	0.000000	0.877525	0.000000	0	0.0	0.0

Fill missing values with a scalar, but limit the number of column fills to 2:

df1.fillna(12, axis='columns', limit=2)

	A	B	C	D	E	F	G	H
a	0.42335	0.143592	-0.642234	0.345545	2013-01-01 00:00:00	1.0	12	12
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02 00:00:00	2.0	12	12
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03 00:00:00	3.0	1.0	12
h	12	12	0.634336	NaN	NaT	NaN	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04 00:00:00	4.0	1.0	12
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05 00:00:00	5.0	1.0	12
f	-0.808026	1.027726	1.405489	0.72493	2013-01-06 00:00:00	6.0	1.0	12
g	12	12	0.877525	NaN	NaT	NaN	NaN	NaN

Fill gaps forward or backward

Propogate missing values forward:

df1.ffill() # ffill ≡ pad ≡ fillna(method='ffill')

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	NaN	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	NaN	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
h	0.563209	-1.123406	0.634336	1.210963	2013-01-03	3.0	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN
g	-0.808026	1.027726	0.877525	0.724930	2013-01-06	6.0	1.0	NaN

Propogate missing values backward along rows:

df1.bfill() # bfill ≡ backfill ≡ fillna(method='bfill')

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	1.0	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	1.0	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
h	-1.435517	0.916215	0.634336	0.646783	2013-01-04	4.0	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN
g	NaN	NaN	0.877525	NaN	NaT	NaN	NaN	NaN

Fill with a Pandas object

Fill the missing values of a column with the mean of the column:

df1.fillna(df1.mean())

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01 00:00:00	1.0	1.0	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02 00:00:00	2.0	1.0	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03 00:00:00	3.0	1.0	NaN
h	0.045313	0.521347	0.634336	0.559000	2013-01-03 12:00:00	3.5	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04 00:00:00	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05 00:00:00	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06 00:00:00	6.0	1.0	NaN
g	0.045313	0.521347	0.877525	0.559000	2013-01-03 12:00:00	3.5	1.0	NaN

Fill only a subset of the columns using the corresponding median

df1.fillna(df1.median()["F":"G"])

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01	1.0	1.0	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02	2.0	1.0	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03	3.0	1.0	NaN
h	NaN	NaN	0.634336	NaN	NaT	3.5	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06	6.0	1.0	NaN
g	NaN	NaN	0.877525	NaN	NaT	3.5	1.0	NaN

Interpolation

The interpolate method is used to fill in missing values with estimated values based on the surrounding data points. It’s particularly useful for time series data or datasets where values are expected to follow a pattern.

The interpolate method employs various interpolation techniques to estimate missing values, including linear interpolation, polynomial interpolation, and more. The specific interpolation method used depends on the kind of data and the desired behavior.

Linear Interpolation: Linear interpolation is the default method used by interpolate. It estimates missing values by drawing a straight line between two adjacent known data points and filling in the missing value with a value along that line. This method is suitable for data that appears to change linearly over time or between data points.
Polynomial Interpolation: Polynomial interpolation uses higher-degree polynomial functions to estimate missing values. This method can capture more complex relationships between data points but may be susceptible to overfitting if not used carefully.
Time-Based Interpolation: When working with time series data, you can use the method parameter to specify time-based interpolation methods such as ‘time’, ‘index’, or ‘values’. These methods consider the time or index values to estimate missing values.

Here’s an example using linear interpolation:

df1.interpolate()

	A	B	C	D	E	F	G	H
a	0.423350	0.143592	-0.642234	0.345545	2013-01-01 00:00:00	1.0	NaN	NaN
b	0.165784	1.102192	-0.931715	-1.135617	2013-01-02 00:00:00	2.0	NaN	NaN
c	0.563209	-1.123406	0.382713	1.210963	2013-01-03 00:00:00	3.0	1.0	NaN
h	-0.436154	-0.103596	0.634336	0.928873	2013-01-03 12:00:00	3.5	1.0	NaN
d	-1.435517	0.916215	-0.047471	0.646783	2013-01-04 00:00:00	4.0	1.0	NaN
e	1.363078	1.061762	-1.189557	1.561393	2013-01-05 00:00:00	5.0	1.0	NaN
f	-0.808026	1.027726	1.405489	0.724930	2013-01-06 00:00:00	6.0	1.0	NaN
g	-0.808026	1.027726	0.877525	0.724930	2013-01-06 00:00:00	6.0	1.0	NaN

Grouping data

The groupby function is a powerful feature within pandas that allows you to group and aggregate data in a DataFrame based on one or more columns. This can be especially useful when you want to perform summary statistics, calculations, or transformations on subsets of your data based on certain criteria. It involves one or more of the following steps:

Splitting: The first step in a groupby operation is to split the DataFrame into groups based on the values in one or more columns. You specify the column(s) by which you want to group your data. This column is often referred to as the “key” or “grouping column”. Each unique value in the grouping column(s) forms a group, and the rows in the DataFrame are distributed among these groups.
Applying a Function: After splitting the data into groups, you can apply various aggregation or transformation functions to each group. These functions are typically applied to one or more columns in each group. Common aggregation functions include sum, mean, count, min, max, and more. You can also apply custom functions or perform complex operations on the grouped data.
- Aggregation: compute a summary statistic (or statistics) for each group. Examples: compute group sums or means, compute group size.
- Transformation: perform some group-specific computations and return a like-indexed object. Examples: standardize data (zscore) within a group, filling NAs within groups with a value derived from each group.
- Filtration: discard some groups, according to a group-wise computation that evaluates to True or False. Examples: discard data that belong to groups with only a few members, filter out data based on the group sum or mean.
Combining Results: Once the specified function(s) are applied to each group, the results are combined into a new DataFrame or data structure. This final result will often have a hierarchical structure, with the grouping columns as index levels.

Consider the following metabric dataset for the examples listed below.

# Load the Metabric dataset from the URL into a DataFrame
metabric = pd.read_csv("https://zenodo.org/record/6450144/files/metabric_clinical_and_expression_data.csv")
metabric

	Patient_ID	Cohort	Age_at_diagnosis	Survival_time	Survival_status	Vital_status	Chemotherapy	Radiotherapy	Tumour_size	Tumour_stage	...	Integrative_cluster	Mutation_count	ESR1	ERBB2	PGR	TP53	PIK3CA	GATA3	FOXA1	MLPH
0	MB-0000	1	75.65	140.500000	LIVING	Living	NO	YES	22.0	2.0	...	4ER+	NaN	8.929817	9.333972	5.680501	6.338739	5.704157	6.932146	7.953794	9.729728
1	MB-0002	1	43.19	84.633333	LIVING	Living	NO	YES	10.0	1.0	...	4ER+	2.0	10.047059	9.729606	7.505424	6.192507	5.757727	11.251197	11.843989	12.536570
2	MB-0005	1	48.87	163.700000	DECEASED	Died of Disease	YES	NO	15.0	2.0	...	3	2.0	10.041281	9.725825	7.376123	6.404516	6.751566	9.289758	11.698169	10.306115
3	MB-0006	1	47.68	164.933333	LIVING	Living	YES	YES	25.0	2.0	...	9	1.0	10.404685	10.334979	6.815637	6.869241	7.219187	8.667723	11.863379	10.472181
4	MB-0008	1	76.97	41.366667	DECEASED	Died of Disease	YES	YES	40.0	2.0	...	9	2.0	11.276581	9.956267	7.331223	6.337951	5.817818	9.719781	11.625006	12.161961
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1899	MB-7295	4	43.10	196.866667	LIVING	Living	NO	YES	25.0	NaN	...	3	4.0	9.591235	9.935179	7.984515	6.753291	6.279207	9.207323	11.281194	11.337601
1900	MB-7296	4	42.88	44.733333	DECEASED	Died of Disease	NO	YES	20.0	NaN	...	5	6.0	9.733986	13.753037	5.616082	6.271912	5.999093	9.530390	11.532033	11.626140
1901	MB-7297	4	62.90	175.966667	DECEASED	Died of Disease	NO	YES	25.0	NaN	...	1	4.0	11.053198	10.228570	7.478069	6.212256	6.192399	9.540589	11.482761	11.180360
1902	MB-7298	4	61.16	86.233333	DECEASED	Died of Other Causes	NO	NO	25.0	NaN	...	1	15.0	11.055114	9.892589	8.282737	6.466712	6.287254	10.365901	11.371176	12.827069
1903	MB-7299	4	60.02	201.900000	DECEASED	Died of Other Causes	NO	YES	20.0	NaN	...	10	3.0	10.696475	10.227787	5.533486	6.180511	6.208784	9.749368	10.867527	9.847856

1904 rows × 32 columns

Aggregation

This is primarily used for aggregating data (within groups when you use groupby). It allows you to apply one or more aggregation functions to each group and obtain a summarized result for each group.

Calculate the total number of patients, categorized by cancer type.

metabric.groupby("Cancer_type")["Patient_ID"].count()

	Patient_ID
Cancer_type
Breast	17
Breast Invasive Ductal Carcinoma	1500
Breast Invasive Lobular Carcinoma	142
Breast Invasive Mixed Mucinous Carcinoma	22
Breast Mixed Ductal and Lobular Carcinoma	207
Metaplastic Breast Cancer	1

count() computes the number of non-NA values in the groups whereas size() computes the number of values in each group. To demonstrate the difference between these two methods, group the metabric dataset by 3-gene-classifier and find the number of values in each group based on Tumour_size column.

metabric.groupby("3-gene_classifier")["Tumour_size"].count()

	Tumour_size
3-gene_classifier
ER+/HER2- High Prolif	596
ER+/HER2- Low Prolif	615
ER-/HER2-	287
HER2+	186

metabric.groupby("3-gene_classifier")["Tumour_size"].size()

	Tumour_size
3-gene_classifier
ER+/HER2- High Prolif	603
ER+/HER2- Low Prolif	619
ER-/HER2-	290
HER2+	188

Determine the median age at diagnosis for each type of vital status.
```
metabric.groupby("Vital_status")["Age_at_diagnosis"].median()
```
Age_at_diagnosis

Vital_status

Died of Disease 61.135

Died of Other Causes 70.505

Living 56.740

	Age_at_diagnosis
Died of Disease	61.135
Died of Other Causes	70.505
Living	56.740

Calculate the total, mean, and standard deviation of Tumour_size for each combination of cancer type and ER status.

grouped = metabric.groupby(["Cancer_type", "ER_status"])
grouped["Tumour_size"].agg(['sum','mean','std'])

		sum	mean	std
Cancer_type	ER_status
Breast	Negative	173.00	43.250000	17.951323
Breast	Positive	289.00	22.230769	13.417363
Breast Invasive Ductal Carcinoma	Negative	11241.77	28.388308	19.209870
Breast Invasive Ductal Carcinoma	Positive	27468.87	25.154643	12.961027
Breast Invasive Lobular Carcinoma	Negative	588.80	29.440000	19.155277
Breast Invasive Lobular Carcinoma	Positive	3579.00	30.075630	19.792221
Breast Invasive Mixed Mucinous Carcinoma	Negative	40.00	40.000000	NaN
Breast Invasive Mixed Mucinous Carcinoma	Positive	518.00	25.900000	15.403007
Breast Mixed Ductal and Lobular Carcinoma	Negative	326.00	25.076923	13.174860
Breast Mixed Ductal and Lobular Carcinoma	Positive	4832.00	25.298429	12.900426
Metaplastic Breast Cancer	Negative	16.00	16.000000	NaN

Transformation

This is used for element-wise transformations. It applies a given function to each element in a DataFrame or Series and returns a new Series with the same index as the original DataFrame. This function is commonly used when you want to broadcast a computed value back to the original DataFrame, maintaining the original shape of the data.

Compute the difference between adjacent values of expression of ESR1 within each group of cancer types.
```
metabric.groupby("Cancer_type")["ESR1"].diff()
```
ESR1

0 NaN

1 1.117242

2 -0.005778

3 NaN

4 0.871896

... ...

1899 -1.833044

1900 -1.556991

1901 1.319212

1902 0.001916

1903 -0.358639

1904 rows × 1 columns
Compute the cumulative sum of mutation count within each integrative cluster group.
```
metabric.groupby("Chemotherapy")["Mutation_count"].transform("cumsum")
```
Mutation_count

0 NaN

1 2.0

2 2.0

3 3.0

4 5.0

... ...

1899 8583.0

1900 8589.0

1901 8593.0

1902 8608.0

1903 8611.0

1904 rows × 1 columns
- Convert the survival time which is given in months to years.
```
metabric["Survival_time"].transform(lambda x: x / 12)
```
```
0       11.708333
1        7.052778
2       13.641667
3       13.744444
4        3.447222
          ...    
1899    16.405556
1900     3.727778
1901    14.663889
1902     7.186111
1903    16.825000
Name: Survival_time, Length: 1904, dtype: float64
```
A lambda function in Python is a small, anonymous, and inline function. It is also known as a lambda expression or lambda notation. Lambda functions are a way to create small, one-time-use functions without needing to define them using the def keyword. Lambda functions are typically used for short, simple operations where defining a full function using def would be overly verbose.
```
lambda arguments: expression
```
Here’s a simple example to illustrate the use of lambda functions:
```
# Regular function to calculate the square of a number
def square(x):
    return x ** 2

# Equivalent lambda function to calculate the square of a number
square_lambda = lambda x: x ** 2
```

	ESR1
0	NaN
1	1.117242
2	-0.005778
3	NaN
4	0.871896
...	...
1899	-1.833044
1900	-1.556991
1901	1.319212
1902	0.001916
1903	-0.358639

	Mutation_count
0	NaN
1	2.0
2	2.0
3	3.0
4	5.0
...	...
1899	8583.0
1900	8589.0
1901	8593.0
1902	8608.0
1903	8611.0

Filtration

This is the process of selecting rows from a DataFrame based on grouping. It may either filter out entire groups, part of groups, or both.

Select the 3rd row of each group categorized based on cancer type.

metabric.groupby("Cancer_type").nth(3)

	Patient_ID	Cohort	Age_at_diagnosis	Survival_time	Survival_status	Vital_status	Chemotherapy	Radiotherapy	Tumour_size	Tumour_stage	...	Integrative_cluster	Mutation_count	ESR1	ERBB2	PGR	TP53	PIK3CA	GATA3	FOXA1	MLPH
5	MB-0010	1	78.77	7.800000	DECEASED	Died of Disease	NO	YES	31.0	4.0	...	7	4.0	11.239750	9.739996	5.954311	5.419711	6.123056	9.787085	12.142178	11.433164
15	MB-0050	1	44.64	75.333333	LIVING	Living	YES	YES	33.0	2.0	...	8	NaN	9.492446	10.399286	6.665170	6.843690	5.617014	9.396238	11.188134	12.243680
32	MB-0102	1	51.38	140.766667	DECEASED	Died of Disease	YES	YES	40.0	2.0	...	3	4.0	9.913163	9.980732	7.651020	6.431794	5.970082	9.402294	11.052930	11.474559
230	MB-0365	1	76.84	87.233333	DECEASED	Died of Disease	NO	YES	15.0	1.0	...	4ER+	6.0	8.365138	12.345780	5.370492	6.046610	6.092679	8.708236	10.604942	10.728391
429	MB-0620	1	62.97	112.800000	LIVING	Living	YES	YES	14.0	2.0	...	4ER+	NaN	8.449960	9.666006	5.468108	6.080813	6.233319	8.377827	8.580251	9.925870

5 rows × 32 columns

Print the top rows of each group categorized based on cancer type and three gene classifier.

metabric.groupby(["Cancer_type", "3-gene_classifier"]).head()

	Patient_ID	Cohort	Age_at_diagnosis	Survival_time	Survival_status	Vital_status	Chemotherapy	Radiotherapy	Tumour_size	Tumour_stage	...	Integrative_cluster	Mutation_count	ESR1	ERBB2	PGR	TP53	PIK3CA	GATA3	FOXA1	MLPH
0	MB-0000	1	75.65	140.500000	LIVING	Living	NO	YES	22.0	2.0	...	4ER+	NaN	8.929817	9.333972	5.680501	6.338739	5.704157	6.932146	7.953794	9.729728
1	MB-0002	1	43.19	84.633333	LIVING	Living	NO	YES	10.0	1.0	...	4ER+	2.0	10.047059	9.729606	7.505424	6.192507	5.757727	11.251197	11.843989	12.536570
4	MB-0008	1	76.97	41.366667	DECEASED	Died of Disease	YES	YES	40.0	2.0	...	9	2.0	11.276581	9.956267	7.331223	6.337951	5.817818	9.719781	11.625006	12.161961
5	MB-0010	1	78.77	7.800000	DECEASED	Died of Disease	NO	YES	31.0	4.0	...	7	4.0	11.239750	9.739996	5.954311	5.419711	6.123056	9.787085	12.142178	11.433164
8	MB-0028	1	86.41	36.566667	DECEASED	Died of Other Causes	NO	YES	16.0	2.0	...	9	4.0	12.521038	10.678267	5.325554	6.220372	6.250678	10.260059	12.148375	10.936002
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1524	MB-6047	5	72.94	37.866667	DECEASED	Died of Disease	NO	NO	15.0	NaN	...	9	14.0	9.457621	10.750387	6.588294	6.442729	6.114882	9.633760	10.743632	10.220908
1607	MB-6212	5	61.81	224.233333	LIVING	Living	NO	YES	15.0	1.0	...	7	4.0	11.720822	10.060722	6.324464	6.606875	5.476758	10.147003	11.494327	12.624759
1651	MB-6314	5	68.57	134.833333	LIVING	Living	NO	NO	14.0	NaN	...	1	5.0	10.128997	13.637355	5.249247	6.975344	5.938845	10.767529	12.502447	11.903197
1696	MB-7030	4	74.21	31.333333	DECEASED	Died of Other Causes	NO	YES	50.0	NaN	...	4ER+	5.0	6.192748	10.030603	4.969329	7.290813	5.523130	6.343169	6.251026	6.423873
1856	MB-7235	4	62.45	170.800000	LIVING	Living	NO	NO	20.0	NaN	...	8	12.0	11.854216	11.241157	5.898537	6.665210	5.863743	10.676374	11.860498	11.966185

84 rows × 32 columns

`apply` function

Some operations on the grouped data might not fit into the aggregation, transformation, or filtration categories. For these, you can use the apply function.

Select all patients with age between 50 and 70 for each group categorized based on cancer type.

metabric.groupby('ER_status')["Age_at_diagnosis"].apply(lambda x: x[(x >= 50) & (x <= 70)])

ER_status      
Negative   20      52.14
           24      50.42
           30      68.68
           52      52.11
           54      63.53
                   ...  
Positive   1897    56.90
           1898    59.20
           1901    62.90
           1902    61.16
           1903    60.02
Name: Age_at_diagnosis, Length: 983, dtype: float64

# Filter the DataFrame directly based on 'Age_at_diagnosis' and then group by 'ER_status'
filtered_df = metabric[(metabric['Age_at_diagnosis'] >= 50) & (metabric['Age_at_diagnosis'] <= 70)].groupby('ER_status')

# Display each group in the filtered DataFrame
for cancer_type, group in filtered_df:
    print(f"Cancer Type: {cancer_type}")
    print(group)
    print()  # For better readability

Cancer Type: Negative
     Patient_ID  Cohort  Age_at_diagnosis  Survival_time Survival_status  \
20      MB-0062       1             52.14     153.966667          LIVING   
24      MB-0079       1             50.42      28.500000        DECEASED   
30      MB-0100       1             68.68       8.066667        DECEASED   
52      MB-0127       1             52.11     132.066667          LIVING   
54      MB-0129       1             63.53      38.566667        DECEASED   
...         ...     ...               ...            ...             ...   
1865    MB-7251       4             54.66      21.566667        DECEASED   
1869    MB-7256       4             50.92      34.700000        DECEASED   
1879    MB-7270       4             50.84     175.166667          LIVING   
1887    MB-7281       4             51.22      49.533333        DECEASED   
1895    MB-7291       4             53.87       6.833333        DECEASED   

              Vital_status Chemotherapy Radiotherapy  Tumour_size  \
20                  Living          YES          YES        17.00   
24         Died of Disease          YES          YES        40.00   
30         Died of Disease          YES          YES        39.00   
52                  Living          YES          YES        13.00   
54    Died of Other Causes          YES          YES        24.00   
...                    ...          ...          ...          ...   
1865       Died of Disease           NO           NO        15.47   
1869       Died of Disease          YES          YES        40.00   
1879                Living          YES           NO        16.00   
1887       Died of Disease          YES          YES        16.00   
1895       Died of Disease           NO           NO        50.00   

      Tumour_stage  ...  Integrative_cluster  Mutation_count      ESR1  \
20             1.0  ...                   10             3.0  5.965770   
24             2.0  ...                   10             4.0  6.312633   
30             2.0  ...                   10             1.0  6.204958   
52             2.0  ...                    3             5.0  7.383420   
54             2.0  ...                    5             NaN  6.217353   
...            ...  ...                  ...             ...       ...   
1865           NaN  ...                    5             1.0  6.552171   
1869           NaN  ...                    5             4.0  6.425775   
1879           NaN  ...                   10             8.0  5.996827   
1887           NaN  ...                    5             3.0  6.074817   
1895           NaN  ...                    1             4.0  6.096158   

          ERBB2       PGR      TP53    PIK3CA     GATA3      FOXA1       MLPH  
20     7.789817  5.164429  5.444836  7.332574  5.987214   6.461705   6.266163  
24     9.068778  5.305683  6.673993  7.068624  5.667747  10.690917   9.056929  
30     8.881671  5.172111  7.072938  6.189728  5.687771   6.311955   7.963707  
52    11.011174  5.368214  5.877772  5.997610  8.749941  10.912581  11.732541  
54    13.495315  5.582046  5.678390  6.556857  7.141058  10.474060  10.331051  
...         ...       ...       ...       ...       ...        ...        ...  
1865  14.224677  5.078956  6.921696  5.385871  9.574542  11.608117  11.280612  
1869  14.464282  5.356381  6.107157  6.480691  8.570477  11.123834  12.386091  
1879   9.515256  5.261880  6.981713  5.724471  7.574704  10.685767   9.210209  
1887  12.545156  5.218277  6.423540  5.856297  7.084129  11.029630  10.737725  
1895  12.600238  5.260572  6.351516  5.829438  8.116684  10.298983  10.747426  

[221 rows x 32 columns]

Cancer Type: Positive
     Patient_ID  Cohort  Age_at_diagnosis  Survival_time Survival_status  \
6       MB-0014       1             56.45     164.333333          LIVING   
14      MB-0048       1             51.46     103.833333          LIVING   
17      MB-0054       1             66.91     160.300000          LIVING   
18      MB-0056       1             62.62      62.866667          LIVING   
21      MB-0064       1             69.13     108.933333          LIVING   
...         ...     ...               ...            ...             ...   
1897    MB-7293       4             56.90     199.233333          LIVING   
1898    MB-7294       4             59.20      82.733333        DECEASED   
1901    MB-7297       4             62.90     175.966667        DECEASED   
1902    MB-7298       4             61.16      86.233333        DECEASED   
1903    MB-7299       4             60.02     201.900000        DECEASED   

              Vital_status Chemotherapy Radiotherapy  Tumour_size  \
6                   Living          YES          YES         10.0   
14                  Living          YES          YES         25.0   
17                  Living           NO          YES         36.0   
18                  Living           NO           NO         29.0   
21                  Living           NO           NO         18.0   
...                    ...          ...          ...          ...   
1897                Living           NO           NO         45.0   
1898       Died of Disease           NO           NO         15.0   
1901       Died of Disease           NO          YES         25.0   
1902  Died of Other Causes           NO           NO         25.0   
1903  Died of Other Causes           NO          YES         20.0   

      Tumour_stage  ...  Integrative_cluster  Mutation_count       ESR1  \
6              2.0  ...                    3             4.0  10.793832   
14             2.0  ...                 4ER+             NaN   8.101955   
17             2.0  ...                   10             3.0  10.194473   
18             1.0  ...                    1             4.0  11.560030   
21             1.0  ...                    3             NaN  11.288632   
...            ...  ...                  ...             ...        ...   
1897           NaN  ...                    3             5.0  10.879891   
1898           NaN  ...                    1             2.0  11.290976   
1901           NaN  ...                    1             4.0  11.053198   
1902           NaN  ...                    1            15.0  11.055114   
1903           NaN  ...                   10             3.0  10.696475   

          ERBB2       PGR      TP53    PIK3CA      GATA3      FOXA1       MLPH  
6      9.276507  7.720952  5.992706  7.481835   8.365527  11.482627  10.755199  
14    12.847992  6.114007  6.397985  7.779824   7.294771  10.963978   9.063062  
17     9.451855  6.769441  5.350892  7.564274   7.465016  10.822652   8.665371  
18     8.674760  6.336389  5.845364  7.234934   8.838840  10.905011   9.829488  
21     8.367936  8.161056  6.433638  6.412103   8.528343  11.847914  10.443699  
...         ...       ...       ...       ...        ...        ...        ...  
1897  10.219154  6.431113  5.791022  6.000993   9.551231  11.302246  11.301274  
1898  10.846545  7.312247  5.660943  6.190000   9.424235  11.075688  11.567166  
1901  10.228570  7.478069  6.212256  6.192399   9.540589  11.482761  11.180360  
1902   9.892589  8.282737  6.466712  6.287254  10.365901  11.371176  12.827069  
1903  10.227787  5.533486  6.180511  6.208784   9.749368  10.867527   9.847856  

[762 rows x 32 columns]

`assign` function

This function is used to create new columns or modify existing columns in a DataFrame in a concise and flexible way. It returns a new DataFrame with the added or updated columns while keeping the original DataFrame unchanged, unless explicitly reassigned.

Create a new column named Tumor_size_cm in the metabric DataFrame to represent tumor size in centimeters.

metabric.assign(Tumour_size_cm = metabric.Tumour_size / 10)

	Patient_ID	Cohort	Age_at_diagnosis	Survival_time	Survival_status	Vital_status	Chemotherapy	Radiotherapy	Tumour_size	Tumour_stage	...	Mutation_count	ESR1	ERBB2	PGR	TP53	PIK3CA	GATA3	FOXA1	MLPH	Tumour_size_cm
0	MB-0000	1	75.65	140.500000	LIVING	Living	NO	YES	22.0	2.0	...	NaN	8.929817	9.333972	5.680501	6.338739	5.704157	6.932146	7.953794	9.729728	2.2
1	MB-0002	1	43.19	84.633333	LIVING	Living	NO	YES	10.0	1.0	...	2.0	10.047059	9.729606	7.505424	6.192507	5.757727	11.251197	11.843989	12.536570	1.0
2	MB-0005	1	48.87	163.700000	DECEASED	Died of Disease	YES	NO	15.0	2.0	...	2.0	10.041281	9.725825	7.376123	6.404516	6.751566	9.289758	11.698169	10.306115	1.5
3	MB-0006	1	47.68	164.933333	LIVING	Living	YES	YES	25.0	2.0	...	1.0	10.404685	10.334979	6.815637	6.869241	7.219187	8.667723	11.863379	10.472181	2.5
4	MB-0008	1	76.97	41.366667	DECEASED	Died of Disease	YES	YES	40.0	2.0	...	2.0	11.276581	9.956267	7.331223	6.337951	5.817818	9.719781	11.625006	12.161961	4.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1899	MB-7295	4	43.10	196.866667	LIVING	Living	NO	YES	25.0	NaN	...	4.0	9.591235	9.935179	7.984515	6.753291	6.279207	9.207323	11.281194	11.337601	2.5
1900	MB-7296	4	42.88	44.733333	DECEASED	Died of Disease	NO	YES	20.0	NaN	...	6.0	9.733986	13.753037	5.616082	6.271912	5.999093	9.530390	11.532033	11.626140	2.0
1901	MB-7297	4	62.90	175.966667	DECEASED	Died of Disease	NO	YES	25.0	NaN	...	4.0	11.053198	10.228570	7.478069	6.212256	6.192399	9.540589	11.482761	11.180360	2.5
1902	MB-7298	4	61.16	86.233333	DECEASED	Died of Other Causes	NO	NO	25.0	NaN	...	15.0	11.055114	9.892589	8.282737	6.466712	6.287254	10.365901	11.371176	12.827069	2.5
1903	MB-7299	4	60.02	201.900000	DECEASED	Died of Other Causes	NO	YES	20.0	NaN	...	3.0	10.696475	10.227787	5.533486	6.180511	6.208784	9.749368	10.867527	9.847856	2.0

1904 rows × 33 columns

Tidy Data

Tidy data is a structured and organized format for presenting data that follows a simple convention: variables are placed in columns, observations are placed in rows and values are placed in cells. This standardized arrangement makes it easy to work with and analyze data efficiently. The principles of tidy data, are designed to promote consistency and ease of use in data analysis.

Data is often entered in a wide format, where each row typically represents a site, subject, or patient, and there are multiple observation variables containing the same type of data.

For instance, consider the AirPassengers dataset. It contains information on monthly airline passenger numbers from 1949 to 1960. In this dataset, each row corresponds to a single year, and the columns represent each month from January to December.

Wide format is intuitive for data entry. But it is less so for data analysis. Consider calculating the monthly mean; where would you place it? Would it be another row?

Data needs to be reshaped to conform to the tidy data structure. It involves using two primary verbs (or pairs of opposites):

Gather columns into rows (melt()).
Spread rows into columns (pivot()).

Converting data from wide to long format

First read the counts file called GSE60450_normalized_data.csv that is in a folder called data (i.e. the path to the file should be data/GSE60450_normalized_data.csv).

counts = pd.read_csv("data/GSE60450_normalized_data.csv")
counts.head()

Output

	X	gene_symbol	GSM1480291	GSM1480292	GSM1480293	GSM1480294	GSM1480295	GSM1480296	GSM1480297	GSM1480298	GSM1480299	GSM1480300	GSM1480301	GSM1480302
0	ENSMUSG00000031805	Jak3	82.22869	81.18604	36.13808	36.62751	12.35039	11.85235	10.59006	14.88337	7.57182	7.05763	3.16982	4.25451
1	ENSMUSG00000033909	Usp36	88.22360	94.59098	76.11531	63.73863	27.10872	24.87050	95.67017	100.73912	78.07470	59.35009	36.36491	40.37057
2	ENSMUSG00000065524	Mir135a-2	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
3	ENSMUSG00000030374	Strn4	94.26324	85.05204	59.54855	54.30684	24.42892	24.90936	125.00263	102.99289	110.64124	97.98130	77.96883	105.70093
4	ENSMUSG00000098547	Mir6921	0.08948	0.08404	0.00000	0.08459	0.03884	0.07772	0.00000	0.08505	0.07726	0.08255	0.04403	0.04727

To transform this table from a wide format to a long format, we use the melt() function. It’s important to note that this function does not create tidy data as it duplicates rows. However, the output in ‘long format’ from melt() is often necessary for visualization and for left join, which will be introduced later.

This operation will convert multiple columns with counts for each sample into a single column containing all the expression values, as illustrated in the image below.

The melt() function takes four arguments:

id_vars = : Column(s) to use as identifier variables.
value_vars = : Columns to be unpivot or converted into labels,values in long form. If not specified, uses all columns that are not set as id_vars =.
var_name = : a name(s) for the new column(s) containing the labels from the specified columns.
value_name =: a name(s) for the new column(s) containing the values corresponding to the specified columns.

counts.melt(id_vars=['X', 'gene_symbol'], var_name='Sample', value_name='Count')

Output

	X	gene_symbol	Sample	Count
0	ENSMUSG00000031805	Jak3	GSM1480291	82.22869
1	ENSMUSG00000033909	Usp36	GSM1480291	88.22360
2	ENSMUSG00000065524	Mir135a-2	GSM1480291	0.00000
3	ENSMUSG00000030374	Strn4	GSM1480291	94.26324
4	ENSMUSG00000098547	Mir6921	GSM1480291	0.08948
...	...	...	...	...
235	ENSMUSG00000066189	Cacng3	GSM1480302	0.00000
236	ENSMUSG00000005611	Mrvi1	GSM1480302	185.16572
237	ENSMUSG00000064299	4921528I07Rik	GSM1480302	0.00000
238	ENSMUSG00000028174	Rpe65	GSM1480302	0.00000
239	ENSMUSG00000024902	Mrpl11	GSM1480302	10.06901

240 rows × 4 columns

counts.melt(id_vars='X', value_vars=counts.columns[counts.columns.str.startswith('GSM')], var_name='Sample', value_name='Count')

Output

	X	Sample	Count
0	ENSMUSG00000031805	GSM1480291	82.22869
1	ENSMUSG00000033909	GSM1480291	88.22360
2	ENSMUSG00000065524	GSM1480291	0.00000
3	ENSMUSG00000030374	GSM1480291	94.26324
4	ENSMUSG00000098547	GSM1480291	0.08948
...	...	...	...
235	ENSMUSG00000066189	GSM1480302	0.00000
236	ENSMUSG00000005611	GSM1480302	185.16572
237	ENSMUSG00000064299	GSM1480302	0.00000
238	ENSMUSG00000028174	GSM1480302	0.00000
239	ENSMUSG00000024902	GSM1480302	10.06901

240 rows × 3 columns

The counts.columns[counts.columns.str.startswith('GSM')] command returns a vector of columns whose names starts with ‘GSM’. melt() will then transform those columns into two new columns, denoted as “Sample” and “Count.” The parameter var_name = 'Sample' indicates the new column containing the specified columns (defined by cols) should be named “Sample,” while value_name = 'Count' specifies that the new column containing the values should be named “Count”.

Converting data from long to wide format

First, read the annotation file called GSE60450_annotation.csv (the path to the file should be data/GSE60450_annotation.csv).

annot = pd.read_csv("data/GSE60450_annotation.csv")
annot.head()

Output

	ENSEMBL	Type	Annotation
0	ENSMUSG00000031805	SYMBOL	Jak3
1	ENSMUSG00000031805	GENENAME	Janus kinase 3
2	ENSMUSG00000033909	SYMBOL	Usp36
3	ENSMUSG00000033909	GENENAME	ubiquitin specific peptidase 36
4	ENSMUSG00000065524	SYMBOL	Mir135a-2

To transform this table so that it conforms to the tidy principles, we use the pivot() function.

This operation will convert multiple rows with type and annotation into columns containing the Symbol and Gene_name, as illustrated in the image below.

The pivot() function takes three arguments:

columns = : column containing the labels that will be transformed into the new column names.
index = : column to make new frames’ index. If not given, uses existing index.
values = : column(s) containing the values that will fill the new columns.

In our scenario, to reshape the annot data frame, we will use the column names Type and Annotation:

annot.pivot(index='ENSEMBL', columns='Type', values='Annotation')

Output

Type	GENENAME	SYMBOL
ENSEMBL
ENSMUSG00000005611	inositol 1,4,5-triphosphate receptor associated 1	Irag1
ENSMUSG00000024902	mitochondrial ribosomal protein L11	Mrpl11
ENSMUSG00000026283	inhibitor of growth family, member 5	Ing5
ENSMUSG00000028174	retinal pigment epithelium 65	Rpe65
ENSMUSG00000030374	striatin, calmodulin binding protein 4	Strn4
ENSMUSG00000031805	Janus kinase 3	Jak3
ENSMUSG00000033475	translocase of outer mitochondrial membrane 6	Tomm6
ENSMUSG00000033909	ubiquitin specific peptidase 36	Usp36
ENSMUSG00000035629	RUN domain and cysteine-rich domain containing...	Rubcn
ENSMUSG00000037331	La ribonucleoprotein 1, translational regulator	Larp1
ENSMUSG00000038246	family with sequence similarity 50, member B	Fam50b
ENSMUSG00000039715	dynein 2 intermediate chain 2	Dync2i2
ENSMUSG00000053080	zinc finger translocation associated	Zfta
ENSMUSG00000055491	peroxisome proliferative activated receptor, g...	Pprc1
ENSMUSG00000064299	RIKEN cDNA 4921528I07 gene	4921528I07Rik
ENSMUSG00000065524	microRNA 135a-2	Mir135a-2
ENSMUSG00000066189	calcium channel, voltage-dependent, gamma subu...	Cacng3
ENSMUSG00000074489	bone gamma-carboxyglutamate protein 3	Bglap3
ENSMUSG00000094053	secretoglobin, family 2B, member 7	Scgb2b7
ENSMUSG00000098547	microRNA 6921	Mir6921

The above operation changes the ‘shape’ of the dataframe from a longer format (more rows) to a wider format (more columns). While the original table consists of 40 rows, using pivot() results in only 20 rows. This reduction is due to the de-duplication of rows during the creation of new columns.

It’s important to note that since we only have two distinct labels in the Type column, we are essentially replacing the existing two columns with just two new columns. Consequently, the shape of the output doesn’t technically become wider than the input data frame. However, when there are more than two unique labels in the columns= column, the output will indeed become wider compared to the input.

Separate Columns

First, read the metadata file called GSE60450_metadata.csv (the path to the file should be data/GSE60450_metadata.csv).

metadata = pd.read_csv("data/GSE60450_metadata.csv")
metadata.head()

Output

	gene_id	characteristics
0	GSM1480291	mammary gland;luminal cells;virgin
1	GSM1480292	mammary gland;luminal cells;virgin
2	GSM1480293	mammary gland;luminal cells;18.5 day pregnancy
3	GSM1480294	mammary gland;luminal cells;18.5 day pregnancy
4	GSM1480295	mammary gland;luminal cells;2 day lactation

To transform this table so that it conforms to the tidy principles, we use the str.split() function. This operation will separate characteristic column into 3 separate columns containing the tissue_type, immunophenotype and development_stage, as illustrated in the image below.

To separate characteristic column in the metadata data frame into three separate columns based on the delimeter ; (semi colon), we can use the str.split() function:

# split characteristic column into 3 strings based on delimeter ; and update the metadata dataframe
metadata[["tissue_type", "immunophenotype", "development_stage"]] = metadata.characteristics.str.split(';', expand=True)
# remove characteristics column
metadata = metadata.drop(columns='characteristics')
metadata

Output

	gene_id	tissue_type	immunophenotype	development_stage
0	GSM1480291	mammary gland	luminal cells	virgin
1	GSM1480292	mammary gland	luminal cells	virgin
2	GSM1480293	mammary gland	luminal cells	18.5 day pregnancy
3	GSM1480294	mammary gland	luminal cells	18.5 day pregnancy
4	GSM1480295	mammary gland	luminal cells	2 day lactation
5	GSM1480296	mammary gland	luminal cells	2 day lactation
6	GSM1480297	mammary gland	basal cells	virgin
7	GSM1480298	mammary gland	basal cells	virgin
8	GSM1480299	mammary gland	basal cells	18.5 day pregnancy
9	GSM1480300	mammary gland	basal cells	18.5 day pregnancy
10	GSM1480301	mammary gland	basal cells	2 day lactation
11	GSM1480302	mammary gland	basal cells	2 day lactation

Uniting Columns

This is the complement of separate. Therefore, let’s revert what we did in the previous section to combine multiple columns to a single column as illustrated in the image below.

To combine tissue type, immunophenotype, and development stage columns into characteristic column in the metadata dataframe:

metadata['characteristics'] = metadata.tissue_type.str.cat(metadata[['immunophenotype','development_stage']], sep = ';')
# remove tissue type, immunophenotype, and development stage columns
metadata = metadata.drop(columns=['tissue_type','immunophenotype','development_stage'])
metadata

Output

	gene_id	tissue_type	immunophenotype	development_stage	characteristics
0	GSM1480291	mammary gland	luminal cells	virgin	mammary gland;luminal cells;virgin
1	GSM1480292	mammary gland	luminal cells	virgin	mammary gland;luminal cells;virgin
2	GSM1480293	mammary gland	luminal cells	18.5 day pregnancy	mammary gland;luminal cells;18.5 day pregnancy
3	GSM1480294	mammary gland	luminal cells	18.5 day pregnancy	mammary gland;luminal cells;18.5 day pregnancy
4	GSM1480295	mammary gland	luminal cells	2 day lactation	mammary gland;luminal cells;2 day lactation
5	GSM1480296	mammary gland	luminal cells	2 day lactation	mammary gland;luminal cells;2 day lactation
6	GSM1480297	mammary gland	basal cells	virgin	mammary gland;basal cells;virgin
7	GSM1480298	mammary gland	basal cells	virgin	mammary gland;basal cells;virgin
8	GSM1480299	mammary gland	basal cells	18.5 day pregnancy	mammary gland;basal cells;18.5 day pregnancy
9	GSM1480300	mammary gland	basal cells	18.5 day pregnancy	mammary gland;basal cells;18.5 day pregnancy
10	GSM1480301	mammary gland	basal cells	2 day lactation	mammary gland;basal cells;2 day lactation
11	GSM1480302	mammary gland	basal cells	2 day lactation	mammary gland;basal cells;2 day lactation

A quick and easy method to unite columns using the + operator:

metadata.characteristics = metadata.tissue_type + ';' + metadata.immunophenotype + ';' + metadata.development_stage
# remove tissue type, immunophenotype, and development stage columns
metadata = metadata.drop(columns=['tissue_type','immunophenotype','development_stage'])
metadata

Since we have already updated the metadata DataFrame to combine columns in the previous example, this code will not work. The tissue type, immunophenotype, and development stage columns are no longer available in the DataFrame as we combined them to create the characteristics column.

Joining Data

Pandas provides two primary methods for combining DataFrames: concat and merge. These methods allow you to combine DataFrames along rows and columns, and they support various types of joins, including inner, left, right, and outer joins. Here’s an explanation of both methods and the types of joins they support:

`concat`

The concat method is used for concatenating (stacking) DataFrames along a particular axis, which can be either rows or columns. It’s particularly useful when you want to combine DataFrames with the same structure along a common axis. It does not require a common key to merge DataFrames, as it is primarily for stacking them.

Consider the following three DataFrames,

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})
                    
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5']})

df3 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2']})

Concatenate along rows (vertically)

df1

	A	B
0	A0	B0
1	A1	B1
2	A2	B2

df2

	A	B
0	A3	B3
1	A4	B4
2	A5	B5

pd.concat([df1, df2], axis=0)

	A	B
0	A0	B0
1	A1	B1
2	A2	B2
0	A3	B3
1	A4	B4
2	A5	B5

Concatenating along columns (horizontally)

df1

	A	B
0	A0	B0
1	A1	B1
2	A2	B2

df3

	C	D
0	C0	D0
1	C1	D1
2	C2	D2

pd.concat([df1, df3], axis=1)

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2

`merge`

The merge method is used for merging DataFrames based on common columns or indexes, similar to SQL joins. It’s especially useful when you have different DataFrames with related data and want to combine them based on a shared key.

Consider the following two DataFrames,

left = pd.DataFrame({
         "key1": ["K0", "K0", "K1", "K2"],
         "key2": ["K0", "K1", "K0", "K1"],
         "A": ["A0", "A1", "A2", "A3"],
         "B": ["B0", "B1", "B2", "B3"],
     })
right = pd.DataFrame({
         "key1": ["K0", "K1", "K1", "K2"],
         "key2": ["K0", "K0", "K0", "K0"],
         "C": ["C0", "C1", "C2", "C3"],
         "D": ["D0", "D1", "D2", "D3"],
     })

Inner join

An inner join returns only the rows with matching values in the specified columns (the common key). It combines data from two or more tables or DataFrames based on the intersection of keys, excluding rows that do not have corresponding matches in both tables.

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

pd.merge(left, right, how="inner", on=["key1", "key2"])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2

Outer join

An outer join returns all rows from both tables, including rows with matching keys and rows with non-matching keys. When there’s no match for a particular row in one of the tables, the missing values are filled with NULL (or NaN in pandas), indicating no corresponding data.

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

pd.merge(left, right, how="outer", on=["key1", "key2"])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K0	NaN	NaN	C3	D3
5	K2	K1	A3	B3	NaN	NaN

Left join

A left join returns all rows from the left table (the first table specified) and matching rows from the right table (the second table specified). Non-matching rows in the right table have NULL (or NaN) values in the result.

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

pd.merge(left, right, how="left", on=["key1", "key2"])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN

Right join

A right join is similar to a left join but returns all rows from the right table and matching rows from the left table. Non-matching rows in the left table have NULL (or NaN) values in the result.

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

pd.merge(left, right, how="right", on=["key1", "key2"])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2
3	K2	K0	NaN	NaN	C3	D3

Cross join

A cross join is used to create all possible combinations of rows from multiple tables or DataFrames. It can be useful in specific situations, such as when you want to generate all possible pairs or combinations of data.

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

pd.merge(left, right, how="cross")

	key1_x	key2_x	A	B	key1_y	key2_y	C	D
0	K0	K0	A0	B0	K0	K0	C0	D0
1	K0	K0	A0	B0	K1	K0	C1	D1
2	K0	K0	A0	B0	K1	K0	C2	D2
3	K0	K0	A0	B0	K2	K0	C3	D3
4	K0	K1	A1	B1	K0	K0	C0	D0
5	K0	K1	A1	B1	K1	K0	C1	D1
6	K0	K1	A1	B1	K1	K0	C2	D2
7	K0	K1	A1	B1	K2	K0	C3	D3
8	K1	K0	A2	B2	K0	K0	C0	D0
9	K1	K0	A2	B2	K1	K0	C1	D1
10	K1	K0	A2	B2	K1	K0	C2	D2
11	K1	K0	A2	B2	K2	K0	C3	D3
12	K2	K1	A3	B3	K0	K0	C0	D0
13	K2	K1	A3	B3	K1	K0	C1	D1
14	K2	K1	A3	B3	K1	K0	C2	D2
15	K2	K1	A3	B3	K2	K0	C3	D3

String Manipulation in Pandas

String manipulation is an essential part of data pre-processing and analysis in Pandas, especially when dealing with textual data. The str accessor in Pandas provides a wide range of string functions that allow you to clean, transform, and extract valuable information from text data within a DataFrame.

Here, we explore some of the most commonly used string manipulation functions available in Pandas for DataFrames.

str.replace() Replaces occurrences of a pattern (string or regex) with another string.

cms.columns.str.replace(' ','_')  # Replace "_" with spaces

Index(['ID', 'Facility_Name', 'County', 'Hospital_Type', 'Star_Rating',
       'No_of_Surveys', 'Response_Rate', 'Overall_Rating'],
      dtype='object')

str.contains()

Checks if each string contains a specified substring (can use regular expressions).

mask = cms["Facility Name"].str.contains('GENERAL')
cms[mask]

	ID	Facility Name	County	Hospital Type	Star Rating	No of Surveys	Response Rate	Overall Rating
6	151317	GREENE COUNTY GENERAL HOSPITAL	GREENE	Critical Access Hospital	3	114	22	3
8	490057	SENTARA GENERAL HOSPITAL	VIRGINIA BEACH	Acute Care Hospital	4	619	32	3
13	501339	WHIDBEY GENERAL HOSPITAL	ISLAND	Critical Access Hospital	3	389	29	3

col_mask = cms.columns.str.contains('Rating')
cms.columns[col_mask]

Index(['Star Rating', 'Overall Rating'], dtype='object')

Checks if each string contains a specified regular expression pattern.

mask_col = cms.columns.str.contains('[A-Z][a-z]*\s[A-Z][a-z]{3}$')
cms.columns[mask_col]

Index(['Facility Name', 'Hospital Type', 'Response Rate'], dtype='object')

Here, the regular expression [A-Z][a-z]*\s[A-Z][a-z]{3}$ can be broken down into smaller chunks for better understanding:

[A-Z] matches an uppercase character from ‘A’ to ‘Z’.
[a-z]* matches a set of lowercase characters from ‘a’ to ‘z’.
matches a space.
[A-Z] matches an uppercase character from ‘A’ to ‘Z’.
[a-z]{3} matches any three lowercase characters from ‘a’ to ‘z’.
$ matches the end of the string

Putting this together, the expression selects column names that contains two words of characters separated by a space with the first character in each word in upper case and the second word have four characters. Thus, it should match column names: Facility Name, Hospital Type, and Response Rate.

If you’re unfamiliar with regular expressions, you can skip this section for now. However, interested readers can find many online resources to learn about regular expressions. One of my favorite online tools for building and testing regular expressions is https://regexr.com. You can use this tool to test the correctness of a regular expression.

str.startswith() and str.endswith() Checks if each string starts or ends with a specified substring.

mask_county = cms.County.str.startswith('S')
cms.County[mask_county]

0     SAN DIEGO
3     SEBASTIAN
4        SHELBY
12       SUMNER
Name: County, dtype: object

mask_col = cms.columns.str.endswith('e')
cms.columns[mask_col]

Index(['Facility Name', 'Hospital Type', 'Response Rate'], dtype='object')

str.len() Returns the length of each string in the Series.

cms['County'].str.len()

0      9
1      4
2      4
3      9
4      6
5      6
6      6
7      9
8     14
9      7
10    11
11    10
12     6
13     6
14    11
Name: County, dtype: int64

str.split() Splits each string in the Series into a list of substrings based on a specified delimiter.

cms['Hospital Type'].str.split().str[0]  # Split the Hospital Type column and take the first part

0        Acute
1        Acute
2        Acute
3        Acute
4        Acute
5        Acute
6     Critical
7     Critical
8        Acute
9        Acute
10       Acute
11       Acute
12       Acute
13    Critical
14       Acute
Name: Hospital Type, dtype: object

str.join() Concatenates a list of strings into a single string with a specified separator.

cms[['Facility Name', 'County']].apply(','.join, axis=1)  # Join first and last names

0            SCRIPPS GREEN HOSPITAL,SAN DIEGO
1                    ST BERNARD HOSPITAL,COOK
2                    SOUTH LAKE HOSPITAL,LAKE
3         MERCY HOSPITAL FORT SMITH,SEBASTIAN
4            BAPTIST MEMORIAL HOSPITAL,SHELBY
5     ST JOSEPH REGIONAL HEALTH CENTER,BRAZOS
6       GREENE COUNTY GENERAL HOSPITAL,GREENE
7       SOUTHWEST MEMORIAL HOSPITAL,MONTEZUMA
8     SENTARA GENERAL HOSPITAL,VIRGINIA BEACH
9           PIEDMONT FAYETTE HOSPITAL,FAYETTE
10     MISSION COMMUNITY HOSPITAL,LOS ANGELES
11                DOCTORS HOSPITAL,MIAMI-DADE
12      SUMNER REGIONAL MEDICAL CENTER,SUMNER
13            WHIDBEY GENERAL HOSPITAL,ISLAND
14      NORTHRIDGE MEDICAL CENTER,LOS ANGELES
dtype: object

str.find() Returns the lowest index of the substring; returns -1 if not found.

cms['Facility Name'].str.find('MEMORIAL')

0     -1
1     -1
2     -1
3     -1
4      8
5     -1
6     -1
7     10
8     -1
9     -1
10    -1
11    -1
12    -1
13    -1
14    -1
Name: Facility Name, dtype: int64

str.zfill() Pads each string in the Series with zeros on the left to achieve a specified width.

cms['ID'].astype(str).str.zfill(6)  # Ensure all zip codes have 5 digits

0     050424
1     140103
2     100051
3     040062
4     440048
5     450011
6     151317
7     061327
8     490057
9     110215
10    050704
11    100296
12    440003
13    501339
14    050116
Name: ID, dtype: object

str.cat() Concatenates strings in the Series/Index with an optional separator.

cms['Facility Name'].str.cat(cms['County'], sep='-')

0            SCRIPPS GREEN HOSPITAL-SAN DIEGO
1                    ST BERNARD HOSPITAL-COOK
2                    SOUTH LAKE HOSPITAL-LAKE
3         MERCY HOSPITAL FORT SMITH-SEBASTIAN
4            BAPTIST MEMORIAL HOSPITAL-SHELBY
5     ST JOSEPH REGIONAL HEALTH CENTER-BRAZOS
6       GREENE COUNTY GENERAL HOSPITAL-GREENE
7       SOUTHWEST MEMORIAL HOSPITAL-MONTEZUMA
8     SENTARA GENERAL HOSPITAL-VIRGINIA BEACH
9           PIEDMONT FAYETTE HOSPITAL-FAYETTE
10     MISSION COMMUNITY HOSPITAL-LOS ANGELES
11                DOCTORS HOSPITAL-MIAMI-DADE
12      SUMNER REGIONAL MEDICAL CENTER-SUMNER
13            WHIDBEY GENERAL HOSPITAL-ISLAND
14      NORTHRIDGE MEDICAL CENTER-LOS ANGELES
Name: Facility Name, dtype: object

str.lower() and str.upper()

Converts all characters in a string to lowercase or uppercase, respectively.

cms["Facility Name"].str.lower() # Convert to lowercase

0               scripps green hospital
1                  st bernard hospital
2                  south lake hospital
3            mercy hospital fort smith
4            baptist memorial hospital
5     st joseph regional health center
6       greene county general hospital
7          southwest memorial hospital
8             sentara general hospital
9            piedmont fayette hospital
10          mission community hospital
11                    doctors hospital
12      sumner regional medical center
13            whidbey general hospital
14           northridge medical center
Name: Facility Name, dtype: object

cms["Hospital Type"].str.upper() # Convert to uppercase

0          ACUTE CARE HOSPITAL
1          ACUTE CARE HOSPITAL
2          ACUTE CARE HOSPITAL
3          ACUTE CARE HOSPITAL
4          ACUTE CARE HOSPITAL
5          ACUTE CARE HOSPITAL
6     CRITICAL ACCESS HOSPITAL
7     CRITICAL ACCESS HOSPITAL
8          ACUTE CARE HOSPITAL
9          ACUTE CARE HOSPITAL
10         ACUTE CARE HOSPITAL
11         ACUTE CARE HOSPITAL
12         ACUTE CARE HOSPITAL
13    CRITICAL ACCESS HOSPITAL
14         ACUTE CARE HOSPITAL
Name: Hospital Type, dtype: object

str.capitalize() and str.title()

str.capitalize() converts the first character of each string to uppercase and the rest to lowercase.
str.title() converts the first character of each word to uppercase and the rest to lowercase.

cms['Facility Name'].str.capitalize()

0               Scripps green hospital
1                  St bernard hospital
2                  South lake hospital
3            Mercy hospital fort smith
4            Baptist memorial hospital
5     St joseph regional health center
6       Greene county general hospital
7          Southwest memorial hospital
8             Sentara general hospital
9            Piedmont fayette hospital
10          Mission community hospital
11                    Doctors hospital
12      Sumner regional medical center
13            Whidbey general hospital
14           Northridge medical center
Name: Facility Name, dtype: object

cms['Facility Name'].str.title()

0               Scripps Green Hospital
1                  St Bernard Hospital
2                  South Lake Hospital
3            Mercy Hospital Fort Smith
4            Baptist Memorial Hospital
5     St Joseph Regional Health Center
6       Greene County General Hospital
7          Southwest Memorial Hospital
8             Sentara General Hospital
9            Piedmont Fayette Hospital
10          Mission Community Hospital
11                    Doctors Hospital
12      Sumner Regional Medical Center
13            Whidbey General Hospital
14           Northridge Medical Center
Name: Facility Name, dtype: object

This concludes the data manipulation section. By the end of this section, you should now be able to confidently read and write datasets into and from DataFrames, pre-process and clean data, handle missing values, group, join, and transform data, and perform basic statistical analysis to gain insights into the underlying data.

In the next section, we will dive into data visualization, where you’ll learn how to create insightful and compelling charts and plots using the plotnine package to better communicate your findings.

Statistics on data

Missing data

Detecting missing values

Handling missing values

Dropping missing values

Dropping missing values by rows

Dropping missing values by columns

Filling missing values

Replace missing values with a scalar value

Fill gaps forward or backward

Fill with a Pandas object

Interpolation

Grouping data

Aggregation

Transformation

Filtration

apply function

assign function

Tidy Data

Converting data from wide to long format

Converting data from long to wide format

Separate Columns

Uniting Columns

Joining Data

concat

merge

Inner join

Outer join

Left join

Right join

Cross join

String Manipulation in Pandas

`apply` function

`assign` function

`concat`

`merge`