Data Manipulation with pandas

In this section, we will explore various data manipulation and analysis techniques using the Pandas library. We’ll begin by learning how to set, update, or replace data within a DataFrame. Then, we’ll cover how to handle missing data effectively. Next, we will delve into grouping data, transforming it, and joining different datasets to prepare for downstream analysis. Finally, we’ll look at generating summary statistics to gain a deeper understanding of the underlying data.

We will use the following DataFrames in this section.

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {
        "A": np.random.uniform(low=0, high=1, size=12),
        "B": pd.date_range(start="20230102", periods=12),
        "C": pd.Series(range(1, 13), index=["R1", "R2", "R3", "R4", "R5", "R6", "R7", "R8" ,"R9", "R10", "R11", "R12"]),
        "D": np.random.randint(1, high=100, size=12),
        "E": pd.Categorical(["red", "green", "blue", "white", "pink", "brown", "black", "purple", "orange", "grey", "violet", "yellow"]),
        "F": "foo",
    }, index = ["R1", "R2", "R3", "R4", "R5", "R6", "R7", "R8" ,"R9", "R10", "R11", "R12"]
)

cms = pd.read_csv('data/patient_satisfaction/cms_hospital_patient_satisfaction.csv')

Statistics on data

The pandas library provides a wide range of statistical functions and methods to compute summary statistics for your data. Below provides some of the key statistical measures you can compute using this library.

Consider the following dataset which contains 15 rows. Each row contains 8 features (columns).

cms
ID Facility Name County Hospital Type Star Rating No of Surveys Response Rate Overall Rating
0 50424 SCRIPPS GREEN HOSPITAL SAN DIEGO Acute Care Hospital 4 3110 41 5
1 140103 ST BERNARD HOSPITAL COOK Acute Care Hospital 1 264 6 2
2 100051 SOUTH LAKE HOSPITAL LAKE Acute Care Hospital 2 1382 20 2
3 40062 MERCY HOSPITAL FORT SMITH SEBASTIAN Acute Care Hospital 3 2506 35 3
4 440048 BAPTIST MEMORIAL HOSPITAL SHELBY Acute Care Hospital 2 1799 18 2
5 450011 ST JOSEPH REGIONAL HEALTH CENTER BRAZOS Acute Care Hospital 3 1379 24 3
6 151317 GREENE COUNTY GENERAL HOSPITAL GREENE Critical Access Hospital 3 114 22 3
7 61327 SOUTHWEST MEMORIAL HOSPITAL MONTEZUMA Critical Access Hospital 4 247 34 3
8 490057 SENTARA GENERAL HOSPITAL VIRGINIA BEACH Acute Care Hospital 4 619 32 3
9 110215 PIEDMONT FAYETTE HOSPITAL FAYETTE Acute Care Hospital 2 1714 21 2
10 50704 MISSION COMMUNITY HOSPITAL LOS ANGELES Acute Care Hospital 3 241 14 3
11 100296 DOCTORS HOSPITAL MIAMI-DADE Acute Care Hospital 4 393 24 3
12 440003 SUMNER REGIONAL MEDICAL CENTER SUMNER Acute Care Hospital 4 680 35 2
13 501339 WHIDBEY GENERAL HOSPITAL ISLAND Critical Access Hospital 3 389 29 3
14 50116 NORTHRIDGE MEDICAL CENTER LOS ANGELES Acute Care Hospital 3 1110 20 2

You can compute the mean of a Series (a single column of data) or a DataFrame (a table of data) using the .mean() method.

  • Calculate the mean value for a certain column:

    cms['No of Surveys'].mean()
    1063.1333333333334
  • Calculate the mean value for each column:

    cms.mean(numeric_only = True)
    ID                211738.200000
    Star Rating            3.000000
    No of Surveys       1063.133333
    Response Rate         25.000000
    Overall Rating         2.733333
    dtype: float64
  • Calculate the mean value for each row:

    cms.mean(axis=1, numeric_only = True)
    0      10716.8
    1      28075.2
    2      20291.4
    3       8521.8
    4      88373.8
    5      90284.0
    6      30291.8
    7      12323.0
    8      98143.0
    9      22390.8
    10     10193.0
    11     20144.0
    12     88144.8
    13    100352.6
    14     10250.2
    dtype: float64
  • The median is the middle value of a dataset when the values are arranged in ascending order. It is not affected by extreme values (outliers) and is often used to describe the central tendency of data. In Pandas, you can compute the median using the .median() method.

    cms.median(numeric_only = True)
    ID                110215.0
    Star Rating            3.0
    No of Surveys        680.0
    Response Rate         24.0
    Overall Rating         3.0
    dtype: float64
  • The standard deviation measures the amount of variation or dispersion in a dataset. A lower standard deviation indicates that data points are close to the mean, while a higher standard deviation indicates greater variability. In Pandas, you can compute the standard deviation using the .std() method.

    cms.std(numeric_only = True)
    ID                188317.557348
    Star Rating            0.925820
    No of Surveys        909.054600
    Response Rate          9.296697
    Overall Rating         0.798809
    dtype: float64
  • Variance quantifies how much individual data points deviate from the mean. It is the square of the standard deviation. In Pandas, you can compute the variance using the .var() method.

    cms.var(numeric_only = True)
    ID                3.546350e+10
    Star Rating       8.571429e-01
    No of Surveys     8.263803e+05
    Response Rate     8.642857e+01
    Overall Rating    6.380952e-01
    dtype: float64
  • You can also compute the sum of values using .sum() and count the total number of non-missing values using .count().

    cms.sum(numeric_only = True)
    ID                3176073
    Star Rating            45
    No of Surveys       15947
    Response Rate         375
    Overall Rating         41
    dtype: int64
    cms.count()
    ID                15
    Facility Name     15
    County            15
    Hospital Type     15
    Star Rating       15
    No of Surveys     15
    Response Rate     15
    Overall Rating    15
    dtype: int64
    cms["Overall Rating"].sum()
    41

Here is a quick reference summary table of common useful functions.

Function Description
count Number of non-NA observations
sum Sum of values
mean Mean of values
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum

Another useful function to count the frequency of values is shown below.

Consider the DataFrame,

d = {
  "a": pd.Series(np.random.randint(0, 5, size=10)),
  "b": pd.Series(np.random.randint(-3, 3, size=10))
}
df2 = pd.DataFrame(d)
df2
a b
0 3 -2
1 1 2
2 0 0
3 1 -1
4 0 2
5 4 -3
6 2 -3
7 3 -3
8 3 1
9 0 0
  • Frequency of values in all rows:

    df2.value_counts()
    a  b 
    0   0    2
        2    1
    1  -1    1
        2    1
    2  -3    1
    3  -3    1
       -2    1
        1    1
    4  -3    1
    Name: count, dtype: int64
  • Frequency of values in a single column:

    df2['a'].value_counts()
    a
    3    3
    0    3
    1    2
    4    1
    2    1
    Name: count, dtype: int64

Additionally, two powerful functions, agg and transform, allow you to perform calculations and transformations on DataFrames. These two functions applies user defined function that reduces or broadcasts its results, repectively. However, these functions serve different purposes and have distinct use cases and we will explore these in detail later.

Missing data

Handling missing values is an essential part of data pre-processing and analysis in Pandas. Missing values can arise due to various reasons, such as data collection errors, incomplete data or sensor failures.

Let’s create a DataFrame with missing values.

Consider the following DataFrame:

df = pd.DataFrame(np.random.randn(6, 4), index=['a', 'b', 'c', 'd', 'e', 'f'], columns=list("ABCD"))

dates = pd.date_range("20130101", periods=6)
df["E"] = dates

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])
df["F"] = s1
df
A B C D E F
a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1
b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2
c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3
d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4
e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5
f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6

Re-indexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data which includes several missing values.

df1 = df.reindex(index=['a', 'b', 'c', 'h', 'd', 'e', 'f', 'g'], columns=list(df.columns) + ["G", "H"])
df1.loc['c' : 'f', "G"] = 1
df1.loc['h','C'] = 0.634336
df1
A B C D E F G H
a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN NaN
b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN NaN
c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
h NaN NaN 0.634336 NaN NaT NaN 1.0 NaN
d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN
g NaN NaN NaN NaN NaT NaN NaN NaN

Detecting missing values

Pandas provides several methods for detecting and dealing with missing values.

isna() and isnull() methods return a DataFrame of the same shape as the input, with Boolean values indicating whether each element is missing (True) or not (False).

pd.isna(df1)
A B C D E F G H
a False False False False False False True True
b False False False False False False True True
c False False False False False False False True
h True True False True True True False True
d False False False False False False False True
e False False False False False False False True
f False False False False False False False True
g True True True True True True True True
pd.isnull(df1)
A B C D E F G H
a False False False False False False True True
b False False False False False False True True
c False False False False False False False True
h True True False True True True False True
d False False False False False False False True
e False False False False False False False True
f False False False False False False False True
g True True True True True True True True

notna() and notnull() methods are the opposite of isna()and isnull(). They return True for non-missing values and False for missing values.

pd.notna(df1)
A B C D E F G H
a True True True True True True False False
b True True True True True True False False
c True True True True True True True False
h False False True False False False True False
d True True True True True True True False
e True True True True True True True False
f True True True True True True True False
g False False False False False False False False
pd.notnull(df1)
A B C D E F G H
a True True True True True True False False
b True True True True True True False False
c True True True True True True True False
h False False True False False False True False
d True True True True True True True False
e True True True True True True True False
f True True True True True True True False
g False False False False False False False False

Check if a column does not contain missing values:

df1["G"].notna()
a    False
b    False
c     True
h     True
d     True
e     True
f     True
g    False
Name: G, dtype: bool

Subsetting rows to return rows that does not contain missing values in column G:

df1[df1["G"].notna()]
A B C D E F G H
c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
h NaN NaN 0.634336 NaN NaT NaN 1.0 NaN
d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN

The info() method provides a summary of the DataFrame, including the count of non-null values in each column.

df1.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, a to g
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       6 non-null      float64       
 1   B       6 non-null      float64       
 2   C       7 non-null      float64       
 3   D       6 non-null      float64       
 4   E       6 non-null      datetime64[ns]
 5   F       6 non-null      float64       
 6   G       5 non-null      float64       
 7   H       0 non-null      float64       
dtypes: datetime64[ns](1), float64(7)
memory usage: 876.0+ bytes

Handling missing values

Once you’ve detected missing values, you can choose from several strategies to handle them.

Consider the following DataFrame,

df1
A B C D E F G H
a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN NaN
b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN NaN
c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
h NaN NaN 0.634336 NaN NaT NaN 1.0 NaN
d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN
g NaN NaN NaN NaN NaT NaN NaN NaN

Dropping missing values

Consider the following DataFrame for this example,

# dataframe excluding column H
df_without_colH = df1.loc[:, "A":"G"]   
df_without_colH
A B C D E F G
a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN
b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN
c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0
h NaN NaN 0.634336 NaN NaT NaN 1.0
d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0
e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0
f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0
g NaN NaN NaN NaN NaT NaN NaN
Dropping missing values by rows
  • Drop the rows where at least one element is missing:

    df_without_colH.dropna()
    A B C D E F G
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0
  • Drop rows that have all missing data:

    df_without_colH.dropna(how="all")
    A B C D E F G
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0
    h NaN NaN 0.634336 NaN NaT NaN 1.0
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0
  • Drop rows that have any missing data:

    df_without_colH.dropna(how="any")
    A B C D E F G
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0
  • Keep only the rows with at least 2 non-NA values:

    df_without_colH.dropna(thresh=2)
    A B C D E F G
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0
    h NaN NaN 0.634336 NaN NaT NaN 1.0
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0
Dropping missing values by columns
  • Consider the following DataFrame for the remaining examples,

    df1.loc['g','C'] = 0.877525
    df1
    A B C D E F G H
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN NaN
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN NaN
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
    h NaN NaN 0.634336 NaN NaT NaN 1.0 NaN
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN
    g NaN NaN 0.877525 NaN NaT NaN NaN NaN
  • Drop the columns where at least one element is missing:

    df1.dropna(axis='columns') # same as axis=1
    C
    a -0.642234
    b -0.931715
    c 0.382713
    h 0.634336
    d -0.047471
    e -1.189557
    f 1.405489
    g 0.877525
  • Drop columns that have all missing data:

    df1.dropna(how="all", axis=1)
    A B C D E F G
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0
    h NaN NaN 0.634336 NaN NaT NaN 1.0
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0
    g NaN NaN 0.877525 NaN NaT NaN NaN
  • Drop columns that have any missing data (shown as NaN):

    df1.dropna(how="any", axis=1)
    C
    a -0.642234
    b -0.931715
    c 0.382713
    h 0.634336
    d -0.047471
    e -1.189557
    f 1.405489
    g 0.877525
  • Keep only the columns with at least 6 non-NA values:

    df1.dropna(thresh=6, axis=1)
    A B C D E F
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0
    h NaN NaN 0.634336 NaN NaT NaN
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0
    g NaN NaN 0.877525 NaN NaT NaN
  • Define in which columns to look for missing values:

    df1.dropna(subset=["B", "G"])
    A B C D E F G H
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN

Filling missing values

You can fill missing values with a specific value or a calculated value using the fillna() method.

Consider the following DataFrame,

df1
A B C D E F G H
a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN NaN
b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN NaN
c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
h NaN NaN 0.634336 NaN NaT NaN 1.0 NaN
d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN
g NaN NaN 0.877525 NaN NaT NaN NaN NaN
Replace missing values with a scalar value
  • Fill missing values with 0:

    df1.fillna(0)
    A B C D E F G H
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 00:00:00 1.0 0.0 0.0
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 00:00:00 2.0 0.0 0.0
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 00:00:00 3.0 1.0 0.0
    h 0.000000 0.000000 0.634336 0.000000 0 0.0 1.0 0.0
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 00:00:00 4.0 1.0 0.0
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 00:00:00 5.0 1.0 0.0
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 00:00:00 6.0 1.0 0.0
    g 0.000000 0.000000 0.877525 0.000000 0 0.0 0.0 0.0
  • Fill missing values with a scalar, but limit the number of column fills to 2:

    df1.fillna(12, axis='columns', limit=2)
    A B C D E F G H
    a 0.42335 0.143592 -0.642234 0.345545 2013-01-01 00:00:00 1.0 12 12
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 00:00:00 2.0 12 12
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 00:00:00 3.0 1.0 12
    h 12 12 0.634336 NaN NaT NaN 1.0 NaN
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 00:00:00 4.0 1.0 12
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 00:00:00 5.0 1.0 12
    f -0.808026 1.027726 1.405489 0.72493 2013-01-06 00:00:00 6.0 1.0 12
    g 12 12 0.877525 NaN NaT NaN NaN NaN
Fill gaps forward or backward
  • Propogate missing values forward:

    df1.ffill() # ffill ≡ pad ≡ fillna(method='ffill')
    A B C D E F G H
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 NaN NaN
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 NaN NaN
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
    h 0.563209 -1.123406 0.634336 1.210963 2013-01-03 3.0 1.0 NaN
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN
    g -0.808026 1.027726 0.877525 0.724930 2013-01-06 6.0 1.0 NaN
  • Propogate missing values backward along rows:

    df1.bfill() # bfill ≡ backfill ≡ fillna(method='bfill')
    A B C D E F G H
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 1.0 NaN
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 1.0 NaN
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
    h -1.435517 0.916215 0.634336 0.646783 2013-01-04 4.0 1.0 NaN
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN
    g NaN NaN 0.877525 NaN NaT NaN NaN NaN
Fill with a Pandas object
  • Fill the missing values of a column with the mean of the column:

    df1.fillna(df1.mean())
    A B C D E F G H
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 00:00:00 1.0 1.0 NaN
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 00:00:00 2.0 1.0 NaN
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 00:00:00 3.0 1.0 NaN
    h 0.045313 0.521347 0.634336 0.559000 2013-01-03 12:00:00 3.5 1.0 NaN
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 00:00:00 4.0 1.0 NaN
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 00:00:00 5.0 1.0 NaN
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 00:00:00 6.0 1.0 NaN
    g 0.045313 0.521347 0.877525 0.559000 2013-01-03 12:00:00 3.5 1.0 NaN
  • Fill only a subset of the columns using the corresponding median

    df1.fillna(df1.median()["F":"G"])
    A B C D E F G H
    a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 1.0 1.0 NaN
    b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 2.0 1.0 NaN
    c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 3.0 1.0 NaN
    h NaN NaN 0.634336 NaN NaT 3.5 1.0 NaN
    d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 4.0 1.0 NaN
    e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 5.0 1.0 NaN
    f -0.808026 1.027726 1.405489 0.724930 2013-01-06 6.0 1.0 NaN
    g NaN NaN 0.877525 NaN NaT 3.5 1.0 NaN
Interpolation

The interpolate method is used to fill in missing values with estimated values based on the surrounding data points. It’s particularly useful for time series data or datasets where values are expected to follow a pattern.

The interpolate method employs various interpolation techniques to estimate missing values, including linear interpolation, polynomial interpolation, and more. The specific interpolation method used depends on the kind of data and the desired behavior.

  • Linear Interpolation: Linear interpolation is the default method used by interpolate. It estimates missing values by drawing a straight line between two adjacent known data points and filling in the missing value with a value along that line. This method is suitable for data that appears to change linearly over time or between data points.
  • Polynomial Interpolation: Polynomial interpolation uses higher-degree polynomial functions to estimate missing values. This method can capture more complex relationships between data points but may be susceptible to overfitting if not used carefully.
  • Time-Based Interpolation: When working with time series data, you can use the method parameter to specify time-based interpolation methods such as ‘time’, ‘index’, or ‘values’. These methods consider the time or index values to estimate missing values.

Here’s an example using linear interpolation:

df1.interpolate()
A B C D E F G H
a 0.423350 0.143592 -0.642234 0.345545 2013-01-01 00:00:00 1.0 NaN NaN
b 0.165784 1.102192 -0.931715 -1.135617 2013-01-02 00:00:00 2.0 NaN NaN
c 0.563209 -1.123406 0.382713 1.210963 2013-01-03 00:00:00 3.0 1.0 NaN
h -0.436154 -0.103596 0.634336 0.928873 2013-01-03 12:00:00 3.5 1.0 NaN
d -1.435517 0.916215 -0.047471 0.646783 2013-01-04 00:00:00 4.0 1.0 NaN
e 1.363078 1.061762 -1.189557 1.561393 2013-01-05 00:00:00 5.0 1.0 NaN
f -0.808026 1.027726 1.405489 0.724930 2013-01-06 00:00:00 6.0 1.0 NaN
g -0.808026 1.027726 0.877525 0.724930 2013-01-06 00:00:00 6.0 1.0 NaN

Grouping data

The groupby function is a powerful feature within pandas that allows you to group and aggregate data in a DataFrame based on one or more columns. This can be especially useful when you want to perform summary statistics, calculations, or transformations on subsets of your data based on certain criteria. It involves one or more of the following steps:

  • Splitting: The first step in a groupby operation is to split the DataFrame into groups based on the values in one or more columns. You specify the column(s) by which you want to group your data. This column is often referred to as the “key” or “grouping column”. Each unique value in the grouping column(s) forms a group, and the rows in the DataFrame are distributed among these groups.
  • Applying a Function: After splitting the data into groups, you can apply various aggregation or transformation functions to each group. These functions are typically applied to one or more columns in each group. Common aggregation functions include sum, mean, count, min, max, and more. You can also apply custom functions or perform complex operations on the grouped data.
    • Aggregation: compute a summary statistic (or statistics) for each group. Examples: compute group sums or means, compute group size.
    • Transformation: perform some group-specific computations and return a like-indexed object. Examples: standardize data (zscore) within a group, filling NAs within groups with a value derived from each group.
    • Filtration: discard some groups, according to a group-wise computation that evaluates to True or False. Examples: discard data that belong to groups with only a few members, filter out data based on the group sum or mean.
  • Combining Results: Once the specified function(s) are applied to each group, the results are combined into a new DataFrame or data structure. This final result will often have a hierarchical structure, with the grouping columns as index levels.

Consider the following metabric dataset for the examples listed below.

# Load the Metabric dataset from the URL into a DataFrame
metabric = pd.read_csv("https://zenodo.org/record/6450144/files/metabric_clinical_and_expression_data.csv")
metabric
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage ... Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
0 MB-0000 1 75.65 140.500000 LIVING Living NO YES 22.0 2.0 ... 4ER+ NaN 8.929817 9.333972 5.680501 6.338739 5.704157 6.932146 7.953794 9.729728
1 MB-0002 1 43.19 84.633333 LIVING Living NO YES 10.0 1.0 ... 4ER+ 2.0 10.047059 9.729606 7.505424 6.192507 5.757727 11.251197 11.843989 12.536570
2 MB-0005 1 48.87 163.700000 DECEASED Died of Disease YES NO 15.0 2.0 ... 3 2.0 10.041281 9.725825 7.376123 6.404516 6.751566 9.289758 11.698169 10.306115
3 MB-0006 1 47.68 164.933333 LIVING Living YES YES 25.0 2.0 ... 9 1.0 10.404685 10.334979 6.815637 6.869241 7.219187 8.667723 11.863379 10.472181
4 MB-0008 1 76.97 41.366667 DECEASED Died of Disease YES YES 40.0 2.0 ... 9 2.0 11.276581 9.956267 7.331223 6.337951 5.817818 9.719781 11.625006 12.161961
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1899 MB-7295 4 43.10 196.866667 LIVING Living NO YES 25.0 NaN ... 3 4.0 9.591235 9.935179 7.984515 6.753291 6.279207 9.207323 11.281194 11.337601
1900 MB-7296 4 42.88 44.733333 DECEASED Died of Disease NO YES 20.0 NaN ... 5 6.0 9.733986 13.753037 5.616082 6.271912 5.999093 9.530390 11.532033 11.626140
1901 MB-7297 4 62.90 175.966667 DECEASED Died of Disease NO YES 25.0 NaN ... 1 4.0 11.053198 10.228570 7.478069 6.212256 6.192399 9.540589 11.482761 11.180360
1902 MB-7298 4 61.16 86.233333 DECEASED Died of Other Causes NO NO 25.0 NaN ... 1 15.0 11.055114 9.892589 8.282737 6.466712 6.287254 10.365901 11.371176 12.827069
1903 MB-7299 4 60.02 201.900000 DECEASED Died of Other Causes NO YES 20.0 NaN ... 10 3.0 10.696475 10.227787 5.533486 6.180511 6.208784 9.749368 10.867527 9.847856

1904 rows × 32 columns

Aggregation

This is primarily used for aggregating data (within groups when you use groupby). It allows you to apply one or more aggregation functions to each group and obtain a summarized result for each group.

  • Calculate the total number of patients, categorized by cancer type.

    metabric.groupby("Cancer_type")["Patient_ID"].count()
    Patient_ID
    Cancer_type
    Breast 17
    Breast Invasive Ductal Carcinoma 1500
    Breast Invasive Lobular Carcinoma 142
    Breast Invasive Mixed Mucinous Carcinoma 22
    Breast Mixed Ductal and Lobular Carcinoma 207
    Metaplastic Breast Cancer 1

    count() computes the number of non-NA values in the groups whereas size() computes the number of values in each group. To demonstrate the difference between these two methods, group the metabric dataset by 3-gene-classifier and find the number of values in each group based on Tumour_size column.

    metabric.groupby("3-gene_classifier")["Tumour_size"].count()
    Tumour_size
    3-gene_classifier
    ER+/HER2- High Prolif 596
    ER+/HER2- Low Prolif 615
    ER-/HER2- 287
    HER2+ 186
    metabric.groupby("3-gene_classifier")["Tumour_size"].size()
    Tumour_size
    3-gene_classifier
    ER+/HER2- High Prolif 603
    ER+/HER2- Low Prolif 619
    ER-/HER2- 290
    HER2+ 188
  • Determine the median age at diagnosis for each type of vital status.

    metabric.groupby("Vital_status")["Age_at_diagnosis"].median()
    Age_at_diagnosis
    Vital_status
    Died of Disease 61.135
    Died of Other Causes 70.505
    Living 56.740
  • Calculate the total, mean, and standard deviation of Tumour_size for each combination of cancer type and ER status.

    grouped = metabric.groupby(["Cancer_type", "ER_status"])
    grouped["Tumour_size"].agg(['sum','mean','std'])
    sum mean std
    Cancer_type ER_status
    Breast Negative 173.00 43.250000 17.951323
    Positive 289.00 22.230769 13.417363
    Breast Invasive Ductal Carcinoma Negative 11241.77 28.388308 19.209870
    Positive 27468.87 25.154643 12.961027
    Breast Invasive Lobular Carcinoma Negative 588.80 29.440000 19.155277
    Positive 3579.00 30.075630 19.792221
    Breast Invasive Mixed Mucinous Carcinoma Negative 40.00 40.000000 NaN
    Positive 518.00 25.900000 15.403007
    Breast Mixed Ductal and Lobular Carcinoma Negative 326.00 25.076923 13.174860
    Positive 4832.00 25.298429 12.900426
    Metaplastic Breast Cancer Negative 16.00 16.000000 NaN

Transformation

This is used for element-wise transformations. It applies a given function to each element in a DataFrame or Series and returns a new Series with the same index as the original DataFrame. This function is commonly used when you want to broadcast a computed value back to the original DataFrame, maintaining the original shape of the data.

  • Compute the difference between adjacent values of expression of ESR1 within each group of cancer types.

    metabric.groupby("Cancer_type")["ESR1"].diff()
    ESR1
    0 NaN
    1 1.117242
    2 -0.005778
    3 NaN
    4 0.871896
    ... ...
    1899 -1.833044
    1900 -1.556991
    1901 1.319212
    1902 0.001916
    1903 -0.358639

    1904 rows × 1 columns

  • Compute the cumulative sum of mutation count within each integrative cluster group.

    metabric.groupby("Chemotherapy")["Mutation_count"].transform("cumsum")
    Mutation_count
    0 NaN
    1 2.0
    2 2.0
    3 3.0
    4 5.0
    ... ...
    1899 8583.0
    1900 8589.0
    1901 8593.0
    1902 8608.0
    1903 8611.0

    1904 rows × 1 columns

    • Convert the survival time which is given in months to years.
    metabric["Survival_time"].transform(lambda x: x / 12)
    0       11.708333
    1        7.052778
    2       13.641667
    3       13.744444
    4        3.447222
              ...    
    1899    16.405556
    1900     3.727778
    1901    14.663889
    1902     7.186111
    1903    16.825000
    Name: Survival_time, Length: 1904, dtype: float64

    A lambda function in Python is a small, anonymous, and inline function. It is also known as a lambda expression or lambda notation. Lambda functions are a way to create small, one-time-use functions without needing to define them using the def keyword. Lambda functions are typically used for short, simple operations where defining a full function using def would be overly verbose.

    lambda arguments: expression

    Here’s a simple example to illustrate the use of lambda functions:

    # Regular function to calculate the square of a number
    def square(x):
        return x ** 2
    
    # Equivalent lambda function to calculate the square of a number
    square_lambda = lambda x: x ** 2

Filtration

This is the process of selecting rows from a DataFrame based on grouping. It may either filter out entire groups, part of groups, or both.

  • Select the 3rd row of each group categorized based on cancer type.

    metabric.groupby("Cancer_type").nth(3)
    Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage ... Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
    5 MB-0010 1 78.77 7.800000 DECEASED Died of Disease NO YES 31.0 4.0 ... 7 4.0 11.239750 9.739996 5.954311 5.419711 6.123056 9.787085 12.142178 11.433164
    15 MB-0050 1 44.64 75.333333 LIVING Living YES YES 33.0 2.0 ... 8 NaN 9.492446 10.399286 6.665170 6.843690 5.617014 9.396238 11.188134 12.243680
    32 MB-0102 1 51.38 140.766667 DECEASED Died of Disease YES YES 40.0 2.0 ... 3 4.0 9.913163 9.980732 7.651020 6.431794 5.970082 9.402294 11.052930 11.474559
    230 MB-0365 1 76.84 87.233333 DECEASED Died of Disease NO YES 15.0 1.0 ... 4ER+ 6.0 8.365138 12.345780 5.370492 6.046610 6.092679 8.708236 10.604942 10.728391
    429 MB-0620 1 62.97 112.800000 LIVING Living YES YES 14.0 2.0 ... 4ER+ NaN 8.449960 9.666006 5.468108 6.080813 6.233319 8.377827 8.580251 9.925870

    5 rows × 32 columns

  • Print the top rows of each group categorized based on cancer type and three gene classifier.

    metabric.groupby(["Cancer_type", "3-gene_classifier"]).head()
    Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage ... Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
    0 MB-0000 1 75.65 140.500000 LIVING Living NO YES 22.0 2.0 ... 4ER+ NaN 8.929817 9.333972 5.680501 6.338739 5.704157 6.932146 7.953794 9.729728
    1 MB-0002 1 43.19 84.633333 LIVING Living NO YES 10.0 1.0 ... 4ER+ 2.0 10.047059 9.729606 7.505424 6.192507 5.757727 11.251197 11.843989 12.536570
    4 MB-0008 1 76.97 41.366667 DECEASED Died of Disease YES YES 40.0 2.0 ... 9 2.0 11.276581 9.956267 7.331223 6.337951 5.817818 9.719781 11.625006 12.161961
    5 MB-0010 1 78.77 7.800000 DECEASED Died of Disease NO YES 31.0 4.0 ... 7 4.0 11.239750 9.739996 5.954311 5.419711 6.123056 9.787085 12.142178 11.433164
    8 MB-0028 1 86.41 36.566667 DECEASED Died of Other Causes NO YES 16.0 2.0 ... 9 4.0 12.521038 10.678267 5.325554 6.220372 6.250678 10.260059 12.148375 10.936002
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    1524 MB-6047 5 72.94 37.866667 DECEASED Died of Disease NO NO 15.0 NaN ... 9 14.0 9.457621 10.750387 6.588294 6.442729 6.114882 9.633760 10.743632 10.220908
    1607 MB-6212 5 61.81 224.233333 LIVING Living NO YES 15.0 1.0 ... 7 4.0 11.720822 10.060722 6.324464 6.606875 5.476758 10.147003 11.494327 12.624759
    1651 MB-6314 5 68.57 134.833333 LIVING Living NO NO 14.0 NaN ... 1 5.0 10.128997 13.637355 5.249247 6.975344 5.938845 10.767529 12.502447 11.903197
    1696 MB-7030 4 74.21 31.333333 DECEASED Died of Other Causes NO YES 50.0 NaN ... 4ER+ 5.0 6.192748 10.030603 4.969329 7.290813 5.523130 6.343169 6.251026 6.423873
    1856 MB-7235 4 62.45 170.800000 LIVING Living NO NO 20.0 NaN ... 8 12.0 11.854216 11.241157 5.898537 6.665210 5.863743 10.676374 11.860498 11.966185

    84 rows × 32 columns

apply function

Some operations on the grouped data might not fit into the aggregation, transformation, or filtration categories. For these, you can use the apply function.

  • Select all patients with age between 50 and 70 for each group categorized based on cancer type.
metabric.groupby('ER_status')["Age_at_diagnosis"].apply(lambda x: x[(x >= 50) & (x <= 70)])
ER_status      
Negative   20      52.14
           24      50.42
           30      68.68
           52      52.11
           54      63.53
                   ...  
Positive   1897    56.90
           1898    59.20
           1901    62.90
           1902    61.16
           1903    60.02
Name: Age_at_diagnosis, Length: 983, dtype: float64
# Filter the DataFrame directly based on 'Age_at_diagnosis' and then group by 'ER_status'
filtered_df = metabric[(metabric['Age_at_diagnosis'] >= 50) & (metabric['Age_at_diagnosis'] <= 70)].groupby('ER_status')

# Display each group in the filtered DataFrame
for cancer_type, group in filtered_df:
    print(f"Cancer Type: {cancer_type}")
    print(group)
    print()  # For better readability
Cancer Type: Negative
     Patient_ID  Cohort  Age_at_diagnosis  Survival_time Survival_status  \
20      MB-0062       1             52.14     153.966667          LIVING   
24      MB-0079       1             50.42      28.500000        DECEASED   
30      MB-0100       1             68.68       8.066667        DECEASED   
52      MB-0127       1             52.11     132.066667          LIVING   
54      MB-0129       1             63.53      38.566667        DECEASED   
...         ...     ...               ...            ...             ...   
1865    MB-7251       4             54.66      21.566667        DECEASED   
1869    MB-7256       4             50.92      34.700000        DECEASED   
1879    MB-7270       4             50.84     175.166667          LIVING   
1887    MB-7281       4             51.22      49.533333        DECEASED   
1895    MB-7291       4             53.87       6.833333        DECEASED   

              Vital_status Chemotherapy Radiotherapy  Tumour_size  \
20                  Living          YES          YES        17.00   
24         Died of Disease          YES          YES        40.00   
30         Died of Disease          YES          YES        39.00   
52                  Living          YES          YES        13.00   
54    Died of Other Causes          YES          YES        24.00   
...                    ...          ...          ...          ...   
1865       Died of Disease           NO           NO        15.47   
1869       Died of Disease          YES          YES        40.00   
1879                Living          YES           NO        16.00   
1887       Died of Disease          YES          YES        16.00   
1895       Died of Disease           NO           NO        50.00   

      Tumour_stage  ...  Integrative_cluster  Mutation_count      ESR1  \
20             1.0  ...                   10             3.0  5.965770   
24             2.0  ...                   10             4.0  6.312633   
30             2.0  ...                   10             1.0  6.204958   
52             2.0  ...                    3             5.0  7.383420   
54             2.0  ...                    5             NaN  6.217353   
...            ...  ...                  ...             ...       ...   
1865           NaN  ...                    5             1.0  6.552171   
1869           NaN  ...                    5             4.0  6.425775   
1879           NaN  ...                   10             8.0  5.996827   
1887           NaN  ...                    5             3.0  6.074817   
1895           NaN  ...                    1             4.0  6.096158   

          ERBB2       PGR      TP53    PIK3CA     GATA3      FOXA1       MLPH  
20     7.789817  5.164429  5.444836  7.332574  5.987214   6.461705   6.266163  
24     9.068778  5.305683  6.673993  7.068624  5.667747  10.690917   9.056929  
30     8.881671  5.172111  7.072938  6.189728  5.687771   6.311955   7.963707  
52    11.011174  5.368214  5.877772  5.997610  8.749941  10.912581  11.732541  
54    13.495315  5.582046  5.678390  6.556857  7.141058  10.474060  10.331051  
...         ...       ...       ...       ...       ...        ...        ...  
1865  14.224677  5.078956  6.921696  5.385871  9.574542  11.608117  11.280612  
1869  14.464282  5.356381  6.107157  6.480691  8.570477  11.123834  12.386091  
1879   9.515256  5.261880  6.981713  5.724471  7.574704  10.685767   9.210209  
1887  12.545156  5.218277  6.423540  5.856297  7.084129  11.029630  10.737725  
1895  12.600238  5.260572  6.351516  5.829438  8.116684  10.298983  10.747426  

[221 rows x 32 columns]

Cancer Type: Positive
     Patient_ID  Cohort  Age_at_diagnosis  Survival_time Survival_status  \
6       MB-0014       1             56.45     164.333333          LIVING   
14      MB-0048       1             51.46     103.833333          LIVING   
17      MB-0054       1             66.91     160.300000          LIVING   
18      MB-0056       1             62.62      62.866667          LIVING   
21      MB-0064       1             69.13     108.933333          LIVING   
...         ...     ...               ...            ...             ...   
1897    MB-7293       4             56.90     199.233333          LIVING   
1898    MB-7294       4             59.20      82.733333        DECEASED   
1901    MB-7297       4             62.90     175.966667        DECEASED   
1902    MB-7298       4             61.16      86.233333        DECEASED   
1903    MB-7299       4             60.02     201.900000        DECEASED   

              Vital_status Chemotherapy Radiotherapy  Tumour_size  \
6                   Living          YES          YES         10.0   
14                  Living          YES          YES         25.0   
17                  Living           NO          YES         36.0   
18                  Living           NO           NO         29.0   
21                  Living           NO           NO         18.0   
...                    ...          ...          ...          ...   
1897                Living           NO           NO         45.0   
1898       Died of Disease           NO           NO         15.0   
1901       Died of Disease           NO          YES         25.0   
1902  Died of Other Causes           NO           NO         25.0   
1903  Died of Other Causes           NO          YES         20.0   

      Tumour_stage  ...  Integrative_cluster  Mutation_count       ESR1  \
6              2.0  ...                    3             4.0  10.793832   
14             2.0  ...                 4ER+             NaN   8.101955   
17             2.0  ...                   10             3.0  10.194473   
18             1.0  ...                    1             4.0  11.560030   
21             1.0  ...                    3             NaN  11.288632   
...            ...  ...                  ...             ...        ...   
1897           NaN  ...                    3             5.0  10.879891   
1898           NaN  ...                    1             2.0  11.290976   
1901           NaN  ...                    1             4.0  11.053198   
1902           NaN  ...                    1            15.0  11.055114   
1903           NaN  ...                   10             3.0  10.696475   

          ERBB2       PGR      TP53    PIK3CA      GATA3      FOXA1       MLPH  
6      9.276507  7.720952  5.992706  7.481835   8.365527  11.482627  10.755199  
14    12.847992  6.114007  6.397985  7.779824   7.294771  10.963978   9.063062  
17     9.451855  6.769441  5.350892  7.564274   7.465016  10.822652   8.665371  
18     8.674760  6.336389  5.845364  7.234934   8.838840  10.905011   9.829488  
21     8.367936  8.161056  6.433638  6.412103   8.528343  11.847914  10.443699  
...         ...       ...       ...       ...        ...        ...        ...  
1897  10.219154  6.431113  5.791022  6.000993   9.551231  11.302246  11.301274  
1898  10.846545  7.312247  5.660943  6.190000   9.424235  11.075688  11.567166  
1901  10.228570  7.478069  6.212256  6.192399   9.540589  11.482761  11.180360  
1902   9.892589  8.282737  6.466712  6.287254  10.365901  11.371176  12.827069  
1903  10.227787  5.533486  6.180511  6.208784   9.749368  10.867527   9.847856  

[762 rows x 32 columns]

assign function

This function is used to create new columns or modify existing columns in a DataFrame in a concise and flexible way. It returns a new DataFrame with the added or updated columns while keeping the original DataFrame unchanged, unless explicitly reassigned.

Create a new column named Tumor_size_cm in the metabric DataFrame to represent tumor size in centimeters.

metabric.assign(Tumour_size_cm = metabric.Tumour_size / 10)
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage ... Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH Tumour_size_cm
0 MB-0000 1 75.65 140.500000 LIVING Living NO YES 22.0 2.0 ... NaN 8.929817 9.333972 5.680501 6.338739 5.704157 6.932146 7.953794 9.729728 2.2
1 MB-0002 1 43.19 84.633333 LIVING Living NO YES 10.0 1.0 ... 2.0 10.047059 9.729606 7.505424 6.192507 5.757727 11.251197 11.843989 12.536570 1.0
2 MB-0005 1 48.87 163.700000 DECEASED Died of Disease YES NO 15.0 2.0 ... 2.0 10.041281 9.725825 7.376123 6.404516 6.751566 9.289758 11.698169 10.306115 1.5
3 MB-0006 1 47.68 164.933333 LIVING Living YES YES 25.0 2.0 ... 1.0 10.404685 10.334979 6.815637 6.869241 7.219187 8.667723 11.863379 10.472181 2.5
4 MB-0008 1 76.97 41.366667 DECEASED Died of Disease YES YES 40.0 2.0 ... 2.0 11.276581 9.956267 7.331223 6.337951 5.817818 9.719781 11.625006 12.161961 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1899 MB-7295 4 43.10 196.866667 LIVING Living NO YES 25.0 NaN ... 4.0 9.591235 9.935179 7.984515 6.753291 6.279207 9.207323 11.281194 11.337601 2.5
1900 MB-7296 4 42.88 44.733333 DECEASED Died of Disease NO YES 20.0 NaN ... 6.0 9.733986 13.753037 5.616082 6.271912 5.999093 9.530390 11.532033 11.626140 2.0
1901 MB-7297 4 62.90 175.966667 DECEASED Died of Disease NO YES 25.0 NaN ... 4.0 11.053198 10.228570 7.478069 6.212256 6.192399 9.540589 11.482761 11.180360 2.5
1902 MB-7298 4 61.16 86.233333 DECEASED Died of Other Causes NO NO 25.0 NaN ... 15.0 11.055114 9.892589 8.282737 6.466712 6.287254 10.365901 11.371176 12.827069 2.5
1903 MB-7299 4 60.02 201.900000 DECEASED Died of Other Causes NO YES 20.0 NaN ... 3.0 10.696475 10.227787 5.533486 6.180511 6.208784 9.749368 10.867527 9.847856 2.0

1904 rows × 33 columns

Tidy Data

Tidy data is a structured and organized format for presenting data that follows a simple convention: variables are placed in columns, observations are placed in rows and values are placed in cells. This standardized arrangement makes it easy to work with and analyze data efficiently. The principles of tidy data, are designed to promote consistency and ease of use in data analysis.

Data is often entered in a wide format, where each row typically represents a site, subject, or patient, and there are multiple observation variables containing the same type of data.

For instance, consider the AirPassengers dataset. It contains information on monthly airline passenger numbers from 1949 to 1960. In this dataset, each row corresponds to a single year, and the columns represent each month from January to December.

Wide format is intuitive for data entry. But it is less so for data analysis. Consider calculating the monthly mean; where would you place it? Would it be another row?

Data needs to be reshaped to conform to the tidy data structure. It involves using two primary verbs (or pairs of opposites):

  • Gather columns into rows (melt()).
  • Spread rows into columns (pivot()).

Converting data from wide to long format

First read the counts file called GSE60450_normalized_data.csv that is in a folder called data (i.e. the path to the file should be data/GSE60450_normalized_data.csv).

counts = pd.read_csv("data/GSE60450_normalized_data.csv")
counts.head()
Output
X gene_symbol GSM1480291 GSM1480292 GSM1480293 GSM1480294 GSM1480295 GSM1480296 GSM1480297 GSM1480298 GSM1480299 GSM1480300 GSM1480301 GSM1480302
0 ENSMUSG00000031805 Jak3 82.22869 81.18604 36.13808 36.62751 12.35039 11.85235 10.59006 14.88337 7.57182 7.05763 3.16982 4.25451
1 ENSMUSG00000033909 Usp36 88.22360 94.59098 76.11531 63.73863 27.10872 24.87050 95.67017 100.73912 78.07470 59.35009 36.36491 40.37057
2 ENSMUSG00000065524 Mir135a-2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
3 ENSMUSG00000030374 Strn4 94.26324 85.05204 59.54855 54.30684 24.42892 24.90936 125.00263 102.99289 110.64124 97.98130 77.96883 105.70093
4 ENSMUSG00000098547 Mir6921 0.08948 0.08404 0.00000 0.08459 0.03884 0.07772 0.00000 0.08505 0.07726 0.08255 0.04403 0.04727

To transform this table from a wide format to a long format, we use the melt() function. It’s important to note that this function does not create tidy data as it duplicates rows. However, the output in ‘long format’ from melt() is often necessary for visualization and for left join, which will be introduced later.

This operation will convert multiple columns with counts for each sample into a single column containing all the expression values, as illustrated in the image below.

The melt() function takes four arguments:

  1. id_vars = : Column(s) to use as identifier variables.
  2. value_vars = : Columns to be unpivot or converted into labels,values in long form. If not specified, uses all columns that are not set as id_vars =.
  3. var_name = : a name(s) for the new column(s) containing the labels from the specified columns.
  4. value_name =: a name(s) for the new column(s) containing the values corresponding to the specified columns.
counts.melt(id_vars=['X', 'gene_symbol'], var_name='Sample', value_name='Count')
Output
X gene_symbol Sample Count
0 ENSMUSG00000031805 Jak3 GSM1480291 82.22869
1 ENSMUSG00000033909 Usp36 GSM1480291 88.22360
2 ENSMUSG00000065524 Mir135a-2 GSM1480291 0.00000
3 ENSMUSG00000030374 Strn4 GSM1480291 94.26324
4 ENSMUSG00000098547 Mir6921 GSM1480291 0.08948
... ... ... ... ...
235 ENSMUSG00000066189 Cacng3 GSM1480302 0.00000
236 ENSMUSG00000005611 Mrvi1 GSM1480302 185.16572
237 ENSMUSG00000064299 4921528I07Rik GSM1480302 0.00000
238 ENSMUSG00000028174 Rpe65 GSM1480302 0.00000
239 ENSMUSG00000024902 Mrpl11 GSM1480302 10.06901

240 rows × 4 columns

counts.melt(id_vars='X', value_vars=counts.columns[counts.columns.str.startswith('GSM')], var_name='Sample', value_name='Count')
Output
X Sample Count
0 ENSMUSG00000031805 GSM1480291 82.22869
1 ENSMUSG00000033909 GSM1480291 88.22360
2 ENSMUSG00000065524 GSM1480291 0.00000
3 ENSMUSG00000030374 GSM1480291 94.26324
4 ENSMUSG00000098547 GSM1480291 0.08948
... ... ... ...
235 ENSMUSG00000066189 GSM1480302 0.00000
236 ENSMUSG00000005611 GSM1480302 185.16572
237 ENSMUSG00000064299 GSM1480302 0.00000
238 ENSMUSG00000028174 GSM1480302 0.00000
239 ENSMUSG00000024902 GSM1480302 10.06901

240 rows × 3 columns

The counts.columns[counts.columns.str.startswith('GSM')] command returns a vector of columns whose names starts with ‘GSM’. melt() will then transform those columns into two new columns, denoted as “Sample” and “Count.” The parameter var_name = 'Sample' indicates the new column containing the specified columns (defined by cols) should be named “Sample,” while value_name = 'Count' specifies that the new column containing the values should be named “Count”.

Converting data from long to wide format

First, read the annotation file called GSE60450_annotation.csv (the path to the file should be data/GSE60450_annotation.csv).

annot = pd.read_csv("data/GSE60450_annotation.csv")
annot.head()
Output
ENSEMBL Type Annotation
0 ENSMUSG00000031805 SYMBOL Jak3
1 ENSMUSG00000031805 GENENAME Janus kinase 3
2 ENSMUSG00000033909 SYMBOL Usp36
3 ENSMUSG00000033909 GENENAME ubiquitin specific peptidase 36
4 ENSMUSG00000065524 SYMBOL Mir135a-2

To transform this table so that it conforms to the tidy principles, we use the pivot() function.

This operation will convert multiple rows with type and annotation into columns containing the Symbol and Gene_name, as illustrated in the image below.

The pivot() function takes three arguments:

  1. columns = : column containing the labels that will be transformed into the new column names.
  2. index = : column to make new frames’ index. If not given, uses existing index.
  3. values = : column(s) containing the values that will fill the new columns.

In our scenario, to reshape the annot data frame, we will use the column names Type and Annotation:

annot.pivot(index='ENSEMBL', columns='Type', values='Annotation')
Output
Type GENENAME SYMBOL
ENSEMBL
ENSMUSG00000005611 inositol 1,4,5-triphosphate receptor associated 1 Irag1
ENSMUSG00000024902 mitochondrial ribosomal protein L11 Mrpl11
ENSMUSG00000026283 inhibitor of growth family, member 5 Ing5
ENSMUSG00000028174 retinal pigment epithelium 65 Rpe65
ENSMUSG00000030374 striatin, calmodulin binding protein 4 Strn4
ENSMUSG00000031805 Janus kinase 3 Jak3
ENSMUSG00000033475 translocase of outer mitochondrial membrane 6 Tomm6
ENSMUSG00000033909 ubiquitin specific peptidase 36 Usp36
ENSMUSG00000035629 RUN domain and cysteine-rich domain containing... Rubcn
ENSMUSG00000037331 La ribonucleoprotein 1, translational regulator Larp1
ENSMUSG00000038246 family with sequence similarity 50, member B Fam50b
ENSMUSG00000039715 dynein 2 intermediate chain 2 Dync2i2
ENSMUSG00000053080 zinc finger translocation associated Zfta
ENSMUSG00000055491 peroxisome proliferative activated receptor, g... Pprc1
ENSMUSG00000064299 RIKEN cDNA 4921528I07 gene 4921528I07Rik
ENSMUSG00000065524 microRNA 135a-2 Mir135a-2
ENSMUSG00000066189 calcium channel, voltage-dependent, gamma subu... Cacng3
ENSMUSG00000074489 bone gamma-carboxyglutamate protein 3 Bglap3
ENSMUSG00000094053 secretoglobin, family 2B, member 7 Scgb2b7
ENSMUSG00000098547 microRNA 6921 Mir6921

The above operation changes the ‘shape’ of the dataframe from a longer format (more rows) to a wider format (more columns). While the original table consists of 40 rows, using pivot() results in only 20 rows. This reduction is due to the de-duplication of rows during the creation of new columns.

It’s important to note that since we only have two distinct labels in the Type column, we are essentially replacing the existing two columns with just two new columns. Consequently, the shape of the output doesn’t technically become wider than the input data frame. However, when there are more than two unique labels in the columns= column, the output will indeed become wider compared to the input.

Separate Columns

First, read the metadata file called GSE60450_metadata.csv (the path to the file should be data/GSE60450_metadata.csv).

metadata = pd.read_csv("data/GSE60450_metadata.csv")
metadata.head()
Output
gene_id characteristics
0 GSM1480291 mammary gland;luminal cells;virgin
1 GSM1480292 mammary gland;luminal cells;virgin
2 GSM1480293 mammary gland;luminal cells;18.5 day pregnancy
3 GSM1480294 mammary gland;luminal cells;18.5 day pregnancy
4 GSM1480295 mammary gland;luminal cells;2 day lactation

To transform this table so that it conforms to the tidy principles, we use the str.split() function. This operation will separate characteristic column into 3 separate columns containing the tissue_type, immunophenotype and development_stage, as illustrated in the image below.

To separate characteristic column in the metadata data frame into three separate columns based on the delimeter ; (semi colon), we can use the str.split() function:

# split characteristic column into 3 strings based on delimeter ; and update the metadata dataframe
metadata[["tissue_type", "immunophenotype", "development_stage"]] = metadata.characteristics.str.split(';', expand=True)
# remove characteristics column
metadata = metadata.drop(columns='characteristics')
metadata
Output
gene_id tissue_type immunophenotype development_stage
0 GSM1480291 mammary gland luminal cells virgin
1 GSM1480292 mammary gland luminal cells virgin
2 GSM1480293 mammary gland luminal cells 18.5 day pregnancy
3 GSM1480294 mammary gland luminal cells 18.5 day pregnancy
4 GSM1480295 mammary gland luminal cells 2 day lactation
5 GSM1480296 mammary gland luminal cells 2 day lactation
6 GSM1480297 mammary gland basal cells virgin
7 GSM1480298 mammary gland basal cells virgin
8 GSM1480299 mammary gland basal cells 18.5 day pregnancy
9 GSM1480300 mammary gland basal cells 18.5 day pregnancy
10 GSM1480301 mammary gland basal cells 2 day lactation
11 GSM1480302 mammary gland basal cells 2 day lactation

Uniting Columns

This is the complement of separate. Therefore, let’s revert what we did in the previous section to combine multiple columns to a single column as illustrated in the image below.

To combine tissue type, immunophenotype, and development stage columns into characteristic column in the metadata dataframe:

metadata['characteristics'] = metadata.tissue_type.str.cat(metadata[['immunophenotype','development_stage']], sep = ';')
# remove tissue type, immunophenotype, and development stage columns
metadata = metadata.drop(columns=['tissue_type','immunophenotype','development_stage'])
metadata
Output
gene_id tissue_type immunophenotype development_stage characteristics
0 GSM1480291 mammary gland luminal cells virgin mammary gland;luminal cells;virgin
1 GSM1480292 mammary gland luminal cells virgin mammary gland;luminal cells;virgin
2 GSM1480293 mammary gland luminal cells 18.5 day pregnancy mammary gland;luminal cells;18.5 day pregnancy
3 GSM1480294 mammary gland luminal cells 18.5 day pregnancy mammary gland;luminal cells;18.5 day pregnancy
4 GSM1480295 mammary gland luminal cells 2 day lactation mammary gland;luminal cells;2 day lactation
5 GSM1480296 mammary gland luminal cells 2 day lactation mammary gland;luminal cells;2 day lactation
6 GSM1480297 mammary gland basal cells virgin mammary gland;basal cells;virgin
7 GSM1480298 mammary gland basal cells virgin mammary gland;basal cells;virgin
8 GSM1480299 mammary gland basal cells 18.5 day pregnancy mammary gland;basal cells;18.5 day pregnancy
9 GSM1480300 mammary gland basal cells 18.5 day pregnancy mammary gland;basal cells;18.5 day pregnancy
10 GSM1480301 mammary gland basal cells 2 day lactation mammary gland;basal cells;2 day lactation
11 GSM1480302 mammary gland basal cells 2 day lactation mammary gland;basal cells;2 day lactation

A quick and easy method to unite columns using the + operator:

metadata.characteristics = metadata.tissue_type + ';' + metadata.immunophenotype + ';' + metadata.development_stage
# remove tissue type, immunophenotype, and development stage columns
metadata = metadata.drop(columns=['tissue_type','immunophenotype','development_stage'])
metadata

Since we have already updated the metadata DataFrame to combine columns in the previous example, this code will not work. The tissue type, immunophenotype, and development stage columns are no longer available in the DataFrame as we combined them to create the characteristics column.

Joining Data

Pandas provides two primary methods for combining DataFrames: concat and merge. These methods allow you to combine DataFrames along rows and columns, and they support various types of joins, including inner, left, right, and outer joins. Here’s an explanation of both methods and the types of joins they support:

concat

The concat method is used for concatenating (stacking) DataFrames along a particular axis, which can be either rows or columns. It’s particularly useful when you want to combine DataFrames with the same structure along a common axis. It does not require a common key to merge DataFrames, as it is primarily for stacking them.

Consider the following three DataFrames,

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})
                    
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5']})

df3 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2']})
  • Concatenate along rows (vertically)
df1
A B
0 A0 B0
1 A1 B1
2 A2 B2
df2
A B
0 A3 B3
1 A4 B4
2 A5 B5
pd.concat([df1, df2], axis=0)
A B
0 A0 B0
1 A1 B1
2 A2 B2
0 A3 B3
1 A4 B4
2 A5 B5
  • Concatenating along columns (horizontally)
df1
A B
0 A0 B0
1 A1 B1
2 A2 B2
df3
C D
0 C0 D0
1 C1 D1
2 C2 D2
pd.concat([df1, df3], axis=1)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

merge

The merge method is used for merging DataFrames based on common columns or indexes, similar to SQL joins. It’s especially useful when you have different DataFrames with related data and want to combine them based on a shared key.

Consider the following two DataFrames,

left = pd.DataFrame({
         "key1": ["K0", "K0", "K1", "K2"],
         "key2": ["K0", "K1", "K0", "K1"],
         "A": ["A0", "A1", "A2", "A3"],
         "B": ["B0", "B1", "B2", "B3"],
     })
right = pd.DataFrame({
         "key1": ["K0", "K1", "K1", "K2"],
         "key2": ["K0", "K0", "K0", "K0"],
         "C": ["C0", "C1", "C2", "C3"],
         "D": ["D0", "D1", "D2", "D3"],
     })

Inner join

An inner join returns only the rows with matching values in the specified columns (the common key). It combines data from two or more tables or DataFrames based on the intersection of keys, excluding rows that do not have corresponding matches in both tables.

left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, how="inner", on=["key1", "key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2

Outer join

An outer join returns all rows from both tables, including rows with matching keys and rows with non-matching keys. When there’s no match for a particular row in one of the tables, the missing values are filled with NULL (or NaN in pandas), indicating no corresponding data.

left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, how="outer", on=["key1", "key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K0 NaN NaN C3 D3
5 K2 K1 A3 B3 NaN NaN

Left join

A left join returns all rows from the left table (the first table specified) and matching rows from the right table (the second table specified). Non-matching rows in the right table have NULL (or NaN) values in the result.

left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, how="left", on=["key1", "key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN

Right join

A right join is similar to a left join but returns all rows from the right table and matching rows from the left table. Non-matching rows in the left table have NULL (or NaN) values in the result.

left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, how="right", on=["key1", "key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3

Cross join

A cross join is used to create all possible combinations of rows from multiple tables or DataFrames. It can be useful in specific situations, such as when you want to generate all possible pairs or combinations of data.

left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, how="cross")
key1_x key2_x A B key1_y key2_y C D
0 K0 K0 A0 B0 K0 K0 C0 D0
1 K0 K0 A0 B0 K1 K0 C1 D1
2 K0 K0 A0 B0 K1 K0 C2 D2
3 K0 K0 A0 B0 K2 K0 C3 D3
4 K0 K1 A1 B1 K0 K0 C0 D0
5 K0 K1 A1 B1 K1 K0 C1 D1
6 K0 K1 A1 B1 K1 K0 C2 D2
7 K0 K1 A1 B1 K2 K0 C3 D3
8 K1 K0 A2 B2 K0 K0 C0 D0
9 K1 K0 A2 B2 K1 K0 C1 D1
10 K1 K0 A2 B2 K1 K0 C2 D2
11 K1 K0 A2 B2 K2 K0 C3 D3
12 K2 K1 A3 B3 K0 K0 C0 D0
13 K2 K1 A3 B3 K1 K0 C1 D1
14 K2 K1 A3 B3 K1 K0 C2 D2
15 K2 K1 A3 B3 K2 K0 C3 D3

String Manipulation in Pandas

String manipulation is an essential part of data pre-processing and analysis in Pandas, especially when dealing with textual data. The str accessor in Pandas provides a wide range of string functions that allow you to clean, transform, and extract valuable information from text data within a DataFrame.

Here, we explore some of the most commonly used string manipulation functions available in Pandas for DataFrames.

  • str.replace() Replaces occurrences of a pattern (string or regex) with another string.

    cms.columns.str.replace(' ','_')  # Replace "_" with spaces
    Index(['ID', 'Facility_Name', 'County', 'Hospital_Type', 'Star_Rating',
           'No_of_Surveys', 'Response_Rate', 'Overall_Rating'],
          dtype='object')
  • str.contains()

    Checks if each string contains a specified substring (can use regular expressions).

    mask = cms["Facility Name"].str.contains('GENERAL')
    cms[mask]
    ID Facility Name County Hospital Type Star Rating No of Surveys Response Rate Overall Rating
    6 151317 GREENE COUNTY GENERAL HOSPITAL GREENE Critical Access Hospital 3 114 22 3
    8 490057 SENTARA GENERAL HOSPITAL VIRGINIA BEACH Acute Care Hospital 4 619 32 3
    13 501339 WHIDBEY GENERAL HOSPITAL ISLAND Critical Access Hospital 3 389 29 3
    col_mask = cms.columns.str.contains('Rating')
    cms.columns[col_mask]
    Index(['Star Rating', 'Overall Rating'], dtype='object')

    Checks if each string contains a specified regular expression pattern.

    mask_col = cms.columns.str.contains('[A-Z][a-z]*\s[A-Z][a-z]{3}$')
    cms.columns[mask_col]
    Index(['Facility Name', 'Hospital Type', 'Response Rate'], dtype='object')

    Here, the regular expression [A-Z][a-z]*\s[A-Z][a-z]{3}$ can be broken down into smaller chunks for better understanding:

    • [A-Z] matches an uppercase character from ‘A’ to ‘Z’.
    • [a-z]* matches a set of lowercase characters from ‘a’ to ‘z’.
    • matches a space.
    • [A-Z] matches an uppercase character from ‘A’ to ‘Z’.
    • [a-z]{3} matches any three lowercase characters from ‘a’ to ‘z’.
    • $ matches the end of the string

    Putting this together, the expression selects column names that contains two words of characters separated by a space with the first character in each word in upper case and the second word have four characters. Thus, it should match column names: Facility Name, Hospital Type, and Response Rate.

    If you’re unfamiliar with regular expressions, you can skip this section for now. However, interested readers can find many online resources to learn about regular expressions. One of my favorite online tools for building and testing regular expressions is https://regexr.com. You can use this tool to test the correctness of a regular expression.

  • str.startswith() and str.endswith() Checks if each string starts or ends with a specified substring.

    mask_county = cms.County.str.startswith('S')
    cms.County[mask_county]
    0     SAN DIEGO
    3     SEBASTIAN
    4        SHELBY
    12       SUMNER
    Name: County, dtype: object
    mask_col = cms.columns.str.endswith('e')
    cms.columns[mask_col]
    Index(['Facility Name', 'Hospital Type', 'Response Rate'], dtype='object')
  • str.len() Returns the length of each string in the Series.

    cms['County'].str.len()
    0      9
    1      4
    2      4
    3      9
    4      6
    5      6
    6      6
    7      9
    8     14
    9      7
    10    11
    11    10
    12     6
    13     6
    14    11
    Name: County, dtype: int64
  • str.split() Splits each string in the Series into a list of substrings based on a specified delimiter.

    cms['Hospital Type'].str.split().str[0]  # Split the Hospital Type column and take the first part
    0        Acute
    1        Acute
    2        Acute
    3        Acute
    4        Acute
    5        Acute
    6     Critical
    7     Critical
    8        Acute
    9        Acute
    10       Acute
    11       Acute
    12       Acute
    13    Critical
    14       Acute
    Name: Hospital Type, dtype: object
  • str.join() Concatenates a list of strings into a single string with a specified separator.

    cms[['Facility Name', 'County']].apply(','.join, axis=1)  # Join first and last names
    0            SCRIPPS GREEN HOSPITAL,SAN DIEGO
    1                    ST BERNARD HOSPITAL,COOK
    2                    SOUTH LAKE HOSPITAL,LAKE
    3         MERCY HOSPITAL FORT SMITH,SEBASTIAN
    4            BAPTIST MEMORIAL HOSPITAL,SHELBY
    5     ST JOSEPH REGIONAL HEALTH CENTER,BRAZOS
    6       GREENE COUNTY GENERAL HOSPITAL,GREENE
    7       SOUTHWEST MEMORIAL HOSPITAL,MONTEZUMA
    8     SENTARA GENERAL HOSPITAL,VIRGINIA BEACH
    9           PIEDMONT FAYETTE HOSPITAL,FAYETTE
    10     MISSION COMMUNITY HOSPITAL,LOS ANGELES
    11                DOCTORS HOSPITAL,MIAMI-DADE
    12      SUMNER REGIONAL MEDICAL CENTER,SUMNER
    13            WHIDBEY GENERAL HOSPITAL,ISLAND
    14      NORTHRIDGE MEDICAL CENTER,LOS ANGELES
    dtype: object
  • str.find() Returns the lowest index of the substring; returns -1 if not found.

    cms['Facility Name'].str.find('MEMORIAL')
    0     -1
    1     -1
    2     -1
    3     -1
    4      8
    5     -1
    6     -1
    7     10
    8     -1
    9     -1
    10    -1
    11    -1
    12    -1
    13    -1
    14    -1
    Name: Facility Name, dtype: int64
  • str.zfill() Pads each string in the Series with zeros on the left to achieve a specified width.

    cms['ID'].astype(str).str.zfill(6)  # Ensure all zip codes have 5 digits
    0     050424
    1     140103
    2     100051
    3     040062
    4     440048
    5     450011
    6     151317
    7     061327
    8     490057
    9     110215
    10    050704
    11    100296
    12    440003
    13    501339
    14    050116
    Name: ID, dtype: object
  • str.cat() Concatenates strings in the Series/Index with an optional separator.

    cms['Facility Name'].str.cat(cms['County'], sep='-')
    0            SCRIPPS GREEN HOSPITAL-SAN DIEGO
    1                    ST BERNARD HOSPITAL-COOK
    2                    SOUTH LAKE HOSPITAL-LAKE
    3         MERCY HOSPITAL FORT SMITH-SEBASTIAN
    4            BAPTIST MEMORIAL HOSPITAL-SHELBY
    5     ST JOSEPH REGIONAL HEALTH CENTER-BRAZOS
    6       GREENE COUNTY GENERAL HOSPITAL-GREENE
    7       SOUTHWEST MEMORIAL HOSPITAL-MONTEZUMA
    8     SENTARA GENERAL HOSPITAL-VIRGINIA BEACH
    9           PIEDMONT FAYETTE HOSPITAL-FAYETTE
    10     MISSION COMMUNITY HOSPITAL-LOS ANGELES
    11                DOCTORS HOSPITAL-MIAMI-DADE
    12      SUMNER REGIONAL MEDICAL CENTER-SUMNER
    13            WHIDBEY GENERAL HOSPITAL-ISLAND
    14      NORTHRIDGE MEDICAL CENTER-LOS ANGELES
    Name: Facility Name, dtype: object
  • str.lower() and str.upper()

    Converts all characters in a string to lowercase or uppercase, respectively.

    cms["Facility Name"].str.lower() # Convert to lowercase
    0               scripps green hospital
    1                  st bernard hospital
    2                  south lake hospital
    3            mercy hospital fort smith
    4            baptist memorial hospital
    5     st joseph regional health center
    6       greene county general hospital
    7          southwest memorial hospital
    8             sentara general hospital
    9            piedmont fayette hospital
    10          mission community hospital
    11                    doctors hospital
    12      sumner regional medical center
    13            whidbey general hospital
    14           northridge medical center
    Name: Facility Name, dtype: object
    cms["Hospital Type"].str.upper() # Convert to uppercase
    0          ACUTE CARE HOSPITAL
    1          ACUTE CARE HOSPITAL
    2          ACUTE CARE HOSPITAL
    3          ACUTE CARE HOSPITAL
    4          ACUTE CARE HOSPITAL
    5          ACUTE CARE HOSPITAL
    6     CRITICAL ACCESS HOSPITAL
    7     CRITICAL ACCESS HOSPITAL
    8          ACUTE CARE HOSPITAL
    9          ACUTE CARE HOSPITAL
    10         ACUTE CARE HOSPITAL
    11         ACUTE CARE HOSPITAL
    12         ACUTE CARE HOSPITAL
    13    CRITICAL ACCESS HOSPITAL
    14         ACUTE CARE HOSPITAL
    Name: Hospital Type, dtype: object
  • str.capitalize() and str.title()

    • str.capitalize() converts the first character of each string to uppercase and the rest to lowercase.
    • str.title() converts the first character of each word to uppercase and the rest to lowercase.
    cms['Facility Name'].str.capitalize()
    0               Scripps green hospital
    1                  St bernard hospital
    2                  South lake hospital
    3            Mercy hospital fort smith
    4            Baptist memorial hospital
    5     St joseph regional health center
    6       Greene county general hospital
    7          Southwest memorial hospital
    8             Sentara general hospital
    9            Piedmont fayette hospital
    10          Mission community hospital
    11                    Doctors hospital
    12      Sumner regional medical center
    13            Whidbey general hospital
    14           Northridge medical center
    Name: Facility Name, dtype: object
    cms['Facility Name'].str.title()
    0               Scripps Green Hospital
    1                  St Bernard Hospital
    2                  South Lake Hospital
    3            Mercy Hospital Fort Smith
    4            Baptist Memorial Hospital
    5     St Joseph Regional Health Center
    6       Greene County General Hospital
    7          Southwest Memorial Hospital
    8             Sentara General Hospital
    9            Piedmont Fayette Hospital
    10          Mission Community Hospital
    11                    Doctors Hospital
    12      Sumner Regional Medical Center
    13            Whidbey General Hospital
    14           Northridge Medical Center
    Name: Facility Name, dtype: object

This concludes the data manipulation section. By the end of this section, you should now be able to confidently read and write datasets into and from DataFrames, pre-process and clean data, handle missing values, group, join, and transform data, and perform basic statistical analysis to gain insights into the underlying data.

In the next section, we will dive into data visualization, where you’ll learn how to create insightful and compelling charts and plots using the plotnine package to better communicate your findings.


Back to top