Analysing Patient Data

Overview

Teaching: 40
Exercises: 20
Questions:
- How do I summarize and compare patient data in Python?
Objectives:
- Describe data with simple summaries (average, median, spread, totals, and counts).
- Choose whether to summarize across columns or across rows.
- Filter the dataset using conditions and summarize just those rows.
- Handle missing values so results are trustworthy.
- Explore categories.

Words tell a story; statistics help us check whether that story fits the data. Once the Metabric data are loaded and labelled, the next step is to turn questions into short, reproducible operations: filter the rows you care about, pick the variables that answer the question, and apply the right summary or comparison.

Before we can analyze patient data, we need to load it into Python in a way that makes it easy to work with. Let’s load the Metabric patient data and set the patient ID as the index, so rows are labelled meaningfully. If you already have metabric_patients in memory from a previous episode, you can reuse it; otherwise, run the following to (re)load it:

import pandas as pd

metabric_patients = pd.read_csv(
    'https://zenodo.org/record/6450144/files/metabric_clinical_and_expression_data.csv',
    index_col='Patient_ID'
)

Statistics on Data

Before jumping into complex models, we start by describing the data. Descriptive statistics answer simple questions like “what is typical?” and “how much do values vary?” These summaries help you build intuition and spot problems early (e.g., unexpected scales or missing values).

The pandas library provides a wide range of statistical functions and methods to compute summary statistics for your data. Below provides some of the key statistical measures you can compute using this library.

Mean

Mean (average) of a single column (one variable):

metabric_patients.loc[:, 'Survival_time'].mean()

np.float64(125.12132352924738)

This answers “on average, what is the value of this variable?” Here we use the Survival_time column.

Mean of all numeric columns (column-wise):

metabric_patients.mean(numeric_only = True)

Cohort                             2.643908
Age_at_diagnosis                  61.087054
Survival_time                    125.121324
Tumour_size                       26.238726
Tumour_stage                       1.750535
Neoplasm_histologic_grade          2.415939
Lymph_nodes_examined_positive      2.002101
Lymph_node_status                  1.639706
Nottingham_prognostic_index        4.033019
Mutation_count                     5.697687
ESR1                               9.607824
ERBB2                             10.765364
PGR                                6.237203
TP53                               6.197967
PIK3CA                             5.970097
GATA3                              9.502910
FOXA1                             10.800526
MLPH                              11.362384
dtype: float64

This computes the mean for each numeric column independently, giving you a quick overview of typical values across the dataset.

Why numeric_only=True?

Many DataFrames mix numeric columns (like Tumour_size) with non-numeric columns (like Patient_ID or Cohort labels). Adding numeric_only=True tells pandas to ignore non-numeric columns when computing a statistic across the whole DataFrame. This avoids errors or confusing results. For a single numeric column (e.g., metabric_patients.loc[:, 'Survival_time']), you don’t need this argument; it’s most helpful for DataFrame-wide operations.

Mean across each row:

metabric_patients.mean(axis=1, numeric_only=True)

Patient_ID
MB-0000    19.046874
MB-0002    12.483745
MB-0005    17.399631
MB-0006    18.072797
MB-0008    14.313514
             ...    
MB-7295    20.963895
MB-7296    11.924471
MB-7297    23.663816
MB-7298    17.058346
MB-7299    21.721870
Length: 1904, dtype: float64

This treats each row as a small collection of values and averages across columns. It only makes sense when the columns you include are on a comparable scale (for example, a set of related measurements). The axis=1 argument tells Pandas to compute across columns for each row.

Understanding axis=

axis=0 (default) means “by column”: compute one result per column across rows.
axis=1 means “by row”: compute one result per row across columns.

Most reducers (sum, mean, min, max, etc.) accept axis= and follow the same pattern.

Median

The median (robust “typical” value) is the middle value when data are sorted. It is not affected by extreme values (outliers) and is often used to describe the central tendency of data.

metabric_patients.median(numeric_only = True)

Cohort                             3.000000
Age_at_diagnosis                  61.770000
Survival_time                    115.616667
Tumour_size                       23.000000
Tumour_stage                       2.000000
Neoplasm_histologic_grade          3.000000
Lymph_nodes_examined_positive      0.000000
Lymph_node_status                  1.000000
Nottingham_prognostic_index        4.042000
Mutation_count                     5.000000
ESR1                              10.252166
ERBB2                             10.530301
PGR                                5.877591
TP53                               6.185873
PIK3CA                             5.938094
GATA3                              9.911805
FOXA1                             11.367947
MLPH                              11.873967
dtype: float64

Visualising mean vs median:

The mean pulls toward extreme values more than the median. In skewed distributions, the two lines will sit apart.

Standard Deviation

The standard deviation (spread around the mean) measures the amount of variation or dispersion in a dataset. A lower standard deviation indicates that data points are close to the mean, while a higher standard deviation indicates greater variability.

metabric_patients.std(numeric_only = True)

Cohort                            1.228615
Age_at_diagnosis                 12.978711
Survival_time                    76.334148
Tumour_size                      15.160976
Tumour_stage                      0.628999
Neoplasm_histologic_grade         0.650612
Lymph_nodes_examined_positive     4.079993
Lymph_node_status                 0.743809
Nottingham_prognostic_index       1.144492
Mutation_count                    4.058778
ESR1                              2.133827
ERBB2                             1.357359
PGR                               1.020871
TP53                              0.401864
PIK3CA                            0.352549
GATA3                             1.502636
FOXA1                             1.754282
MLPH                              1.687555
dtype: float64

Variance

Variance (spread in squared units) is the square of the standard deviation. It quantifies how much individual data points deviate from the mean. It’s useful mathematically, but note its units are squared (e.g., cm² if the original variable is in cm).

metabric_patients.var(numeric_only = True)

Cohort                              1.509495
Age_at_diagnosis                  168.446952
Survival_time                    5826.902195
Tumour_size                       229.855200
Tumour_stage                        0.395640
Neoplasm_histologic_grade           0.423296
Lymph_nodes_examined_positive      16.646343
Lymph_node_status                   0.553252
Nottingham_prognostic_index         1.309861
Mutation_count                     16.473681
ESR1                                4.553217
ERBB2                               1.842424
PGR                                 1.042178
TP53                                0.161494
PIK3CA                              0.124291
GATA3                               2.257914
FOXA1                               3.077506
MLPH                                2.847843
dtype: float64

Sum

Use .sum() to add up values.

Total for each numeric column (column-wise):

metabric_patients.sum(numeric_only = True)

Cohort                             5034.000000
Age_at_diagnosis                 116309.750000
Survival_time                    238231.000000
Tumour_size                       49433.760000
Tumour_stage                       2456.000000
Neoplasm_histologic_grade          4426.000000
Lymph_nodes_examined_positive      3812.000000
Lymph_node_status                  3122.000000
Nottingham_prognostic_index        7678.867520
Mutation_count                    10592.000000
ESR1                              18293.296572
ERBB2                             20497.253069
PGR                               11875.634913
TP53                              11800.928408
PIK3CA                            11367.064843
GATA3                             18093.541447
FOXA1                             20564.200657
MLPH                              21633.979019
dtype: float64

Total for a single column (e.g., Mutation_count across all patients):

metabric_patients.loc[:, 'Mutation_count'].sum()

np.float64(10592.0)

If you ever sum across rows (axis=1), be sure the columns are on comparable scales.

Row-wise sum of selected numeric columns (use with care):

metabric_patients.loc[:, 'ESR1':'MLPH'].sum(axis = 1, numeric_only = True)

Patient_ID
MB-0000    60.602853
MB-0002    74.864079
MB-0005    71.593354
MB-0006    72.647012
MB-0008    74.226588
             ...    
MB-7295    72.369544
MB-7296    74.062673
MB-7297    73.368202
MB-7298    76.548554
MB-7299    69.311795
Length: 1904, dtype: float64

What the output shows

DataFrame sum (no axis= given): a Series (one total per numeric column).
Single-column sum: a single number.
Row-wise sum (axis=1): a Series indexed by row labels (one total per patient).

Count

Use .count() to count non-missing values. This is helpful to see how complete each column is.

Count of non-missing values for each column:

metabric_patients.count()

Cohort                           1904
Age_at_diagnosis                 1904
Survival_time                    1904
Survival_status                  1904
Vital_status                     1903
Chemotherapy                     1904
Radiotherapy                     1904
Tumour_size                      1884
Tumour_stage                     1403
Neoplasm_histologic_grade        1832
Lymph_nodes_examined_positive    1904
Lymph_node_status                1904
Cancer_type                      1889
ER_status                        1904
PR_status                        1904
HER2_status                      1904
HER2_status_measured_by_SNP6     1904
PAM50                            1904
3-gene_classifier                1700
Nottingham_prognostic_index      1904
Cellularity                      1850
Integrative_cluster              1904
Mutation_count                   1859
ESR1                             1904
ERBB2                            1904
PGR                              1904
TP53                             1904
PIK3CA                           1904
GATA3                            1904
FOXA1                            1904
MLPH                             1904
dtype: int64

This returns the counts or the number of entries excluding missing values for each column in the dataframe.

Count of non-missing values for a single column:

metabric_patients.loc[:, 'Survival_status'].count()

np.int64(1904)

This returns a single number - non-missing entries in the Survival_status column.

Here is a quick reference summary table of common useful functions.

Function	Description
`count`	Number of non-NA observations
`sum`	Sum of values
`mean`	Mean of values
`median`	Arithmetic median of values
`min`	Minimum
`max`	Maximum
`mode`	Mode
`abs`	Absolute Value
`prod`	Product of values
`std`	Bessel-corrected sample standard deviation
`var`	Unbiased variance
`sem`	Standard error of the mean
`skew`	Sample skewness (3rd moment)
`kurt`	Sample kurtosis (4th moment)
`quantile`	Sample quantile (value at %)
`cumsum`	Cumulative sum
`cumprod`	Cumulative product
`cummax`	Cumulative maximum
`cummin`	Cumulative minimum

In summary, descriptive statistics provide a powerful first look at your data, helping you understand its central tendencies, variability, and completeness. By using functions like mean, median, std, and value_counts, you can quickly identify patterns, spot anomalies, and prepare your dataset for deeper analysis. Mastering these basic tools in pandas will make your data exploration more efficient and insightful.

Recall from the previous episode that you can select specific rows using boolean conditions (masks). See the section on boolean masks. We’ll apply the above statistics to filtered subsets. All the statistical operators that work on entire DataFrames work the same way on slices.

Below are a few examples showing how to compute min and max over columns and rows, and how to find which patient or column produced that value.

Let’s say we want to analyze patients from cohort 1, whose IDs range from MB-0000 to MB-0906. For example, to find the maximum tumour size among these patients, we can use the following code:

cohort_mask = metabric_patients.loc[:, "Cohort"] == 1

print('Maximum for Tumour_size column for cohort 1 patients:')

Maximum for Tumour_size column for cohort 1 patients:

print(metabric_patients.loc[cohort_mask, "Tumour_size"].max())

180.0

For the same set of patients, you can check the minimum mutation count and identify which patient has it. For example:

print('\nMinimum for Mutation_count column for cohort 1 patients:')


Minimum for Mutation_count column for cohort 1 patients:

min_mutation = metabric_patients.loc[cohort_mask, "Mutation_count"].min()
print(min_mutation)

1.0

print('Patients with minimum mutations:')

Patients with minimum mutations:

min_mutation_mask = metabric_patients.loc[:, "Mutation_count"] == min_mutation
metabric_patients.loc[min_mutation_mask & cohort_mask, :]

            Cohort  Age_at_diagnosis  ...      FOXA1       MLPH
Patient_ID                            ...                      
MB-0006          1             47.68  ...  11.863379  10.472181
MB-0022          1             89.08  ...  10.679403   9.945023
MB-0036          1             85.49  ...  11.991655  12.466928
MB-0100          1             68.68  ...   6.311955   7.963707
MB-0124          1             43.46  ...  11.121980  12.157160
...            ...               ...  ...        ...        ...
MB-0663          1             57.76  ...  11.210480  12.008812
MB-0666          1             61.31  ...  11.271216  11.532359
MB-0880          1             73.64  ...  12.200512  12.800465
MB-0882          1             73.39  ...  11.514192  12.707877
MB-0895          1             79.61  ...  11.615925  10.906647

[78 rows x 31 columns]

Challenge: Summaries for Breast Invasive Ductal Carcinoma (IDC)

Build a small workflow that focuses on a clinically defined subset and its gene expression profile.

Create a boolean mask for patients whose Cancer_type is “Breast Invasive Ductal Carcinoma” (IDC).
Using that mask, select only the expression columns for those patients.
Compute column-wise summaries across patients for the expression columns:
- Mean for each gene
- Median for each gene
Compute a left-to-right cumulative sum across the expression columns for each patient. In each row, the final expression column should equal the total (sum) across all genes for that patient.

Hints

In this dataset the expression columns starts at the ESR1 column and runs to the end of the table.
Use .eq('Breast Invasive Ductal Carcinoma') or == 'Breast Invasive Ductal Carcinoma' to build the mask.
Column-wise summaries are the default (axis=0), row-wise requires axis=1.

Solution

# 1) Mask for IDC patients
cancer_type_mask = metabric_patients.loc[:, 'Cancer_type'] == 'Breast Invasive Ductal Carcinoma'

# 2) Expression-only slice for IDC patients (from ESR1 to end)
expr_idc = metabric_patients.loc[cancer_type_mask, 'ESR1':]
expr_idc

                 ESR1      ERBB2       PGR  ...      GATA3      FOXA1       MLPH
Patient_ID                                  ...                                 
MB-0000      8.929817   9.333972  5.680501  ...   6.932146   7.953794   9.729728
MB-0002     10.047059   9.729606  7.505424  ...  11.251197  11.843989  12.536570
MB-0005     10.041281   9.725825  7.376123  ...   9.289758  11.698169  10.306115
MB-0010     11.239750   9.739996  5.954311  ...   9.787085  12.142178  11.433164
MB-0014     10.793832   9.276507  7.720952  ...   8.365527  11.482627  10.755199
...               ...        ...       ...  ...        ...        ...        ...
MB-7294     11.290976  10.846545  7.312247  ...   9.424235  11.075688  11.567166
MB-7296      9.733986  13.753037  5.616082  ...   9.530390  11.532033  11.626140
MB-7297     11.053198  10.228570  7.478069  ...   9.540589  11.482761  11.180360
MB-7298     11.055114   9.892589  8.282737  ...  10.365901  11.371176  12.827069
MB-7299     10.696475  10.227787  5.533486  ...   9.749368  10.867527   9.847856

[1500 rows x 8 columns]

# 3) Column-wise summaries (one value per gene)
expr_means = expr_idc.mean()      # axis=0 by default
expr_means

ESR1       9.448306
ERBB2     10.823817
PGR        6.199401
TP53       6.195431
PIK3CA     5.978770
GATA3      9.377249
FOXA1     10.680569
MLPH      11.225926
dtype: float64

expr_medians = expr_idc.median()  # axis=0 by default
expr_medians

ESR1      10.148198
ERBB2     10.549932
PGR        5.791284
TP53       6.180130
PIK3CA     5.949822
GATA3      9.808565
FOXA1     11.334381
MLPH      11.771525
dtype: float64

# 4) Row-wise cumulative sums (one row per patient, accumulating across genes)
expr_cumsum_per_patient = expr_idc.cumsum(axis=1)

Useful Helper Functions

Before plotting or building more complex summaries, a few small helpers make your life easier. These tools don’t change the meaning of your data—they help you clean rows, control order, and ask focused questions. After each command, check what changed: the values, the order, the labels, or the row count.

Missing Values

It’s important to make sure your data doesn’t contain missing values (represented as NaN, which stands for “Not a Number”). Missing values can cause errors or misleading results in your analysis. For example, calculating the mean of a column with missing values will return NaN, and plotting functions might skip or misrepresent those data points.

To handle this, you can use the dropna() method. This method removes any rows (or columns) that contain missing values. By default, dropna() removes rows where any column has a missing value.

Drop rows where either ‘3-gene_classifier’ or ‘GATA3’ is missing:

subset = metabric_patients.loc[:, ['3-gene_classifier', 'GATA3']].dropna()
print('Rows before:', len(metabric_patients))

Rows before: 1904

print('Rows after dropna:', len(subset))

Rows after dropna: 1700

You’ll see the original row count and a smaller count after dropna() command. The subset DataFrame keeps only rows where both columns are present. By default, rows are dropped if any of the specified columns are missing. To drop columns instead of rows, use axis=1. To control which columns are checked, pass subset=[...] (as above).

Unique Values

When you’re exploring categorical columns, a first sanity check is to list which labels actually appear. This helps you spot typos (“Lum a” vs “LumA”), unexpected categories, or missing values before you count or plot. Pandas provides .unique() for this: it returns the distinct values observed in a column.

unique() returns an array of the distinct values. If the column contains missing values (NaN), they will be included in the result. Use dropna() first if you want to exclude missing values from the listing.

metabric_patients.loc[:, '3-gene_classifier'].unique()

array(['ER-/HER2-', 'ER+/HER2- High Prolif', nan, 'ER+/HER2- Low Prolif',
       'HER2+'], dtype=object)

# Distinct labels, excluding missing entries
metabric_patients.loc[:, '3-gene_classifier'].dropna().unique()

array(['ER-/HER2-', 'ER+/HER2- High Prolif', 'ER+/HER2- Low Prolif',
       'HER2+'], dtype=object)

The order of the output is based on first appearance in the column; it’s not alphabetical. If you want you can sort the categories as follows:

labels = metabric_patients.loc[:, '3-gene_classifier'].dropna().unique()
sorted_labels = sorted(labels)
sorted_labels

['ER+/HER2- High Prolif', 'ER+/HER2- Low Prolif', 'ER-/HER2-', 'HER2+']

Listing the set of categories is often followed by asking “how many of each?”. To get a frequency table see the Frequency of Values section below.

Frequency of Values

To inspect how often each distinct value appears in a column, use value_counts(). This is particularly helpful for categorical variables (for example, labels) to check balance, spot rare categories, and reveal unexpected or missing labels. By default it returns counts sorted by frequency.

metabric_patients.value_counts()

Cohort  Age_at_diagnosis  Survival_time  Survival_status  Vital_status          Chemotherapy  Radiotherapy  Tumour_size  Tumour_stage  Neoplasm_histologic_grade  Lymph_nodes_examined_positive  Lymph_node_status  Cancer_type                                ER_status  PR_status  HER2_status  HER2_status_measured_by_SNP6  PAM50  3-gene_classifier      Nottingham_prognostic_index  Cellularity  Integrative_cluster  Mutation_count  ESR1       ERBB2      PGR       TP53      PIK3CA    GATA3      FOXA1      MLPH     
1       26.36             32.633333      DECEASED         Died of Disease       NO            YES           12.0         1.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- High Prolif  4.024                        High         6                    1.0             9.588608   9.490468   7.928097  6.356356  5.680179  9.411429   10.718919  10.956553    1
3       55.53             62.900000      DECEASED         Died of Disease       NO            NO            30.0         2.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Negative   Negative   Negative     NEUTRAL                       Basal  ER-/HER2-              4.060                        High         10                   4.0             6.253704   8.004007   5.490949  5.905707  6.398763  7.535037   6.905833   7.705025     1
        55.72             79.300000      LIVING           Living                NO            YES           15.0         1.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Negative   Negative   Negative     NEUTRAL                       LumA   ER+/HER2- Low Prolif   4.030                        Moderate     4ER-                 7.0             6.216032   11.759620  5.237508  6.098317  5.764115  6.892657   11.372491  12.298825    1
        55.65             186.633333     DECEASED         Died of Other Causes  NO            YES           20.0         2.0           3.0                        9                              3                  Breast Mixed Ductal and Lobular Carcinoma  Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- High Prolif  6.040                        Moderate     3                    8.0             11.589441  10.791974  6.134051  6.716117  6.035893  10.242622  11.648412  12.421318    1
        55.62             174.833333     LIVING           Living                NO            YES           40.0         2.0           2.0                        2                              2                  Breast Invasive Ductal Carcinoma           Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- Low Prolif   4.080                        Moderate     8                    7.0             11.665731  11.480590  6.425127  6.465125  6.238929  9.793889   10.949489  12.199768    1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ..
2       43.51             28.566667      DECEASED         Died of Disease       YES           YES           20.0         3.0           3.0                        6                              3                  Breast Invasive Ductal Carcinoma           Negative   Negative   Negative     NEUTRAL                       Basal  ER-/HER2-              6.040                        High         10                   13.0            5.539984   10.292837  5.407988  5.518323  6.311530  7.297214   5.636049   7.157132     1
        43.56             225.500000     LIVING           Living                NO            YES           23.0         1.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Positive   Positive   Positive     GAIN                          LumA   ER+/HER2- High Prolif  4.046                        High         4ER+                 4.0             9.691953   12.735390  9.109610  6.620104  6.473224  9.526202   11.764086  11.211695    1
        43.58             237.133333     LIVING           Living                NO            YES           27.0         1.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Positive   Negative   Negative     GAIN                          LumB   ER+/HER2- High Prolif  4.054                        High         1                    6.0             10.267879  10.644271  5.668715  5.994670  5.793478  11.428410  12.125305  13.132729    1
        43.67             188.733333     LIVING           Living                NO            NO            17.0         2.0           2.0                        1                              2                  Breast Mixed Ductal and Lobular Carcinoma  Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- Low Prolif   4.034                        Moderate     8                    16.0            10.031496  10.181827  6.612446  6.322806  6.528300  9.847294   11.458224  12.150357    1
5       86.61             63.566667      DECEASED         Died of Other Causes  NO            NO            16.0         1.0           1.0                        0                              1                  Breast Invasive Lobular Carcinoma          Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- Low Prolif   2.032                        Moderate     3                    10.0            10.899206  9.560221   7.459093  6.164949  5.869367  10.422461  11.099896  12.124662    1
Name: count, Length: 1121, dtype: int64

When you call value_counts() on an entire DataFrame, each index represents a unique combination of values across all columns, and the count shows how many rows share that exact combination. Since each row in our dataset is unique, all counts will be 1—this is a quick way to check for duplicate rows.

For a specific column (e.g., 3-gene_classifier):

metabric_patients.loc[:, '3-gene_classifier'].value_counts()

3-gene_classifier
ER+/HER2- Low Prolif     619
ER+/HER2- High Prolif    603
ER-/HER2-                290
HER2+                    188
Name: count, dtype: int64

When used on a single column, value_counts() returns a Series where the index lists each unique value (such as category names) and the corresponding values show how many times each occurs. In the example above, you’ll see four categories and their respective counts.

Set normalize=True to display proportions (percentages) rather than raw counts.

metabric_patients.loc[:, '3-gene_classifier'].value_counts(normalize=True)

3-gene_classifier
ER+/HER2- Low Prolif     0.364118
ER+/HER2- High Prolif    0.354706
ER-/HER2-                0.170588
HER2+                    0.110588
Name: proportion, dtype: float64

To display percentages, multiply the result by 100:

metabric_patients.loc[:, '3-gene_classifier'].value_counts(normalize=True) * 100

3-gene_classifier
ER+/HER2- Low Prolif     36.411765
ER+/HER2- High Prolif    35.470588
ER-/HER2-                17.058824
HER2+                    11.058824
Name: proportion, dtype: float64

By default, missing values (or NaN) are excluded from the result. This means that only rows with actual, non-missing values are counted in the frequency table. If you want to include missing values as their own category in the output, you can pass the argument dropna=False to value_counts(). This will show how many times NaN appears in the column, which is useful for quickly assessing the amount of missing data in a categorical variable.

metabric_patients.loc[:, '3-gene_classifier'].value_counts(dropna=False)

3-gene_classifier
ER+/HER2- Low Prolif     619
ER+/HER2- High Prolif    603
ER-/HER2-                290
NaN                      204
HER2+                    188
Name: count, dtype: int64

Tip — Missing Categories

Use dropna=False to see how many entries are missing. If your analysis should exclude missing rows entirely, clean first with .dropna() on the relevant columns (see the Missing Values section above) and then run value_counts().

Counts returned by value_counts() are automatically sorted from most to least frequent. If you want to sort the counts by the actual labels (in alphabetical order), you can use the sort_index() method. This is useful when you need to view your data categories in a predictable, ordered way rather than by their frequency.

metabric_patients.loc[:, 'Integrative_cluster'].value_counts().sort_index()

Integrative_cluster
1       132
10      219
2        72
3       282
4ER+    244
4ER-     74
5       184
6        84
7       182
8       289
9       142
Name: count, dtype: int64

To display categories in a custom order (such as a clinically meaningful sequence), use the reindex() method. This lets you specify the exact order of labels you want to see in your output, rather than relying on alphabetical or frequency-based sorting. For example, if you have a desired order of cluster labels, you can filter to only those present in your data and then reindex the counts:

counts = metabric_patients.loc[:, 'Integrative_cluster'].value_counts()

# custom order (keep only labels that exist)
desired = ["1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10"]
present = [lab for lab in desired if lab in counts.index]
counts_custom = counts.reindex(present)
counts_custom

Integrative_cluster
1       132
2        72
3       282
4ER-     74
4ER+    244
5       184
6        84
7       182
8       289
9       142
10      219
Name: count, dtype: int64

This approach first uses value_counts() to tally each unique value in the Integrative_cluster column, producing a pandas Series with cluster labels as the index and their counts as values. The desired list defines the preferred order for these labels.

The code then filters desired to include only labels that are actually present in the data. Using reindex(present) rearranges the counts Series to match this custom order, ensuring the output is organized as intended.

If a label from desired is missing in the data, it is simply omitted from the result—no NaNs are introduced. This method is especially helpful for displaying or plotting results in a meaningful sequence.

Challenge: Summarize Categories (Cellularity)

List the unique labels in Cellularity:
- Excluding missing values
- Including missing values
Show Cellularity label counts with:
- Raw counts (include and exclude missing values)
- Percentages
- Alphabetical order
How many patients have High cellularity but no recorded 3-gene classifier? (i.e., missing value in 3-gene classifier column)

Hints

To count rows, use .shape[0] or len(...).
“No recorded 3-gene classifier” means missing values.
Use .shape[0] to get number of rows.
Patients with high cellularity but no recorded 3-gene classifier = patients with high cellularity and all records in 3-gene classifier - patients with high cellularity and non missing values in 3-gene classifier.

Solution

# 1) Unique labels
metabric_patients.loc[:, 'Cellularity'].unique()

array([nan, 'High', 'Moderate', 'Low'], dtype=object)

metabric_patients.loc[:, 'Cellularity'].dropna().unique()

array(['High', 'Moderate', 'Low'], dtype=object)

# 2) Label counts
metabric_patients.loc[:, 'Cellularity'].value_counts(dropna=False)

Cellularity
High        939
Moderate    711
Low         200
NaN          54
Name: count, dtype: int64

metabric_patients.loc[:, 'Cellularity'].value_counts()

Cellularity
High        939
Moderate    711
Low         200
Name: count, dtype: int64

metabric_patients.loc[:, 'Cellularity'].value_counts(normalize=True) * 100

Cellularity
High        50.756757
Moderate    38.432432
Low         10.810811
Name: proportion, dtype: float64

metabric_patients.loc[:, 'Cellularity'].value_counts().sort_index()

Cellularity
High        939
Low         200
Moderate    711
Name: count, dtype: int64

# 3) mask for high cellularity
mask_cell = metabric_patients.loc[:, 'Cellularity'] == 'High'
# no. of patients with high cellularity 
n_cell = metabric_patients.loc[mask_cell, '3-gene_classifier'].shape[0]
# no. of patients with high cellularity and non-missing values in 3-gene_classifier
n_cell_3gene = metabric_patients.loc[mask_cell, '3-gene_classifier'].dropna().shape[0]
# no. of patients with high cellularity and missing values in 3-gene_classifier
n_cell - n_cell_3gene

# 3) Patients with High cellularity and NO recorded 3-gene classifier
mask_cell_high = metabric_patients.loc[:, 'Cellularity'].eq('High')
no_classifier_count = metabric_patients.loc[mask_cell_high, '3-gene_classifier'].isna().sum()
no_classifier_count

np.int64(99)

Key Points

Compute common summaries (mean, median, std, var, sum, count, value_counts).
Switch between column-wise and row-wise operations with axis= and use numeric_only for mixed dtypes.
Filter rows with boolean masks and summarize subsets.
Handle missing values with dropna() when appropriate.
Explore categories with unique().

← Previous