import pandas as pd
metabric_patients = pd.read_csv(
    'https://zenodo.org/record/6450144/files/metabric_clinical_and_expression_data.csv',
    index_col='Patient_ID'
)Analysing Patient Data
Words tell a story; statistics help us check whether that story fits the data. Once the Metabric data are loaded and labelled, the next step is to turn questions into short, reproducible operations: filter the rows you care about, pick the variables that answer the question, and apply the right summary or comparison.
Before we can analyze patient data, we need to load it into Python in a way that makes it easy to work with. Let’s load the Metabric patient data and set the patient ID as the index, so rows are labelled meaningfully. If you already have metabric_patients in memory from a previous episode, you can reuse it; otherwise, run the following to (re)load it:
Statistics on Data
Before jumping into complex models, we start by describing the data. Descriptive statistics answer simple questions like “what is typical?” and “how much do values vary?” These summaries help you build intuition and spot problems early (e.g., unexpected scales or missing values).
The pandas library provides a wide range of statistical functions and methods to compute summary statistics for your data. Below provides some of the key statistical measures you can compute using this library.
Mean
Mean (average) of a single column (one variable):
metabric_patients.loc[:, 'Survival_time'].mean()np.float64(125.12132352924738)This answers “on average, what is the value of this variable?” Here we use the Survival_time column.
Mean of all numeric columns (column-wise):
metabric_patients.mean(numeric_only = True)Cohort                             2.643908
Age_at_diagnosis                  61.087054
Survival_time                    125.121324
Tumour_size                       26.238726
Tumour_stage                       1.750535
Neoplasm_histologic_grade          2.415939
Lymph_nodes_examined_positive      2.002101
Lymph_node_status                  1.639706
Nottingham_prognostic_index        4.033019
Mutation_count                     5.697687
ESR1                               9.607824
ERBB2                             10.765364
PGR                                6.237203
TP53                               6.197967
PIK3CA                             5.970097
GATA3                              9.502910
FOXA1                             10.800526
MLPH                              11.362384
dtype: float64This computes the mean for each numeric column independently, giving you a quick overview of typical values across the dataset.
numeric_only=True?
Many DataFrames mix numeric columns (like Tumour_size) with non-numeric columns (like Patient_ID or Cohort labels). Adding numeric_only=True tells pandas to ignore non-numeric columns when computing a statistic across the whole DataFrame. This avoids errors or confusing results. For a single numeric column (e.g., metabric_patients.loc[:, 'Survival_time']), you don’t need this argument; it’s most helpful for DataFrame-wide operations.
Mean across each row:
metabric_patients.mean(axis=1, numeric_only=True)Patient_ID
MB-0000    19.046874
MB-0002    12.483745
MB-0005    17.399631
MB-0006    18.072797
MB-0008    14.313514
             ...    
MB-7295    20.963895
MB-7296    11.924471
MB-7297    23.663816
MB-7298    17.058346
MB-7299    21.721870
Length: 1904, dtype: float64This treats each row as a small collection of values and averages across columns. It only makes sense when the columns you include are on a comparable scale (for example, a set of related measurements). The axis=1 argument tells Pandas to compute across columns for each row.
axis=
- axis=0(default) means “by column”: compute one result per column across rows.
- axis=1means “by row”: compute one result per row across columns.
Most reducers (sum, mean, min, max, etc.) accept axis= and follow the same pattern.
Median
The median (robust “typical” value) is the middle value when data are sorted. It is not affected by extreme values (outliers) and is often used to describe the central tendency of data.
metabric_patients.median(numeric_only = True)Cohort                             3.000000
Age_at_diagnosis                  61.770000
Survival_time                    115.616667
Tumour_size                       23.000000
Tumour_stage                       2.000000
Neoplasm_histologic_grade          3.000000
Lymph_nodes_examined_positive      0.000000
Lymph_node_status                  1.000000
Nottingham_prognostic_index        4.042000
Mutation_count                     5.000000
ESR1                              10.252166
ERBB2                             10.530301
PGR                                5.877591
TP53                               6.185873
PIK3CA                             5.938094
GATA3                              9.911805
FOXA1                             11.367947
MLPH                              11.873967
dtype: float64Visualising mean vs median:
The mean pulls toward extreme values more than the median. In skewed distributions, the two lines will sit apart.
Standard Deviation
The standard deviation (spread around the mean) measures the amount of variation or dispersion in a dataset. A lower standard deviation indicates that data points are close to the mean, while a higher standard deviation indicates greater variability.
metabric_patients.std(numeric_only = True)Cohort                            1.228615
Age_at_diagnosis                 12.978711
Survival_time                    76.334148
Tumour_size                      15.160976
Tumour_stage                      0.628999
Neoplasm_histologic_grade         0.650612
Lymph_nodes_examined_positive     4.079993
Lymph_node_status                 0.743809
Nottingham_prognostic_index       1.144492
Mutation_count                    4.058778
ESR1                              2.133827
ERBB2                             1.357359
PGR                               1.020871
TP53                              0.401864
PIK3CA                            0.352549
GATA3                             1.502636
FOXA1                             1.754282
MLPH                              1.687555
dtype: float64Variance
Variance (spread in squared units) is the square of the standard deviation. It quantifies how much individual data points deviate from the mean. It’s useful mathematically, but note its units are squared (e.g., cm² if the original variable is in cm).
metabric_patients.var(numeric_only = True)Cohort                              1.509495
Age_at_diagnosis                  168.446952
Survival_time                    5826.902195
Tumour_size                       229.855200
Tumour_stage                        0.395640
Neoplasm_histologic_grade           0.423296
Lymph_nodes_examined_positive      16.646343
Lymph_node_status                   0.553252
Nottingham_prognostic_index         1.309861
Mutation_count                     16.473681
ESR1                                4.553217
ERBB2                               1.842424
PGR                                 1.042178
TP53                                0.161494
PIK3CA                              0.124291
GATA3                               2.257914
FOXA1                               3.077506
MLPH                                2.847843
dtype: float64Sum
Use .sum() to add up values.
Total for each numeric column (column-wise):
metabric_patients.sum(numeric_only = True)Cohort                             5034.000000
Age_at_diagnosis                 116309.750000
Survival_time                    238231.000000
Tumour_size                       49433.760000
Tumour_stage                       2456.000000
Neoplasm_histologic_grade          4426.000000
Lymph_nodes_examined_positive      3812.000000
Lymph_node_status                  3122.000000
Nottingham_prognostic_index        7678.867520
Mutation_count                    10592.000000
ESR1                              18293.296572
ERBB2                             20497.253069
PGR                               11875.634913
TP53                              11800.928408
PIK3CA                            11367.064843
GATA3                             18093.541447
FOXA1                             20564.200657
MLPH                              21633.979019
dtype: float64Total for a single column (e.g., Mutation_count across all patients):
metabric_patients.loc[:, 'Mutation_count'].sum()np.float64(10592.0)If you ever sum across rows (axis=1), be sure the columns are on comparable scales.
Row-wise sum of selected numeric columns (use with care):
metabric_patients.loc[:, 'ESR1':'MLPH'].sum(axis = 1, numeric_only = True)Patient_ID
MB-0000    60.602853
MB-0002    74.864079
MB-0005    71.593354
MB-0006    72.647012
MB-0008    74.226588
             ...    
MB-7295    72.369544
MB-7296    74.062673
MB-7297    73.368202
MB-7298    76.548554
MB-7299    69.311795
Length: 1904, dtype: float64What the output shows
- DataFrame sum (no axis=given): a Series (one total per numeric column).
- Single-column sum: a single number.
- Row-wise sum (axis=1): a Series indexed by row labels (one total per patient).
Count
Use .count() to count non-missing values. This is helpful to see how complete each column is.
Count of non-missing values for each column:
metabric_patients.count()Cohort                           1904
Age_at_diagnosis                 1904
Survival_time                    1904
Survival_status                  1904
Vital_status                     1903
Chemotherapy                     1904
Radiotherapy                     1904
Tumour_size                      1884
Tumour_stage                     1403
Neoplasm_histologic_grade        1832
Lymph_nodes_examined_positive    1904
Lymph_node_status                1904
Cancer_type                      1889
ER_status                        1904
PR_status                        1904
HER2_status                      1904
HER2_status_measured_by_SNP6     1904
PAM50                            1904
3-gene_classifier                1700
Nottingham_prognostic_index      1904
Cellularity                      1850
Integrative_cluster              1904
Mutation_count                   1859
ESR1                             1904
ERBB2                            1904
PGR                              1904
TP53                             1904
PIK3CA                           1904
GATA3                            1904
FOXA1                            1904
MLPH                             1904
dtype: int64This returns the counts or the number of entries excluding missing values for each column in the dataframe.
Count of non-missing values for a single column:
metabric_patients.loc[:, 'Survival_status'].count()np.int64(1904)This returns a single number - non-missing entries in the Survival_status column.
Here is a quick reference summary table of common useful functions.
| Function | Description | 
|---|---|
| count | Number of non-NA observations | 
| sum | Sum of values | 
| mean | Mean of values | 
| median | Arithmetic median of values | 
| min | Minimum | 
| max | Maximum | 
| mode | Mode | 
| abs | Absolute Value | 
| prod | Product of values | 
| std | Bessel-corrected sample standard deviation | 
| var | Unbiased variance | 
| sem | Standard error of the mean | 
| skew | Sample skewness (3rd moment) | 
| kurt | Sample kurtosis (4th moment) | 
| quantile | Sample quantile (value at %) | 
| cumsum | Cumulative sum | 
| cumprod | Cumulative product | 
| cummax | Cumulative maximum | 
| cummin | Cumulative minimum | 
In summary, descriptive statistics provide a powerful first look at your data, helping you understand its central tendencies, variability, and completeness. By using functions like mean, median, std, and value_counts, you can quickly identify patterns, spot anomalies, and prepare your dataset for deeper analysis. Mastering these basic tools in pandas will make your data exploration more efficient and insightful.
Recall from the previous episode that you can select specific rows using boolean conditions (masks). See the section on boolean masks. We’ll apply the above statistics to filtered subsets. All the statistical operators that work on entire DataFrames work the same way on slices.
Below are a few examples showing how to compute min and max over columns and rows, and how to find which patient or column produced that value.
Let’s say we want to analyze patients from cohort 1, whose IDs range from MB-0000 to MB-0906. For example, to find the maximum tumour size among these patients, we can use the following code:
cohort_mask = metabric_patients.loc[:, "Cohort"] == 1
print('Maximum for Tumour_size column for cohort 1 patients:')Maximum for Tumour_size column for cohort 1 patients:print(metabric_patients.loc[cohort_mask, "Tumour_size"].max())180.0For the same set of patients, you can check the minimum mutation count and identify which patient has it. For example:
print('\nMinimum for Mutation_count column for cohort 1 patients:')
Minimum for Mutation_count column for cohort 1 patients:min_mutation = metabric_patients.loc[cohort_mask, "Mutation_count"].min()
print(min_mutation)1.0print('Patients with minimum mutations:')Patients with minimum mutations:min_mutation_mask = metabric_patients.loc[:, "Mutation_count"] == min_mutation
metabric_patients.loc[min_mutation_mask & cohort_mask, :]            Cohort  Age_at_diagnosis  ...      FOXA1       MLPH
Patient_ID                            ...                      
MB-0006          1             47.68  ...  11.863379  10.472181
MB-0022          1             89.08  ...  10.679403   9.945023
MB-0036          1             85.49  ...  11.991655  12.466928
MB-0100          1             68.68  ...   6.311955   7.963707
MB-0124          1             43.46  ...  11.121980  12.157160
...            ...               ...  ...        ...        ...
MB-0663          1             57.76  ...  11.210480  12.008812
MB-0666          1             61.31  ...  11.271216  11.532359
MB-0880          1             73.64  ...  12.200512  12.800465
MB-0882          1             73.39  ...  11.514192  12.707877
MB-0895          1             79.61  ...  11.615925  10.906647
[78 rows x 31 columns]Useful Helper Functions
Before plotting or building more complex summaries, a few small helpers make your life easier. These tools don’t change the meaning of your data—they help you clean rows, control order, and ask focused questions. After each command, check what changed: the values, the order, the labels, or the row count.
Missing Values
It’s important to make sure your data doesn’t contain missing values (represented as NaN, which stands for “Not a Number”). Missing values can cause errors or misleading results in your analysis. For example, calculating the mean of a column with missing values will return NaN, and plotting functions might skip or misrepresent those data points.
To handle this, you can use the dropna() method. This method removes any rows (or columns) that contain missing values. By default, dropna() removes rows where any column has a missing value.
Drop rows where either ‘3-gene_classifier’ or ‘GATA3’ is missing:
subset = metabric_patients.loc[:, ['3-gene_classifier', 'GATA3']].dropna()
print('Rows before:', len(metabric_patients))Rows before: 1904print('Rows after dropna:', len(subset))Rows after dropna: 1700You’ll see the original row count and a smaller count after dropna() command. The subset DataFrame keeps only rows where both columns are present. By default, rows are dropped if any of the specified columns are missing. To drop columns instead of rows, use axis=1. To control which columns are checked, pass subset=[...] (as above).
Unique Values
When you’re exploring categorical columns, a first sanity check is to list which labels actually appear. This helps you spot typos (“Lum a” vs “LumA”), unexpected categories, or missing values before you count or plot. Pandas provides .unique() for this: it returns the distinct values observed in a column.
unique() returns an array of the distinct values. If the column contains missing values (NaN), they will be included in the result. Use dropna() first if you want to exclude missing values from the listing.
metabric_patients.loc[:, '3-gene_classifier'].unique()array(['ER-/HER2-', 'ER+/HER2- High Prolif', nan, 'ER+/HER2- Low Prolif',
       'HER2+'], dtype=object)# Distinct labels, excluding missing entries
metabric_patients.loc[:, '3-gene_classifier'].dropna().unique()array(['ER-/HER2-', 'ER+/HER2- High Prolif', 'ER+/HER2- Low Prolif',
       'HER2+'], dtype=object)The order of the output is based on first appearance in the column; it’s not alphabetical. If you want you can sort the categories as follows:
labels = metabric_patients.loc[:, '3-gene_classifier'].dropna().unique()
sorted_labels = sorted(labels)
sorted_labels['ER+/HER2- High Prolif', 'ER+/HER2- Low Prolif', 'ER-/HER2-', 'HER2+']Listing the set of categories is often followed by asking “how many of each?”. To get a frequency table see the Frequency of Values section below.
Frequency of Values
To inspect how often each distinct value appears in a column, use value_counts(). This is particularly helpful for categorical variables (for example, labels) to check balance, spot rare categories, and reveal unexpected or missing labels. By default it returns counts sorted by frequency.
metabric_patients.value_counts()Cohort  Age_at_diagnosis  Survival_time  Survival_status  Vital_status          Chemotherapy  Radiotherapy  Tumour_size  Tumour_stage  Neoplasm_histologic_grade  Lymph_nodes_examined_positive  Lymph_node_status  Cancer_type                                ER_status  PR_status  HER2_status  HER2_status_measured_by_SNP6  PAM50  3-gene_classifier      Nottingham_prognostic_index  Cellularity  Integrative_cluster  Mutation_count  ESR1       ERBB2      PGR       TP53      PIK3CA    GATA3      FOXA1      MLPH     
1       26.36             32.633333      DECEASED         Died of Disease       NO            YES           12.0         1.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- High Prolif  4.024                        High         6                    1.0             9.588608   9.490468   7.928097  6.356356  5.680179  9.411429   10.718919  10.956553    1
3       55.53             62.900000      DECEASED         Died of Disease       NO            NO            30.0         2.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Negative   Negative   Negative     NEUTRAL                       Basal  ER-/HER2-              4.060                        High         10                   4.0             6.253704   8.004007   5.490949  5.905707  6.398763  7.535037   6.905833   7.705025     1
        55.72             79.300000      LIVING           Living                NO            YES           15.0         1.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Negative   Negative   Negative     NEUTRAL                       LumA   ER+/HER2- Low Prolif   4.030                        Moderate     4ER-                 7.0             6.216032   11.759620  5.237508  6.098317  5.764115  6.892657   11.372491  12.298825    1
        55.65             186.633333     DECEASED         Died of Other Causes  NO            YES           20.0         2.0           3.0                        9                              3                  Breast Mixed Ductal and Lobular Carcinoma  Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- High Prolif  6.040                        Moderate     3                    8.0             11.589441  10.791974  6.134051  6.716117  6.035893  10.242622  11.648412  12.421318    1
        55.62             174.833333     LIVING           Living                NO            YES           40.0         2.0           2.0                        2                              2                  Breast Invasive Ductal Carcinoma           Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- Low Prolif   4.080                        Moderate     8                    7.0             11.665731  11.480590  6.425127  6.465125  6.238929  9.793889   10.949489  12.199768    1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ..
2       43.51             28.566667      DECEASED         Died of Disease       YES           YES           20.0         3.0           3.0                        6                              3                  Breast Invasive Ductal Carcinoma           Negative   Negative   Negative     NEUTRAL                       Basal  ER-/HER2-              6.040                        High         10                   13.0            5.539984   10.292837  5.407988  5.518323  6.311530  7.297214   5.636049   7.157132     1
        43.56             225.500000     LIVING           Living                NO            YES           23.0         1.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Positive   Positive   Positive     GAIN                          LumA   ER+/HER2- High Prolif  4.046                        High         4ER+                 4.0             9.691953   12.735390  9.109610  6.620104  6.473224  9.526202   11.764086  11.211695    1
        43.58             237.133333     LIVING           Living                NO            YES           27.0         1.0           3.0                        0                              1                  Breast Invasive Ductal Carcinoma           Positive   Negative   Negative     GAIN                          LumB   ER+/HER2- High Prolif  4.054                        High         1                    6.0             10.267879  10.644271  5.668715  5.994670  5.793478  11.428410  12.125305  13.132729    1
        43.67             188.733333     LIVING           Living                NO            NO            17.0         2.0           2.0                        1                              2                  Breast Mixed Ductal and Lobular Carcinoma  Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- Low Prolif   4.034                        Moderate     8                    16.0            10.031496  10.181827  6.612446  6.322806  6.528300  9.847294   11.458224  12.150357    1
5       86.61             63.566667      DECEASED         Died of Other Causes  NO            NO            16.0         1.0           1.0                        0                              1                  Breast Invasive Lobular Carcinoma          Positive   Positive   Negative     NEUTRAL                       LumA   ER+/HER2- Low Prolif   2.032                        Moderate     3                    10.0            10.899206  9.560221   7.459093  6.164949  5.869367  10.422461  11.099896  12.124662    1
Name: count, Length: 1121, dtype: int64When you call value_counts() on an entire DataFrame, each index represents a unique combination of values across all columns, and the count shows how many rows share that exact combination. Since each row in our dataset is unique, all counts will be 1—this is a quick way to check for duplicate rows.
For a specific column (e.g., 3-gene_classifier):
metabric_patients.loc[:, '3-gene_classifier'].value_counts()3-gene_classifier
ER+/HER2- Low Prolif     619
ER+/HER2- High Prolif    603
ER-/HER2-                290
HER2+                    188
Name: count, dtype: int64When used on a single column, value_counts() returns a Series where the index lists each unique value (such as category names) and the corresponding values show how many times each occurs. In the example above, you’ll see four categories and their respective counts.
Set normalize=True to display proportions (percentages) rather than raw counts.
metabric_patients.loc[:, '3-gene_classifier'].value_counts(normalize=True)3-gene_classifier
ER+/HER2- Low Prolif     0.364118
ER+/HER2- High Prolif    0.354706
ER-/HER2-                0.170588
HER2+                    0.110588
Name: proportion, dtype: float64To display percentages, multiply the result by 100:
metabric_patients.loc[:, '3-gene_classifier'].value_counts(normalize=True) * 1003-gene_classifier
ER+/HER2- Low Prolif     36.411765
ER+/HER2- High Prolif    35.470588
ER-/HER2-                17.058824
HER2+                    11.058824
Name: proportion, dtype: float64By default, missing values (or NaN) are excluded from the result. This means that only rows with actual, non-missing values are counted in the frequency table. If you want to include missing values as their own category in the output, you can pass the argument dropna=False to value_counts(). This will show how many times NaN appears in the column, which is useful for quickly assessing the amount of missing data in a categorical variable.
metabric_patients.loc[:, '3-gene_classifier'].value_counts(dropna=False)3-gene_classifier
ER+/HER2- Low Prolif     619
ER+/HER2- High Prolif    603
ER-/HER2-                290
NaN                      204
HER2+                    188
Name: count, dtype: int64Use dropna=False to see how many entries are missing. If your analysis should exclude missing rows entirely, clean first with .dropna() on the relevant columns (see the Missing Values section above) and then run value_counts().
Counts returned by value_counts() are automatically sorted from most to least frequent. If you want to sort the counts by the actual labels (in alphabetical order), you can use the sort_index() method. This is useful when you need to view your data categories in a predictable, ordered way rather than by their frequency.
metabric_patients.loc[:, 'Integrative_cluster'].value_counts().sort_index()Integrative_cluster
1       132
10      219
2        72
3       282
4ER+    244
4ER-     74
5       184
6        84
7       182
8       289
9       142
Name: count, dtype: int64To display categories in a custom order (such as a clinically meaningful sequence), use the reindex() method. This lets you specify the exact order of labels you want to see in your output, rather than relying on alphabetical or frequency-based sorting. For example, if you have a desired order of cluster labels, you can filter to only those present in your data and then reindex the counts:
counts = metabric_patients.loc[:, 'Integrative_cluster'].value_counts()
# custom order (keep only labels that exist)
desired = ["1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10"]
present = [lab for lab in desired if lab in counts.index]
counts_custom = counts.reindex(present)
counts_customIntegrative_cluster
1       132
2        72
3       282
4ER-     74
4ER+    244
5       184
6        84
7       182
8       289
9       142
10      219
Name: count, dtype: int64This approach first uses value_counts() to tally each unique value in the Integrative_cluster column, producing a pandas Series with cluster labels as the index and their counts as values. The desired list defines the preferred order for these labels.
The code then filters desired to include only labels that are actually present in the data. Using reindex(present) rearranges the counts Series to match this custom order, ensuring the output is organized as intended.
If a label from desired is missing in the data, it is simply omitted from the result—no NaNs are introduced. This method is especially helpful for displaying or plotting results in a meaningful sequence.
- Compute common summaries (mean, median, std, var, sum, count, value_counts).
- Switch between column-wise and row-wise operations with axis=and usenumeric_onlyfor mixed dtypes.
- Filter rows with boolean masks and summarize subsets.
- Handle missing values with dropna()when appropriate.
- Explore categories with unique().
| ← Previous | Next → |