Primary Outcome
Figure 2.1: Number of high-grade and low-grade prostate cancer (PCa) patients in Development Cohort.
In Tang et al. (2024), we developed the simplified MyProstateScore2.0 (sMPS2), a 7-gene urine test which achieves similar state-of-the-art diagnostic accuracy for predicting high-grade prostate cancer as the original 18-gene MyProstateScore2.0 (MPS2) (Tosoian et al. 2024). This simplified biomarker test provides a more cost-effective alternative to the original MPS2 test and greatly increases its accessibility for routine clinical care.
In this PCS documentation (Yu and Kumbier 2020), we expand upon the sMPS2 model development pipeline, transparently documenting and justifying human judgment calls (including data preprocessing and modeling decisions) when possible. We also provide additional visualizations and stability analyses to further support the robustness and generalizability of the sMPS2 test.
In this section, we provide a brief exploration of the Development Cohort data, which was used to develop the simplified MyProstateScore2.0 (sMPS2) test. This Development Cohort consists of 761 samples and was used to build the original MPS2 models (Tosoian et al. 2024).
Below, we visualize the (marginal) distribution of each gene and clinical variable, grouped by prostate cancer (PCa) grade. Some interesting observations:
We also plot a correlation heatmap (Figure 2.3), showing the pairwise relationships between two genes’ expressions (as measured via their Ct values). This correlation heatmap shows that there are indeed strong positive correlations between groups of genes, which can complicate the interpretation and affect the stability of feature importances.
Figure 2.1: Number of high-grade and low-grade prostate cancer (PCa) patients in Development Cohort.
Figure 2.2: Distribution of Ct values in Development Cohort for each gene by prostate cancer (PCa) grade.
Figure 2.3: Correlation heatmap of gene expression (Ct values) in Development Cohort data. Genes have been clustered using hierarchical clustering.
Figure 2.4: Distribution of clinical features in Development Cohort by prostate cancer (PCa) grade.
As discussed in Tang et al. (2024), gene expression in each urine sample was measured via the cycle threshold (Ct) using qPCR profiling across 54 genes. These 54 genes were previously nominated as potential biomarkers for prostate cancer (PCa) detection in the MPS2 study (Tosoian et al. 2024) and are thus of interest here. In what follows, we recap the data preprocessing procedure used in this study (also described in Tang et al. (2024)) and provide additional justification for our judgment calls wherever possible.
As a starting point, we preprocessed the expression data as in the original MPS2 study (Tosoian et al. 2024):
We set the upper Ct value limit to 35. Specifically, Ct values greater than this limit were considered undetected and set to 35. Ct values from OpenArray that were “Undetermined” or “Inconclusive/No Amp” were also considered to be undetected and set to the upper Ct value limit of 35.
We computed the standard deviation (SD) across 3 technical replicates. If SD \(\geq\) 1, the replicate farthest from the mean was removed; otherwise, all 3 replicates were kept. This is to help filter out poor quality replicates.
We computed the average Ct value across the remaining technical replicates.
All samples with an average Ct value of the reference gene KLK3 above the 95th percentile were removed.
We normalized the average Ct values for each target gene by KLK3 using the formula -[ average Ct of gene X - average Ct of KLK3 ].
Finally, z-score scaling was applied to the normalized average Ct before downstream model development and feature selection.
We refer to this data preprocessing pipeline as the base preprocessing pipeline. However, there are several alternative, but equally-reasonable ways to deal with undetectable Ct values and poor quality samples in the data preprocessing. While we cannot explore all possible preprocessing choices, we do explore a few alternatives in this work in order to improve the robustness of our model and conclusions. Namely, we considered the following alternative preprocessing pipelines:
In addition to these preprocessed gene expression data, we also have access to various clinical data for each sample. We chose to focus on the following clinical variables for model development, as they are both known to be associated with high-grade prostate cancer and generally available in clinical practice: age, race, family history of prostate cancer, abnormal DRE, prior negative biopsy, and prostate specific antigen (PSA) (Thompson et al. 2006).
For each preprocessed dataset, we trained many different statistical/machine learning models to predict high-grade prostate cancer (PCa). Specifically, we considered the following models:
Logistic-based Models:
Tree-based Models:
We focused on these logistic- and tree-based models given the importance of interpretability and our goal of identifying important genes for reliable biomarker development. While the logistic-based models are generally thought to serve as baseline models, the tree-based models can provide greater flexibility to capture more complex relationships between genes and the outcome of interest without sacrificing interpretability. We note that other flexible but interpretable machine learning models could also be considered and may be of interest in future work. However, in this current work, we chose to first focus on these logistic-based models and tree-based models – the latter of which is often uniquely suited for biological tasks such as this, in part due to the resemblance between the thresholding behavior of decision trees and the on-off switch-like behavior commonly thought to govern genetic processes (Nelson, Lehninger, and Cox 2008).
We detailed the hyperparameters and python implementation used for each model in Table 1 in Tang et al. (2024). Hyperparameters were tuned using 5-fold cross-validation.
As the first step in the sMPS2 model development pipeline, we performed a prediction check to filter out models which have poor prediction performance and thus may not accurately reflect reality (Yu and Kumbier 2020). Here, guided by the PCS framework, we use prediction performance as a reality check and a minimum requirement for interpretability. Moreover, we assess the prediction performance, not only for different models but also for different data preprocessing pipelines. Examining multiple prediction metrics (i.e., area under the receiver operating characteristic (AUROC), area under the precision-recall curve (AUPRC), and classification accuracy), we found that:
We defer additional methodological details on how this prediction check was conducted to Tang et al. (2024).
Figure 5.1: (Left) For each choice of data preprocessing and prediction model, the validation AUROC, averaged across 4 CV folds and 10 repeated Development-Test splits, is shown. The error bars represent the inner 95% quantile range of the distribution of AUROCs. (Middle, Right) We compare the variation in AUROC across data preprocessing pipelines and methods. In the middle subplot, we show the range of mean AUROCs across the four data preprocessing pipelines for each method. In the right subplot, we show the difference between the mean AUROC from each method and the best performing method (i.e., logistic regression with the elastic net penalty) across all data preprocessing pipelines. The difference in AUROCs across data preprocessing pipelines is substantially smaller than that across prediction methods, suggesting that the development pipeline and downstream findings are robust to data preprocessing choices.
Figure 5.2: (Left) For each choice of data preprocessing and prediction model, the validation AUPRC, averaged across 4 CV folds and 10 repeated Development-Test splits, is shown. The error bars represent the inner 95% quantile range of the distribution of AUPRCs. (Middle, Right) We compare the variation in AUPRC across data preprocessing pipelines and methods. In the middle subplot, we show the range of mean AUPRCs across the four data preprocessing pipelines for each method. In the right subplot, we show the difference between the mean AUPRC from each method and the best performing method (i.e., logistic regression with the elastic net penalty) across all data preprocessing pipelines. The difference in AUPRCs across data preprocessing pipelines is substantially smaller than that across prediction methods, suggesting that the development pipeline and downstream findings are robust to data preprocessing choices.
Figure 5.3: (Left) For each choice of data preprocessing and prediction model, the validation Accuracy, averaged across 4 CV folds and 10 repeated Development-Test splits, is shown. The error bars represent the inner 95% quantile range of the distribution of Accuracys. (Middle, Right) We compare the variation in Accuracy across data preprocessing pipelines and methods. In the middle subplot, we show the range of mean Accuracys across the four data preprocessing pipelines for each method. In the right subplot, we show the difference between the mean Accuracy from each method and the best performing method (i.e., logistic regression with the elastic net penalty) across all data preprocessing pipelines. The difference in Accuracys across data preprocessing pipelines is substantially smaller than that across prediction methods, suggesting that the development pipeline and downstream findings are robust to data preprocessing choices.
After filtering out the poor-performing prediction models, we sought next to identify the topmost important genes, which were stably important across all four data preprocessing pipelines, six prediction-checked models, and ten Development-Test splits (i.e., \(4 \times 6 \times 10 = 240\) combinations). Details on how we computed the gene importances for each model are discussed in Tang et al. (2024).
We instead use this opportunity to conduct a stability analysis of the PCS-ensembled gene rankings. As discussed in Tang et al. (2024), the obtained PCS-ensembled gene rankings are an ensemble of gene rankings across four different data preprocessing pipelines and six prediction-checked models (RF, RF+, logistic elastic net, logistic LASSO, logistic ridge, and ordinary logistic regression). We chose to use these six prediction models since each passed the prediction check. However, it is natural to wonder whether the PCS-ensembled gene rankings would change if a different subset of the prediction-checked prediction models were used. In particular, since the original set of prediction models consisted of four logistic-based models and two tree-based models, we investigated how the PCS-ensembled gene rankings would change if we used a “balanced” set of prediction models, composed of two logistic-based models and two tree-based models.
Below in the Gene Ranking Summary
tab, we examined the PCS-ensembled gene rankings, ensembled across all four data preprocessing pipelines and
Here, we chose to always include logistic elastic net in the PCS ensemble as it demonstrated the highest predictive power in the prediction check step.
Takeaways from this stability analysis of the PCS-ensembled gene rankings:
To supplement this stability analysis, we also provide a more granular view of the gene rankings per data preprocessing pipeline and model in the Gene Ranking Heatmap
tab. These heatmaps showcase both genes that are stably important across all data preprocessing pipelines and methods as well as genes that are stably important across only a subset of data preprocessing pipelines and/or methods. In particular, these heatmaps confirm that the logistic regression model drives much of the instability that we observed previously in the PCA3 gene rankings.
Figure 6.1: Summary of the gene importance rankings, as measuerd by their mean gene ranking across four data preprocessing pipelines and 6 prediction-checked models (Logistic, Logistic Elastic Net, Logistic Lasso, Logistic Ridge, RF, RF+), the variability of their gene rankings as measured by the standard deviation (SD) of this distribution, and the proportion of times that the gene appeared int he top 5, 10, and 17 genes. The six genes highlighted in dark teal were used in the s7MPS2 model. The APOC1 gene, highlighted in light teal, was used in the s8MPS2 model.
Figure 6.2: Summary of the gene importance rankings, as measuerd by their mean gene ranking across four data preprocessing pipelines and 4 prediction-checked models (Logistic Elastic Net, Logistic Ridge, RF, RF+), the variability of their gene rankings as measured by the standard deviation (SD) of this distribution, and the proportion of times that the gene appeared int he top 5, 10, and 17 genes. The six genes highlighted in dark teal were used in the s7MPS2 model. The APOC1 gene, highlighted in light teal, was used in the s8MPS2 model.
Figure 6.3: Summary of the gene importance rankings, as measuerd by their mean gene ranking across four data preprocessing pipelines and 4 prediction-checked models (Logistic Elastic Net, Logistic Lasso, RF, RF+), the variability of their gene rankings as measured by the standard deviation (SD) of this distribution, and the proportion of times that the gene appeared int he top 5, 10, and 17 genes. The six genes highlighted in dark teal were used in the s7MPS2 model. The APOC1 gene, highlighted in light teal, was used in the s8MPS2 model.
Figure 6.4: Summary of the gene importance rankings, as measuerd by their mean gene ranking across four data preprocessing pipelines and 4 prediction-checked models (Logistic, Logistic Elastic Net, RF, RF+), the variability of their gene rankings as measured by the standard deviation (SD) of this distribution, and the proportion of times that the gene appeared int he top 5, 10, and 17 genes. The six genes highlighted in dark teal were used in the s7MPS2 model. The APOC1 gene, highlighted in light teal, was used in the s8MPS2 model.
Figure 6.5: Heatmap of the mean gene ranking (averaged across 10 Development-Test splits) per data preprocessing pipeline and model choice.
Figure 6.6: Heatmap of the gene ranking per data preprocessing pipeline, model, and Development-Test split. Each row corresponds to a different Development-Test split for the data preprocessing pipeline and model choice labeled on the right.
We next assessed the impact of the choice of gene panel size (i.e,. the number of top-ranked genes used in the sMPS2 model) and the gene ranking strategy (i.e., model-specific versus model-ensembled versus PCS-ensembled) on the prediction accuracy, evaluated on the test set (from the Development-Test split). Test prediction accuracies are averaged across 10 different Development-Test splits. In this section, we summarize these test prediction results (measured via AUROC, AUPRC, and classification accuracy) across the different gene panel sizes, gene ranking strategies, data preprocessing pipelines, and model choices. In general, the results suggest that 6 or 7 genes are sufficient to achieve competitive prediction performance, and that the PCS-ensembled gene ranking strategy is the most robust across different data preprocessing pipelines and model choices.
In Tang et al. (2024), we additionally conducted and detailed an external validation study, which confirmed the strong prediction accuracy of the sMPS2 models. However, given the blinded nature of the external validation study, the data is not accessible by the co-first authors for use in this PCS documentation. We refer the reader to the original publication for details.
Figure 7.1: Mean AUROC, evaluated on test set, when training various models (rows) using various choices of gene panel sizes (x-axis), data preprocessing pipelines (columns), and gene rankings (color). The PCS-ensembled gene rankings (in black) generally yield the highest test AUROCs compared to other procedures for obtaining the gene rankings. Moreover, using 6 or 7 predictor genes (vertical dotted and dashed lines, respectively) yields very competitive test prediction performance and is often comparable to the high achieved AUROC.
Figure 7.2: Mean AUPRC, evaluated on test set, when training various models (rows) using various choices of gene panel sizes (x-axis), data preprocessing pipelines (columns), and gene rankings (color). The PCS-ensembled gene rankings (in black) generally yield the highest test AUPRCs compared to other procedures for obtaining the gene rankings. Moreover, using 6 or 7 predictor genes (vertical dotted and dashed lines, respectively) yields very competitive test prediction performance and is often comparable to the high achieved AUPRC.
Figure 7.3: Mean Accuracy, evaluated on test set, when training various models (rows) using various choices of gene panel sizes (x-axis), data preprocessing pipelines (columns), and gene rankings (color). The PCS-ensembled gene rankings (in black) generally yield the highest test Accuracys compared to other procedures for obtaining the gene rankings. Moreover, using 6 or 7 predictor genes (vertical dotted and dashed lines, respectively) yields very competitive test prediction performance and is often comparable to the high achieved Accuracy.
In this PCS documentation (Yu and Kumbier 2020), we have shed additional light on the various decisions that were made throughout the development of the sMPS2 model and have justified many of these choices to the best of our ability. While we acknowledge that other equally-reasonable choices could have been made, we hope that this documentation will be a useful resource for researchers and clinicians who are interested in building upon this work.