1 Overview

In Q. Wang et al. (2023), we developed an end-to-end pipeline to demonstrate the role of epistasis in the genetic control of cardiac hypertrophy. This pipeline consisted of four major phases: (1) the derivation of estimates of left ventricular mass via deep learning, (2) the computational prioritization of epistatic drivers based on data from the UK Biobank (Bycroft et al. 2018), (3) functional interpretation of the hypothesized epistatic genetic loci, and (4) experimental confirmation of epistasis in cardiac transcriptome and cardiac cellular morphology.

Here, we expand upon the computational prioritization of epistatic drivers phase. In this phase, we leveraged large-scale genetic and cardiac MRI data from the UK Biobank and developed the low-signal signed iterative random forest (lo-siRF) to prioritize genetic loci and interactions between loci for downstream investigations. As lo-siRF is heavily rooted in the Predictability, Computability, and Stability (PCS) framework for veridical (trustworthy) data science (Yu and Kumbier 2020), we conducted extensive stability analyses to help minimize the impact of human judgment calls and modeling decisions on our scientific conclusions. In this supplementary PCS documentation, we will:

  1. Transparently document and justify the many human judgment calls and modeling decisions that were inevitably made throughout our statistical analysis.
  2. Provide additional exploration and insight into the main results presented in Q. Wang et al. (2023) alongside the stability analyses.

In what follows, we organize this documentation by step in the lo-siRF pipeline (see Q. Wang et al. (2023)), beginning with the choice of phenotypic data, followed by the dimension reduction step, binarization step, prediction step, and concluding with the prioritization step. We then investigate the possibility of pleiotropy between LV mass and blood pressure, conducting a stability analysis where we examine the impact of excluding hypertensive individuals from the lo-siRF analysis. Lastly, we compare the lo-siRF prioritized loci and interactions between loci with those identified using existing methods and conclude with final remarks.

Origins and intended usage of lo-siRF: As discussed in Q. Wang et al. (2023), lo-siRF was originally developed as a first-stage recommendation (or hypothesis generation) tool to prioritize genetic loci and interactions between loci for downstream analyses and experimental validation. Importantly, the output of lo-siRF is a prioritization or ranking order of loci and interactions between loci, not a measure of statistical significance. We emphasize that recommendations or prioritizations from lo-siRF should not be interpreted as causal findings without further investigation. Given the numerous challenges inherent to epistasis discovery (e.g., high heterogeneity, small effect sizes, and unmeasured confounding factors, discussed further in Q. Wang et al. (2023)), assessing causality solely from data is incredibly complex and difficult to do reliably. Lo-siRF was thus designed and intended to be used as a single step within a larger scientific discovery pipeline. In Q. Wang et al. (2023), we relied on extensive follow-up investigations to validate the lo-siRF-prioritized genetic loci and interactions between loci, and we encourage practitioners to do the same.

2 Choice of Phenotypic Data

In the first step of lo-siRF, we began with a careful choice of phenotypic data. This first, but often overlooked, choice is critical to the reliability and interpretation of results. If the data quality is poor, conclusions may be unreliable. If the data is far-removed from the domain problem of interest, conclusions may not be relevant. Given the importance of this choice of phenotypic data, we conducted extensive data explorations for different phenotypes in addition obtaining clinician input regarding its clinical relevance.

Initial choice of phenotype: Our overarching goal was to study the role of epistasis in cardiac hypertrophy. One important disease that is closely related to cardiac hypertrophy is hypertrophic cardiomyopathy (HCM). HCM is a common heart disease, impacting around 1 in 500 people, where the heart muscle becomes enlarged or thickened. Given the high prevalence and possible life-threatening consequences of this disease, we first set out to explore this HCM phenotype, defined as a 0-1 indicator variable of whether or not the individual was diagnosed with HCM (i.e., any ICD10 code of I42.1 or I42.2 in the UK Biobank). However, when fitting various machine learning prediction models, including L1 or L2-regularized logistic regression, random forest (RF), iterative RF (iRF), and support vector machines, using single-nucleotide variant (SNV) data from the UK Biobank (details in Q. Wang et al. (2023)) to predict HCM diagnosis, the balanced classification accuracy on the held-out validation data did not consistently exceed 50% when trained on different bootstrapped training samples. In other words, the fitted prediction models were no better than random guessing. This made us wary of any interpretations extracted from these prediction models since it is unclear whether these models are capturing any relevant phenotypic signal or simply noise. Given this incredibly weak prediction signal, we deemed that these models did not pass the prediction check (the “P” in PCS) (Yu and Kumbier 2020). For this reason, we chose not to proceed further with the HCM phenotype and did not conduct any analysis with respect to interpreting the aforementioned models for HCM.

Pivoting to a cardiac MRI-derived phenotype: We instead pivoted to an alternative phenotype for cardiac hypertrophy based upon deep-learning-enabled left ventricular (LV) structural analysis using cardiac MRIs. This choice was heavily motivated by our struggles with HCM. Specifically, when looking back at the HCM diagnosis data, we noticed that the number of HCM cases in the UK Biobank data (only 236 HCM cases out of 337,535 unrelated White British individuals) was much lower than the 1 in 500 ratio that is expected. Although the UK Biobank population is known to be healthier than the general population (Fry et al. 2017), the observed discrepancy is large, leading us to speculate about potential under-diagnosis issues with the HCM phenotype. Consequently, rather than relying on a clinician’s diagnosis which may suffer from issues such as under-diagnosis, we chose to build upon recent advances in deep-learning-enabled phenotyping (Bai et al. 2018). Doing so enables us to extract a measure of left ventricular hypertrophy, specifically, the left ventricular mass (LVM), from anyone with a cardiac MRI.

Arriving at LVMi: However, LVM is known to be heavily influenced by body weight, height, and gender (Engel, Schwartz, and Homma 2016; Mizukoshi et al. 2016). Thus, if we were to proceed solely with LVM in our analysis, we risk identifying genetic loci and interactions between loci that are primarily associated with body weight, height, and gender, rather than left ventricular hypertrophy. Indeed, if we visualize the relationship between LVM, height, and weight within each gender group (Figure 2.1), we see strong positive correlations.

This motivates the need to adjust for height and weight in order to help decouple the effects of height/weight and LVM. As done in Khurshid et al. (2023), one common way to address this is by analyzing LVM indexed by body surface area (LVMi), defined as:

\[\begin{align*} \text{LVMi} = \text{LVM } / \text{ Body surface area}, \end{align*}\] where body surface area is calculated based on height and body weight via the Du Bois formula (Du Bois and Du Bois 1916).

After normalizing by body surface area (and hence, height and weight), the associations between LVMi and height/weight are much weaker within each gender group, exhibiting correlations closer to 0 (Figure 2.1). Thus, by focusing our statistical analysis on LVMi, we lessen the possibility that the identified genetic loci and interactions between loci are only selected because of their association with body height and/or weight. Still, there are differences in LVMi between males and females, with males tending to have higher LVMi measurements. As seen later, we will account for this in the modeling stage by using gender-specific binarization thresholds to binarize the LVMi responses into the high and low LVMi groups.

Study Cohort: Before proceeding, we note that another important human judgment call within our pipeline is the choice of study cohort. Specifically, we restricted our analysis to the unrelated White British individual population with both SNV and cardiac MRI data in the UK Biobank, resulting in a cohort of 29,661 individuals. We chose to focus this analysis on the unrelated White British population, anticipating the very low-signal problem under study (e.g., due to the generally small effect sizes of common genetic variants and the high phenotypic diversity of cardiac hypertrophy). By studying a more homogeneous cohort such as the unrelated White British population, we hope to first establish a proof-of-concept foothold on the role of epistasis in cardiac hypertrophy, and we view the study and generalization to the broader population as a very crucial next step.

Pair plot, illustrating the relationships between LVM, LVMi, height, and weight within each gender group (males in green and females in orange). Density distributions are shown along the diagonal.

Figure 2.1: Pair plot, illustrating the relationships between LVM, LVMi, height, and weight within each gender group (males in green and females in orange). Density distributions are shown along the diagonal.

3 Dimension Reduction Step

Overview

Having carefully chosen the phenotypic data (i.e., the cardiac MRI-derived LVMi) under study, the next step in lo-siRF is to do dimension reduction to reduce the number of SNVs under consideration using a genome-wide association study (GWAS). This dimension reduction step is necessary to mitigate the computational and statistical challenges of searching across millions of possible SNVs (which in the case of searching for epistasis results in an even larger combinatorial explosion of possibilities).

Why GWAS? We chose to do dimension reduction by running a GWAS as it (1) has been shown to be an incredibly useful tool for better understanding the genetic basis of numerous diseases in this field (Visscher et al. 2017; Pirruccello et al. 2020; Meyer et al. 2020; Harper et al. 2021; Khurshid et al. 2023), (2) is one of the few methods that can handle an input of millions of SNVs in a computationally-tractable manner, and (3) is a common, widely-accepted preprocessing step in other related statistical genomics tasks such as fine-mapping (Schaid, Chen, and Larson 2018).

Limitations: We acknowledge, however, that using a GWAS to reduce the number of SNVs under consideration has limitations. Because a GWAS prioritizes SNVs that have strong marginal associations with the phenotype under study, we may be removing SNVs that are associated with LVMi through an epistatic interaction but not through a marginal association. Alleviating this limitation is certainly interesting and an important direction to pursue in future work. Regardless, in this current work, we use lo-siRF primarily as a way to generate recommendations for experimental validation of epistasis. We thus are most interested in generating reliable recommendations for a limited number of experiments, rather than attempting to find all possible interactions that drive the disease. Given our particular use-case for lo-siRF, we accept the limitations of using a GWAS for dimension reduction and again view this work as a reliable first step (and certainly not the last) for better understanding the role of epistasis in cardiac hypertrophy.

How was the dimension reduction performed? We performed a GWAS on the training data for the rank-based inverse normal-transformed LVMi using two algorithms, PLINK (Purcell et al. 2007) and BOLT-LMM (Loh et al. 2015). For each of the two GWAS runs, we ranked the SNVs by significance (i.e., the GWAS p-value). We then took the union of the top 1000 SNVs (without clumping) from each of the two GWAS runs and proceeded with this resulting set of 1405 SNVs for the remainder of the lo-siRF pipeline. We refer to Q. Wang et al. (2023) for the details on the PLINK and BOLT-LMM implementations and use this document instead to justify several of our human judgment calls in this dimension reduction step.

  • Why PLINK and BOLT-LMM? Since BOLT-LMM and PLINK rely on different statistical models and it is unclear a priori which model is better suited for our problem, we employed both software packages to mitigate the dependence of downstream conclusions on this arbitrary choice of model.
  • Why the rank-based inverse normal-transformed LVMi? Because the distribution of LVMi measurements is right skewed (see Figure 2.1), and the p-value computation in PLINK and BOLT-LMM rely on normality assumptions, the rank-based inverse normal-transform was taken as in Khurshid et al. (2023).
  • Why was the union taken? As mentioned previously, relying on a GWAS to do dimension reduction can be limiting since it is primarily assessing marginal associations between the SNVs and the phenotype. There are also other restrictive modeling assumptions (e.g., linearity) that may add to these limitations. Due to this concern that we may be imposing too many unnecessary restrictions in order for SNVs to pass the GWAS filter, we chose to take the union of the top SNVs from BOLT-LMM and from PLINK, rather than taking the intersection which would have made it more difficult or restrictive for SNVs to pass the GWAS filter.
  • How was the threshold chosen? We chose the threshold of 1000 SNVs per GWAS method (without clumping) as it (1) struck a balance between the amount of information loss and the computational cost of our downstream modeling and (2) yielded the highest validation prediction accuracy compared to choosing other possible thresholds (500 and 2000) with and without clumping in the prediction modeling stage. We note that this choice of the top 1000 SNVs includes all SNVs that passed the genome-wide significance level (p-value = 5E-8) as well as additional SNVs that did not pass the genome-wide significance level. In particular, taking the top 1000 SNVs corresponds to using the p-value threshold of 1.7E-5 and 6.7E-6 for PLINK and BOLT-LMM, respectively.

Remark: We note that alternative dimension reduction strategies may be used in replacement of our proposed procedure. For example, one may consider using a different GWAS software (e.g., REGENIE (Mbatchou et al. 2021)), a different phenotype transformation, or perhaps a more sophisticated approach tailored for interactions and epistasis detection (e.g., MAPIT (Crawford et al. 2017)).

Organization: Under the GWAS Results Summary tab, we provide an investigation into the BOLT-LMM and PLINK GWAS results for LVM and LVMi. Under the GWAS-filtered SNVs, we explore the selected 1405 SNVs that passed the GWAS filter for the LVMi phenotype.

GWAS Results Summary

Below, we provide an overview of the genome-wide association study (GWAS) results for both the LVM and LVMi phenotypes. While LVMi is the primary phenotype of interest, we provide the LVM GWAS results for completeness. We further investigate the similarities in findings between the LVM and LVMi GWAS analyses via a stability analysis.

  • Under the BOLT-LMM and PLINK tabs:
    • We provide a brief look into the GWAS results, including the Manhattan plots for each phenotype (LVMi and LVM).
    • We also provide a table of the genomic risk loci, lead SNVs, and independent significant SNVs from each GWAS, determined using FUMA (Watanabe et al. 2017) with the default settings (i.e., maximum p-value of lead SNVs = 5E-8, maximum p-value cutoff = 0.05, \(r^2\) threshold to define independent significant SNVs = 0.6, 2nd \(r^2\) threshold to define lead SNVs = 0.1, minimum minor allele frequency = 0, maximum distance between LD blocks to merge into a locus = 250kb, using the UKB release2b 10k White British reference panel population).
    • Furthermore, we used FUMA to map SNVs to genes based on position, again with the default settings (maximum distance = 10kb), and provide this table of mapped genes below.
  • Lastly, under the Stability Analysis tab, we investigated the similarities and differences between the BOLT-LMM and PLINK results as well as between the LVM and LVMi results.

BOLT-LMM

LVMi

GWAS Manhattan plot for the LVMi phenotype using BOLT-LMM. Red dashed line indicates the genome-wide significance level (p = 5E-8).

Figure 3.1: GWAS Manhattan plot for the LVMi phenotype using BOLT-LMM. Red dashed line indicates the genome-wide significance level (p = 5E-8).

Genomic Risk Loci

Figure 3.2: List of genomic risk loci from FUMA for the LVMi BOLT-LMM GWAS. uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the top lead SNV for the given genomic locus. chr = Chromosome of top lead SNV. pos = Position of top lead SNV on hg19. p = P-value of top lead SNV. start = Start position of the locus. end = End position of the locus. nGWASSNVs = Number of the GWAS-tagged candidate SNVs within the genomic locus. nIndSigSNVs = Number of the independent significant SNVs in the genomic locus. IndSigSNVs = rsID of independent significant SNVs in the genomic locus. nLeadSNVs = The number of lead SNVs in the genomic locus. LeadSNVs = rsID of lead SNVs in the genomic locus.

Lead SNVs

Figure 3.3: List of lead SNVs from FUMA for the LVMi BOLT-LMM GWAS. GenomicLocus = Index of assigned genomic locus (note: multiple independent lead SNVs can be assigned to the same genomic locus). uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the SNV. chr = Chromosome. pos = Position on hg19. p = P-value from GWAS. nIndSigSNVs = Number of independent significant SNVs which are in LD of the lead SNV at \(r^2\) = 0.1. IndSigSNVs = rsID of independent significant SNVs which are in LD of the lead SNV at \(r^2\) = 0.1.

Independent Significant SNVs

Figure 3.4: List of independent significant SNVs from FUMA for the LVMi BOLT-LMM GWAS. GenomicLocus = Index of assigned genomic locus (note: multiple independent lead SNVs can be assigned to the same genomic locus). uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the SNV. chr = Chromosome. pos = Position on hg19. p = P-value from GWAS. nGWASSNVs = Number of GWAS-tagged SNVs which are in LD of the independent significant SNV given \(r^2\) = 0.6.

Mapped Genes

Figure 3.5: List of mapped genes (via positional mapping) from FUMA for the LVMi BOLT-LMM GWAS. ensg = ENSG ID. symbol = Gene Symbol. chr = Chromosome. start = Starting position of the gene. end = Ending position of the gene. strand = Strand of the gene. type = Gene biotype from Ensembl. entrezID = entrez ID (if available). HUGO = HUGO (HGNC) gene symbol. pLI = pLI score (the probability of being loss-of-function intolerant) from ExAC database; the higher the score is, the more intolerant to loss-of-function mutations the gene is. ncRVIS = Non-coding residual variation intolerance score; the higher the score is, the more intolerant to non-coding variation the gene is. posMapSNVs = Number of SNVs mapped to gene based on positional mapping. posMapMaxCADD = The maximum CADD score of mapped SNVs by positional mapping. minGwasP = The minimum P-value of mapped SNVs. IndSigSNVs = rsID of the independent significant SNVs that are in LD with the mapped SNVs. GenomicLocus = Index of genomic loci where mapped SNVs are from.

LVM

GWAS Manhattan plot for the LVM phenotype using BOLT-LMM. Red dashed line indicates the genome-wide significance level (p = 5E-8).

Figure 3.6: GWAS Manhattan plot for the LVM phenotype using BOLT-LMM. Red dashed line indicates the genome-wide significance level (p = 5E-8).

Genomic Risk Loci

Figure 3.7: List of genomic risk loci from FUMA for the LVM BOLT-LMM GWAS. uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the top lead SNV for the given genomic locus. chr = Chromosome of top lead SNV. pos = Position of top lead SNV on hg19. p = P-value of top lead SNV. start = Start position of the locus. end = End position of the locus. nGWASSNVs = Number of the GWAS-tagged candidate SNVs within the genomic locus. nIndSigSNVs = Number of the independent significant SNVs in the genomic locus. IndSigSNVs = rsID of independent significant SNVs in the genomic locus. nLeadSNVs = The number of lead SNVs in the genomic locus. LeadSNVs = rsID of lead SNVs in the genomic locus.

Lead SNVs

Figure 3.8: List of lead SNVs from FUMA for the LVM BOLT-LMM GWAS. GenomicLocus = Index of assigned genomic locus (note: multiple independent lead SNVs can be assigned to the same genomic locus). uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the SNV. chr = Chromosome. pos = Position on hg19. p = P-value from GWAS. nIndSigSNVs = Number of independent significant SNVs which are in LD of the lead SNV at \(r^2\) = 0.1. IndSigSNVs = rsID of independent significant SNVs which are in LD of the lead SNV at \(r^2\) = 0.1.

Independent Significant SNVs

Figure 3.9: List of independent significant SNVs from FUMA for the LVM BOLT-LMM GWAS. GenomicLocus = Index of assigned genomic locus (note: multiple independent lead SNVs can be assigned to the same genomic locus). uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the SNV. chr = Chromosome. pos = Position on hg19. p = P-value from GWAS. nGWASSNVs = Number of GWAS-tagged SNVs which are in LD of the independent significant SNV given \(r^2\) = 0.6.

Mapped Genes

Figure 3.10: List of mapped genes (via positional mapping) from FUMA for the LVM BOLT-LMM GWAS. ensg = ENSG ID. symbol = Gene Symbol. chr = Chromosome. start = Starting position of the gene. end = Ending position of the gene. strand = Strand of the gene. type = Gene biotype from Ensembl. entrezID = entrez ID (if available). HUGO = HUGO (HGNC) gene symbol. pLI = pLI score (the probability of being loss-of-function intolerant) from ExAC database; the higher the score is, the more intolerant to loss-of-function mutations the gene is. ncRVIS = Non-coding residual variation intolerance score; the higher the score is, the more intolerant to non-coding variation the gene is. posMapSNVs = Number of SNVs mapped to gene based on positional mapping. posMapMaxCADD = The maximum CADD score of mapped SNVs by positional mapping. minGwasP = The minimum P-value of mapped SNVs. IndSigSNVs = rsID of the independent significant SNVs that are in LD with the mapped SNVs. GenomicLocus = Index of genomic loci where mapped SNVs are from.

Stability Analysis

To further investigate the choice of GWAS method and phenotype, we conducted a deeper exploration into the similarities and differences between the BOLT-LMM and PLINK results as well as between the LVM and LVMi results in the stability analysis below. The primary goal of this stability analysis is to answer the following two questions:

  1. Are there large differences between the GWAS p-values from BOLT-LMM and PLINK?
  • Figure 3.21: For each SNV, we plot the p-value (log-transformed) obtained via BOLT-LMM (x-axis) and PLINK (y-axis) for each of the LVM (left) and LVMi (right) phenotypes. We see that the most significant GWAS hits tend to be stable, no matter the choice of algorithm, but as the strength of the association decreases, there is greater variation in the p-value between BOLT-LMM and PLINK. This is to say that there is more stability around the SNVs with the strongest associations with LVMi and less stability around the SNVs with moderate or weak associations with LVMi. Another reason for taking the union of the top 1000 SNVs from each GWAS algorithm in the lo-siRF pipeline was to allow these moderate to weak association SNVs the chance of being identified in a multivariate model (i.e., via iterative random forest).
  1. What are the FUMA GWAS findings that are stable across GWAS methods (BOLT-LMM and PLINK) and choice of phenotype (LVM and LVMi)? As advocated by the stability principle (“S” in PCS) (Yu and Kumbier 2020; Yu 2013), we view stability across these arbitrary choices as a minimum requirement for making a scientific discovery.
  • In Figure 3.22, we summarize the SNVs, identified by FUMA, to be the top lead SNVs (per genomic risk locus), lead SNVs, and/or independent significant SNVs for each GWAS method (BOLT-LMM and PLINK) and choice of phenotype (LVM and LVMi). Some takeaways here:
    • rs3045696, an intronic TTN variant, is the only stable top lead SNV, identified across all four GWAS runs.
    • Besides rs3045696, rs1873164, an intronic CCDC141 variant, is the only stable lead SNV, identified by FUMA across all four GWAS runs.
    • rs7591091 (an intronic CCDC141 variant with the highest frequency of occurrence in lo-siRF, seen later), along with rs967507, rs12622500, and rs10178003, were stably identified as independent significant SNVs by FUMA across all four GWAS runs.
    • Two other SNVs (rs62178977, rs73973171) were also identified as independent significant SNVs by FUMA using both BOLT-LMM and PLINK for the LVMi phenotype.
  • In Figure 3.23, we summarize the genes that were positionally-mapped and prioritized by FUMA across the different GWAS methods (BOLT-LMM and PLINK) and choice of phenotype (LVM and LVMi).
    • The genes that were stably mapped by FUMA across all four GWAS runs are PRKRA, DFNB59, FKBP7, PLEKHA3, TTN, and CCDC141. We note that all of these genes are located near each other on chromosome 2 within ~600kb. The start position of PRKRA is 179296141 while the end position of CCDC141 is 179914813, with all other aforementioned genes in between.
    • No other genes were identified when using both BOLT-LMM and PLINK for the LVMi phenotype.
Comparison of the BOLT-LMM and PLINK GWAS p-values for each phenotype (LVM left, LVMi right). Each point represents a SNV with its BOLT-LMM p-value on the x-axis and its PLINK p-value on the y-axis.

Figure 3.21: Comparison of the BOLT-LMM and PLINK GWAS p-values for each phenotype (LVM left, LVMi right). Each point represents a SNV with its BOLT-LMM p-value on the x-axis and its PLINK p-value on the y-axis.

Summary of annotated SNVs from FUMA across different GWAS methods (BOLT-LMM and PLINK) and across different choices of phenotype (LVMi and LVM).

Figure 3.22: Summary of annotated SNVs from FUMA across different GWAS methods (BOLT-LMM and PLINK) and across different choices of phenotype (LVMi and LVM).

Summary of mapped genes from FUMA across different GWAS methods (BOLT-LMM and PLINK) and across different choices of phenotype (LVMi and LVM).

Figure 3.23: Summary of mapped genes from FUMA across different GWAS methods (BOLT-LMM and PLINK) and across different choices of phenotype (LVMi and LVM).

GWAS-filtered SNVs

In this section, we briefly explore the selected 1405 SNVs that passed the GWAS filter for the LVMi phenotype. We make two observations that will impact how we interpret the lo-siRF fit downstream:

  • There are strong correlations within blocks of neighboring SNVs due to linkage disequilibrium (LD) (Figure 3.24). This is a large contributing factor for why we ultimately extracted feature importances from the lo-siRF fit at the level of genetic loci, rather than for each individual SNV.
  • There are large differences in the number of SNVs that passed the GWAS filter from each genetic loci (Figure 3.25). We hence must account for this in the lo-siRF locus-level importance score.

We will expand upon this locus-level importance score in a later section (see Prioritization Step).

Additional notes:

  • Given the interest in CCDC141 and TTN in our work and their close proximity to each other on chromosome 2, we point out a couple pairwise (Pearson) correlations of particular relevance:
    • \(| Cor(\text{rs7591091}, \text{ rs66733621}) | = 0.35\)
    • \(| Cor(\text{rs7591091}, \text{ rs3045696}) | = 0.34\)
    • rs7591091 is the top SNV from the CCDC141 locus in lo-siRF and is one of the top CCDC141 GWAS hits (in fact, is identified to be a stable independent significant SNV (see GWAS Results Summary tab)). rs3045696, an intronic TTN SNV, was stably identified to be a top lead SNV in the GWASes, and rs66733621 is the top SNV from the TTN locus in lo-siRF. This is to say that the correlation between the CCDC141 and TTN loci is moderate and may possibly suggest independent roles of CCDC141 and TTN in their detected interaction. A further discussion of this is provided in Q. Wang et al. (2023). [Note: we use the term top SNV from a genetic locus in lo-siRF to mean the SNV in the specified genetic locus with the highest frequency of occurrence across decision paths in the siRF fit.]

LVMi - Correlation Between GWAS-filtered SNVs

Correlation heatmap for the GWAS-filtered SNVs. We plot the magnitude of the Pearson correlation between all possible pairs of SNVs that passed the GWAS filter. Darker red indicates higher correlation between the two SNVs. SNVs along the axes are ordered according to genomic location. Loci with more than two constituent SNVs are labeled.

Figure 3.24: Correlation heatmap for the GWAS-filtered SNVs. We plot the magnitude of the Pearson correlation between all possible pairs of SNVs that passed the GWAS filter. Darker red indicates higher correlation between the two SNVs. SNVs along the axes are ordered according to genomic location. Loci with more than two constituent SNVs are labeled.

LVMi - Table of GWAS-filtered SNVs

Figure 3.25: Number of SNVs per genetic locus that passed the GWAS filter. Note: a semicolon is used to denote the intergenic region between two genes.

4 Binarization Step

Overview

In the next step of lo-siRF, we chose to binarize the (continuous) LVMi phenotype measurements into two categories: a low LVMi group and a high LVMi group before fitting the prediction models.

Why did we choose to binarize? One of our main motivating factors to pursue binarization revolved around the prediction check criteria in the PCS framework (Yu and Kumbier 2020). This predictability principle advocates for the use of prediction accuracy as a reality check. This is to say that at a minimum, models should fit the data well, as measured by prediction accuracy, before believing that the model is capturing something relevant to reality or the real-world phenomena under study. In particular, before we thought to binarize the phenotype, we first attempted to train prediction models using the GWAS-filtered SNVs to predict the original (continuous) LVMi phenotype measurements. While this seemed like the most natural next step, these prediction results illuminated an incredibly weak prediction signal, with validation \(R^2\) values hovering very close to 0 (see Figure 4.1 in the Prediction Results without Binarization tab). Given that an \(R^2\) value of 0 can be achieved by simply predicting the mean LVMi response, this raised the question of whether these prediction models are capturing any biologically-relevant signal in the data. Consequently, the interpretation of these models may not be an accurate representation of reality. Put differently, these models do not satisfy the prediction check criteria under the PCS framework.

Given these difficulties in accurately predicting LVMi, we drew inspiration from how scientists often reduce complex problems into simplified versions. Here, the complex version of our problem aims to understand how SNVs are predictive of one’s precise (continuous-valued) LVMi measurement. This is challenging, as demonstrated by the poor prediction accuracy discussed previously. On the other hand, arguably the most simplified version of this problem is to understand how SNVs are predictive of whether one has an extremely high or extremely low LVMi measurement. Intuitively, if there are in fact genetic differences between these two LV chamber states, then differentiating between the two groups — those with extremely high LVMi and those with extremely low LVMi — should be easier than the original regression problem (i.e., predicting the raw continuous-valued LVMi response). Indeed, we will see this to be the case, observing that the resulting classification models consistently performed better than random guessing unlike the original regression models. More specifically, we observed balanced classification accuracies \(>55\%\), area under the receiver operator characteristic (AUROC) \(>0.58\), and area under the precision-recall curve (AUPRC) \(>0.55\) for the constructed binary classification task (see the Prediction Step tab). This prediction performance instills greater trustworthiness that the interpretation of these models after binarization are more relevant to the biological phenomena than the case without binarization.

How is the binarization done? We transformed the original regression problem of predicting the raw continuous-valued LVMi response into a binary classification problem as follows. For particular choice of threshold \(x\), we partitioned the raw continuous-valued LVMi phenotype into a low, middle, and high LVMi group corresponding to the \([0, x]\), \((x, 1-x)\), and \([1-x, 1]\) quantile ranges, respectively. We then omitted the individuals in the middle quantile range and trained the prediction models to differentiate between the low and high LVMi individuals. However, due to the gender-specific biological variation of LVMi shown earlier in the exploratory data analysis (see Choice of Phenotypic Data), we performed this binarization procedure for males and females separately. That is, we computed the quantiles for males and females separately. Finally, given that this threshold is artificially introduced, we run the remainder of the lo-siRF pipeline using three different reasonable binarization thresholds (15%, 20%, 25%) that balance the improvement in prediction signal and amount of data lost. In particular, we note that if we had chosen a binarization threshold of 50% so that no data is thrown out, then the balanced classification accuracy of the best-performing prediction model (signed iterative random forest) falls to \(51%\), which is very similar to random guessing. In the end, we will consolidate the results that were stable across all three binarization thresholds (15%, 20%, 25%), described later.

Limitations of binarization: We recognize that this binarization procedure has limitations and that these limitations should be carefully considered in context of the problem under study. For instance, the utility of this binarization is highly dependent on whether understanding the differences between the low and high quantiles is relevant to the study aims. In our case, we believe that the connection between cardiac hypertrophy and those with high LVMi helps to justify the use of binarizing based on the low and high quantiles. If one does not expect a monotonic relationship between the genetic features and the phenotype, then binarization may not be appropriate. In addition, binarization also requires the choice of binarization threshold. This choice is almost certainly a subjective choice and thus necessitates a stability analysis, that is, analyzing the stability of findings across different choices of binarization thresholds. In our case, we only prioritize loci/interactions that were stably important across three different choices of binarization threshold (15%, 20%, and 25%). Finally, binarization also reduces the sample size and systematically omits individuals who are not in the tails of the phenotype distribution. As a result, there may be additional interactions that are missed in lo-siRF. While we do not view this as a major concern if the goal is to recommend a small set of reliable experimental targets, this limitation may certainly impact studies with different end goals.

We expand on the prediction modeling with the binarized LVMi phenotype in the next section (see Prediction Step).

Prediction Results without Binarization

Here, we present a summary of the prediction modeling results, where we use the GWAS-filtered SNVs to predict the raw LVMi phenotype measurements (i.e., without any binarization). Specifically, we measure the validation set prediction accuracy from several popular machine learning regression algorithms (kernel ridge regression, Lasso regression, ridge regression, random forest (RF), and signed iterative random forest (siRF)) across four different prediction performance metrics - the Pearson correlation, mean absolute error (MAE), R-squared, and root-mean-squared error (RMSE) between the observed LVMi response and the predicted LVMi response. We also visualize how the predicted responses differ across the different prediction models in the provided pair plots. Though RF appears to yield the best prediction performance relative to the other methods under consideration, the RF prediction performance is still very poor as both the correlation and R-squared metrics are very close to 0.

Prediction Accuracies

Figure 4.1: Summary of prediction performance on the validation set for the LVMi phenotype (without binarization) across various prediction methods and evaluation metrics.

Prediction Pair Plot

Pair plot of the predicted LVMi responses from various prediction models. Each point represents an individual in the validation set, and the coordinates represent the predicted LVMi response from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted LVMi responses are shown along the diagonal.

Figure 4.2: Pair plot of the predicted LVMi responses from various prediction models. Each point represents an individual in the validation set, and the coordinates represent the predicted LVMi response from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted LVMi responses are shown along the diagonal.

5 Prediction Step

After choosing to binarize the LVMi phenotype, we next fit various prediction models using the GWAS-filtered SNVs to predict the binarized LVMi phenotype and assess their validation prediction accuracy.

Which models were chosen and why? While there are many methods, each with their unique advantages and disadvantages, for detecting epistasis from data (e.g., see Niel et al. 2015 and references therein), we chose to focus on siRF (Kumbier et al. 2023) as our primary interaction search engine for various reasons.

  1. Unlike many methods that make linear assumptions or require pre-specification of interaction order (i.e., the number of features involved in an interaction), siRF can identify higher-order, nonlinear Boolean interactions in a computationally-tractable manner without the need to specify the interaction order of interest.
  2. Another advantage of siRF for detecting interactions is the resemblance between the localized thresholding behavior of its decision trees and the threshold (or on-off switch-like) behavior frequently observed in biomolecular interactions (Nelson, Lehninger, and Cox 2008). In other words, the form of the siRF model appears to be well-suited for the biological phenomena under study.
  3. Furthermore, the prediction accuracy of the siRF model fit can be easily assessed. siRF also exhibited higher prediction accuracy than other commonly used prediction methods across almost all evaluation metrics examined and binarization thresholds (Figure 5.1). As discussed previously, prediction accuracy can serve as an indicator of whether or not the model is an accurate representation of reality. This prediction check is a necessary pre-requisite for interpreting the prediction model and is especially important in low-signal regimes such as this study.
  4. siRF and its predecessor iRF have demonstrated great success in detecting reliable interactions in previous real-world studies, such as in epistasis-controlled hair color (Behr et al. 2020), obesity (Allen et al. 2022), schizophrenia (De Rosa et al. 2022), and predictive gene expression networks (Cliff et al. 2019; Walker et al. 2022).
  5. Finally, due to the weak phenotypic signal and the strong correlations between neighboring SNVs due to linkage disequilibrium (Figure 3.24), we believed it was most appropriate to recommend interactions between genetic loci, rather than interactions betweeen individual SNVs. While it is not always clear how to perform this locus-level interaction search when the data is given at the SNV-level, we can build upon the siRF structure to aggregate feature importances across multiple SNVs using our new locus-level importance score, described in Q. Wang et al. (2023).

Though we are most interested in interpreting the siRF prediction model to identify candidate epistatic interactions, we also fit several other popular prediction models, including RF, L1-regularized logistic regression (Lasso), L2-regularized logistic regression (ridge), support vector machines (SVM), multilayer perceptron, AutoGluon TabularPredictor, and a polygenic risk score. These prediction models serve as baselines to ensure that siRF indeed fits the data well, relative to these other popular prediction models, and that siRF passes the prediction check in accordance with the PCS framework (Yu and Kumbier 2020).

Summary of prediction results: For all of the classification algorithms under investigation, we see that the validation prediction accuracy metrics, measured via classification accuracy, area under the receiver operator characteristic curve (AUROC), and area under the precision-recall curve (AUPRC), are all comfortably above the random-guessing baseline of 0.5 across three different binarization thresholds. This suggests that the prediction models are able to ascertain some phenotypic signal from the SNVs under study. That is, the SNVs used in the prediction models have some predictive power in differentiating those with high LVMi versus those with low LVMi. Moreover, siRF consistently yielded the highest predictive power across the various binarization thresholds and prediction metrics (with one exception being the classification accuracy for the 15% binarization threshold, where ridge is slightly better than siRF). From this, we conclude that siRF passed the prediction check and will proceed to interpret the siRF fit to recommend and prioritize interactions between genetic loci for downstream analyses and experiments.

Beyond the validation prediction accuracies, we also provide the confusion matrices, ROC and precision-recall curves, and pair plots comparing the predicted responses across methods for additional post-hoc exploration.

Prediction Accuracy Summary

Figure 5.1: Summary of prediction performance on the validation set for the binarized LVMi phenotype across various prediction methods, binarization thresholds, and evaluation metrics. Abbreviations: siRF, signed iterative random forest; RF, random forest; Lasso, L1-regularized logistic regression; Ridge, L2-regularized logistic regression; SVM, support vector machine; MLP, multilayer perceptron; Autogluon (medium/good), Autogluon TabularPredictor fitted with the medium/good quality preset; PGS, polygenic risk score.

Binarization Threshold = 0.15

Confusion Tables

Figure 5.2: Confusion matrix from various prediction models for predicting the binarized LVMi phenotype (binarization threshold = 0.15).

ROC/PR Curves

Figure 5.3: ROC curve for predicting the binarized LVMi (threshold = 0.15) across various prediction methods. TPR = true positive rate; FPR = false positive rate.

Figure 5.4: Precision-recall curve for predicting the binarized LVMi (threshold = 0.15) across various prediction methods.

Prediction Pair Plot

Pair plots of the predicted binarized LVMi responses (threshold = 0.15) across various prediction methods. Each point represents an individual in the validation set, and the coordinates represent the predicted probability of having high LVMi from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted probabilities are shown along the diagonal.

Figure 5.5: Pair plots of the predicted binarized LVMi responses (threshold = 0.15) across various prediction methods. Each point represents an individual in the validation set, and the coordinates represent the predicted probability of having high LVMi from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted probabilities are shown along the diagonal.

Binarization Threshold = 0.2

Confusion Tables

Figure 5.6: Confusion matrix from various prediction models for predicting the binarized LVMi phenotype (binarization threshold = 0.2).

ROC/PR Curves

Figure 5.7: ROC curve for predicting the binarized LVMi (threshold = 0.2) across various prediction methods. TPR = true positive rate; FPR = false positive rate.

Figure 5.8: Precision-recall curve for predicting the binarized LVMi (threshold = 0.2) across various prediction methods.

Prediction Pair Plot

Pair plots of the predicted binarized LVMi responses (threshold = 0.2) across various prediction methods. Each point represents an individual in the validation set, and the coordinates represent the predicted probability of having high LVMi from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted probabilities are shown along the diagonal.

Figure 5.9: Pair plots of the predicted binarized LVMi responses (threshold = 0.2) across various prediction methods. Each point represents an individual in the validation set, and the coordinates represent the predicted probability of having high LVMi from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted probabilities are shown along the diagonal.

Binarization Threshold = 0.25

Confusion Tables

Figure 5.10: Confusion matrix from various prediction models for predicting the binarized LVMi phenotype (binarization threshold = 0.25).

ROC/PR Curves

Figure 5.11: ROC curve for predicting the binarized LVMi (threshold = 0.25) across various prediction methods. TPR = true positive rate; FPR = false positive rate.

Figure 5.12: Precision-recall curve for predicting the binarized LVMi (threshold = 0.25) across various prediction methods.

Prediction Pair Plot

Pair plots of the predicted binarized LVMi responses (threshold = 0.25) across various prediction methods. Each point represents an individual in the validation set, and the coordinates represent the predicted probability of having high LVMi from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted probabilities are shown along the diagonal.

Figure 5.13: Pair plots of the predicted binarized LVMi responses (threshold = 0.25) across various prediction methods. Each point represents an individual in the validation set, and the coordinates represent the predicted probability of having high LVMi from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted probabilities are shown along the diagonal.

6 Prioritization Step

Overview

Having passed the prediction check, we finally proceed to interpret the siRF fit in the last step of lo-siRF. In this step, we developed a novel stability-driven feature importance score to prioritize both genetic loci and interactions between genetic loci from siRF. We refer to Q. Wang et al. (2023) for the detailed description of this new importance score and here provide brief insights into the motivation and intuition behind this method.

Main idea: At a high-level, this new importance score (1) aggregates weak, unstable SNV-level importances from siRF into stronger, more stable locus-level importances and (2) can be leveraged to evaluate both the marginal importance of a genetic locus in siRF as well as the importance of higher-order interactions between genetic loci in siRF. The need for this locus-level feature importance score stems from two main reasons: the high correlation between SNVs and the weak phenotypic signal. Though the prediction models (including siRF) are all trained using SNV data, it is often very challenging to interpret the importance of an individual SNV from these models due to the high correlation between SNVs (Toloşi and Lengauer 2011; Hooker, Mentch, and Zhou 2021), which we observed in our previous exploratory data analysis (see Choice of Phenotypic Data tab). To further complicate matters, the weak phenotypic signal increases the instability of existing feature importance methods in that we see the feature rankings change drastically when using different bootstrap training samples. Given these challenges with SNV-level feature importances, it is then no surprise that siRF, out-of-the-box, does not find any stable interactions between SNVs (i.e., the siRF stability score is around 0 for all interactions between SNVs). To address these challenges, our new feature importance score leverages the correlation structure between SNVs, groups SNVs into genetic loci, and aggregates weak, unstable SNV-level importances from siRF into stronger, more stable locus-level importances.

Motivation behind new feature importance score: To better understand the motivation and construction of our new locus-level feature importance score, it can be helpful to first review the original importance score used in siRF (Kumbier et al. 2023). Briefly, in the original version of siRF, the importance of a signed interaction is based upon how stable or frequently it occurs in decisions paths across the random forest – that is, based upon the number of decision paths for which every signed feature in the signed interaction was used in at least one decision split (see Kumbier et al. (2023) for details). Following a similar logic, one natural path forward to calculate a locus-level importance from a forest that was fit on SNV data is as follows:

  1. Take the fitted forest, where each decision split used some SNV to make the split.
  2. Map each SNV used in a decision split to its corresponding genetic loci (e.g., using annotation software like ANNOVAR (K. Wang, Li, and Hakonarson 2010)).
  3. Compute the importance of a signed locus-level interaction based upon the frequency of occurrence of the interaction as before, but using the mapped genetic loci in lieu of the SNVs. That is, compute the number of decision paths for which every signed locus in the signed locus-level interaction was used in at least one decision split.

However, this frequency of occurrence measure is heavily tied to the number of SNVs that belong to each genetic locus. A genetic locus that consists of more SNVs will naturally have a higher frequency of occurrence than a genetic locus with only a few SNVs even though both genetic loci may be equally important (or unimportant). In other words, the original siRF importance score, measuring the frequency of occurrence, is not comparable across genetic loci. This bias is problematic given that we have previously seen some genetic loci may consist of only 1 SNV while other genetic loci have >50 SNVs (see GWAS-filtered SNVs tab in the Dimension Reduction Step).

Intuition behind new feature importance score: To alleviate this bias, we leverage the idea of a local feature importance score, where we compute a per-individual score that measures the importance of a locus/interaction for making the prediction of each individual. This is in contrast to a global feature importance score, which is a measure of importance for the entire group of individuals under study. While a global feature importance score gives rise to a single score for the entire population under study, we can compute a local feature importance score for each individual in the population. By computing a local feature importance score for each individual (i.e., this is the local stability importance (LSI) score from Q. Wang et al. (2023)), we can compare the importance of the locus/interaction across individuals and avoid the across-loci comparison that was problematic in the aforementioned discussion. This comparison is done via a permutation test in Q. Wang et al. (2023). Moreover, since we are comparing the local feature importances between individuals within or per loci/interaction in this permutation test, the local feature importances are all measured on the same scale and comparable, thereby mitigating the bias towards genetic loci with more SNVs.

How to aggregate SNVs: We mapped each SNV to its corresponding genetic locus using ANNOVAR (K. Wang, Li, and Hakonarson 2010) according to the hg19 refSeq Gene annotations. Given our ultimate goal of recommending/prioritizing genetic loci and interactions between loci for downstream analyses and experimental validation, we believe this choice provides an appropriate starting point to obtain actionable recommendations. However, we note that there are many alternative ways of aggregating SNVs. For example, it is possible to aggregate SNVs according to LD/haplotype blocks or to include additional gene-based or functional annotations in the aggregation scheme.

Organization: Under the SNV-level Importances tab, we provide the SNV-level feature importances from the fitted prediction algorithms (described in the Prediction Step section) for completeness. These SNV-level feature importances are very unstable across methods and across bootstrap perturbations for the reasons discussed above. We then summarize the results from the lo-siRF pipeline including the rankings of genetic loci and interaction between genetic loci using our new locus-level importance score under the Summary of lo-siRF Results tab. A final technical note on the advantages of incorporating signed information from siRF in this feature importance computation is provided under the Advantages of Signed Information tab.

SNV-level Feature Importances

Having observed that the aforementioned classification algorithms do have some predictive power to predict those with high versus low LVMi, a natural next step is to investigate which SNVs were used in these prediction models. Below, we highlight the top 100 SNVs, ranked by importance in each of the prediction models. For completeness, we provide the SNV importances for the unbinarized LVMi (regression) problem as well as the binarized LVMi (classification) problem across the three different binarization thresholds.

How is importance defined?

  • For ridge and Lasso, the importance of an SNV is defined as the magnitude of its coefficient in the prediction model.
  • For RF, the importance of an SNV is determined by the mean decrease in impurity (MDI) score.
  • For siRF, the importance of an SNV is determined by the MDI from the forest in the final iteration.
  • Due to the computational complexity, we did not compute the SNV importances from the SVM fit as there is no built-in model-based feature importance method (in R), and any permutation-based approach would be computationally expensive.

A word of caution: When interpreting these SNV importances, it is crucial to keep in mind that there are very large, strongly correlated blocks of SNVs within this data, which often renders the feature rankings unmeaningful and unstable (Toloşi and Lengauer 2011; Hooker, Mentch, and Zhou 2021). In particular, it is well-known that the MDI variable importance metric used in RF/iRF/siRF is biased against correlated features (Hooker, Mentch, and Zhou 2021). In other words, SNVs within large LD blocks tend to have artificially low MDI values. Given these idiosyncrasies, it is unsurprising that SNVs from TTN and CCDC141 – two genetic loci with many SNVs under LD – do not have high importances according to MDI. We provide these SNV importance tables only for completeness and to partially motivate the need to aggregate SNV-level importances into locus-level importances in lo-siRF.

Unbinarized (Regression)

Lasso

Figure 6.1: Lasso - Top 100 SNVs, ranked by importance, for predicting LVMi.

Ridge

Figure 6.2: Ridge - Top 100 SNVs, ranked by importance, for predicting LVMi.

RF

Figure 6.3: RF - Top 100 SNVs, ranked by importance, for predicting LVMi.

siRF

Figure 6.4: siRF - Top 100 SNVs, ranked by importance, for predicting LVMi.

Binarization Threshold = 0.15

Lasso

Figure 6.5: Lasso - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.15).

Ridge

Figure 6.6: Ridge - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.15).

RF

Figure 6.7: RF - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.15).

siRF

Figure 6.8: siRF - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.15).

Binarization Threshold = 0.2

Lasso

Figure 6.9: Lasso - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.2).

Ridge

Figure 6.10: Ridge - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.2).

RF

Figure 6.11: RF - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.2).

siRF

Figure 6.12: siRF - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.2).

Binarization Threshold = 0.25

Lasso

Figure 6.13: Lasso - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.25).

Ridge

Figure 6.14: Ridge - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.25).

RF

Figure 6.15: RF - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.25).

siRF

Figure 6.16: siRF - Top 100 SNVs, ranked by importance, for predicting the binarized LVMi (threshold = 0.25).

Summary of lo-siRF Results

Finally, we summarize the prioritization results from the lo-siRF pipeline using our new importance score to rank genetic loci and interactions between genetic loci. In this new importance score, we (1) computed a local (i.e., on a per-individual basis) stability importance score that aggregates weak SNV-level importances into stronger locus-level importances using the siRF fit and (2) conducted a permutation test to assess whether the local stability importance scores for a given locus/interaction are different between individuals with high and low LVMi (conditioned on the rest of the fitted forest). The larger the difference in local stability importance scores between those with high versus low LVMi, the greater the importance of the locus/interaction for predicting LVMi. For details on the new importance score, we refer to Q. Wang et al. (2023).

Below, we recap the siRF prediction accuracy for predicting the binarized LVMi across the three binarization thresholds (15%, 20%, 25%) (Figure 6.17). Then, in Figure 6.18, we summarize the prioritization rankings of genetic loci and interaction between loci according to the new importance score. We provide the nominal permutation p-values along with the permutation test statistic, that is, the difference in local feature stability scores between the high and low LVMi groups, with the standard deviation of this permutation distribution in parentheses.

Note that the permutation test was not performed for all loci/interactions (details in Q. Wang et al. (2023)). Given our primary aim of generating reliable hypotheses for follow-up experimental validation, we chose to only test loci/interactions that were predictive and stable in the siRF fit, following the PCS framework for veridical data science (Yu and Kumbier 2020). Specifically, for each binarization threshold, we only tested:

  • the top 25 loci with the highest average local stability importance scores, or in other words, those that occurred most stably/frequently (and thus likely predictive) in the siRF fit, and
  • the interactions between loci that were identified by siRF and achieved a stability score > 0.5, stability score for mean increase in precision > 0, and stability score for feature selection dependence > 0. We note that in addition to the explicit stability checks here, the requirement on the mean increase in precision serves as a predictability check, and the requirement on the feature selection dependence serves to differentiate marginal additive effects from non-additive interaction effects (see Kumbier et al. (2023) for details).

By implementing this additional predictability and stability check, we are requiring that the prioritized loci/interactions meet a higher standard, with the hope that this ultimately generates more reliable (albeit more conservative) experimental recommendations.

In addition to the provided summary of results,

  • Under the siRF Interaction Table tab, we provide the full siRF locus-level interaction table output for each binarization threshold (Figures 6.19, 6.21, and 6.23). This table includes several metrics for evaluating the stability and strength of the interaction. For details on these metrics, we refer to Kumbier et al. (2023).
    • For comparison, we also provide the analogous siRF output for interactions at the SNV level (Figures 6.20, 6.22, and 6.24). Here, we see that if we do not perform the SNV-to-locus aggregation, the stability scores for the SNV-level interactions are very close to 0. This means that the siRF-identified SNV-level interactions rarely appear when siRF is refitted using different bootstrapped training samples. As discussed previously, this instability is partially due to the correlation among SNVs and the weak phenotypic signal. This high instability also motivates the need for an alternative approach via our new importance score.
  • Under the Local Stability Importance Plots - Interactions and Local Stability Importance Plots - Loci, we provide visualizations of the density distribution of local stability importance scores between the low and high LVMi groups as well as the frequency distribution of SNVs used in the siRF fit (i.e., which SNVs occurred most frequently in the decision paths across the siRF fit).
    • In the density distribution of local stability importance scores, observable differences between the low and high LVMi groups are evidence that the genetic locus or interaction between loci is important for differentiating between individuals with high versus low LVMi. The larger the difference, the greater the importance of the locus or interaction between loci.
    • In the SNV frequency distributions, we show the frequency of occurrence for each SNV or SNV pair in the siRF fit. This is the number of decision paths for which the SNV or SNV pair appeared in the siRF fit. SNVs or SNV pairs with the highest number of occurrences in the RF decision paths (y-axis) contributed the most to the overall locus/interaction’s importance in the siRF fit.

Summary

Figure 6.17: Summary of the validation prediction performance from siRF across various LVMi binarization thresholds.

Figure 6.18: Summary of the lo-siRF permutation test results across various LVMi binarization thresholds including the test statistic (i.e., difference in local feature stability scores between the high and the low LVMi groups (SD in parentheses) and the resulting nominal p-values. Blank cells indicate that the locus/interaction was not selected for testing for the given binarization threshold. Loci/interactions are ranked according to the mean nominal p-value across all three binarization thresholds. Interactions between genetic loci are highlighted in blue.

siRF Interaction Table

0.15

Figure 6.19: Summary of the siRF locus-level interaction table output for the binarized LVMi phenotype (binarization threshold = 0.15). For all metrics excluding the nominal lo-siRF p-value, higher values indicate a stronger interaction. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).

Figure 6.20: Summary of the siRF SNV-level interaction table output for the binarized LVMi phenotype (binarization threshold = 0.15). Stability scores for all interactions are close to 0, suggesting unreliable SNV-level interactions. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).

0.2

Figure 6.21: Summary of the siRF locus-level interaction table output for the binarized LVMi phenotype (binarization threshold = 0.2). For all metrics excluding the nominal lo-siRF p-value, higher values indicate a stronger interaction. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).

Figure 6.22: Summary of the siRF SNV-level interaction table output for the binarized LVMi phenotype (binarization threshold = 0.2). Stability scores for all interactions are close to 0, suggesting unreliable SNV-level interactions. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).

0.25

Figure 6.23: Summary of the siRF locus-level interaction table output for the binarized LVMi phenotype (binarization threshold = 0.25). For all metrics excluding the nominal lo-siRF p-value, higher values indicate a stronger interaction. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).

Figure 6.24: Summary of the siRF SNV-level interaction table output for the binarized LVMi phenotype (binarization threshold = 0.25). Stability scores for all interactions are close to 0, suggesting unreliable SNV-level interactions. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).

Local Stability Importance Plots - Interactions

0.15

Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.15) for each signed interaction between loci (specified by subplot) that passed the siRF stability filter. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.25: Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.15) for each signed interaction between loci (specified by subplot) that passed the siRF stability filter. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.26: Frequency of occurrence of SNV-SNV combinations in the siRF fit for the binarized LVMi phenotype (binarization threshold = 0.15). SNV-SNV combinations with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level interaction importance in the siRF fit.

0.2

Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.2) for each signed interaction between loci (specified by subplot) that passed the siRF stability filter. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.27: Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.2) for each signed interaction between loci (specified by subplot) that passed the siRF stability filter. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.28: Frequency of occurrence of SNV-SNV combinations in the siRF fit for the binarized LVMi phenotype (binarization threshold = 0.2). SNV-SNV combinations with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level interaction importance in the siRF fit.

0.25

Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.25) for each signed interaction between loci (specified by subplot) that passed the siRF stability filter. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.29: Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.25) for each signed interaction between loci (specified by subplot) that passed the siRF stability filter. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.30: Frequency of occurrence of SNV-SNV combinations in the siRF fit for the binarized LVMi phenotype (binarization threshold = 0.25). SNV-SNV combinations with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level interaction importance in the siRF fit.

Local Stability Importance Plots - Loci

0.15

Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.15) for each signed locus (specified by subplot). Here, we show only the top 25 signed loci, ranked by their average local stability importance scores. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.31: Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.15) for each signed locus (specified by subplot). Here, we show only the top 25 signed loci, ranked by their average local stability importance scores. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.32: Frequency of occurrence of SNVs in the siRF fit for the binarized LVMi phenotype (binarization threshold = 0.15). SNVs with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level importance in the siRF fit. SNVs are ordered on the x-axis by genomic location.

0.2

Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.2) for each signed locus (specified by subplot). Here, we show only the top 25 signed loci, ranked by their average local stability importance scores. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.33: Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.2) for each signed locus (specified by subplot). Here, we show only the top 25 signed loci, ranked by their average local stability importance scores. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.34: Frequency of occurrence of SNVs in the siRF fit for the binarized LVMi phenotype (binarization threshold = 0.2). SNVs with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level importance in the siRF fit. SNVs are ordered on the x-axis by genomic location.

0.25

Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.25) for each signed locus (specified by subplot). Here, we show only the top 25 signed loci, ranked by their average local stability importance scores. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.35: Distribution of local stability importance scores between the high and low LVMi groups (binarization threshold = 0.25) for each signed locus (specified by subplot). Here, we show only the top 25 signed loci, ranked by their average local stability importance scores. The nominal p-value from the permutation test, assessing distributional differences between the high and low LVMi groups, is provided in the top right corner of each subplot.

Figure 6.36: Frequency of occurrence of SNVs in the siRF fit for the binarized LVMi phenotype (binarization threshold = 0.25). SNVs with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level importance in the siRF fit. SNVs are ordered on the x-axis by genomic location.

Advantages of Signed Information

Though the signed information is not pertinent to our goal of prioritizing/recommending candidate markers for experiments, the signed information from siRF provides more granular information that can improve our interpretation of the fit. To see this, suppose that we did not keep track of the sign of the feature when splitting (i.e., that we do not keep track of whether we split on low or high values of the feature). Suppose also that we would like to measure the importance of feature \(X_1\), which was used as the first (root) split in tree \(T\). In the first step in our new importance score computation, we compute the local stability importance (LSI) score of \(X_1\) for each individual \(i\). If we do not keep track of the sign of the \(X_1\) split, then \(X_1\) appears on every individual’s decision path so that \(LSI_T(X_1, i) = 1\) for every individual \(i\). In the next step of our new importance score computation, we perform a permutation test to assess whether there is a difference in the LSI scores for individuals with high LVMi and those with low LVMi. Since \(LSI_T(X_1, i) = 1\) for every \(i\), this permutation test will return a nominal p-value of 1. This suggests that \(X_1\) is an unimportant feature in this model. However, this is counterintuitive given that \(X_1\) was split on first in the tree and should be considered one of the most important features in this tree. By keeping track of the signed information of the splits as done in siRF, we can utilize this more granular information to mitigate the issue discussed above and improve our interpretation of the siRF fit.

Moreover, to facilitate the interpretation of this signed information, it can be beneficial to aggregate SNV features that are positively correlated (as opposed to negatively correlated) into a single locus (or group). Since the encoding of SNVs as a value taking on 0, 1, or 2 is rather arbitrary (e.g., a value of 0 can be re-encoded to take on the value of 2 by changing the reference allele), we can choose the encoding of the SNV to try to avoid strong negative correlations within a locus. More specifically, for each genetic locus which consists of SNVs \(X_1, \ldots, X_g\), we choose to “flip” the encoding of the SNV \(X_j\) (\(j = 2, \ldots, g\)) if the Pearson correlation between the observed values of \(X_j\) and \(X_1\) is below -0.15. Here, we use the term “flip” to mean the application of the function that maps 0 to 2, 1 to 1, and 2 to 0. This correlation threshold of -0.15 was chosen by visually inspecting the resulting pairwise correlations between \(X_i\) and \(X_j\) for all \(i, j = 1, \ldots, g\) and checking that these pairwise correlations are not strongly negative. However, while we focus on the lo-siRF prioritization results using this correlation threshold of -0.15, the results appear to be robust to other reasonable correlation threshold choices (e.g., -0.05, -0.25). Further analysis and improvements to this procedure are very interesting for future work.

7 Non-hypertensive Analysis

Overview

The lo-siRF analysis thus far has not accounted for the possibility of pleiotropy, particular between LV mass and blood pressure given the known connections between LV hypertrophy and hypertension (Yildiz et al. 2020). To evaluate this possibility of pleiotropic effects of the identified variants on both LV mass and blood pressure, we conducted two analyses.

  1. We assessed the marginal association between each of the lo-siRF-prioritized variants and hypertension (see Association between SNVs and Hypertension tab).
  2. We repeated the lo-siRF analysis using only the subset of UK Biobank individuals who were not reported to have hypertension (see Lo-siRF Results Summary tab).

Here, we define hypertensive diseases as anyone with self-reported hypertension, high blood pressure as diagnosed by a doctor, or any ICD10 billing code diagnosis in I10-I16. Out of the 29,661 UK Biobank participants in the original lo-siRF analysis, 7,371 individuals had hypertension, leaving 22,290 individuals for the non-hypertensive analysis.

Association between SNVs and Hypertension

To assess their marginal association with hypertension, we fit a separate logistic regression model for each of the 1405 GWAS-filtered SNVs, regressing hypertension (i.e., a binary indicator of whether or not one has hypertension) on the genotype of the SNV, while adjusting for the first five principal components of ancestry, sex, age, height, and body weight.

Overall, we found that none of the 1405 SNVs met the genome-wide (p < 5E-8) nor the suggestive (p < 1E-5) significance level (Figure 7.1 and Figure 7.2). Given our interest in the lo-siRF-prioritized interactions, which involve the TTN, CCDC141, IGF1R, and LOC157273;TNKS loci, we highlight the SNVs in each locus with the smallest p-values:

  • TTN: rs35112591 (p = 0.0328)
  • CCDC141: rs6725028 (p = 0.0403)
  • IGF1R: rs62024491 (p = 0.376)
  • LOC157273;TNKS: rs73185221 (0.0491)

The top lo-siRF-prioritized variants in each loci (see Figure 2f in Q. Wang et al. (2023)) exhibited even weaker marginal associations with hypertension;

  • TTN: rs66733621 (p = 0.776)
  • CCDC141: rs7591091 (p = 0.446)
  • IGF1R: rs62024491 (p = 0.376)
  • LOC157273;TNKS: rs6999852 (0.304)

We note however that the MIR588;RSPO3 locus with lead SNV rs2022479 gave the smallest p-value of 5E-5. This, in combination with the findings in the non-hypertension lo-siRF analysis next, may suggest a possible contribution of the MIR588;RSPO3 locus on mediating LV hypertrophy through regulating blood pressure.

Manhattan plot for the association between the 1405 GWAS-filtered SNVs used in the original lo-siRF analysis and hypertension.

Figure 7.1: Manhattan plot for the association between the 1405 GWAS-filtered SNVs used in the original lo-siRF analysis and hypertension.

Figure 7.2: List of top 100 SNVs (out of the 1405 GWAS-filtered SNVs used in the original lo-siRF analysis) with the highest association with hypertension. rsID = rsID of the SNV. Chr = Chromosome. Pos = Position on hg19. Lo-siRF Locus = assigned locus of the variant in the original lo-siRF analysis. p = P-value measuring the association between the SNV and hypertension.

Non-hypertensive Lo-siRF Results Summary

Using the same set of 1405 GWAS-filtered SNVs as in the original lo-siRF analysis, we repeated steps 2-4 of the lo-siRF analysis using only the non-hypertension UK Biobank cohort.

  • In Figure 7.3, we confirm that the siRF fit on the non-hypertensive cohort performs better than random guessing and satisfies the prediction check according to the PCS framework (Yu and Kumbier 2020). We thus choose to proceed with interpreting the siRF fit.
  • Overall, we found that the top 3 interactions between loci from our original lo-siRF analysis — namely, CCDC141-IGF1R, CCDC141-TTN, and CCDC141-LOC157273;TNKS — were again prioritized by lo-siRF when excluding the hypertensive individuals. That is, each of the aforementioned interactions yielded a nominal lo-siRF p-value less than our pre-specified threshold of 0.1 (in fact, all were < 0.01) for each of the three binarization thresholds (see Figure 7.4 below). This is to say that the top lo-siRF-prioritized interactions are stable with respect to the inclusion or exclusion of hypertensive individuals and provides evidence that these interactions are not solely mediating LV hypertrophy through the regulation of blood pressure.
  • All of the previously prioritized loci, with the exception of the MIR588;RSPO3 locus, were also prioritized in this non-hypertensive analysis. While the nominal lo-siRF p-value for the MIR588;RSPO3 locus was 0.019 and 0.007 for the 20% and 25% binarization thresholds, the nominal lo-siRF p-value was 0.187 (> 0.1 threshold) for the 15% binarization threshold and thus deemed not stable and was not prioritized in the non-hypertensive lo-siRF analysis. Though the MIR588;RSPO3 locus almost met the prioritization criteria here, this result, combined with the suspect marginal association results, suggests a possibility of pleiotropic locus contributing to variations of both blood pressure and LV mass.
  • We provide the full list of loci and interactions between loci from the non-hypertensive lo-siRF analysis in Figure 7.5. This list includes many loci and interactions that were not stable in the original lo-siRF analysis.

Figure 7.3: Summary of the validation prediction performance from siRF on the non-hypertensive cohort across various LVMi binarization thresholds.

Figure 7.4: Summary of the non-hypertensive lo-siRF permutation test results across various LVMi binarization thresholds including the test statistic (i.e., difference in local feature stability scores between the high and the low LVMi groups (SD in parentheses) and the resulting nominal p-values. Results are shown for the loci/interactions that were previously prioritized in the original lo-siRF analysis (including both hypertensive and non-hypertensive individuals). Loci/interactions are ranked according to the mean nominal p-value across all three binarization thresholds. Interactions between genetic loci are highlighted in blue. This table is a subset of the table below.

Figure 7.5: Summary of the non-hypertensive lo-siRF permutation test results across various LVMi binarization thresholds including the test statistic (i.e., difference in local feature stability scores between the high and the low LVMi groups (SD in parentheses) and the resulting nominal p-values. Blank cells indicate that the locus/interaction was not selected for testing for the given binarization threshold. Loci/interactions are ranked according to the mean nominal p-value across all three binarization thresholds. Interactions between genetic loci are highlighted in blue.

8 Comparison to Existing Methods

Epistasis Detection Methods

In this section, we compare the lo-siRF-prioritized interactions between loci to those interactions prioritized by other existing epistasis detection methods, namely, an exhaustive pairwise interaction search via linear (or logistic) regression, MAPIT (Crawford et al. 2017), and a locus-level version of MAPIT where we combined MAPIT and gene set enrichment analysis (GSEA) (Subramanian et al. 2005; Mootha et al. 2003).

Regression-based pairwise interaction scan

Below, we present the results from conducting an exhaustive pairwise interaction scan using linear (or logistic) regression over all GWAS-filtered SNVs with a minor allele frequency > 0.05. In this exhaustive interaction search, there are four different transformations of the LVMi response/phenotype that are of interest — namely, the rank-based inverse normal-transformed (RINT) LVMi (fitted using linear regression) and the binarized LVMi at three different thresholds (0.15, 0.2, and 0.25) (fitted using logistic regression). Details regarding the methodology can be found in Q. Wang et al. (2023).

  • In Figure 8.1, we examined the stability of the top ranked SNV-SNV interactions (i.e., those with the smallest p-values for the interaction term) across the different choices of LVMi response (i.e., RINT or one of the binarized versions). While there is some positive correlation across the different binarization runs, we found that there is very little correlation between the prioritized interactions when using the RINT-transformed LVMi versus the binarized LVMi as the response.
    • We note that the exhaustive interaction search was also conducted using all 1405 GWAS-filtered SNVs without the minor allele frequency filter. However, this led to even greater instability. We hence omit these results and focus on the results after restricting to SNVs with a minor allele frequency > 0.05.
  • In Figure 8.2, we provide a table of the SNV-SNV interactions and their corresponding interaction term p-values for each choice of LVMi response. We show only those SNV-SNV interactions with a p-value < 0.01 for at least one of the LVMi responses.
    • We note that none of the SNV-SNV interactions met the \(\alpha = 0.05\) significance threshold after performing a Bonferroni multiple testing adjustment.
    • Still, one could in theory use the p-values for ranking SNV-SNV interactions. For the unbinarized RINT-transformed LVMi response, the top interaction was between rs10227393 (chromosome 7; TPK1;CNTNAP2 locus) and rs8076248 (chromosome 17; CFAP52 locus). For the binarized LVMi response using both the 15% and 25% thresholds, the top interaction was between rs3176323 (chromosome 6; CDKN1A locus) and rs12795076 (chromosome 11; KIRREL3 locus). To our knowledge, these loci are not known to be closely related to hypertrophy. Only for the binarized LVMi response with the 20% threshold do we obtain a top interaction involving one of the lo-siRF-prioritized loci. Here, the top interaction is between rs12618871 (chromosome 2; CCDC141 locus) and rs9267356 (chromosome 6; MICB locus).

These differences in the top interactions across the different LVMi responses further highlight the instability of the regression-based pairwise interaction scan. This instability poses issues in interpreting results, as it can be unclear which results to trust. We suspect that at least part of the instability is a result of the low signal-to-noise ratio in this problem and highlights the need for more robust and stability-driven methods, such as lo-siRF, for detecting epistasis in low signal-to-noise regimes.

**Pair plot of the LVMi interaction effect p-values from an exhaustive regression-based SNVxSNV interaction scan using different transformations of the LVMi phenotype.** The transformations of interest include the rank-based inverse normal-transformed LVMi and the binarized LVMi using three different binarization thresholds (15%, 20%, 25%). The x- and y-axes show the interaction effect p-values after a -log10 transformation. Each point represents an SNV-SNV pair. We observe that the interaction effect p-values are generally unstable and depend on the choice of transformation of the phenotype (or response).

Figure 8.1: Pair plot of the LVMi interaction effect p-values from an exhaustive regression-based SNVxSNV interaction scan using different transformations of the LVMi phenotype. The transformations of interest include the rank-based inverse normal-transformed LVMi and the binarized LVMi using three different binarization thresholds (15%, 20%, 25%). The x- and y-axes show the interaction effect p-values after a -log10 transformation. Each point represents an SNV-SNV pair. We observe that the interaction effect p-values are generally unstable and depend on the choice of transformation of the phenotype (or response).

Figure 8.2: List of SNV-SNV interactions from an exhaustive regression-based SNVxSNV interaction scan using different transformations of the LVMi phenotype. The transformations of interest include the rank-based inverse normal-transformed LVMi and the binarized LVMi using three different binarization thresholds (15%, 20%, 25%). The table sorts the SNV-SNV interactions based on the mean p-value across the different transformations of the LVMi phenotype. To sort the SNV-SNV interactions based upon their p-value from a particular transformation, please use the toggles next to the column names. Only interactions that yielded a p-value < 0.01 for at least one transformation of the LVMi phenotype are shown for brevity.

MAPIT

Given the instabilities observed in the exhaustive regression-based interaction search, we applied MAPIT (Crawford et al. 2017) to our dataset as an additional comparison method. At a high-level, MAPIT aims to identify variants with non-zero marginal epistatic effects, defined as the total pairwise interaction effect between the given variant and all other variants in the data.

Similar to before, we used the GWAS-filtered SNVs with minor allele frequency > 0.05 as input to MAPIT. For the response, we used the rank-based inverse normal-transformed LVMi only since the MAPIT extension to binary traits has not yet been implemented. Details regarding the methodology can be found in Q. Wang et al. (2023).

  • Note: we also tried applying MAPIT to the full set of 1405 GWAS-filtered SNVs (without the additional minor allele frequency filter). However, the resulting top SNVs were predominantly those with the smallest minor allele frequencies, and we omit these results.

The top 5 SNVs from MAPIT with the smallest p-values were located in the MAT2B;LINC02143, MICB, BRMS1L;LINC00609, and CDKN1A loci (Figure 8.3). Interestingly, the highest-ranked SNVs from the CCDC141, TTN, and IGF1R loci were ranked 109 (p = 0.028), 75 (p = 0.003), and 87 (p = 0.011), respectively (compared to p < 8E-5 for the top 5 SNVs), according to MAPIT.

As is, the results from MAPIT and lo-siRF are not directly comparable since they are operating at different biological scales (i.e., MAPIT aims to identify epistatic variants while lo-siRF aims to identify epistatic loci). Variant-level approaches are known to be under-powered in traits where the effect of each variant is very small and/or noisy while group-level (or locus-level) approaches aggregate these signals across a collection of variants in hope of detecting more stable and biologically-relevant loci. In addition to this difference, there are a number of other potential reasons why MAPIT and lo-siRF may yield different, but non-contradictory results. For example, MAPIT and lo-siRF assume different underlying model assumptions and may capture different forms of interactions. In short, MAPIT encodes interactions as pairwise multiplicative effects (more typical in the statistical epistasis literature) whereas lo-siRF takes a more machine learning approach, encoding pairwise and higher-order interactions through decision paths (or boolean thresholding operations) in a decision tree. Other potential differences may stem from the different handling of variants with minor allele frequencies below 0.05 or the slightly different phenotypes used (i.e., lo-siRF used the binarized LVMi phenotype while MAPIT unfortunately requires a continuous phenotype and thus used the rank-based inverse normal-transformed LVMi).

Figure 8.3: Top epistatic variants from MAPIT, ranked by p-value. We show only the epistatic variants with p-value < 0.05. rsID = rsID of the variant. Chr = Chromosome. Pos = Position on hg19. Lo-siRF Locus = assigned locus of the variant in the lo-siRF analysis. P-value = p-value from MAPIT.

MAPIT+GSEA

Since the results of MAPIT and lo-siRF are on different biological scales (i.e., SNVs versus loci), we also conducted a locus-level analysis using MAPIT (Crawford et al. 2017) in conjunction with gene set enrichment analysis (GSEA) (Subramanian et al. 2005; Mootha et al. 2003). In this pipeline, we took the union of the top 10,000 GWAS hits from PLINK and BOLT-LMM, ran MAPIT on these SNVs to obtain a ranked list of SNVs according to their MAPIT p-values, and then ran GSEA on this ranked list to identify loci that are enriched at the top of this ranked list. The results of this MAPIT+GSEA analysis are provided below. We highlight three notable findings from this analysis: When ranking the loci by their normalized enrichment score,

  • The top-ranked locus (by MAPIT+GSEA) was the intergenic region between IGFBP3 and LOC730338 (denoted as IGFBP3;LOC730338). IGFBP3 encodes an insulin-like growth factor (IGF) binding protein, which is known to interact closely with IGF1R (Baxter 2023).
  • The three interacting loci, prioritized by lo-siRF, were all ranked within the top 30 by the MAPIT+GSEA analysis (LOC157273;TNKS: rank 2; IGF1R: rank 7; TTN: rank 23; CCDC141: rank 28 out of 863 total loci). Their corresponding nominal p-value, FDR q-value, and FWER p-value were 0 (or < 1e-3 given that we used 1000 permutations). We note however that the magnitudes of these p-values are not directly comparable to those from lo-siRF given the different modeling assumptions.
  • Aside from IGFBP3;LOC730338, LOC157273;TNKS, IGF1R, and ADAMTS16, other loci ranked in the top 10 by MAPIT+GSEA do not appear to be closely related to cardiac hypertrophy or cardiovascular diseases in the current literature. Moreover, except for IGFBP3;LOC730338, LOC157273;TNKS, IGF1R, ADAMTS16, and FBXL17, the remaining 5 loci in the top 10 were not among the top 1000 GWAS-filtered SNVs used in the original lo-siRF analysis. In other words, half of the top 10 loci identified by MAPIT+GSEA did not show strong marginal associations with LVMi. There is no expectation that variants involved in epistasis also have large marginal effects, so the lack of marginal associations alone is not particularly concerning. However, combined with the observation that the primary biological functions of these loci are not directly related to the heart, we are uncertain whether these epistatic findings would be validated in wet lab experiments.

Overall, in comparing lo-siRF with MAPIT+GSEA, we are encouraged to see that both methods identified loci like LOC157273;TNKS, IGF1R, TTN, and CCDC141. On the other hand, MAPIT+GSEA also identified loci deprioritized by lo-siRF, which turn out to lack biological relevance to cardiac diseases. These discrepancies may arise from the fundamental differences between the computational frameworks of lo-siRF and MAPIT+GSEA, which make direct comparisons challenging. First, MAPIT (or MAPIT+GSEA) outputs single SNVs or loci with marginal epistatic effects while lo-siRF directly identifies specific interactions between genetic loci. Secondly, lo-siRF and MAPIT are based on different modeling assumptions, so their definitions of epistasis are not directly comparable. Lastly, since performing GSEA after MAPIT has not been extensively studied, it is unclear whether the observed strengths and weaknesses of MAPIT+GSEA arise from MAPIT, GSEA, or their combination. It also remains uncertain whether using alternative SNV-to-locus grouping methods (in place of GSEA) would improve upon the epistatic loci rankings.

Figure 8.4: Top epistatic loci from MAPIT+GSEA, ranked by normalized enrichment score. We show only the epistatic loci with nominal p-value < 0.05.

Marginal Gene-based Methods

In this section, we compare the lo-siRF-prioritized loci to those loci prioritized by other existing marginal set-based methods, such as MAGMA (Leeuw et al. 2015) (Figures 8.5-8.6) and SKAT-O (Lee et al. 2012) (Figure 8.7). Details regarding the methodology can be found in Q. Wang et al. (2023).

Notably, we found that while lo-siRF prioritized IGF1R as one of its top recommended loci, SKAT-O and MAGMA did not rank IGF1R very highly. According to SKAT-O, IGF1R yielded a p-value of 1.0E-5 and was the 45th ranked locus, and using MAGMA, IGF1R yielded a p-value of 0.002 and was the 146th ranked locus. Given our extensive validation efforts demonstrating the importance of IGF1R for left ventricular hypertrophy, we view this as an interesting example where lo-siRF is able to identify important loci that are not necessarily prioritized by other methods, even for marginal effects.

Manhattan plot of the gene-based test as computed by MAGMA. The genome-wide significance level is shown by the red dotted line.

Figure 8.5: Manhattan plot of the gene-based test as computed by MAGMA. The genome-wide significance level is shown by the red dotted line.

Figure 8.6: Top genes from MAGMA, ranked by p-value. We show only the top 200 genes. Locus = Gene symbol. Chr = Chromosome. Start = Start position on hg19. Stop = Stop position on hg19. Z = Z-statistic from MAGMA. P-value = p-value from MAGMA.

Figure 8.7: Top genes from SKAT-O, ranked by p-value. Locus = Name of locus from lo-siRF analysis. Chr = Chromosome. P-value = p-value from SKAT-O.

9 Lo-siRF Permutation Test Simulations

In this section, we conduct several simulations to investigate the performance and demonstrate the validity of the permutation p-values from lo-siRF.

P-value Calibration Simulations

Calibration of p-values under null model: In the first set of simulations, we aimed to assess whether the permutation p-values are calibrated under the null model where no epistasis is present. To this end, we simulated binary responses \(y\) from a null, purely-noise model, where \(y \sim Bernoulli(1/2)\). Under this null simulation setup, we generated covariates \(X\) from three different data-generating processes:

  1. Independent normal: \(X_{ij} \stackrel{iid}{\sim} N(0, 1)\) for \(i = 1, \ldots, n\) and \(j = 1, \ldots, p\).
  2. Independent SNPs: \(X_{ij}\) iid from \(\{0, 1, 2\}\) with uniform probabilities for \(i = 1, \ldots, n\) and \(j = 1, \ldots, p\).
  3. Real-world SNPs: The \(n \times p\) matrix \(X\) is taken to be the GWAS-filtered SNV matrix used in lo-siRF. In this case, we we randomly subsampled the features and the columns to meet the desired dimensions of \(n\) and \(p\).

Under each of these three settings, we ran the simulated data through a simplified lo-siRF pipeline, where we:

  1. Split the data into two partition (2/3 training and 1/3 validation).
  2. Fit siRF on the training data to obtain a list of candidate interactions.
  3. For each of the interactions from step 2, evaluate the local stability importance scores and compute the associated permutation p-value using the validation data.

In simulations (A) and (B), we searched for interactions at the level of individual features (or SNVs), rather than a group of features (or genes). This was done to simplify the simulation setup and interpretation of results, given that the features were simulated independently. For simulation setup (C), we leveraged the same SNV-to-gene mapping used previously in lo-siRF and searched for interactions at the gene level.

We also note that the simplified lo-siRF pipeline omits the interaction filtering step based upon their siRF stability scores. This was done because the interaction filtering step excludes almost all non-relevant candidate interactions under the simulation setup. In particular, under the null simulation model, the interaction filtering (using the recommended thresholds from the original siRF paper (Kumbier et al. 2023)) removed all but 2 interactions across 200 simulated data replicates. Given our goal of assessing whether the permutation p-values are calibrated, we will focus our evaluation on the permutation p-values from all candidate interactions in this setting regardless of whether the interactions passed the stability filtering.

Results: In Figure 9.1 below, we plot the proportion of rejected null hypotheses across different significance levels \(\alpha\) under the aforementioned null simulation setup. We found that the observed type I error of our permutation test closely aligns with the significance level \(\alpha\), thereby confirming that the p-values are calibrated. Moreover, these results hold across different sample sizes, number of features, and all three data-generating processes for X.

Proportion of rejected null hypotheses across varying significance levels under the null simulation response model are represented by the solid black line. Dotted line represents a perfectly calibrated hypothesis test (i.e., the y = x line). Each row corresponds to a different data-generating process for X, and each column corresponds to a different choice of number of samples. Results are aggregated across 200 simulation replicates.

Figure 9.1: Proportion of rejected null hypotheses across varying significance levels under the null simulation response model are represented by the solid black line. Dotted line represents a perfectly calibrated hypothesis test (i.e., the y = x line). Each row corresponds to a different data-generating process for X, and each column corresponds to a different choice of number of samples. Results are aggregated across 200 simulation replicates.

Marginal and Interaction Effect Simulations

Marginal and Interaction Effect Simulations: Moving beyond this basic calibration investigation and the null, purely-noise model, we were more interested in studying the broader performance of lo-siRF and the permutation test. To this end, we conducted two additional simulations using boolean-type rules. We note that the boolean-type rules were chosen as they have been extensively studied in previous RF-based work and are thought to closely resemble the switch-like and stereospecific nature of interactions among biomolecules (Nelson, Lehninger, and Cox 2008; Kumbier et al. 2023; Behr et al. 2020). Specifically, we first simulated the \(X\) data from independent SNVs, where \(X_{ij}\) is simulated iid from \(\{0, 1, 2\}\) with uniform probabilities. Next, we simulated binary responses \(y \mid X \sim Bernoulli(p(X))\) via:

  1. Marginal effect model: \(p(X) = 0.4 \cdot \mathbf{1}\{X_1 > 0\} + 0.4 \cdot \mathbf{1}\{X_2 > 0\}\)
    • In this model, there are no true interactions.
  2. Interaction effect model: \(p(X) = 0.4 \cdot \mathbf{1}\{X_1 > 0\} \mathbf{1}\{X_2 > 0\} + 0.4 \cdot \mathbf{1}\{X_3 > 0\} \mathbf{1}\{X_4 > 0\}\)
    • In this model, the true interactions are \(X_1:X_2\) and \(X_3:X_4\).

We set the number of features to 100 and varied the number of samples across n = 1000, 1500, 2000. For each simulation model and choice of sample size, we ran 200 simulation replicates.

Results: The simulation results are summarized in Table 9.2 below. In what follows, we define “partial interaction” as a candidate interaction that involves at least one true signal feature from an interaction that appeared in the simulation model — specifically, at least one of \(X_1\) or \(X_2\) in the marginal effect simulation (D), or at least one of \(X_1\), \(X_2\), \(X_3\), or \(X_4\) in the interaction effect simulation (E). We define “true interaction” as a candidate interaction that involves both \(X_1\) and \(X_2\) or both \(X_3\) and \(X_4\) in the interaction effect simulation.

The main takeaways from these simulations are:

  1. Under both the marginal effect and interaction effect scenarios, the candidate interactions from siRF almost always include at least one true signal feature (and equivalently, is a partial interaction).

    • Specifically, under both the marginal effect (D) and interaction effect (E) scenarios, over 99% of the candidate interactions identified by siRF were found to be “partial interactions” (the 5th column in Table 9.2). Moreover, under the interaction effect (E) simulation, over 38% of the candidate interactions identified by siRF were “true interactions” (the 4th column in Table 9.2).
  2. The interaction filtering step within siRF, where we filter out candidate interactions that do not meet the siRF stability thresholds, is both necessary and effective at distinguishing between interactions and marginal effects.

    • First, the siRF interaction filtering is necessary to differentiate between interactions and marginal effects since the permutation test itself is not designed to distinguish between “true interactions” and “partial interactions.”

      • Because the permutation test within lo-siRF formally assesses whether the local stability importance of an interaction differs between the high and low LVMi groups, any candidate interaction involving at least one true signal feature (i.e., any partial interaction) is expected to show a difference between the two groups and result in a significant p-value. Our results support this expectation. In both the interaction and marginal simulation models, an overwhelming majority (>99%) of the permutation tests led to a significant p-value (p < 0.05) (the last column in Table 9.2). This is expected since each of these tested interactions was a partial interaction and involved at least one true signal feature (the 8th column in Table 9.2).
      • We thus relied on the siRF interaction filtering step to distinguish between marginal and true interaction effects. This siRF interaction filtering step is detailed in the Methods section Lo-siRF step 4.3: Permutation test for difference in local stability importance scores in Q. Wang et al. (2023). Specifically, the feature selection dependence threshold (discussed in greater detail in the original siRF paper (Kumbier et al. 2023)) was originally designed to differentiate between additive (e.g., marginal) and non-additive (e.g., interaction) effects. This metric showed strong performance in empirical simulations from the original siRF paper.
    • As in the original siRF paper, the siRF interaction filtering demonstrated similarly strong effectiveness for distinguishing between true interactions and true marginal effects under our simulation setup.

      • In particular, under the interaction simulation model, the siRF interaction filtering step reduced the number of interactions from over 3000 candidates (the 3rd column in Table 9.2), of which \(\sim\) 40% were true interactions (4th column), to ~1300 candidates (6th column). Of these remaining \(\sim\) 1300 candidates, 87%, 99%, and 100% were true interactions in the n = 1000, 1500, and 2000 simulations, respectively (the 7th column).
      • Furthermore, under the marginal simulation model, the siRF interaction filtering step reduced the number of interactions from over 4400 candidates (the 3rd column in Table 9.2) to fewer than 100 candidates (4th column). That is, less than 2% of the candidate interactions remain after the siRF interaction filtering step in the marginal effect simulation setting.
    • Thus, even though the permutation test rejects most tests involving partial interactions, in lo-siRF, the permutation test occurs after the siRF interaction filtering, which we have shown to effectively eliminate the vast majority of non-signal or partial interactions.

Figure 9.2: Summary of marginal and interaction effect simulations. # Candidate Interactions from siRF: The total number of interactions identified by siRF before applying the stability filtering step. Prop. (Num.) of True Interactions in Candidate Set: The proportion (or number) of true interactions out of the total number of candidate interactions from siRF. Prop. (Num.) of Partial Interactions in Candidate Set: The proportion (or number) of partial interactions out of the total number of candidate interactions from siRF. Prop. (Num.) of Stable Interactions in Candidate Set: The proportion (or number) of interactions that passed the stability filter step out of the total number of candidate interactions from siRF. Prop. (Num.) of True Interactions in Stable Set: The proportion (or number) of true interactions out of the total number of interactions that passed the stability filter. Prop. (Num.) of Partial Interactions in Stable Set: The proportion (or number) of partial interactions out of the total number of interactions that passed the stability filter. # Interactions Rejected (p < 0.05): The number of stable (or tested) interactions with a permutation test p-value below 0.05.

10 Final Remarks

We reiterate that lo-siRF was originally developed as a first-stage recommendation (or hypothesis generation) tool within a broader pipeline in order to prioritize genetic loci and interactions between loci for downstream analyses and experimental validation. Importantly, the output of lo-siRF is a prioritization/ranking order, not a proper statistical test of significance. We rely on follow-up investigations and in this study rigorous gene-silencing experiments in order to validate the prioritized genetic loci and interaction between loci. With that, many of the modeling decisions and human judgment calls were made with this original use case in mind. For other use cases or problems, it is highly likely that different choices may be better suited. We also acknowledge that though many stability checks were conducted to help ensure that our conclusions are stable across different reasonable data and/or modeling perturbations, these stability checks are inevitably constrained by computational limitations, and additional stability analyses can be conducted. We hope that by documenting our modeling choices, stability analyses, and providing insights into our motivations/reasons, this may help to both clarify the limitations of our current work and to facilitate the trustworthy (veridical) use of lo-siRF (and variants thereof) in other scientific problems.

11 Bibliography

Allen, Ben, Morgan Lane, Elizabeth Anderson Steeves, and Hollie Raynor. 2022. “Using Explainable Artificial Intelligence to Discover Interactions in an Ecological Model for Obesity.” International Journal of Environmental Research and Public Health 19 (15): 9447.
Bai, Wenjia, Matthew Sinclair, Giacomo Tarroni, Ozan Oktay, Martin Rajchl, Ghislain Vaillant, Aaron M Lee, et al. 2018. “Automated Cardiovascular Magnetic Resonance Image Analysis with Fully Convolutional Networks.” Journal of Cardiovascular Magnetic Resonance 20 (1): 1–12.
Baxter, Robert C. 2023. “Signaling Pathways of the Insulin-Like Growth Factor Binding Proteins.” Endocrine Reviews 44 (5): 753–78.
Behr, Merle, Karl Kumbier, Aldo Cordova-Palomera, Matthew Aguirre, Euan Ashley, Atul J Butte, Rima Arnaout, Ben Brown, James Priest, and Bin Yu. 2020. “Learning Epistatic Polygenic Phenotypes with Boolean Interactions.” bioRxiv, 2020–11.
Bycroft, Clare, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T Elliott, Kevin Sharp, Allan Motyer, et al. 2018. “The UK Biobank Resource with Deep Phenotyping and Genomic Data.” Nature 562 (7726): 203–9.
Cliff, Ashley, Jonathon Romero, David Kainer, Angelica Walker, Anna Furches, and Daniel Jacobson. 2019. “A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks.” Genes 10 (12): 996.
Crawford, Lorin, Ping Zeng, Sayan Mukherjee, and Xiang Zhou. 2017. “Detecting Epistasis with the Marginal Epistasis Test in Genetic Mapping Studies of Quantitative Traits.” PLoS Genetics 13 (7): e1006869.
De Rosa, Arianna, Andrea Fontana, Tommaso Nuzzo, Martina Garofalo, Anna Di Maio, Daniela Punzo, Massimiliano Copetti, et al. 2022. “Machine Learning Algorithm Unveils Glutamatergic Alterations in the Post-Mortem Schizophrenia Brain.” Schizophrenia 8 (1): 8.
Du Bois, Delafield, and Eugene F Du Bois. 1916. “Clinical Calorimetry: Tenth Paper a Formula to Estimate the Approximate Surface Area If Height and Weight Be Known.” Archives of Internal Medicine 17 (6_2): 863–71.
Engel, David J, Allan Schwartz, and Shunichi Homma. 2016. “Athletic Cardiac Remodeling in US Professional Basketball Players.” JAMA Cardiology 1 (1): 80–87.
Fry, Anna, Thomas J Littlejohns, Cathie Sudlow, Nicola Doherty, Ligia Adamska, Tim Sprosen, Rory Collins, and Naomi E Allen. 2017. “Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants with Those of the General Population.” American Journal of Epidemiology 186 (9): 1026–34.
Harper, Andrew R, Anuj Goel, Christopher Grace, Kate L Thomson, Steffen E Petersen, Xiao Xu, Adam Waring, et al. 2021. “Common Genetic Variants and Modifiable Risk Factors Underpin Hypertrophic Cardiomyopathy Susceptibility and Expressivity.” Nature Genetics 53 (2): 135–42.
Hooker, Giles, Lucas Mentch, and Siyu Zhou. 2021. “Unrestricted Permutation Forces Extrapolation: Variable Importance Requires at Least One More Model, or There Is No Free Variable Importance.” Statistics and Computing 31: 1–16.
Khurshid, Shaan, Julieta Lazarte, James P Pirruccello, Lu-Chen Weng, Seung Hoan Choi, Amelia W Hall, Xin Wang, et al. 2023. “Clinical and Genetic Associations of Deep Learning-Derived Cardiac Magnetic Resonance-Based Left Ventricular Mass.” Nature Communications 14 (1): 1558.
Kumbier, Karl, Sumanta Basu, Erwin Frise, James B Brown, Susan Celniker, and Bin Yu. 2023. “Signed Iterative Random Forests to Identify Enhancer-Associated Transcription Factor Binding.” arXiv Preprint arXiv:1810.07287.
Lee, Seunggeun, Mary J Emond, Michael J Bamshad, Kathleen C Barnes, Mark J Rieder, Deborah A Nickerson, David C Christiani, Mark M Wurfel, and Xihong Lin. 2012. “Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies.” The American Journal of Human Genetics 91 (2): 224–37.
Leeuw, Christiaan A de, Joris M Mooij, Tom Heskes, and Danielle Posthuma. 2015. “MAGMA: Generalized Gene-Set Analysis of GWAS Data.” PLoS Computational Biology 11 (4): e1004219.
Loh, Po-Ru, George Tucker, Brendan K Bulik-Sullivan, Bjarni J Vilhjalmsson, Hilary K Finucane, Rany M Salem, Daniel I Chasman, et al. 2015. “Efficient Bayesian Mixed-Model Analysis Increases Association Power in Large Cohorts.” Nature Genetics 47 (3): 284–90.
Mbatchou, Joelle, Leland Barnard, Joshua Backman, Anthony Marcketta, Jack A Kosmicki, Andrey Ziyatdinov, Christian Benner, et al. 2021. “Computationally Efficient Whole-Genome Regression for Quantitative and Binary Traits.” Nature Genetics 53 (7): 1097–1103.
Meyer, Hannah V, Timothy JW Dawes, Marta Serrani, Wenjia Bai, Paweł Tokarczuk, Jiashen Cai, Antonio de Marvao, et al. 2020. “Genetic and Functional Insights into the Fractal Structure of the Heart.” Nature 584 (7822): 589–94.
Mizukoshi, Kei, Masaaki Takeuchi, Yasufumi Nagata, Karima Addetia, Roberto M Lang, Yoshihiro J Akashi, and Yutaka Otsuji. 2016. “Normal Values of Left Ventricular Mass Index Assessed by Transthoracic Three-Dimensional Echocardiography.” Journal of the American Society of Echocardiography 29 (1): 51–61.
Mootha, Vamsi K, Cecilia M Lindgren, Karl-Fredrik Eriksson, Aravind Subramanian, Smita Sihag, Joseph Lehar, Pere Puigserver, et al. 2003. “PGC-1\(\alpha\)-Responsive Genes Involved in Oxidative Phosphorylation Are Coordinately Downregulated in Human Diabetes.” Nature Genetics 34 (3): 267–73.
Nelson, David L, Albert L Lehninger, and Michael M Cox. 2008. Lehninger Principles of Biochemistry. Macmillan.
Niel, Clément, Christine Sinoquet, Christian Dina, and Ghislain Rocheleau. 2015. “A Survey about Methods Dedicated to Epistasis Detection.” Frontiers in Genetics 6: 285.
Pirruccello, James P, Alexander Bick, Minxian Wang, Mark Chaffin, Samuel Friedman, Jie Yao, Xiuqing Guo, et al. 2020. “Analysis of Cardiac Magnetic Resonance Imaging in 36,000 Individuals Yields Genetic Insights into Dilated Cardiomyopathy.” Nature Communications 11 (1): 2254.
Purcell, Shaun, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR Ferreira, David Bender, Julian Maller, et al. 2007. “PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses.” The American Journal of Human Genetics 81 (3): 559–75.
Schaid, Daniel J, Wenan Chen, and Nicholas B Larson. 2018. “From Genome-Wide Associations to Candidate Causal Variants by Statistical Fine-Mapping.” Nature Reviews Genetics 19 (8): 491–504.
Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50.
Toloşi, Laura, and Thomas Lengauer. 2011. “Classification with Correlated Features: Unreliability of Feature Ranking and Solutions.” Bioinformatics 27 (14): 1986–94.
Visscher, Peter M, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew A Brown, and Jian Yang. 2017. “10 Years of GWAS Discovery: Biology, Function, and Translation.” The American Journal of Human Genetics 101 (1): 5–22.
Walker, Angelica M, Ashley Cliff, Jonathon Romero, Manesh B Shah, Piet Jones, Joao Gabriel Felipe Machado Gazolla, Daniel A Jacobson, and David Kainer. 2022. “Evaluating the Performance of Random Forest and Iterative Random Forest Based Methods When Applied to Gene Expression Data.” Computational and Structural Biotechnology Journal 20: 3372–86.
Wang, Kai, Mingyao Li, and Hakon Hakonarson. 2010. “ANNOVAR: Functional Annotation of Genetic Variants from High-Throughput Sequencing Data.” Nucleic Acids Research 38 (16): e164–64.
Wang, Qianru, Tiffany M Tang, Nathan Youlton, Chad S Weldy, Ana M Kenney, Omer Ronen, J Weston Hughes, et al. 2023. “Epistasis Regulates Genetic Control of Cardiac Hypertrophy.”
Watanabe, Kyoko, Erdogan Taskesen, Arjen Van Bochoven, and Danielle Posthuma. 2017. “Functional Mapping and Annotation of Genetic Associations with FUMA.” Nature Communications 8 (1): 1826.
Yildiz, Mehmet, Ahmet Afşin Oktay, Merrill H Stewart, Richard V Milani, Hector O Ventura, and Carl J Lavie. 2020. “Left Ventricular Hypertrophy and Hypertension.” Progress in Cardiovascular Diseases 63 (1): 10–21.
Yu, Bin. 2013. “Stability.”
Yu, Bin, and Karl Kumbier. 2020. “Veridical Data Science.” Proceedings of the National Academy of Sciences 117 (8): 3920–29. https://doi.org/10.1073/pnas.1901326117.
---
title: "PCS documentation for low-signal signed iterative random forests (lo-siRF)"
# author: "Tiffany M. Tang"
date: "September 27, 2024"
bibliography: bibliography.bib
header-includes:
    - \usepackage{float}
    - \usepackage{amsmath}
    - \usepackage{gensymb}
output:
  vthemes::vmodern:
    number_sections: true
params:
  eval: false
---


```{r setup, include=FALSE}
options(width = 10000)
knitr::opts_chunk$set(
  echo = FALSE,
  warning = FALSE,
  message = FALSE,
  cache = FALSE,
  # eval = FALSE,
  fig.align = "center",
  fig.pos = "H",
  # fig.show = "hold",
  fig.height = 12,
  fig.width = 10
)

library(knitr)
library(fontawesome)
library(tidyverse)
library(data.table)
library(plotly)
library(kableExtra)
library(patchwork)
library(vthemes)
library(vdocs)

source(file.path("..", "functions", "plot-functions.R"), chdir = T)
source(file.path("..", "functions", "load-functions.R"), chdir = T)
source(file.path("..", "functions", "eval-functions.R"), chdir = T)

options(
  knitr.kable.NA = "NA",
  dplyr.summarise.inform = FALSE
)

# scrollable text output
local({
  hook_output <- knitr::knit_hooks$get("output")
  knitr::knit_hooks$set(output = function(x, options) {
    if (!is.null(options$max.height)) {
      options$attr.output <- c(
        options$attr.output,
        sprintf('style="max-height: %s;"', options$max.height)
      )
    }
    hook_output(x, options)
  })
})

display_image <- function(fname, chunk_idx, caption = "''") {
  subchunkify(
    g = knitr::include_graphics(fname), i = chunk_idx,
    caption = caption,
    add_class = c("panel panel-default"),
    other_args = "out.width='100%'"
  )
  chunk_idx <<- chunk_idx + 1
}

subchunkify <- function(...) {
  vthemes::subchunkify(...)
  cat("\n\n")
}

chunk_idx <- 1
data_dir <- file.path("..", "data")
res_dir <- file.path("..", "results")
tab_res_dir <- file.path(res_dir, "tables")
fig_res_dir <- file.path(res_dir, "figures")
phenotypes <- c("LVMi" = "iLVM")
phenotypes_all <- c("LVMi" = "iLVM", "LVM" = "LVM")
```

```{css, echo=FALSE}
.btn-toolbar {
  display: none;
}

.figure p.caption {
  font-size: 90%;
  padding: 8px;
}

.figure {
  display: inherit;
}

.th-group-header {
  padding: 5px !important;
  border: none !important;
  text-align: center;
}

.th-group-header:after {
  content: ""; /* This is necessary for the pseudo element to work. */ 
  display: block; /* This will put the pseudo element on its own line. */
  margin: 0 auto; /* This will center the border. */
  width: 95%; /* Change this to whatever width you want. */
  padding-top: 14px; /* This creates some space between the element and the border. */
  border-bottom: 1px solid black; /* This creates the border. Replace black with whatever color you want. */
}

.th-group-subheader {
  border-top: none !important;
}

.citation {
  color: #009688;
}

.csl-entry {
  margin-bottom: 0.5em;
}
```


# Overview {.tabset .tabset-fade .tabset-vmodern}

<div class="panel panel-default padded-panel">
In [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1), we developed an end-to-end pipeline to demonstrate the role of epistasis in the genetic control of cardiac hypertrophy. 
This pipeline consisted of four major phases: 
(1) the derivation of estimates of left ventricular mass via deep learning,
(2) the computational prioritization of epistatic drivers based on data from the UK Biobank [@bycroft2018uk], 
(3) functional interpretation of the hypothesized epistatic genetic loci, and 
(4) experimental confirmation of epistasis in cardiac transcriptome and cardiac cellular morphology.

Here, we expand upon the computational prioritization of epistatic drivers phase. 
In this phase, we leveraged large-scale genetic and cardiac MRI data from the UK Biobank and developed the low-signal signed iterative random forest (lo-siRF) to prioritize genetic loci and interactions between loci for downstream investigations. 
As lo-siRF is heavily rooted in the Predictability, Computability, and Stability (PCS) framework for veridical (trustworthy) data science [@yu2020veridical], we conducted extensive stability analyses to help minimize the impact of human judgment calls and modeling decisions on our scientific conclusions. 
In this supplementary PCS documentation, we will:

1. Transparently document and justify the many human judgment calls and modeling decisions that were inevitably made throughout our statistical analysis.
2. Provide additional exploration and insight into the main results presented in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1) alongside the stability analyses.

In what follows, we organize this documentation by step in the lo-siRF pipeline (see [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1)), beginning with the choice of phenotypic data, followed by the dimension reduction step, binarization step, prediction step, and concluding with the prioritization step. 
We then investigate the possibility of pleiotropy between LV mass and blood pressure, conducting a stability analysis where we examine the impact of excluding hypertensive individuals from the lo-siRF analysis. 
Lastly, we compare the lo-siRF prioritized loci and interactions between loci with those identified using existing methods and conclude with final remarks.

</div>

<div class="panel panel-default padded-panel">
&#9888;
**Origins and intended usage of lo-siRF:** 
As discussed in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1), lo-siRF was originally developed as a first-stage recommendation (or hypothesis generation) tool to prioritize genetic loci and interactions between loci for downstream analyses and experimental validation. 
Importantly, the output of lo-siRF is a prioritization or ranking order of loci and interactions between loci, not a measure of statistical significance. 
We emphasize that recommendations or prioritizations from lo-siRF should not be interpreted as causal findings without further investigation. 
Given the numerous challenges inherent to epistasis discovery (e.g., high heterogeneity, small effect sizes, and unmeasured confounding factors, discussed further in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1)), assessing causality solely from data is incredibly complex and difficult to do *reliably*. 
Lo-siRF was thus designed and intended to be used as a single step within a larger scientific discovery pipeline. 
In [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1), we relied on extensive follow-up investigations to validate the lo-siRF-prioritized genetic loci and interactions between loci, and we encourage practitioners to do the same.
</div>

# Choice of Phenotypic Data {.tabset .tabset-fade .tabset-vmodern}

<div class="panel panel-default padded-panel">
In the first step of lo-siRF, we began with a careful choice of phenotypic data. 
This first, but often overlooked, choice is critical to the reliability and interpretation of results.
If the data quality is poor, conclusions may be unreliable. 
If the data is far-removed from the domain problem of interest, conclusions may not be relevant. 
Given the importance of this choice of phenotypic data, we conducted extensive data explorations for different phenotypes in addition obtaining clinician input regarding its clinical relevance.

**Initial choice of phenotype:** 
Our overarching goal was to study the role of epistasis in cardiac hypertrophy. 
One important disease that is closely related to cardiac hypertrophy is hypertrophic cardiomyopathy (HCM). 
HCM is a common heart disease, impacting around 1 in 500 people, where the heart muscle becomes enlarged or thickened. 
Given the high prevalence and possible life-threatening consequences of this disease, we first set out to explore this HCM phenotype, defined as a 0-1 indicator variable of whether or not the individual was diagnosed with HCM (i.e., any ICD10 code of I42.1 or I42.2 in the UK Biobank). 
However, when fitting various machine learning prediction models, including L1 or L2-regularized logistic regression, random forest (RF), iterative RF (iRF), and support vector machines, using single-nucleotide variant (SNV) data from the UK Biobank (details in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1)) to predict HCM diagnosis, the balanced classification accuracy on the held-out validation data did not consistently exceed 50\% when trained on different bootstrapped training samples. 
In other words, the fitted prediction models were no better than random guessing. 
This made us wary of any interpretations extracted from these prediction models since it is unclear whether these models are capturing any relevant phenotypic signal or simply noise. 
Given this incredibly weak prediction signal, we deemed that these models did not pass the prediction check (the "P" in PCS) [@yu2020veridical]. 
For this reason, we chose not to proceed further with the HCM phenotype and did not conduct any analysis with respect to interpreting the aforementioned models for HCM.

**Pivoting to a cardiac MRI-derived phenotype:** 
We instead pivoted to an alternative phenotype for cardiac hypertrophy based upon deep-learning-enabled left ventricular (LV) structural analysis using cardiac MRIs. 
This choice was heavily motivated by our struggles with HCM. 
Specifically, when looking back at the HCM diagnosis data, we noticed that the number of HCM cases in the UK Biobank data (only 236 HCM cases out of 337,535 unrelated White British individuals) was much lower than the 1 in 500 ratio that is expected. 
Although the UK Biobank population is known to be healthier than the general population [@fry2017comparison], the observed discrepancy is large, leading us to speculate about potential under-diagnosis issues with the HCM phenotype. 
Consequently, rather than relying on a clinician's diagnosis which may suffer from issues such as under-diagnosis, we chose to build upon recent advances in deep-learning-enabled phenotyping [@bai2018automated]. 
Doing so enables us to extract a measure of left ventricular hypertrophy, specifically, the left ventricular mass (LVM), from anyone with a cardiac MRI. 

**Arriving at LVMi:**
However, LVM is known to be heavily influenced by body weight, height, and gender [@engel2016athletic; @mizukoshi2016normal]. 
Thus, if we were to proceed solely with LVM in our analysis, we risk identifying genetic loci and interactions between loci that are primarily associated with body weight, height, and gender, rather than left ventricular hypertrophy. 
Indeed, if we visualize the relationship between LVM, height, and weight within each gender group (Figure \@ref(fig:subchunk-1)), we see strong positive correlations. 

This motivates the need to adjust for height and weight in order to help decouple the effects of height/weight and LVM. 
As done in @khurshid2023clinical, one common way to address this is by analyzing LVM indexed by body surface area (LVMi), defined as:

\begin{align*}
\text{LVMi} = \text{LVM } / \text{ Body surface area},
\end{align*}
where body surface area is calculated based on height and body weight via the Du Bois formula [@du1916clinical].

After normalizing by body surface area (and hence, height and weight), the associations between LVMi and height/weight are much weaker within each gender group, exhibiting correlations closer to 0 (Figure \@ref(fig:subchunk-1)). 
Thus, by focusing our statistical analysis on LVMi, we lessen the possibility that the identified genetic loci and interactions between loci are only selected because of their association with body height and/or weight. 
Still, there are differences in LVMi between males and females, with males tending to have higher LVMi measurements. 
As seen later, we will account for this in the modeling stage by using gender-specific binarization thresholds to binarize the LVMi responses into the high and low LVMi groups. 

**Study Cohort:** 
Before proceeding, we note that another important human judgment call within our pipeline is the choice of study cohort. 
Specifically, we restricted our analysis to the unrelated White British individual population with both SNV and cardiac MRI data in the UK Biobank, resulting in a cohort of 29,661 individuals. 
We chose to focus this analysis on the unrelated White British population, anticipating the very low-signal problem under study (e.g., due to the generally small effect sizes of common genetic variants and the high phenotypic diversity of cardiac hypertrophy). 
By studying a more homogeneous cohort such as the unrelated White British population, we hope to first establish a proof-of-concept foothold on the role of epistasis in cardiac hypertrophy, and we view the study and generalization to the broader population as a very crucial next step.
</div>

```{r lvm-eda, results = "asis"}
if (params$eval) {
  pheno_data <- fread("/oak/stanford/projects/ukbb/imaging_ashley/20209/results/table_ventricular_volume_with_indexing.csv")
  merge_df <- fread("/oak/stanford/projects/ukbb/genetic_data_ashley/v2/bridging_file_2228_to_13721.csv")
  pheno_data <- left_join(
    x = pheno_data, y = merge_df,
    by = c("V1" = "app22282")
  ) %>%
    filter(!is.na(app13721)) %>%
    mutate(Gender = ifelse(loadDemographics(app13721)[, "gender"] == 1,
      "Male", "Female"
    ) %>% as.factor()) %>%
    as.data.frame() %>%
    rename(
      "LVMi (g/m^2)" = "iLVM (g/m2)",
      "Weight (kg)" = "weight (kg)",
      "Height (cm)" = "height (cm)"
    ) %>%
    dplyr::filter(!is.na(Gender))

  keep_cols <- c("LVM (g)", "LVMi (g/m^2)", "Weight (kg)", "Height (cm)")
  plt <- pheno_data %>%
    dplyr::select(tidyselect::all_of(keep_cols), Gender) %>%
    vdocs::plot_pairs(columns = keep_cols, color = .$Gender, point_size = 0.1)
  # saveRDS(plt, file.path(fig_res_dir, "phenotype_correlation_heatmap.rds"))
  # ggsave(plt,
  #   filename = file.path(fig_res_dir, "phenotype_correlation_heatmap.png"),
  #   width = 8, height = 6
  # )
} else {
  display_image(
    file.path(fig_res_dir, "phenotype_correlation_heatmap.png"),
    chunk_idx = chunk_idx,
    caption = "'Pair plot, illustrating the relationships between LVM, LVMi, height, and weight within each gender group (males in green and females in orange). Density distributions are shown along the diagonal.'"
  )
}
```


# Dimension Reduction Step {.tabset .tabset-fade .tabset-vmodern}

## Overview {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}

<div class="panel panel-default padded-panel">
Having carefully chosen the phenotypic data (i.e., the cardiac MRI-derived LVMi) under study, the next step in lo-siRF is to do dimension reduction to reduce the number of SNVs under consideration using a genome-wide association study (GWAS). 
This dimension reduction step is necessary to mitigate the computational and statistical challenges of searching across millions of possible SNVs (which in the case of searching for epistasis results in an even larger combinatorial explosion of possibilities).

**Why GWAS?** 
We chose to do dimension reduction by running a GWAS as it 
(1) has been shown to be an incredibly useful tool for better understanding the genetic basis of numerous diseases in this field [@visscher201710; @pirruccello2020analysis; @meyer2020genetic; @harper2021common; @khurshid2023clinical], 
(2) is one of the few methods that can handle an input of millions of SNVs in a computationally-tractable manner, and 
(3) is a common, widely-accepted preprocessing step in other related statistical genomics tasks such as fine-mapping [@schaid2018genome].

**Limitations:** 
We acknowledge, however, that using a GWAS to reduce the number of SNVs under consideration has limitations. 
Because a GWAS prioritizes SNVs that have strong marginal associations with the phenotype under study, we may be removing SNVs that are associated with LVMi through an epistatic interaction but not through a marginal association. 
Alleviating this limitation is certainly interesting and an important direction to pursue in future work. 
Regardless, in this current work, we use lo-siRF primarily as a way to generate recommendations for experimental validation of epistasis. 
We thus are most interested in generating *reliable* recommendations for a limited number of experiments, rather than attempting to find all possible interactions that drive the disease. 
Given our particular use-case for lo-siRF, we accept the limitations of using a GWAS for dimension reduction and again view this work as a reliable *first* step (and certainly not the last) for better understanding the role of epistasis in cardiac hypertrophy.

**How was the dimension reduction performed?** 
We performed a GWAS on the training data for the rank-based inverse normal-transformed LVMi using two algorithms, PLINK [@purcell2007plink] and BOLT-LMM [@loh2015efficient]. 
For each of the two GWAS runs, we ranked the SNVs by significance (i.e., the GWAS p-value). 
We then took the union of the top 1000 SNVs (without clumping) from each of the two GWAS runs and proceeded with this resulting set of 1405 SNVs for the remainder of the lo-siRF pipeline. 
We refer to [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1) for the details on the PLINK and BOLT-LMM implementations and use this document instead to justify several of our human judgment calls in this dimension reduction step.

- **Why PLINK and BOLT-LMM?** Since BOLT-LMM and PLINK rely on different statistical models and it is unclear a priori which model is better suited for our problem, we employed both software packages to mitigate the dependence of downstream conclusions on this arbitrary choice of model.
- **Why the rank-based inverse normal-transformed LVMi?** Because the distribution of LVMi measurements is right skewed (see Figure \@ref(fig:subchunk-1)), and the p-value computation in PLINK and BOLT-LMM rely on normality assumptions, the rank-based inverse normal-transform was taken as in @khurshid2023clinical.
- **Why was the union taken?** As mentioned previously, relying on a GWAS to do dimension reduction can be limiting since it is primarily assessing marginal associations between the SNVs and the phenotype. There are also other restrictive modeling assumptions (e.g., linearity) that may add to these limitations. Due to this concern that we may be imposing too many unnecessary restrictions in order for SNVs to pass the GWAS filter, we chose to take the union of the top SNVs from BOLT-LMM and from PLINK, rather than taking the intersection which would have made it more difficult or restrictive for SNVs to pass the GWAS filter.
- **How was the threshold chosen?** We chose the threshold of 1000 SNVs per GWAS method (without clumping) as it (1) struck a balance between the amount of information loss and the computational cost of our downstream modeling and (2) yielded the highest validation prediction accuracy compared to choosing other possible thresholds (500 and 2000) with and without clumping in the prediction modeling stage. We note that this choice of the top 1000 SNVs includes all SNVs that passed the genome-wide significance level (p-value = 5E-8) as well as additional SNVs that did not pass the genome-wide significance level. In particular, taking the top 1000 SNVs corresponds to using the p-value threshold of 1.7E-5 and 6.7E-6 for PLINK and BOLT-LMM, respectively.

**Remark:** We note that alternative dimension reduction strategies may be used in replacement of our proposed procedure. For example, one may consider using a different GWAS software (e.g., REGENIE [@mbatchou2021computationally]), a different phenotype transformation, or perhaps a more sophisticated approach tailored for interactions and epistasis detection (e.g., MAPIT [@crawford2017detecting]).

**Organization:** Under the `GWAS Results Summary` tab, we provide an investigation into the BOLT-LMM and PLINK GWAS results for LVM and LVMi. Under the `GWAS-filtered SNVs`, we explore the selected 1405 SNVs that passed the GWAS filter for the LVMi phenotype.
</div>

## GWAS Results Summary {.unnumbered .tabset}

<div class="panel panel-default padded-panel">
Below, we provide an overview of the genome-wide association study (GWAS) results for both the LVM and LVMi phenotypes. While LVMi is the primary phenotype of interest, we provide the LVM GWAS results for completeness. We further investigate the similarities in findings between the LVM and LVMi GWAS analyses via a stability analysis. 

- Under the `BOLT-LMM` and `PLINK` tabs:
  - We provide a brief look into the GWAS results, including the Manhattan plots for each phenotype (LVMi and LVM). 
  - We also provide a table of the genomic risk loci, lead SNVs, and independent significant SNVs from each GWAS, determined using FUMA [@watanabe2017functional] with the default settings (i.e., maximum p-value of lead SNVs = 5E-8, maximum p-value cutoff = 0.05, $r^2$ threshold to define independent significant SNVs = 0.6, 2nd $r^2$ threshold to define lead SNVs = 0.1, minimum minor allele frequency = 0, maximum distance between LD blocks to merge into a locus = 250kb, using the UKB release2b 10k White British reference panel population).
  - Furthermore, we used FUMA to map SNVs to genes based on position, again with the default settings (maximum distance = 10kb), and provide this table of mapped genes below. 
- Lastly, under the `Stability Analysis` tab, we investigated the similarities and differences between the BOLT-LMM and PLINK results as well as between the LVM and LVMi results. 
</div>

```{r gwas-manhattan, results = "asis"}
gwas_methods <- c("BOLT-LMM", "Plink")

gwas_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square} \n\n"
phenotype_header <- "\n\n#### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle}\n\n"
fuma_header <- "\n\n##### %s {.unnumbered}\n\n"

PVAL_THR <- 4

# plot gwas summary results
if (params$eval) {
  gwas_ls <- list()
  snps_tab_ls <- list()
  genes_tab_ls <- list()
  for (gwas_method in gwas_methods) {
    gwas_ls[[gwas_method]] <- list()
    for (phenotype in phenotypes_all) {
      # retrieve gwas results
      gwas_dir <- paste0("gwas_", str_replace(tolower(gwas_method), "-", "_"))
      fname <- paste0(phenotype, "_norm_app13721.stats_annot")
      # gwas_ls[[gwas_method]][[phenotype]] <- fread(file.path(res_dir, gwas_dir, fname))

      # retrieve annotated fuma gwas results
      phenotype_name <- names(phenotypes_all)[phenotypes_all == phenotype]
      gwas_method_fstr <- stringr::str_replace(tolower(gwas_method), "-", "_")
      phenotype_fstr <- stringr::str_replace(tolower(phenotype), "ilvm", "lvmi")
      fuma_fpath <- file.path(
        res_dir, "gwas_fuma", sprintf("%s_%s_gwas", gwas_method_fstr, phenotype_fstr)
      )
      loci_tab <- fread(file.path(fuma_fpath, "GenomicRiskLoci.txt")) %>%
        tibble::as_tibble()
      leadsnps_tab <- fread(file.path(fuma_fpath, "leadSNPs.txt")) %>%
        dplyr::select(-No) %>%
        tibble::as_tibble()
      indsigsnps_tab <- fread(file.path(fuma_fpath, "IndSigSNPs.txt")) %>%
        dplyr::select(-No) %>%
        tibble::as_tibble()
      genes_tab <- fread(file.path(fuma_fpath, "genes.txt")) %>%
        tibble::as_tibble() %>%
        dplyr::mutate(
          `GWAS Method` = !!gwas_method,
          `Phenotype` = !!phenotype_name
        )
      snps_tab <- indsigsnps_tab %>%
        dplyr::select(GenomicLocus:pos) %>%
        dplyr::mutate(
          `Top Lead SNP` = purrr::map_lgl(uniqID, ~ .x %in% loci_tab$uniqID),
          `Lead SNP` = purrr::map_lgl(uniqID, ~ .x %in% leadsnps_tab$uniqID),
          `Ind. Sig. SNP` = purrr::map_lgl(uniqID, ~ .x %in% indsigsnps_tab$uniqID),
          `GWAS Method` = !!gwas_method,
          `Phenotype` = !!phenotype_name
        )
      snps_tab_ls <- c(snps_tab_ls, list(snps_tab))
      genes_tab_ls <- c(genes_tab_ls, list(genes_tab))
    }
  }

  # compare bolt-lmm vs plink results
  plt_df <- map2_dfr(gwas_ls[["BOLT-LMM"]], gwas_ls[["Plink"]],
    function(bolt, plink) {
      inner_join(
        x = bolt %>%
          select(SNP, `BOLT-LMM` = P_BOLT_LMM_INF) %>%
          # filter(-log10(`BOLT-LMM`) > PVAL_THR) %>%
          mutate(`BOLT-LMM` = -log10(`BOLT-LMM`)),
        y = plink %>%
          select(SNP, Plink = P_GLM) %>%
          # filter(-log10(Plink) > PVAL_THR) %>%
          mutate(Plink = -log10(Plink)),
        by = "SNP"
      )
    },
    .id = "Phenotype"
  ) %>%
    dplyr::mutate( # keep = (`BOLT-LMM` > PVAL_THR) | (Plink > PVAL_THR),
      Phenotype = ifelse(Phenotype == "iLVM", "LVMi", Phenotype)
    )
  plt <- vdocs::plot_point(
    plt_df, # %>% dplyr::filter(keep),
    x_str = "BOLT-LMM", y_str = "Plink", size = .1
  ) +
    facet_grid(~Phenotype) +
    labs(x = "-log10(BOLT-LMM P-value)", y = "-log10(PLINK P-value)")
  # saveRDS(plt, file.path(fig_res_dir, "gwas_pvalues_compare.rds"))
  # ggsave(plt,
  #   filename = file.path(fig_res_dir, "gwas_pvalues_compare.png"),
  #   width = 6, height = 3.5
  # )

  # summarize fuma gwas SNP results using stability across gwas runs
  snps_tab <- dplyr::bind_rows(snps_tab_ls) %>%
    tidyr::pivot_wider(
      id_cols = c(uniqID:pos),
      names_from = c(`GWAS Method`, `Phenotype`),
      values_from = c(`Top Lead SNP`, `Lead SNP`, `Ind. Sig. SNP`),
      names_sep = " ",
      names_prefix = "- ",
      values_fill = FALSE
    ) %>%
    dplyr::rowwise() %>%
    dplyr::mutate(
      n_counts = sum(dplyr::c_across(where(is.logical)))
    ) %>%
    dplyr::left_join(
      gwas_ls[["BOLT-LMM"]][["iLVM"]] %>%
        dplyr::select(rsID = SNP, `BOLT-LMM LVMi p-value` = P_BOLT_LMM_INF),
      by = "rsID"
    ) %>%
    dplyr::left_join(
      gwas_ls[["BOLT-LMM"]][["LVM"]] %>%
        dplyr::select(rsID = SNP, `BOLT-LMM LVM p-value` = P_BOLT_LMM_INF),
      by = "rsID"
    ) %>%
    dplyr::left_join(
      gwas_ls[["Plink"]][["iLVM"]] %>%
        dplyr::select(rsID = SNP, `Plink LVMi p-value` = P_GLM),
      by = "rsID"
    ) %>%
    dplyr::left_join(
      gwas_ls[["Plink"]][["LVM"]] %>%
        dplyr::select(rsID = SNP, `Plink LVM p-value` = P_GLM),
      by = "rsID"
    ) %>%
    dplyr::rowwise() %>%
    dplyr::mutate(
      n_p = sum(!is.na(dplyr::c_across(tidyselect::contains("p-value")))),
      avg_p = mean(dplyr::c_across(tidyselect::contains("p-value")), na.rm = TRUE)
    ) %>%
    dplyr::arrange(-n_counts, -n_p, avg_p)

  snps_tab_long <- snps_tab %>%
    dplyr::select(
      -uniqID, -chr, -pos, -n_counts, -n_p, -avg_p,
      -tidyselect::contains("p-value")
    ) %>%
    tidyr::pivot_longer(
      cols = c(
        tidyselect::starts_with("Top"),
        tidyselect::starts_with("Lead"),
        tidyselect::starts_with("Ind")
      )
    ) %>%
    dplyr::mutate(
      Method = stringr::str_trim(stringr::str_extract(name, "(?<=-)\\s*.*")),
      Group = factor(
        stringr::str_trim(stringr::str_extract(name, "^[^-]+")) %>%
          stringr::str_replace("SNP", "SNV"),
        levels = c("Top Lead SNV", "Lead SNV", "Ind. Sig. SNV")
      ),
      rsID = factor(rsID, levels = rev(snps_tab$rsID)),
      value = factor(
        ifelse(value, "True", "False"),
        levels = c("True", "False")
      )
    )

  plt <- ggplot2::ggplot(snps_tab_long) +
    ggplot2::aes(x = Method, y = rsID, fill = value) +
    ggplot2::facet_grid(~Group) +
    ggplot2::geom_tile(color = "white") +
    ggplot2::scale_fill_manual(values = c("#337ab7", "grey80"), na.value = "grey80") +
    ggplot2::labs(x = "", fill = "") +
    vthemes::theme_vmodern(
      x_text_angle = TRUE, bg_color = "white", grid_color = "white"
    )
  # saveRDS(plt, file.path(fig_res_dir, "gwas_fuma_snps_stability_plot.rds"))
  # ggsave(plt,
  #   filename = file.path(fig_res_dir, "gwas_fuma_snps_stability_plot.png"),
  #   width = 8, height = 6
  # )

  dt <- snps_tab %>%
    dplyr::select(uniqID:pos) %>%
    pretty_DT(rownames = FALSE)
  # saveRDS(dt, file.path(tab_res_dir, "gwas_fuma_snps_stability_table.rds"))

  # summarize fuma gwas mapped gene results using stability across gwas runs
  genes_tab <- dplyr::bind_rows(genes_tab_ls) %>%
    dplyr::mutate(value = TRUE) %>%
    tidyr::pivot_wider(
      id_cols = c(ensg, symbol, chr, start, end),
      names_from = c(`GWAS Method`, `Phenotype`),
      values_from = value,
      names_sep = " ",
      values_fill = FALSE
    ) %>%
    dplyr::rowwise() %>%
    dplyr::mutate(
      n_counts = sum(dplyr::c_across(where(is.logical)))
    ) %>%
    dplyr::arrange(-n_counts)

  genes_tab_long <- genes_tab %>%
    dplyr::select(-ensg, -chr, -start, -end, -n_counts) %>%
    tidyr::pivot_longer(cols = -symbol) %>%
    dplyr::mutate(
      symbol = factor(symbol, levels = rev(genes_tab$symbol)),
      value = factor(
        ifelse(value, "Mapped", "Not Mapped"),
        levels = c("Mapped", "Not Mapped")
      )
    )

  plt <- ggplot2::ggplot(genes_tab_long) +
    ggplot2::aes(x = name, y = symbol, fill = value) +
    ggplot2::geom_tile(color = "white") +
    ggplot2::scale_fill_manual(values = c("#337ab7", "grey80"), na.value = "grey80") +
    ggplot2::labs(x = "", fill = "", y = "Gene Symbol") +
    vthemes::theme_vmodern(
      x_text_angle = TRUE, bg_color = "white", grid_color = "white"
    )
  # saveRDS(plt, file.path(fig_res_dir, "gwas_fuma_genes_stability_plot.rds"))
  # ggsave(plt,
  #   filename = file.path(fig_res_dir, "gwas_fuma_genes_stability_plot.png"),
  #   width = 6, height = 5
  # )

  dt <- genes_tab %>%
    dplyr::select(ensg:end) %>%
    pretty_DT(rownames = FALSE)
  # saveRDS(dt, file.path(tab_res_dir, "gwas_fuma_genes_stability_table.rds"))
} else {
  for (gwas_method in gwas_methods) {
    cat(sprintf(gwas_header, toupper(gwas_method)))
    for (phenotype in phenotypes_all) {
      phenotype_name <- names(phenotypes_all)[phenotypes_all == phenotype]
      gwas_method_fstr <- stringr::str_replace(tolower(gwas_method), "-", "_")
      phenotype_fstr <- stringr::str_replace(tolower(phenotype), "ilvm", "lvmi")
      fpath <- file.path(
        res_dir, "gwas_fuma", sprintf("%s_%s_gwas", gwas_method_fstr, phenotype_fstr)
      )
      cat(sprintf(phenotype_header, phenotype_name))
      display_image(file.path(res_dir, "gwas_fuma", sprintf("manhattan_%s_%s.png", gwas_method_fstr, phenotype_fstr)),
        chunk_idx = chunk_idx,
        caption = sprintf("'GWAS Manhattan plot for the %s phenotype using %s. Red dashed line indicates the genome-wide significance level (p = 5E-8).'", phenotype_name, gwas_method)
      )

      cat(sprintf(fuma_header, "Genomic Risk Loci"))
      dt <- fread(file.path(fpath, "GenomicRiskLoci.txt")) %>%
        dplyr::select(-nSNPs) %>%
        dplyr::rename(
          nGWASSNVs = nGWASSNPs,
          nIndSigSNVs = nIndSigSNPs,
          IndSigSNVs = IndSigSNPs,
          nLeadSNVs = nLeadSNPs,
          LeadSNVs = LeadSNPs
        ) %>%
        pretty_DT(rownames = FALSE)
      subchunkify(dt,
        i = chunk_idx,
        caption = sprintf("'**List of genomic risk loci from FUMA for the %s %s GWAS.** uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the top lead SNV for the given genomic locus. chr = Chromosome of top lead SNV. pos = Position of top lead SNV on hg19. p = P-value of top lead SNV. start = Start position of the locus. end = End position of the locus. nGWASSNVs = Number of the GWAS-tagged candidate SNVs within the genomic locus. nIndSigSNVs = Number of the independent significant SNVs in the genomic locus. IndSigSNVs = rsID of independent significant SNVs in the genomic locus. nLeadSNVs = The number of lead SNVs in the genomic locus. LeadSNVs = rsID of lead SNVs in the genomic locus.'", phenotype_name, gwas_method)
      )
      chunk_idx <- chunk_idx + 1

      cat(sprintf(fuma_header, "Lead SNVs"))
      dt <- fread(file.path(fpath, "leadSNPs.txt")) %>%
        dplyr::select(-No) %>%
        dplyr::rename(
          nIndSigSNVs = nIndSigSNPs,
          IndSigSNVs = IndSigSNPs
        ) %>%
        pretty_DT(rownames = FALSE)
      subchunkify(dt,
        i = chunk_idx,
        caption = sprintf("'**List of lead SNVs from FUMA for the %s %s GWAS.** GenomicLocus = Index of assigned genomic locus (note: multiple independent lead SNVs can be assigned to the same genomic locus). uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the SNV. chr = Chromosome. pos = Position on hg19. p = P-value from GWAS. nIndSigSNVs = Number of independent significant SNVs which are in LD of the lead SNV at $r^2$ = 0.1. IndSigSNVs = rsID of independent significant SNVs which are in LD of the lead SNV at $r^2$ = 0.1.'", phenotype_name, gwas_method)
      )
      chunk_idx <- chunk_idx + 1

      cat(sprintf(fuma_header, "Independent Significant SNVs"))
      dt <- fread(file.path(fpath, "IndSigSNPs.txt")) %>%
        dplyr::select(-No, -nSNPs) %>%
        dplyr::rename(
          nGWASSNVs = nGWASSNPs
        ) %>%
        pretty_DT(rownames = FALSE)
      subchunkify(dt,
        i = chunk_idx,
        caption = sprintf("'**List of independent significant SNVs from FUMA for the %s %s GWAS.** GenomicLocus = Index of assigned genomic locus (note: multiple independent lead SNVs can be assigned to the same genomic locus). uniqID = Unique ID of SNVs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered. rsID = rsID of the SNV. chr = Chromosome. pos = Position on hg19. p = P-value from GWAS. nGWASSNVs = Number of GWAS-tagged SNVs which are in LD of the independent significant SNV given $r^2$ = 0.6.'", phenotype_name, gwas_method)
      )
      chunk_idx <- chunk_idx + 1

      cat(sprintf(fuma_header, "Mapped Genes"))
      dt <- fread(file.path(fpath, "genes.txt")) %>%
        dplyr::select(-tidyselect::any_of(c("ciMap", "ciMapts"))) %>%
        dplyr::rename(
          posMapSNVs = posMapSNPs,
          IndSigSNVs = IndSigSNPs
        ) %>%
        pretty_DT(rownames = FALSE)
      subchunkify(dt,
        i = chunk_idx,
        caption = sprintf("'**List of mapped genes (via positional mapping) from FUMA for the %s %s GWAS.** ensg = ENSG ID. symbol = Gene Symbol. chr = Chromosome. start = Starting position of the gene. end = Ending position of the gene. strand = Strand of the gene. type = Gene biotype from Ensembl. entrezID = entrez ID (if available). HUGO = HUGO (HGNC) gene symbol. pLI = pLI score (the probability of being loss-of-function intolerant) from ExAC database; the higher the score is, the more intolerant to loss-of-function mutations the gene is. ncRVIS = Non-coding residual variation intolerance score; the higher the score is, the more intolerant to non-coding variation the gene is. posMapSNVs = Number of SNVs mapped to gene based on positional mapping. posMapMaxCADD = The maximum CADD score of mapped SNVs by positional mapping. minGwasP = The minimum P-value of mapped SNVs. IndSigSNVs = rsID of the independent significant SNVs that are in LD with the mapped SNVs. GenomicLocus = Index of genomic loci where mapped SNVs are from.'", phenotype_name, gwas_method)
      )
      chunk_idx <- chunk_idx + 1
    }
  }
}
```



### Stability Analysis {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle}

<div class="panel panel-default padded-panel">
To further investigate the choice of GWAS method and phenotype, we conducted a deeper exploration into the similarities and differences between the BOLT-LMM and PLINK results as well as between the LVM and LVMi results in the stability analysis below. The primary goal of this stability analysis is to answer the following two questions:

1. **Are there large differences between the GWAS p-values from BOLT-LMM and PLINK?**
  - Figure \@ref(fig:subchunk-22): For each SNV, we plot the p-value (log-transformed) obtained via BOLT-LMM (x-axis) and PLINK (y-axis) for each of the LVM (left) and LVMi (right) phenotypes. We see that the most significant GWAS hits tend to be stable, no matter the choice of algorithm, but as the strength of the association decreases, there is greater variation in the p-value between BOLT-LMM and PLINK. This is to say that there is more stability around the SNVs with the strongest associations with LVMi and less stability around the SNVs with moderate or weak associations with LVMi. Another reason for taking the union of the top 1000 SNVs from each GWAS algorithm in the lo-siRF pipeline was to allow these moderate to weak association SNVs the chance of being identified in a multivariate model (i.e., via iterative random forest).

2. **What are the FUMA GWAS findings that are stable across GWAS methods (BOLT-LMM and PLINK) and choice of phenotype (LVM and LVMi)?** As advocated by the stability principle ("S" in PCS) [@yu2020veridical; @yu2013stability], we view stability across these arbitrary choices as a minimum requirement for making a scientific discovery.
  - In Figure \@ref(fig:subchunk-23), we summarize the SNVs, identified by FUMA, to be the top lead SNVs (per genomic risk locus), lead SNVs, and/or independent significant SNVs for each GWAS method (BOLT-LMM and PLINK) and choice of phenotype (LVM and LVMi). Some takeaways here:
    - rs3045696, an intronic *TTN* variant, is the only stable top lead SNV, identified across all four GWAS runs.
    - Besides rs3045696, rs1873164, an intronic *CCDC141* variant, is the only stable lead SNV, identified by FUMA across all four GWAS runs.
    - rs7591091 (an intronic *CCDC141* variant with the highest frequency of occurrence in lo-siRF, seen later), along with rs967507, rs12622500, and rs10178003, were stably identified as independent significant SNVs by FUMA across all four GWAS runs.
    - Two other SNVs (rs62178977, rs73973171) were also identified as independent significant SNVs by FUMA using both BOLT-LMM and PLINK for the LVMi phenotype.
  - In Figure \@ref(fig:subchunk-24), we summarize the genes that were positionally-mapped and prioritized by FUMA across the different GWAS methods (BOLT-LMM and PLINK) and choice of phenotype (LVM and LVMi).
    - The genes that were stably mapped by FUMA across all four GWAS runs are *PRKRA*, *DFNB59*, *FKBP7*, *PLEKHA3*, *TTN*, and *CCDC141*. We note that all of these genes are located near each other on chromosome 2 within ~600kb. The start position of PRKRA is 179296141 while the end position of CCDC141 is 179914813, with all other aforementioned genes in between.
    - No other genes were identified when using both BOLT-LMM and PLINK for the LVMi phenotype.
</div>

```{r gwas-stability, results = "asis"}
PVAL_THR <- 4

# plot summary results
if (!params$eval) {
  display_image(file.path(fig_res_dir, "gwas_pvalues_compare.png"),
    chunk_idx = chunk_idx,
    caption = "'Comparison of the BOLT-LMM and PLINK GWAS p-values for each phenotype (LVM left, LVMi right). Each point represents a SNV with its BOLT-LMM p-value on the x-axis and its PLINK p-value on the y-axis.'"
  )

  display_image(file.path(fig_res_dir, "gwas_fuma_snps_stability_plot.png"),
    chunk_idx = chunk_idx,
    caption = "'Summary of annotated SNVs from FUMA across different GWAS methods (BOLT-LMM and PLINK) and across different choices of phenotype (LVMi and LVM).'"
  )

  # dt <- readRDS(file.path(tab_res_dir, "gwas_fuma_snps_stability_table.rds"))
  # subchunkify(dt, i = chunk_idx, caption = "'List of annotated SNVs from FUMA.'")
  # chunk_idx <- chunk_idx + 1

  display_image(file.path(fig_res_dir, "gwas_fuma_genes_stability_plot.png"),
    chunk_idx = chunk_idx,
    caption = "'Summary of mapped genes from FUMA across different GWAS methods (BOLT-LMM and PLINK) and across different choices of phenotype (LVMi and LVM).'"
  )

  # dt <- readRDS(file.path(tab_res_dir, "gwas_fuma_genes_stability_table.rds"))
  # subchunkify(dt, i = chunk_idx, caption = "'List of mapped genes from FUMA.'")
  # chunk_idx <- chunk_idx + 1
}
```

## GWAS-filtered SNVs {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}

<div class="panel panel-default padded-panel">
In this section, we briefly explore the selected 1405 SNVs that passed the GWAS filter for the LVMi phenotype. We make two observations that will impact how we interpret the lo-siRF fit downstream:

- There are strong correlations within blocks of neighboring SNVs due to linkage disequilibrium (LD) (Figure \@ref(fig:subchunk-25)). This is a large contributing factor for why we ultimately extracted feature importances from the lo-siRF fit at the level of genetic loci, rather than for each individual SNV.
- There are large differences in the number of SNVs that passed the GWAS filter from each genetic loci (Figure \@ref(fig:subchunk-26)). We hence must account for this in the lo-siRF locus-level importance score. 

We will expand upon this locus-level importance score in a later section (see `Prioritization Step`).

Additional notes:

- Given the interest in *CCDC141* and *TTN* in our work and their close proximity to each other on chromosome 2, we point out a couple pairwise (Pearson) correlations of particular relevance:
  - $| Cor(\text{rs7591091}, \text{ rs66733621}) | = 0.35$
  - $| Cor(\text{rs7591091}, \text{ rs3045696}) | = 0.34$
  - rs7591091 is the top SNV from the CCDC141 locus in lo-siRF and is one of the top CCDC141 GWAS hits (in fact, is identified to be a stable independent significant SNV (see `GWAS Results Summary` tab)). rs3045696, an intronic *TTN* SNV, was stably identified to be a top lead SNV in the GWASes, and rs66733621 is the top SNV from the TTN locus in lo-siRF. This is to say that the correlation between the CCDC141 and TTN loci is moderate and may possibly suggest independent roles of CCDC141 and TTN in their detected interaction. A further discussion of this is provided in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1). [Note: we use the term *top SNV from a genetic locus in lo-siRF* to mean the SNV in the specified genetic locus with the highest frequency of occurrence across decision paths in the siRF fit.]

</div>

```{r eda-top-snps, results = "asis"}
HEATMAP_HEIGHT <- 8

phenotype_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
subsection_header <- "\n\n#### %s {.unnumbered}\n\n"
phenotype <- "iLVM"
phenotype_name <- "LVMi"

if (params$eval) {
  # load in data
  nsnps <- 1000
  snplist <- "app13721"
  bolt_dir <- file.path("..", "results", "gwas_bolt_lmm")
  plink_dir <- file.path("..", "results", "gwas_plink")
  keep_snps <- map(
    c(bolt_dir, plink_dir),
    function(fdir) {
      fpath <- file.path(
        fdir,
        paste0(
          phenotype, "_norm_", snplist,
          ".stats_annot"
        )
      )
      snps <- fread(fpath) %>%
        slice(1:nsnps) %>%
        pull(SNP)
    }
  ) %>%
    purrr::reduce(c) %>%
    unique()
  snp_df <- loadSNPInfo(keep_snps, by = "rsID")
  # write.csv(
  #   snp_df %>%
  #     dplyr::select(-Name),
  #   "../results/gwas_filtered_snps.csv",
  #   quote = FALSE, row.names = FALSE
  # )

  # frequency table of snps in each loci
  tab_out <- snp_df %>%
    group_by(Chr, Gene) %>%
    summarise(
      `Start Pos.` = min(Pos),
      `End Pos.` = max(Pos),
      `# SNPs` = n(),
      `Gene Function` = data.frame(table(`Gene Function`)) %>%
        mutate(str = paste0(Gene.Function, ": ", Freq)) %>%
        pull(str) %>%
        paste(collapse = "; ")
    ) %>%
    arrange(Chr, `Start Pos.`) %>%
    rename("Locus" = "Gene", "Function" = "Gene Function")
  # saveRDS(
  #   tab_out,
  #   file.path(
  #     tab_res_dir,
  #     paste0(phenotype, "_gwas_snps_frequency_by_gene.rds")
  #   )
  # )
  # write.csv(tab_out, "../results/gwas_filtered_snps_by_loci.csv", quote = FALSE, row.names = FALSE)

  # get data to make correlation heatmap
  invisible(capture.output(
    train_data <- loadData(
      pheno_name = phenotype, pheno_binary = F,
      n = 15000, n_test = 0, include_dems = F,
      snplist = snplist, keep_snps = keep_snps
    )
  ))

  # snp correlation heatmap
  gnames <- snp_df %>%
    group_by(Gene) %>%
    mutate(GeneName = paste(Gene, rsID, sep = ", ")) %>%
    pull(GeneName)
  plt_df <- train_data$geno_train
  colnames(plt_df) <- gnames
  plt <- vdocs::plot_cor_heatmap(X = plt_df, absolute_value = TRUE, clust = FALSE) +
    labs(x = "SNV", y = "SNV", fill = "|Correlation|") +
    ggplot2::scale_fill_gradient(low = "white", high = "red")
  # clean up axes labels - only show genes with > 2 snps
  data_long <- plt$data %>%
    mutate(
      Gene_y = forcats::fct_inorder(purrr::map_chr(stringr::str_split(y, ", "), ~ .x[[1]])),
      Gene_x = forcats::fct_inorder(purrr::map_chr(stringr::str_split(x, ", "), ~ .x[[1]]))
    )
  gene_freq <- table(data_long$Gene_x) / min(table(data_long$Gene_x))
  invisible_genes <- names(gene_freq[gene_freq <= 2])
  plt_breaks <- data_long %>%
    dplyr::group_by(Gene_x) %>%
    dplyr::mutate(
      order = 1:dplyr::n() - round(dplyr::n() / 2),
      Gene_x = ifelse(Gene_x %in% invisible_genes, "", as.character(Gene_x))
    ) %>%
    dplyr::filter(order == 0) %>%
    dplyr::ungroup()
  plt_breaks <- dplyr::bind_rows(
    plt_breaks,
    data.frame(Gene_x = "TTN, rs66733621", x = "TTN, rs66733621", order = 1),
    data.frame(Gene_x = "TTN, rs3045696", x = "TTN, rs3045696", order = 1),
    data.frame(Gene_x = "CCDC141, rs7591091", x = "CCDC141, rs7591091", order = 1)
  )
  plt <- plt +
    ggplot2::scale_x_discrete(
      breaks = plt_breaks$x, labels = plt_breaks$Gene_x
    ) +
    ggplot2::scale_y_discrete(
      breaks = plt_breaks$x, labels = plt_breaks$Gene_x
    ) +
    ggplot2::theme(
      axis.text.x = ggplot2::element_text(
        angle = 90, hjust = 1, vjust = 0.5,
        color = ifelse(plt_breaks$order, "cornflowerblue", "black")
      ),
      axis.text.y = ggplot2::element_text(color = ifelse(plt_breaks$order, "cornflowerblue", "black"))
    )
  # saveRDS(
  #   plt,
  #   file.path(
  #     fig_res_dir,
  #     paste0(phenotype, "_gwas_snps_cor_heatmap_by_pos.rds")
  #   )
  # )
  # ggsave(plt,
  #   filename = file.path(fig_res_dir, paste0(phenotype, "_gwas_snps_cor_heatmap_by_pos.png")),
  #   width = 12, height = 10
  # )
} else {
  cat(sprintf(phenotype_header, sprintf("%s - Correlation Between GWAS-filtered SNVs", phenotype_name)))
  display_image(file.path(fig_res_dir, paste0(phenotype, "_gwas_snps_cor_heatmap_by_pos.png")),
    chunk_idx = chunk_idx,
    caption = "'Correlation heatmap for the GWAS-filtered SNVs. We plot the magnitude of the Pearson correlation between all possible pairs of SNVs that passed the GWAS filter. Darker red indicates higher correlation between the two SNVs. SNVs along the axes are ordered according to genomic location. Loci with more than two constituent SNVs are labeled.'"
  )

  cat(sprintf(phenotype_header, sprintf("%s - Table of GWAS-filtered SNVs", phenotype_name)))
  tab <- readRDS(
    file.path(
      tab_res_dir,
      paste0(phenotype, "_gwas_snps_frequency_by_gene.rds")
    )
  ) %>%
    dplyr::select(-`Start Pos.`, -`End Pos.`) %>%
    dplyr::rename("Locus" = "Gene", "# SNVs" = "# SNPs", "Function" = "Gene Function") %>%
    pretty_DT(rownames = FALSE)
  subchunkify(tab,
    i = chunk_idx, other_args = "out.width='100%'",
    caption = "'Number of SNVs per genetic locus that passed the GWAS filter. Note: a semicolon is used to denote the intergenic region between two genes.'"
  )
  chunk_idx <- chunk_idx + 1
}
```


# Binarization Step {.tabset .tabset-fade .tabset-vmodern}

## Overview {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}

<div class="panel panel-default padded-panel">

In the next step of lo-siRF, we chose to binarize the (continuous) LVMi phenotype measurements into two categories: a low LVMi group and a high LVMi group before fitting the prediction models.

**Why did we choose to binarize?** One of our main motivating factors to pursue binarization revolved around the prediction check criteria in the PCS framework [@yu2020veridical]. This predictability principle advocates for the use of prediction accuracy as a reality check. This is to say that at a minimum, models should fit the data well, as measured by prediction accuracy, before believing that the model is capturing something relevant to reality or the real-world phenomena under study. In particular, before we thought to binarize the phenotype, we first attempted to train prediction models using the GWAS-filtered SNVs to predict the original (continuous) LVMi phenotype measurements. While this seemed like the most natural next step, these prediction results illuminated an incredibly weak prediction signal, with validation $R^2$ values hovering very close to 0 (see Figure \@ref(fig:subchunk-27) in the `Prediction Results without Binarization` tab). Given that an $R^2$ value of 0 can be achieved by simply predicting the mean LVMi response, this raised the question of whether these prediction models are capturing any biologically-relevant signal in the data. Consequently, the interpretation of these models may not be an accurate representation of reality. Put differently, these models do not satisfy the prediction check criteria under the PCS framework.

Given these difficulties in accurately predicting LVMi, we drew inspiration from how scientists often reduce complex problems into simplified versions. Here, the complex version of our problem aims to understand how SNVs are predictive of one's precise (continuous-valued) LVMi measurement. This is challenging, as demonstrated by the poor prediction accuracy discussed previously. On the other hand, arguably the most simplified version of this problem is to understand how SNVs are predictive of whether one has an extremely high or extremely low LVMi measurement. Intuitively, if there are in fact genetic differences between these two LV chamber states, then differentiating between the two groups --- those with extremely high LVMi and those with extremely low LVMi --- should be easier than the original regression problem (i.e., predicting the raw continuous-valued LVMi response). Indeed, we will see this to be the case, observing that the resulting classification models consistently performed better than random guessing unlike the original regression models. More specifically, we observed balanced classification accuracies $>55\%$, area under the receiver operator characteristic (AUROC) $>0.58$, and area under the precision-recall curve (AUPRC) $>0.55$ for the constructed binary classification task (see the `Prediction Step` tab). This prediction performance instills greater trustworthiness that the interpretation of these models after binarization are more relevant to the biological phenomena than the case without binarization.

**How is the binarization done?** We transformed the original regression problem of predicting the raw continuous-valued LVMi response into a binary classification problem as follows. For particular choice of threshold $x$, we partitioned the raw continuous-valued LVMi phenotype into a low, middle, and high LVMi group corresponding to the $[0, x]$, $(x, 1-x)$, and $[1-x, 1]$ quantile ranges, respectively. We then omitted the individuals in the middle quantile range and trained the prediction models to differentiate between the low and high LVMi individuals. However, due to the gender-specific biological variation of LVMi shown earlier in the exploratory data analysis (see `Choice of Phenotypic Data`), we performed this binarization procedure for males and females separately. That is, we computed the quantiles for males and females separately. Finally, given that this threshold is artificially introduced, we run the remainder of the lo-siRF pipeline using three different reasonable binarization thresholds (15\%, 20\%, 25\%) that balance the improvement in prediction signal and amount of data lost. In particular, we note that if we had chosen a binarization threshold of 50\% so that no data is thrown out, then the balanced classification accuracy of the best-performing prediction model (signed iterative random forest) falls to $51%$, which is very similar to random guessing. In the end, we will consolidate the results that were stable across all three binarization thresholds (15\%, 20\%, 25\%), described later.

**Limitations of binarization:** We recognize that this binarization procedure has limitations and that these limitations should be carefully considered in context of the problem under study. For instance, the utility of this binarization is highly dependent on whether understanding the differences between the low and high quantiles is relevant to the study aims. In our case, we believe that the connection between cardiac hypertrophy and those with high LVMi helps to justify the use of binarizing based on the low and high quantiles. If one does not expect a monotonic relationship between the genetic features and the phenotype, then binarization may not be appropriate. In addition, binarization also requires the choice of binarization threshold. This choice is almost certainly a subjective choice and thus necessitates a stability analysis, that is, analyzing the stability of findings across different choices of binarization thresholds. In our case, we only prioritize loci/interactions that were stably important across three different choices of binarization threshold (15%, 20%, and 25%). Finally, binarization also reduces the sample size and systematically omits individuals who are not in the tails of the phenotype distribution. As a result, there may be additional interactions that are missed in lo-siRF. While we do not view this as a major concern if the goal is to recommend a small set of reliable experimental targets, this limitation may certainly impact studies with different end goals.

We expand on the prediction modeling with the binarized LVMi phenotype in the next section (see `Prediction Step`).

</div>


## Prediction Results without Binarization {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}

<div class="panel panel-default padded-panel">

Here, we present a summary of the prediction modeling results, where we use the GWAS-filtered SNVs to predict the raw LVMi phenotype measurements (i.e., without any binarization). Specifically, we measure the validation set prediction accuracy from several popular machine learning regression algorithms (kernel ridge regression, Lasso regression, ridge regression, random forest (RF), and signed iterative random forest (siRF)) across four different prediction performance metrics - the Pearson correlation, mean absolute error (MAE), R-squared, and root-mean-squared error (RMSE) between the observed LVMi response and the predicted LVMi response. We also visualize how the predicted responses differ across the different prediction models in the provided pair plots. Though RF appears to yield the best prediction performance relative to the other methods under consideration, the RF prediction performance is still very poor as both the correlation and R-squared metrics are very close to 0. 

</div>

```{r pred-results-cts, results = "asis"}
# phenotype_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
section_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
subsection_header <- "\n\n##### %s {.unnumbered} \n\n"

methods <- c("Lasso" = "lasso", "Ridge" = "ridge", "RF" = "rf", "siRF" = "irf", "Kernel Ridge" = "kernel_ridge")

if (params$eval) {
  for (phenotype in phenotypes) {
    fpath <- file.path(res_dir, paste0(phenotype, "_gwas_filtered_nsnps1000"))

    # read in prediction results with demographics
    load(file.path(fpath, "validation_pheno.Rdata")) # > pheno
    preds_data <- list(
      yhat = c(
        readRDS(file.path(fpath, "ypred.rds")),
        fread(file.path(fpath, "ypred.csv"))
      ),
      y = pheno,
      gender = ifelse(loadDemographics(names(pheno))[, "gender"] == 1,
        "Male", "Female"
      ) %>%
        as.factor()
    )

    preds_data$yhat <- preds_data$yhat[methods]
    names(preds_data$yhat) <- names(methods)

    # prediction accuracy table
    metrics <- c("RMSE", "R2", "MAE", "Correlation")
    eval_out <- map_dfr(preds_data$yhat,
      ~ evalPreds(
        y = preds_data$y, yhat = c(.x),
        metric = metrics
      ),
      .id = "Method"
    ) %>%
      spread(key = "Metric", value = "Value") %>%
      dplyr::mutate(
        dplyr::across(
          c(Correlation, R2),
          ~ ifelse(.x == max(.x),
            kableExtra::cell_spec(formatC(.x, digits = 3, format = "g", flag = "#"), bold = TRUE),
            formatC(.x, digits = 3, format = "g", flag = "#")
          )
        ),
        dplyr::across(
          c(MAE, RMSE),
          ~ ifelse(.x == min(.x),
            kableExtra::cell_spec(formatC(.x, digits = 4, format = "g", flag = "#"), bold = TRUE),
            formatC(.x, digits = 4, format = "g", flag = "#")
          )
        )
      )
    tab <- pretty_DT(eval_out,
      digits = 4, sigfig = TRUE, rownames = FALSE,
      na_disp = "--",
      options = list(pageLength = nrow(eval_out), dom = "t")
    )
    # saveRDS(
    #   tab,
    #   file.path(tab_res_dir, paste0(phenotype, "_pred_accuracy.rds"))
    # )

    # prediction pair plots
    plt_df <- map_dfc(preds_data$yhat, ~ c(.x))
    plt <- plot_pairs(
      data = plt_df, columns = 1:ncol(plt_df),
      color = preds_data$gender
    )
    # saveRDS(
    #   plt,
    #   file.path(fig_res_dir, paste0(phenotype, "_pred_pair_plot.rds"))
    # )
  }
} else {
  for (phenotype in phenotypes) {
    phenotype_name <- names(phenotypes)[phenotypes == phenotype]
    # cat(sprintf(phenotype_header, phenotype_name))

    # prediction accuracy table
    cat(sprintf(section_header, "Prediction Accuracies"))
    tab <- readRDS(
      file.path(tab_res_dir, paste0(phenotype, "_pred_accuracy.rds"))
    )
    subchunkify(tab,
      i = chunk_idx,
      caption = sprintf("'Summary of prediction performance on the validation set for the %s phenotype (without binarization) across various prediction methods and evaluation metrics.'", phenotype_name)
    )
    chunk_idx <- chunk_idx + 1

    # prediction pair plots
    cat(sprintf(section_header, "Prediction Pair Plot"))
    plt <- readRDS(
      file.path(
        fig_res_dir,
        paste0(phenotype, "_pred_pair_plot.rds")
      )
    )
    subchunkify(plt,
      i = chunk_idx, fig_height = plt$ncol * 1.5,
      add_class = c("panel panel-default"),
      other_args = "out.width='100%'",
      caption = sprintf("'Pair plot of the predicted %s responses from various prediction models. Each point represents an individual in the validation set, and the coordinates represent the predicted %s response from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted %s responses are shown along the diagonal.'", phenotype_name, phenotype_name, phenotype_name)
    )
    chunk_idx <- chunk_idx + 1
  }
}
```

# Prediction Step {.tabset .tabset-fade .tabset-vmodern}

<div class="panel panel-default padded-panel">

After choosing to binarize the LVMi phenotype, we next fit various prediction models using the GWAS-filtered SNVs to predict the binarized LVMi phenotype and assess their validation prediction accuracy. 

**Which models were chosen and why?** While there are many methods, each with their unique advantages and disadvantages, for detecting epistasis from data [e.g., see @niel2015survey and references therein], we chose to focus on siRF [@kumbier2018refining] as our primary interaction search engine for various reasons. 

1. Unlike many methods that make linear assumptions or require pre-specification of interaction order (i.e., the number of features involved in an interaction), siRF can identify *higher-order*, *nonlinear* Boolean interactions in a computationally-tractable manner without the need to specify the interaction order of interest. 
2. Another advantage of siRF for detecting interactions is the resemblance between the localized thresholding behavior of its decision trees and the threshold (or on-off switch-like) behavior frequently observed in biomolecular interactions [@nelson2008lehninger]. In other words, the form of the siRF model appears to be well-suited for the biological phenomena under study.
3. Furthermore, the prediction accuracy of the siRF model fit can be easily assessed. siRF also exhibited higher prediction accuracy than other commonly used prediction methods across almost all evaluation metrics examined and binarization thresholds (Figure \@ref(fig:subchunk-29)). As discussed previously, prediction accuracy can serve as an indicator of whether or not the model is an accurate representation of reality. This prediction check is a necessary pre-requisite for interpreting the prediction model and is especially important in low-signal regimes such as this study.
4. siRF and its predecessor iRF have demonstrated great success in detecting reliable interactions in previous real-world studies, such as in epistasis-controlled hair color [@behr2020learning], obesity [@allen2022using], schizophrenia [@de2022machine], and predictive gene expression networks [@cliff2019high; @walker2022evaluating].
5. Finally, due to the weak phenotypic signal and the strong correlations between neighboring SNVs due to linkage disequilibrium (Figure \@ref(fig:subchunk-25)), we believed it was most appropriate to recommend interactions between genetic loci, rather than interactions betweeen individual SNVs. While it is not always clear how to perform this locus-level interaction search when the data is given at the SNV-level, we can build upon the siRF structure to aggregate feature importances across multiple SNVs using our new locus-level importance score, described in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1).

Though we are most interested in interpreting the siRF prediction model to identify candidate epistatic interactions, we also fit several other popular prediction models, including RF, L1-regularized logistic regression (Lasso), L2-regularized logistic regression (ridge), support vector machines (SVM), multilayer perceptron, AutoGluon TabularPredictor, and a polygenic risk score. These prediction models serve as baselines to ensure that siRF indeed fits the data well, relative to these other popular prediction models, and that siRF passes the prediction check in accordance with the PCS framework [@yu2020veridical].

**Summary of prediction results:** For all of the classification algorithms under investigation, we see that the validation prediction accuracy metrics, measured via classification accuracy, area under the receiver operator characteristic curve (AUROC), and area under the precision-recall curve (AUPRC), are all comfortably above the random-guessing baseline of 0.5 across three different binarization thresholds. This suggests that the prediction models are able to ascertain some phenotypic signal from the SNVs under study. That is, the SNVs used in the prediction models have some predictive power in differentiating those with high LVMi versus those with low LVMi. Moreover, siRF consistently yielded the highest predictive power across the various binarization thresholds and prediction metrics (with one exception being the classification accuracy for the 15\% binarization threshold, where ridge is slightly better than siRF). From this, we conclude that siRF passed the prediction check and will proceed to interpret the siRF fit to recommend and prioritize interactions between genetic loci for downstream analyses and experiments.

Beyond the validation prediction accuracies, we also provide the confusion matrices, ROC and precision-recall curves, and pair plots comparing the predicted responses across methods for additional post-hoc exploration.

</div>

```{r pred-results-binary, results = "asis"}
# phenotype_header <- "\n\n## %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square} \n\n"
summary_header <- "\n\n## %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}\n\n"
thr_header <- "\n\n## Binarization Threshold = %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}\n\n"
section_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
subsection_header <- "\n\n#### %s {.unnumbered} \n\n"

methods <- c("siRF" = "irf", "RF" = "rf", "Lasso" = "lasso", "Ridge" = "ridge", "SVM" = "svm", "MLP" = "mlp", "Autogluon (medium)" = "autogluon_medium", "Autogluon (good)" = "autogluon_good", "PRS" = "prs")

AUC_HEIGHT <- 5

if (params$eval) {
  for (phenotype in phenotypes) {
    eval_ls <- list()
    phenotype_name <- names(phenotypes)[phenotypes == phenotype]
    for (thr in c(0.15, 0.2, 0.25)) {
      fpath <- file.path(res_dir, paste0(
        phenotype, "_binary_thr", thr,
        "_gwas_filtered_nsnps1000"
      ))
      # read in prediction results
      load(file.path(fpath, "validation_pheno.Rdata")) # > pheno
      preds_data <- list(
        yhat = c(
          readRDS(file.path(fpath, "ypred.rds")),
          fread(file.path(fpath, "ypred.csv")),
          prs = readRDS(file.path(res_dir, "prs", "prs_pred.rds"))[as.character(thr)] %>%
            unname()
        ),
        y = pheno
      )

      preds_data$yhat <- preds_data$yhat[methods]
      names(preds_data$yhat) <- names(methods)

      # prediction accuracy table
      metrics <- c("Class", "AUC", "PR")
      eval_out <- map_dfr(preds_data$yhat,
        ~ bind_rows(
          evalPreds(
            y = preds_data$y, yhat = c(.x),
            metric = c("AUC", "PR")
          ),
          evalPreds(
            y = preds_data$y,
            yhat = round(c(.x)),
            metric = c("BalancedClass", "Class")
          )
        ),
        .id = "Method"
      ) %>%
        spread(key = "Metric", value = "Value")
      eval_ls[[as.character(thr)]] <- eval_out

      # confusion matrix
      conf_all <- map(
        preds_data$yhat,
        ~ evalConfusion(y = preds_data$y, yhat = .x)
      ) %>%
        purrr::reduce(., cbind)
      conf_names <- colnames(conf_all) %>%
        stringr::str_replace("0", "Low") %>%
        stringr::str_replace("1", "High")
      colnames(conf_all) <- paste(conf_names, 1:ncol(conf_all))
      rownames(conf_all) <- rownames(conf_all) %>%
        stringr::str_replace("0", "Low") %>%
        stringr::str_replace("1", "High")
      conf_header <- c(1, rep(2, length(preds_data$yhat))) %>%
        setNames(c(" ", names(preds_data$yhat)))
      tab <- pretty_DT(conf_all,
        rownames = TRUE,
        grouped_header = conf_header,
        grouped_subheader = c(" ", conf_names),
        options = list(dom = "t", ordering = FALSE)
      )
      # saveRDS(
      #   tab,
      #   file.path(
      #     tab_res_dir,
      #     paste0(phenotype, "_thr", thr, "_confusion_matrix.rds")
      #   )
      # )

      # auc/pr plots
      roc_ls <- map(
        preds_data$yhat,
        ~ evalAUC(y = preds_data$y, yhat = .x, metric = "roc")
      )
      pr_ls <- map(
        preds_data$yhat,
        ~ evalAUC(y = preds_data$y, yhat = .x, metric = "pr")
      )

      # drop models with constant predictions
      roc_ls <- roc_ls[!sapply(roc_ls, is.null)]
      pr_ls <- pr_ls[!sapply(pr_ls, is.null)]

      # get curves
      roc_df <- map_dfr(roc_ls,
        ~ data.frame(.x$curve) %>%
          mutate(AUC = round(.x$auc, 3)),
        .id = "Method"
      ) %>%
        mutate(Method = as.factor(paste(Method, " (", AUC, ")", sep = "")))
      pr_df <- map_dfr(pr_ls,
        ~ data.frame(.x$curve) %>%
          mutate(AUC = round(.x$auc.integral, 3)),
        .id = "Method"
      ) %>%
        mutate(Method = as.factor(paste(Method, " (", AUC, ")", sep = "")))

      # make plots
      roc_plt <- ggplot(roc_df) +
        aes(x = X1, y = X2, color = Method) +
        geom_line() +
        labs(x = "FPR", y = "TPR") +
        vthemes::theme_vmodern() +
        vthemes::scale_color_vmodern(discrete = TRUE)
      # saveRDS(
      #   roc_plt,
      #   file.path(
      #     fig_res_dir,
      #     paste0(phenotype, "_thr", thr, "_roc.rds")
      #   )
      # )

      pr_plt <- ggplot(pr_df) +
        aes(x = X1, y = X2, color = Method) +
        geom_line() +
        labs(x = "Recall", y = "Precision") +
        vthemes::theme_vmodern() +
        vthemes::scale_color_vmodern(discrete = TRUE)
      # saveRDS(
      #   pr_plt,
      #   file.path(
      #     fig_res_dir,
      #     paste0(phenotype, "_thr", thr, "_pr.rds")
      #   )
      # )

      # prediction pair plots
      plt_df <- map_dfc(preds_data$yhat, ~ c(.x))
      plt <- plot_pairs(data = plt_df, columns = 1:ncol(plt_df), point_size = 0.1)
      # saveRDS(
      #   plt,
      #   file.path(
      #     fig_res_dir,
      #     paste0(
      #       phenotype, "_thr", thr,
      #       "_pred_pair_plot.rds"
      #     )
      #   )
      # )
    }

    phenotype_name <- names(phenotypes)[phenotypes == phenotype]
    tab <- dplyr::bind_rows(eval_ls, .id = "Threshold") %>%
      dplyr::rename("Accuracy" = "Class", "AUROC" = "AUC", "AUPRC" = "PR") %>%
      dplyr::mutate(Threshold = paste0("Binarization Threshold = ", Threshold)) %>%
      dplyr::select(Threshold, Method, Accuracy, AUROC, AUPRC) %>%
      dplyr::mutate(
        Method = factor(Method, levels = names(methods))
      ) %>%
      dplyr::arrange(Method) %>%
      dplyr::group_by(Threshold) %>%
      dplyr::mutate(dplyr::across(
        c(Accuracy, AUROC, AUPRC),
        ~ ifelse(.x == max(.x),
          kableExtra::cell_spec(
            formatC(.x,
              digits = 3,
              format = "g", flag = "#"
            ),
            bold = TRUE
          ),
          formatC(.x,
            digits = 3,
            format = "g", flag = "#"
          )
        )
      )) %>%
      tidyr::pivot_longer(cols = c(Accuracy, AUROC, AUPRC), names_to = "Metric", values_to = "Value") %>%
      tidyr::pivot_wider(id_cols = Method, names_from = c(Threshold, Metric), values_from = Value) %>%
      vthemes::pretty_DT(
        digits = 3, sigfig = FALSE, rownames = FALSE, na_disp = "--",
        grouped_header = c(
          " " = 1,
          "Binarization Threshold = 0.15" = 3,
          "Binarization Threshold = 0.20" = 3,
          "Binarization Threshold = 0.25" = 3
        ),
        grouped_subheader = c("Method", rep(c("Accuracy", "AUROC", "AUPRC"), times = 3)),
        options = list(dom = "t")
      )
    # saveRDS(
    #   tab,
    #   file.path(tab_res_dir, paste0(phenotype, "_binary_pred_accuracy.rds"))
    # )
  }
} else {
  for (phenotype in phenotypes) {
    phenotype_name <- names(phenotypes)[phenotypes == phenotype]
    # cat(sprintf(phenotype_header, phenotype_name))

    # summary table of prediction accuracies
    cat(sprintf(summary_header, "Prediction Accuracy Summary"))
    tab <- readRDS(file.path(tab_res_dir, paste0(phenotype, "_binary_pred_accuracy.rds")))
    subchunkify(tab,
      i = chunk_idx,
      caption = sprintf("'Summary of prediction performance on the validation set for the binarized %s phenotype across various prediction methods, binarization thresholds, and evaluation metrics. Abbreviations: siRF, signed iterative random forest; RF, random forest; Lasso, L1-regularized logistic regression; Ridge, L2-regularized logistic regression; SVM, support vector machine; MLP, multilayer perceptron; Autogluon (medium/good), Autogluon TabularPredictor fitted with the medium/good quality preset; PGS, polygenic risk score.'", phenotype_name)
    )
    chunk_idx <- chunk_idx + 1

    for (thr in c(0.15, 0.2, 0.25)) {
      cat(sprintf(thr_header, thr))

      # confusion matrix
      cat(sprintf(section_header, "Confusion Tables"))
      tab <- readRDS(file.path(
        tab_res_dir,
        paste0(
          phenotype, "_thr", thr,
          "_confusion_matrix.rds"
        )
      ))
      subchunkify(tab,
        i = chunk_idx,
        caption = sprintf("'Confusion matrix from various prediction models for predicting the binarized %s phenotype (binarization threshold = %s).'", phenotype_name, thr)
      )
      chunk_idx <- chunk_idx + 1

      # auc/pr plots
      cat(sprintf(section_header, "ROC/PR Curves"))
      plt <- readRDS(file.path(
        fig_res_dir,
        paste0(phenotype, "_thr", thr, "_roc.rds")
      ))
      subchunkify(ggplotly(plt),
        i = chunk_idx, fig_height = AUC_HEIGHT,
        add_class = c("panel panel-default padded-panel"),
        other_args = "out.width='100%'",
        caption = sprintf("'ROC curve for predicting the binarized %s (threshold = %s) across various prediction methods. TPR = true positive rate; FPR = false positive rate.'", phenotype_name, thr)
      )
      chunk_idx <- chunk_idx + 1

      plt <- readRDS(file.path(
        fig_res_dir,
        paste0(phenotype, "_thr", thr, "_pr.rds")
      ))
      subchunkify(ggplotly(plt),
        i = chunk_idx, fig_height = AUC_HEIGHT,
        add_class = c("panel panel-default padded-panel"),
        other_args = "out.width='100%'",
        caption = sprintf("'Precision-recall curve for predicting the binarized %s (threshold = %s) across various prediction methods.'", phenotype_name, thr)
      )
      chunk_idx <- chunk_idx + 1

      # prediction pair plots
      cat(sprintf(section_header, "Prediction Pair Plot"))
      plt <- readRDS(file.path(
        fig_res_dir,
        paste0(
          phenotype, "_thr", thr,
          "_pred_pair_plot.rds"
        )
      ))
      subchunkify(plt,
        i = chunk_idx, fig_height = plt$ncol * 1.5,
        add_class = c("panel panel-default"),
        other_args = "out.width='100%'",
        caption = sprintf("'Pair plots of the predicted binarized %s responses (threshold = %s) across various prediction methods. Each point represents an individual in the validation set, and the coordinates represent the predicted probability of having high %s from two different prediction methods, which are specified by the x and y-axes labels in each subplot. Density distributions of the predicted probabilities are shown along the diagonal.'", phenotype_name, thr, phenotype_name)
      )
      chunk_idx <- chunk_idx + 1
    }
  }
}
```

# Prioritization Step {.tabset .tabset-fade .tabset-vmodern}

## Overview {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}

<div class="panel panel-default padded-panel">
Having passed the prediction check, we finally proceed to interpret the siRF fit in the last step of lo-siRF. In this step, we developed a novel stability-driven feature importance score to prioritize both genetic loci and interactions between genetic loci from siRF. We refer to [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1) for the detailed description of this new importance score and here provide brief insights into the motivation and intuition behind this method.

**Main idea:** At a high-level, this new importance score (1) aggregates weak, unstable SNV-level importances from siRF into stronger, more stable locus-level importances and (2) can be leveraged to evaluate both the marginal importance of a genetic locus in siRF as well as the importance of higher-order interactions between genetic loci in siRF. The need for this locus-level feature importance score stems from two main reasons: the high correlation between SNVs and the weak phenotypic signal. Though the prediction models (including siRF) are all trained using SNV data, it is often very challenging to interpret the importance of an individual SNV from these models due to the high correlation between SNVs [@tolocsi2011classification; @hooker2021unrestricted], which we observed in our previous exploratory data analysis (see `Choice of Phenotypic Data` tab). To further complicate matters, the weak phenotypic signal increases the instability of existing feature importance methods in that we see the feature rankings change drastically when using different bootstrap training samples. Given these challenges with SNV-level feature importances, it is then no surprise that siRF, out-of-the-box, does not find any stable interactions between SNVs (i.e., the siRF stability score is around 0 for all interactions between SNVs). To address these challenges, our new feature importance score leverages the correlation structure between SNVs, groups SNVs into genetic loci, and aggregates weak, unstable SNV-level importances from siRF into stronger, more stable locus-level importances.

**Motivation behind new feature importance score:** To better understand the motivation and construction of our new locus-level feature importance score, it can be helpful to first review the original importance score used in siRF [@kumbier2018refining]. Briefly, in the original version of siRF, the importance of a signed interaction is based upon *how stable or frequently it occurs* in decisions paths across the random forest -- that is, based upon the number of decision paths for which every signed feature in the signed interaction was used in at least one decision split (see @kumbier2018refining for details). Following a similar logic, one natural path forward to calculate a locus-level importance from a forest that was fit on SNV data is as follows:

1. Take the fitted forest, where each decision split used some SNV to make the split.
2. Map each SNV used in a decision split to its corresponding genetic loci (e.g., using annotation software like ANNOVAR [@wang2010annovar]).
3. Compute the importance of a signed locus-level interaction based upon the frequency of occurrence of the interaction as before, but using the mapped genetic loci in lieu of the SNVs. That is, compute the number of decision paths for which every signed locus in the signed locus-level interaction was used in at least one decision split. 

However, this *frequency of occurrence* measure is heavily tied to the number of SNVs that belong to each genetic locus. A genetic locus that consists of more SNVs will naturally have a higher frequency of occurrence than a genetic locus with only a few SNVs even though both genetic loci may be equally important (or unimportant). In other words, the original siRF importance score, measuring the frequency of occurrence, is not comparable across genetic loci. This bias is problematic given that we have previously seen some genetic loci may consist of only 1 SNV while other genetic loci have >50 SNVs (see `GWAS-filtered SNVs` tab in the Dimension Reduction Step). 

**Intuition behind new feature importance score:** To alleviate this bias, we leverage the idea of a *local* feature importance score, where we compute a *per-individual* score that measures the importance of a locus/interaction for making the prediction of *each* individual. This is in contrast to a *global* feature importance score, which is a measure of importance for the entire group of individuals under study. While a global feature importance score gives rise to a *single* score for the entire population under study, we can compute a local feature importance score for *each* individual in the population. By computing a local feature importance score for each individual (i.e., this is the local stability importance (LSI) score from [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1)), we can compare the importance of the locus/interaction across *individuals* and avoid the across-*loci* comparison that was problematic in the aforementioned discussion. This comparison is done via a permutation test in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1). Moreover, since we are comparing the local feature importances between individuals *within* or *per* loci/interaction in this permutation test, the local feature importances are all measured on the same scale and comparable, thereby mitigating the bias towards genetic loci with more SNVs.

**How to aggregate SNVs:** We mapped each SNV to its corresponding genetic locus using ANNOVAR [@wang2010annovar] according to the hg19 refSeq Gene annotations. Given our ultimate goal of recommending/prioritizing genetic loci and interactions between loci for downstream analyses and experimental validation, we believe this choice provides an appropriate starting point to obtain actionable recommendations. However, we note that there are many alternative ways of aggregating SNVs. For example, it is possible to aggregate SNVs according to LD/haplotype blocks or to include additional gene-based or functional annotations in the aggregation scheme.

**Organization:** Under the `SNV-level Importances` tab, we provide the SNV-level feature importances from the fitted prediction algorithms (described in the `Prediction Step` section) for completeness. These SNV-level feature importances are very unstable across methods and across bootstrap perturbations for the reasons discussed above. We then summarize the results from the lo-siRF pipeline including the rankings of genetic loci and interaction between genetic loci using our new locus-level importance score under the `Summary of lo-siRF Results` tab. A final technical note on the advantages of incorporating signed information from siRF in this feature importance computation is provided under the `Advantages of Signed Information` tab.

</div>

## SNV-level Feature Importances {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}

<div class="panel panel-default padded-panel">
Having observed that the aforementioned classification algorithms do have some predictive power to predict those with high versus low LVMi, a natural next step is to investigate which SNVs were used in these prediction models. Below, we highlight the top 100 SNVs, ranked by importance in each of the prediction models. For completeness, we provide the SNV importances for the unbinarized LVMi (regression) problem as well as the binarized LVMi (classification) problem across the three different binarization thresholds.

**How is importance defined?**

- For ridge and Lasso, the importance of an SNV is defined as the magnitude of its coefficient in the prediction model.
- For RF, the importance of an SNV is determined by the mean decrease in impurity (MDI) score.
- For siRF, the importance of an SNV is determined by the MDI from the forest in the final iteration.
- Due to the computational complexity, we did not compute the SNV importances from the SVM fit as there is no built-in model-based feature importance method (in R), and any permutation-based approach would be computationally expensive.

**A word of caution:** When interpreting these SNV importances, it is crucial to keep in mind that there are very large, strongly correlated blocks of SNVs within this data, which often renders the feature rankings unmeaningful and unstable [@tolocsi2011classification; @hooker2021unrestricted]. In particular, it is well-known that the MDI variable importance metric used in RF/iRF/siRF is biased against correlated features [@hooker2021unrestricted]. In other words, SNVs within large LD blocks tend to have artificially low MDI values. Given these idiosyncrasies, it is unsurprising that SNVs from TTN and CCDC141 -- two genetic loci with many SNVs under LD -- do not have high importances according to MDI. We provide these SNV importance tables only for completeness and to partially motivate the need to aggregate SNV-level importances into locus-level importances in lo-siRF.

</div>


```{r vimp, results = "asis"}
# phenotype_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
thr_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle}"
method_header <- "\n\n#### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
subsection_header <- "\n\n##### %s {.unnumbered}\n\n"

methods <- c("Lasso" = "lasso", "Ridge" = "ridge", "RF" = "rf", "siRF" = "irf")

if (params$eval) {
  snps_df <- fread(file.path(res_dir, "annovar", "snp2gene_df.csv"))
  for (phenotype in phenotypes) {
    phenotype_name <- names(phenotypes)[phenotypes == phenotype]
    for (thr in c(NA, 0.15, 0.2, 0.25)) {
      if (is.na(thr)) {
        fpath <- file.path(res_dir, paste0(phenotype, "_gwas_filtered_nsnps1000"))
      } else {
        fpath <- file.path(res_dir, paste0(
          phenotype, "_binary_thr", thr,
          "_gwas_filtered_nsnps1000"
        ))
      }
      for (method in methods) {
        vimp_out <- evalVimp(res_dir = fpath, method = method, snps_df = snps_df) %>%
          dplyr::filter(!stringr::str_detect(Gene, "PC[0-9]*$")) %>%
          dplyr::slice(1:100) %>%
          dplyr::mutate(
            Alleles = purrr::map2(Ref, Alt, ~ sort(c(.x, .y))),
            Allele1 = purrr::map_chr(Alleles, ~ .x[[1]]),
            Allele2 = purrr::map_chr(Alleles, ~ .x[[2]]),
            uniqID = paste(Chr, Pos, Allele1, Allele2, sep = ":")
          ) %>%
          dplyr::select(
            Chr, rsID, Pos, uniqID,
            Locus = Gene, Function = `Gene Function`,
            `Exonic Function`, Importance
          )
        if (!is.null(vimp_out)) {
          tab <- pretty_DT(vimp_out, digits = 3, sigfig = T, na_disp = "--")
          # saveRDS(
          #   tab,
          #   file.path(
          #     tab_res_dir,
          #     paste0(
          #       phenotype, "_thr", thr, "_", method,
          #       "_vimp.rds"
          #     )
          #   )
          # )
        }
      }
    }
  }
} else {
  for (phenotype in phenotypes) {
    phenotype_name <- names(phenotypes)[phenotypes == phenotype]
    # cat(sprintf(phenotype_header, phenotype_name))
    for (thr in c(NA, 0.15, 0.2, 0.25)) {
      if (is.na(thr)) {
        cat(sprintf(thr_header, "Unbinarized (Regression)"))
        fpath <- file.path(res_dir, paste0(phenotype, "_gwas_filtered_nsnps1000"))
        phenotype_str <- sprintf("%s", phenotype_name)
      } else {
        cat(sprintf(thr_header, paste0("Binarization Threshold = ", thr)))
        fpath <- file.path(res_dir, paste0(
          phenotype, "_binary_thr", thr,
          "_gwas_filtered_nsnps1000"
        ))
        phenotype_str <- sprintf("the binarized %s (threshold = %s)", phenotype_name, thr)
      }
      for (i in 1:length(methods)) {
        method <- methods[[i]]
        method_name <- names(methods)[i]
        cat(sprintf(method_header, method_name))
        tmp_file <- file.path(
          tab_res_dir,
          paste0(
            phenotype, "_thr", thr, "_", method,
            "_vimp.rds"
          )
        )
        if (file.exists(tmp_file)) {
          tab <- readRDS(tmp_file)
          subchunkify(tab,
            i = chunk_idx,
            other_args = "out.width='100%'",
            caption = sprintf("'%s - Top 100 SNVs, ranked by importance, for predicting %s.'", method_name, phenotype_str)
          )
          chunk_idx <- chunk_idx + 1
        }
      }
    }
  }
}
```

## Summary of lo-siRF Results {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}

<div class="panel panel-default padded-panel">
Finally, we summarize the prioritization results from the lo-siRF pipeline using our new importance score to rank genetic loci and interactions between genetic loci. In this new importance score, we (1) computed a local (i.e., on a per-individual basis) stability importance score that aggregates weak SNV-level importances into stronger locus-level importances using the siRF fit and (2) conducted a permutation test to assess whether the local stability importance scores for a given locus/interaction are different between individuals with high and low LVMi (conditioned on the rest of the fitted forest). The larger the difference in local stability importance scores between those with high versus low LVMi, the greater the importance of the locus/interaction for predicting LVMi. For details on the new importance score, we refer to [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1).

Below, we recap the siRF prediction accuracy for predicting the binarized LVMi across the three binarization thresholds (15\%, 20\%, 25\%) (Figure \@ref(fig:subchunk-58)). Then, in Figure \@ref(fig:subchunk-59), we summarize the prioritization rankings of genetic loci and interaction between loci according to the new importance score. We provide the nominal permutation p-values along with the permutation test statistic, that is, the difference in local feature stability scores between the high and low LVMi groups, with the standard deviation of this permutation distribution in parentheses. 

Note that the permutation test was not performed for all loci/interactions (details in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1)). Given our primary aim of generating reliable hypotheses for follow-up experimental validation, we chose to only test loci/interactions that were *predictive* and *stable* in the siRF fit, following the PCS framework for veridical data science [@yu2020veridical]. Specifically, for each binarization threshold, we only tested:

- the top 25 loci with the highest average local stability importance scores, or in other words, those that occurred most stably/frequently (and thus likely predictive) in the siRF fit, and
- the interactions between loci that were identified by siRF and achieved a stability score > 0.5, stability score for mean increase in precision > 0, and stability score for feature selection dependence > 0. We note that in addition to the explicit stability checks here, the requirement on the mean increase in precision serves as a predictability check, and the requirement on the feature selection dependence serves to differentiate marginal additive effects from non-additive interaction effects (see @kumbier2018refining for details).

By implementing this additional predictability and stability check, we are requiring that the prioritized loci/interactions meet a higher standard, with the hope that this ultimately generates more reliable (albeit more conservative) experimental recommendations.

In addition to the provided summary of results,

- Under the `siRF Interaction Table` tab, we provide the full siRF locus-level interaction table output for each binarization threshold (Figures \@ref(fig:subchunk-60), \@ref(fig:subchunk-62), and \@ref(fig:subchunk-64)). This table includes several metrics for evaluating the stability and strength of the interaction. For details on these metrics, we refer to @kumbier2018refining.
  - For comparison, we also provide the analogous siRF output for interactions at the SNV level (Figures \@ref(fig:subchunk-61), \@ref(fig:subchunk-63), and \@ref(fig:subchunk-65)). Here, we see that if we do not perform the SNV-to-locus aggregation, the stability scores for the SNV-level interactions are very close to 0. This means that the siRF-identified SNV-level interactions rarely appear when siRF is refitted using different bootstrapped training samples. As discussed previously, this instability is partially due to the correlation among SNVs and the weak phenotypic signal. This high instability also motivates the need for an alternative approach via our new importance score.
- Under the `Local Stability Importance Plots - Interactions` and `Local Stability Importance Plots - Loci`, we provide visualizations of the density distribution of local stability importance scores between the low and high LVMi groups as well as the frequency distribution of SNVs used in the siRF fit (i.e., which SNVs occurred most frequently in the decision paths across the siRF fit).
  - In the density distribution of local stability importance scores, observable differences between the low and high LVMi groups are evidence that the genetic locus or interaction between loci is important for differentiating between individuals with high versus low LVMi. The larger the difference, the greater the importance of the locus or interaction between loci.
  - In the SNV frequency distributions, we show the frequency of occurrence for each SNV or SNV pair in the siRF fit. This is the number of decision paths for which the SNV or SNV pair appeared in the siRF fit. SNVs or SNV pairs with the highest number of occurrences in the RF decision paths (y-axis) contributed the most to the overall locus/interaction's importance in the siRF fit.

</div>


```{r irf-results, results = "asis"}
# phenotype_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
section_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
thr_header <- "\n\n#### %s {.unnumbered .tabset .tabset-pills .tabset-fade} \n\n"
# subsection_header <- "\n\n##### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
# subsubsection_header <- "\n\n####### %s {.unnumbered} \n\n"

DIST_HEIGHT <- 8
HEATMAP_HEIGHT <- 8
EPI_HEIGHT <- 8
ITER <- 3
LEFT_MARGIN <- 5

if (params$eval) {
  for (phenotype in phenotypes) {
    print(phenotype)
    eval_ls <- list()
    int_results_ls <- list()
    gene_results_ls <- list()
    for (thr in c(0.15, 0.2, 0.25)) {
      print(thr)
      fpath <- file.path(res_dir, paste0(
        phenotype, "_binary_thr", thr,
        "_gwas_filtered_nsnps1000"
      ))
      # > out, yhat_tr2, yhat_va, irf_gene_out, snps.df, irf_snv_int_out
      load(file.path(fpath, "irf_interaction_fit.Rdata"))
      y_test <- ifelse(out$pheno_test == 0, "Low LVMi", "High LVMi")

      # # prediction results
      # eval_out <- bind_rows(
      #   evalPreds(y = out$pheno_test, yhat = yhat_va, metric = c("AUC", "PR")),
      #   evalPreds(y = out$pheno_test, yhat = round(yhat_va), metric = c("Class"))
      # ) %>%
      #   spread(key = "Metric", value = "Value")
      # eval_ls[[as.character(thr)]] <- eval_out

      # local interaction stability results
      load(file.path(fpath, "local_interaction_stability_results.Rdata"))
      int_out <- list(
        stab_tr2_ls = int_stab_tr2_ls,
        stab_va_ls = int_stab_va_ls,
        test_ints_ls = test_ints_ls,
        perm_out_ls = int_perm_out_ls
      )
      int_results_ls[[as.character(thr)]] <- int_perm_out_ls

      int_all_out <- list(`Using all splits` = int_out)

      pval_int <- map_dfr(int_out$perm_out_ls[[ITER]],
        function(x) {
          data.frame(
            pval = x$pval,
            check.names = FALSE
          )
        },
        .id = "int"
      ) %>%
        dplyr::mutate(
          pval_str = dplyr::case_when(
            pval < 1e-4 ~ "< 10<sup>-4</sup>",
            pval < 1e-3 ~ "< 10<sup>-3</sup>",
            TRUE ~ sprintf("%s", formatC(pval, digits = 3, format = "f"))
          )
        )

      style_names_basic <- function(x, italic = TRUE) {
        x <- x %>%
          stringr::str_replace_all("-_", "<sup>&#8211;</sup>&#8211;") %>%
          stringr::str_replace_all("\\+_", "<sup>+</sup>&#8211;") %>%
          stringr::str_replace_all("-$", "<sup>&#8211;</sup>") %>%
          stringr::str_replace_all("\\+$", "<sup>+</sup>") %>%
          kableExtra::cell_spec(escape = FALSE, italic = italic)
        return(x)
      }

      # iRF results
      tab_out <- irf_gene_out$interaction[[ITER]] %>%
        left_join(y = pval_int, by = "int") %>%
        dplyr::arrange(pval) %>%
        dplyr::select(-pval) %>%
        dplyr::rename(
          "Interaction" = "int",
          "Prevalence" = "prevalence",
          "Precision" = "precision",
          "Class Difference in Prevalence" = "cpe",
          "Stability of Class Difference in Prevalence" = "sta.cpe",
          "Feature Selection Dependence" = "fsd",
          "Stability of Feature Selection Dependence" = "sta.fsd",
          "Increase in Precision" = "mip",
          "Stability of Increase in Precision" = "sta.mip",
          "Stability" = "stability",
          "lo-siRF Mean Permutation p-value" = "pval_str"
        ) %>%
        dplyr::mutate(Interaction = style_names_basic(Interaction))
      tab <- pretty_DT(tab_out,
        digits = 3, sigfig = TRUE, rownames = FALSE,
        na_disp = "--"
      )
      # saveRDS(
      #   tab,
      #   file.path(
      #     tab_res_dir,
      #     paste0(
      #       phenotype, "_thr", thr,
      #       "_irf_interaction_results.rds"
      #     )
      #   )
      # )

      # iRF results - SNV level
      tab_out <- irf_snv_int_out %>%
        dplyr::mutate(int = stringr::str_remove_all(int, "[0-9]*ch")) %>%
        dplyr::arrange(-stability) %>%
        dplyr::rename(
          "Interaction" = "int",
          "Prevalence" = "prevalence",
          "Precision" = "precision",
          "Class Difference in Prevalence" = "cpe",
          "Stability of Class Difference in Prevalence" = "sta.cpe",
          "Feature Selection Dependence" = "fsd",
          "Stability of Feature Selection Dependence" = "sta.fsd",
          "Increase in Precision" = "mip",
          "Stability of Increase in Precision" = "sta.mip",
          "Stability" = "stability"
        ) %>%
        dplyr::mutate(
          Interaction = style_names_basic(Interaction, italic = FALSE)
        )
      tab <- pretty_DT(tab_out, digits = 3, sigfig = TRUE, rownames = FALSE)
      # saveRDS(
      #   tab,
      #   file.path(
      #     tab_res_dir,
      #     paste0(
      #       phenotype, "_thr", thr,
      #       "_irf_snv_interaction_results.rds"
      #     )
      #   )
      # )

      # for (type in names(int_all_out)) {
      #   print(type)
      #   # heatmap
      #   plt <- plot_hclust_heatmap(int_all_out[[type]]$stab_va_ls[[ITER]],
      #     y_groups = y_test,
      #     clust_y_wi_group = FALSE
      #   ) +
      #     labs(
      #       x = "Interaction", y = "Patients",
      #       fill = "Local\nStability\nScore"
      #     ) +
      #     vthemes::theme_vmodern(size_preset = "medium") +
      #     theme(
      #       axis.text.y = element_blank(),
      #       axis.ticks.y = element_blank(),
      #       axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)
      #     )
      #   # saveRDS(
      #   #   plt,
      #   #   file.path(
      #   #     fig_res_dir,
      #   #     paste0(
      #   #       phenotype, "_thr", thr,
      #   #       "_int_local_stability_heatmap_", type, ".rds"
      #   #     )
      #   #   )
      #   # )
      # 
      #   # distribution plots
      #   int_pvals <- map_dfr(int_all_out[[type]]$perm_out_ls[[ITER]],
      #     ~ data.frame(pval = .x$pval),
      #     .id = "int"
      #   ) %>%
      #     arrange(pval) %>%
      #     dplyr::mutate(
      #       pval_str = dplyr::case_when(
      #         pval < 1e-4 ~ "p < 10^-4",
      #         pval < 1e-3 ~ "p < 10^-3",
      #         # pval < 1e-4 ~ "p < 10<sup>-4</sup>",
      #         # pval < 1e-3 ~ "p < 10<sup>-3</sup>",
      #         TRUE ~ sprintf("p~`=`~%s", formatC(pval, digits = 3, format = "f"))
      #       )
      #     ) %>%
      #     mutate(int_pval = sprintf("%s (%s)", int, pval_str))
      #   plt_df <- int_all_out[[type]]$stab_va_ls[[ITER]] %>%
      #     bind_cols(y = y_test) %>%
      #     gather(key = "int", value = "Stability", -y) %>%
      #     left_join(y = int_pvals, by = "int") %>%
      #     # mutate(int = paste0(int, " (p = ", formatC(pval, 3, format = "g"), ")")) %>%
      #     mutate(int = factor(int, levels = int_pvals$int))
      #   plt <- plot_density(data = plt_df, x_str = "Stability", fill_str = "y") +
      #     facet_wrap(~int, scales = "free", ncol = 3) +
      #     ggplot2::labs(x = "Local Stability Importance Score", fill = "") +
      #     ggplot2::geom_text(
      #       ggplot2::aes(x = Inf, y = Inf, hjust = 1.1, vjust = 1.5, label = pval_str),
      #       data = plt_df %>% dplyr::distinct(int, .keep_all = TRUE),
      #       parse = TRUE
      #     ) +
      #     ggplot2::theme(axis.title = ggplot2::element_text(size = 14))
      #   # saveRDS(
      #   #   plt,
      #   #   file.path(
      #   #     fig_res_dir,
      #   #     paste0(
      #   #       phenotype, "_thr", thr,
      #   #       "_int_local_stability_dist_", type, ".rds"
      #   #     )
      #   #   )
      #   # )
      # 
      #   # individual snp plots
      #   plt <- plotRFSnpIntDist(
      #     irf_gene_out$rf.list[[ITER]], snps.df, int_pvals$int,
      #     prop = TRUE
      #   ) +
      #     ggplot2::labs(x = "SNV-SNV Pair")
      #   # saveRDS(
      #   #   plt,
      #   #   file.path(
      #   #     fig_res_dir,
      #   #     paste0(
      #   #       phenotype, "_thr", thr,
      #   #       "_int_local_stability_snp_dist_", type, ".rds"
      #   #     )
      #   #   )
      #   # )
      # }

      # local feature stability results
      load(file.path(fpath, "local_feature_stability_results.Rdata"))
      stab_out <- list(
        stab_tr2_ls = stab_tr2_ls,
        stab_va_ls = stab_va_ls,
        test_genes_ls = test_genes_ls,
        perm_out_ls = perm_out_ls
      )
      gene_results_ls[[as.character(thr)]] <- perm_out_ls

      stab_all_out <- list(`Using all splits` = stab_out)

      # for (type in names(int_all_out)) {
      #   print(type)
      #   stab_df <- stab_all_out[[type]]$stab_va_ls[[ITER]] %>%
      #     select(all_of(stab_all_out[[type]]$test_genes_ls[[ITER]]))
      # 
      #   # heatmap
      #   plt <- plot_hclust_heatmap(stab_df,
      #     y_groups = y_test,
      #     clust_y_wi_group = FALSE
      #   ) +
      #     labs(
      #       x = "Feature", y = "Patients",
      #       fill = "Local\nStability\nScore"
      #     ) +
      #     vthemes::theme_vmodern(size_preset = "medium") +
      #     theme(
      #       axis.text.y = element_blank(),
      #       axis.ticks.y = element_blank(),
      #       axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)
      #     )
      #   # saveRDS(
      #   #   plt,
      #   #   file.path(
      #   #     fig_res_dir,
      #   #     paste0(
      #   #       phenotype, "_thr", thr,
      #   #       "_local_stability_heatmap_", type, ".rds"
      #   #     )
      #   #   )
      #   # )
      # 
      #   # distribution plots
      #   ft_pvals <- map_dfr(stab_all_out[[type]]$perm_out_ls[[ITER]],
      #     ~ data.frame(pval = .x$pval),
      #     .id = "feature"
      #   ) %>%
      #     arrange(pval) %>%
      #     dplyr::mutate(
      #       pval_str = dplyr::case_when(
      #         pval < 1e-4 ~ "p < 10^-4",
      #         pval < 1e-3 ~ "p < 10^-3",
      #         # pval < 1e-4 ~ "p < 10<sup>-4</sup>",
      #         # pval < 1e-3 ~ "p < 10<sup>-3</sup>",
      #         TRUE ~ sprintf("p~`=`~%s", formatC(pval, digits = 3, format = "f"))
      #       )
      #     ) %>%
      #     mutate(ft_pval = paste0(feature, " (p = ", formatC(pval, 3, format = "g"), ")"))
      #   plt_df <- stab_df %>%
      #     bind_cols(y = y_test) %>%
      #     gather(key = "feature", value = "Stability", -y) %>%
      #     left_join(y = ft_pvals, by = "feature") %>%
      #     # mutate(feature = paste0(feature, " (p = ", formatC(pval, 3, format = "g"), ")")) %>%
      #     mutate(feature = factor(feature, levels = ft_pvals$feature))
      #   plt <- plot_density(data = plt_df, x_str = "Stability", fill_str = "y") +
      #     facet_wrap(~feature, scales = "free", ncol = 4) +
      #     ggplot2::labs(x = "Local Stability Importance Score", fill = "") +
      #     ggplot2::geom_text(
      #       ggplot2::aes(x = Inf, y = Inf, hjust = 1.1, vjust = 1.5, label = pval_str),
      #       data = plt_df %>% dplyr::distinct(feature, .keep_all = TRUE),
      #       parse = TRUE
      #     ) +
      #     ggplot2::theme(axis.title = ggplot2::element_text(size = 14))
      #   # saveRDS(
      #   #   plt,
      #   #   file.path(
      #   #     fig_res_dir,
      #   #     paste0(
      #   #       phenotype, "_thr", thr,
      #   #       "_local_stability_dist_", type, ".rds"
      #   #     )
      #   #   )
      #   # )
      # 
      #   # individual snp plots
      #   plt <- plotRFSnpDist(
      #     irf_gene_out$rf.list[[ITER]], snps.df, ft_pvals$feature,
      #     prop = TRUE
      #   ) +
      #     ggplot2::labs(x = "SNV (Ordered by Position)")
      #   # saveRDS(
      #   #   plt,
      #   #   file.path(
      #   #     fig_res_dir,
      #   #     paste0(
      #   #       phenotype, "_thr", thr,
      #   #       "_local_stability_snp_dist_", type, ".rds"
      #   #     )
      #   #   )
      #   # )
      # }
    }

    tab <- dplyr::bind_rows(eval_ls, .id = "Binarization Threshold") %>%
      dplyr::rename("Accuracy" = "Class", "AUROC" = "AUC", "AUPRC" = "PR") %>%
      dplyr::select(`Binarization Threshold`, Accuracy, AUROC, AUPRC) %>%
      pretty_DT(
        digits = 3, sigfig = FALSE, rownames = FALSE,
        options = list(dom = "t", ordering = FALSE)
      )
    # saveRDS(
    #   tab,
    #   file.path(
    #     tab_res_dir,
    #     paste0(phenotype, "_irf_prediction_results.rds")
    #   )
    # )
    
    gene_results_df <- purrr::map_dfr(
      gene_results_ls,
      function(gene_results) {
        purrr::map_dfr(
          gene_results[[ITER]],
          ~ data.frame(
            `SD` = sd(.x$perm_dist),
            p = .x$pval,
            check.names = FALSE
          ),
          .id = "Locus / Interaction"
        )
      },
      .id = "Binarization Threshold"
    )

    # get summary results table (without styling for DT)
    int_results_df <- purrr::map_dfr(
      int_results_ls,
      function(int_results) {
        purrr::map_dfr(
          int_results[[ITER]],
          ~ data.frame(
            `SD` = sd(.x$perm_dist),
            p = .x$pval,
            check.names = FALSE
          ),
          .id = "Locus / Interaction"
        )
      },
      .id = "Binarization Threshold"
    )
    feature_order <- dplyr::bind_rows(gene_results_df, int_results_df) %>%
      dplyr::group_by(`Locus / Interaction`) %>%
      dplyr::summarise(
        n = dplyr::n(),
        mean = mean(p)
      ) %>%
      dplyr::arrange(-n, mean) %>%
      dplyr::select(`Locus / Interaction`)
    results_df <- dplyr::bind_rows(gene_results_df, int_results_df) %>%
      tidyr::pivot_wider(
        id_cols = `Locus / Interaction`,
        names_from = `Binarization Threshold`,
        values_from = p,
        names_vary = "slowest"
      ) %>%
      dplyr::left_join(x = feature_order, y = ., by = "Locus / Interaction")
    # saveRDS(
    #   results_df,
    #   file.path(
    #     tab_res_dir,
    #     paste0(phenotype, "_irf_pvalues_summary_table.rds")
    #   )
    # )
    # write.csv(
    #   results_df,
    #   file.path(
    #     tab_res_dir,
    #     paste0(phenotype, "_irf_pvalues_summary_table.csv")
    #   )
    # )

    gene_results_df <- purrr::map_dfr(
      gene_results_ls,
      function(gene_results) {
        purrr::map_dfr(
          gene_results[[ITER]],
          ~ data.frame(
            `Mean Difference (High - Low)` = formatC(.x$T_obs, digits = 2, format = "g"),
            `SD` = formatC(sd(.x$perm_dist), digits = 2, format = "g"),
            `p-value` = dplyr::case_when(
              .x$pval < 1e-4 ~ "< 10<sup>-4</sup>",
              .x$pval < 1e-3 ~ "< 10<sup>-3</sup>",
              TRUE ~ formatC(.x$pval, digits = 3, format = "f")
            ),
            p = .x$pval,
            check.names = FALSE
          ),
          .id = "Locus / Interaction"
        )
      },
      .id = "Binarization Threshold"
    )

    int_results_df <- purrr::map_dfr(
      int_results_ls,
      function(int_results) {
        purrr::map_dfr(
          int_results[[ITER]],
          ~ data.frame(
            `Mean Difference (High - Low)` = formatC(.x$T_obs, digits = 2, format = "g"),
            `SD` = formatC(sd(.x$perm_dist), digits = 2, format = "g"),
            `p-value` = dplyr::case_when(
              .x$pval < 1e-4 ~ "< 10<sup>-4</sup>",
              .x$pval < 1e-3 ~ "< 10<sup>-3</sup>",
              TRUE ~ formatC(.x$pval, digits = 3, format = "f")
            ),
            p = .x$pval,
            check.names = FALSE
          ),
          .id = "Locus / Interaction"
        )
      },
      .id = "Binarization Threshold"
    )

    style_names <- function(x) {
      color_vec <- stringr::str_detect(x, "_")
      x <- x %>%
        stringr::str_replace_all("-_", "<sup>&#8211;</sup>&#8211;") %>%
        stringr::str_replace_all("-$", "<sup>&#8211;</sup>") %>%
        stringr::str_replace_all("\\+", "<sup>+</sup>")
      x <- dplyr::case_when(
        color_vec ~ kableExtra::cell_spec(
          x,
          color = "cornflowerblue", escape = FALSE, italic = TRUE
        ),
        TRUE ~ kableExtra::cell_spec(x, escape = FALSE, italic = TRUE)
      )
      return(x)
    }

    feature_order <- dplyr::bind_rows(gene_results_df, int_results_df) %>%
      dplyr::group_by(`Locus / Interaction`) %>%
      dplyr::summarise(
        n = dplyr::n(),
        mean = mean(p)
      ) %>%
      dplyr::arrange(-n, mean) %>%
      dplyr::select(`Locus / Interaction`)

    color_vec <- stringr::str_detect(feature_order$`Locus / Interaction`, "_")

    results_df <- dplyr::bind_rows(gene_results_df, int_results_df) %>%
      dplyr::mutate(
        `Difference in Local Feature Stability (High - Low)` = sprintf("%s (%s)", `Mean Difference (High - Low)`, SD)
      ) %>%
      tidyr::pivot_wider(
        id_cols = `Locus / Interaction`,
        names_from = `Binarization Threshold`,
        values_from = c(`Difference in Local Feature Stability (High - Low)`, `p-value`),
        names_vary = "slowest"
      ) %>%
      dplyr::left_join(x = feature_order, y = ., by = "Locus / Interaction") %>%
      dplyr::mutate(
        `Locus / Interaction` = style_names(`Locus / Interaction`)
      )
    results_tab <- pretty_DT(
      results_df,
      na_disp = "--", rownames = FALSE,
      grouped_header = c(
        " " = 1, "Binarization Threshold = 0.15" = 2,
        "Binarization Threshold = 0.20" = 2,
        "Binarization Threshold = 0.25" = 2
      ),
      grouped_subheader = stringr::str_remove(colnames(results_df), "\\_[.0-9]+")
    )
    # saveRDS(
    #   results_tab,
    #   file.path(
    #     tab_res_dir,
    #     paste0(phenotype, "_irf_pvalues_summary.rds")
    #   )
    # )
  }
} else {
  method_types <- c("Using all splits")
  thrs <- c(0.15, 0.2, 0.25)
  for (phenotype in phenotypes) {
    phenotype_name <- names(phenotypes)[phenotypes == phenotype]
    # cat(sprintf(phenotype_header, phenotype_name))

    cat(sprintf(section_header, "Summary"))
    tab <- readRDS(file.path(
      tab_res_dir,
      paste0(phenotype, "_irf_prediction_results.rds")
    ))
    subchunkify(tab,
      i = chunk_idx,
      caption = sprintf("'Summary of the validation prediction performance from siRF across various %s binarization thresholds.'", phenotype_name)
    )
    chunk_idx <- chunk_idx + 1

    tab <- readRDS(file.path(
      tab_res_dir,
      paste0(phenotype, "_irf_pvalues_summary.rds")
    ))
    tab$x$container <- stringr::str_replace_all(
      tab$x$container, "p-value", "Nominal p-value"
    )
    subchunkify(tab,
      i = chunk_idx,
      caption = sprintf("'Summary of the lo-siRF permutation test results across various %s binarization thresholds including the test statistic (i.e., difference in local feature stability scores between the high and the low %s groups (SD in parentheses) and the resulting nominal p-values. Blank cells indicate that the locus/interaction was not selected for testing for the given binarization threshold. Loci/interactions are ranked according to the mean nominal p-value across all three binarization thresholds. Interactions between genetic loci are highlighted in blue.'", phenotype_name, phenotype_name)
    )
    chunk_idx <- chunk_idx + 1

    cat(sprintf(section_header, "siRF Interaction Table"))
    for (thr in thrs) {
      cat(sprintf(thr_header, thr))
      tab <- readRDS(file.path(
        tab_res_dir,
        paste0(
          phenotype, "_thr", thr,
          "_irf_interaction_results.rds"
        )
      ))
      tab$x$container <- stringr::str_replace_all(
        tab$x$container, 
        "lo-siRF Mean Permutation p-value",
        "Nominal lo-siRF p-value"
      )
      subchunkify(tab,
        i = chunk_idx,
        caption = sprintf("'Summary of the siRF locus-level interaction table output for the binarized %s phenotype (binarization threshold = %s). For all metrics excluding the nominal lo-siRF p-value, higher values indicate a stronger interaction. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).'", phenotype_name, thr)
      )
      chunk_idx <- chunk_idx + 1
      cat("\n\n")

      tab <- readRDS(file.path(
        tab_res_dir,
        paste0(
          phenotype, "_thr", thr,
          "_irf_snv_interaction_results.rds"
        )
      ))
      subchunkify(tab,
        i = chunk_idx,
        caption = sprintf("'Summary of the siRF SNV-level interaction table output for the binarized %s phenotype (binarization threshold = %s). Stability scores for all interactions are close to 0, suggesting unreliable SNV-level interactions. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).'", phenotype_name, thr)
      )
      chunk_idx <- chunk_idx + 1
      cat("\n\n")
    }

    cat(sprintf(section_header, "Local Stability Importance Plots - Interactions"))
    for (thr in thrs) {
      cat(sprintf(thr_header, thr))
      for (type in method_types) {
        # distribution plots
        plt <- readRDS(
          file.path(
            fig_res_dir,
            paste0(
              phenotype, "_thr", thr,
              "_int_local_stability_dist_", type, ".rds"
            )
          )
        )
        nrows <- ceiling(nlevels(plt$data$int) / 3)
        subchunkify(plt,
          i = chunk_idx, fig_height = DIST_HEIGHT / 4 * nrows,
          add_class = c("panel panel-default padded-panel"),
          other_args = "out.width='100%'",
          caption = sprintf("'Distribution of local stability importance scores between the high and low %s groups (binarization threshold = %s) for each signed interaction between loci (specified by subplot) that passed the siRF stability filter. The nominal p-value from the permutation test, assessing distributional differences between the high and low %s groups, is provided in the top right corner of each subplot.'", phenotype_name, thr, phenotype_name)
        )
        chunk_idx <- chunk_idx + 1
        cat("\n\n")

        # individual snp plots
        plt <- readRDS(
          file.path(
            fig_res_dir,
            paste0(
              phenotype, "_thr", thr,
              "_int_local_stability_snp_dist_", type, ".rds"
            )
          )
        )
        nrows <- ceiling(length(unique(plt$data$Genes)) / 3)
        subchunkify(ggplotly(plt),
          i = chunk_idx, fig_height = DIST_HEIGHT / 4 * nrows,
          add_class = c("panel panel-default padded-panel"),
          other_args = "out.width='100%'",
          caption = sprintf("'Frequency of occurrence of SNV-SNV combinations in the siRF fit for the binarized %s phenotype (binarization threshold = %s). SNV-SNV combinations with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level interaction importance in the siRF fit.'", phenotype_name, thr)
        )
        chunk_idx <- chunk_idx + 1
        cat("\n\n")

        # # heatmap
        # plt <- readRDS(
        #   file.path(fig_res_dir,
        #             paste0(phenotype, "_thr", thr,
        #                    "_int_local_stability_heatmap_", type, ".rds"))
        # )
        # subchunkify(plt, i = chunk_idx, fig_height = HEATMAP_HEIGHT,
        #             add_class = c("panel panel-default"),
        #             other_args = "out.width='100%'")
        # chunk_idx <- chunk_idx + 1
      }
    }

    cat(sprintf(section_header, "Local Stability Importance Plots - Loci"))
    for (thr in thrs) {
      cat(sprintf(thr_header, thr))
      for (type in method_types) {
        # distribution plots
        plt <- readRDS(
          file.path(
            fig_res_dir,
            paste0(
              phenotype, "_thr", thr,
              "_local_stability_dist_", type, ".rds"
            )
          )
        )
        nrows <- ceiling(nlevels(plt$data$feature) / 4)
        subchunkify(plt,
          i = chunk_idx, fig_height = DIST_HEIGHT / 4 * nrows,
          add_class = c("panel panel-default"),
          other_args = "out.width='100%'",
          caption = sprintf("'Distribution of local stability importance scores between the high and low %s groups (binarization threshold = %s) for each signed locus (specified by subplot). Here, we show only the top 25 signed loci, ranked by their average local stability importance scores. The nominal p-value from the permutation test, assessing distributional differences between the high and low %s groups, is provided in the top right corner of each subplot.'", phenotype_name, thr, phenotype_name)
        )
        chunk_idx <- chunk_idx + 1
        cat("\n\n")

        # individual snp plots
        plt <- readRDS(
          file.path(
            fig_res_dir,
            paste0(
              phenotype, "_thr", thr,
              "_local_stability_snp_dist_", type, ".rds"
            )
          )
        )
        nrows <- ceiling(length(unique(plt$data$Gene)) / 3)
        subchunkify(ggplotly(plt, tooltip = c("y", "text")),
          i = chunk_idx, fig_height = DIST_HEIGHT / 4 * nrows,
          add_class = c("panel panel-default padded-panel"),
          other_args = "out.width='100%'",
          caption = sprintf("'Frequency of occurrence of SNVs in the siRF fit for the binarized %s phenotype (binarization threshold = %s). SNVs with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level importance in the siRF fit. SNVs are ordered on the x-axis by genomic location.'", phenotype_name, thr)
        )
        chunk_idx <- chunk_idx + 1
        cat("\n\n")

        # # heatmap
        # plt <- readRDS(
        #   file.path(fig_res_dir,
        #             paste0(phenotype, "_thr", thr,
        #                    "_local_stability_heatmap_", type, ".rds"))
        # )
        # subchunkify(plt, i = chunk_idx, fig_height = HEATMAP_HEIGHT,
        #             add_class = c("panel panel-default"),
        #             other_args = "out.width='100%'")
        # chunk_idx <- chunk_idx + 1
      }
    }
  }
}
```


## Advantages of Signed Information {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-square}

<div class="panel panel-default padded-panel">
Though the signed information is not pertinent to our goal of prioritizing/recommending candidate markers for experiments, the signed information from siRF provides more granular information that can improve our interpretation of the fit. To see this, suppose that we did not keep track of the sign of the feature when splitting (i.e., that we do not keep track of whether we split on low or high values of the feature). Suppose also that we would like to measure the importance of feature $X_1$, which was used as the first (root) split in tree $T$. In the first step in our new importance score computation, we compute the local stability importance (LSI) score of $X_1$ for each individual $i$. If we do not keep track of the sign of the $X_1$ split, then $X_1$ appears on every individual's decision path so that $LSI_T(X_1, i) = 1$ for every individual $i$. In the next step of our new importance score computation, we perform a permutation test to assess whether there is a difference in the LSI scores for individuals with high LVMi and those with low LVMi. Since $LSI_T(X_1, i) = 1$ for every $i$, this permutation test will return a nominal p-value of 1. This suggests that $X_1$ is an unimportant feature in this model. However, this is counterintuitive given that $X_1$ was split on first in the tree and should be considered one of the most important features in this tree. By keeping track of the signed information of the splits as done in siRF, we can utilize this more granular information to mitigate the issue discussed above and improve our interpretation of the siRF fit.

Moreover, to facilitate the interpretation of this signed information, it can be beneficial to aggregate SNV features that are *positively* correlated (as opposed to negatively correlated) into a single locus (or group). Since the encoding of SNVs as a value taking on 0, 1, or 2 is rather arbitrary (e.g., a value of 0 can be re-encoded to take on the value of 2 by changing the reference allele), we can choose the encoding of the SNV to try to avoid strong negative correlations within a locus. More specifically, for each genetic locus which consists of SNVs $X_1, \ldots, X_g$, we choose to "flip" the encoding of the SNV $X_j$ ($j = 2, \ldots, g$) if the Pearson correlation between the observed values of $X_j$ and $X_1$ is below -0.15. Here, we use the term "flip" to mean the application of the function that maps 0 to 2, 1 to 1, and 2 to 0. This correlation threshold of -0.15 was chosen by visually inspecting the resulting pairwise correlations between $X_i$ and $X_j$ for all $i, j = 1, \ldots, g$ and checking that these pairwise correlations are not strongly negative. However, while we focus on the lo-siRF prioritization results using this correlation threshold of -0.15, the results appear to be robust to other reasonable correlation threshold choices (e.g., -0.05, -0.25). Further analysis and improvements to this procedure are very interesting for future work.
</div>


# Non-hypertensive Analysis {.tabset .tabset-fade .tabset-vmodern}

## Overview {.unnumbered .tabset .tabset-pills .tabset-square}

<div class="panel panel-default padded-panel">
The lo-siRF analysis thus far has not accounted for the possibility of pleiotropy, particular between LV mass and blood pressure given the known connections between LV hypertrophy and hypertension [@yildiz2020left]. To evaluate this possibility of pleiotropic effects of the identified variants on both LV mass and blood pressure, we conducted two analyses.

1. We assessed the marginal association between each of the lo-siRF-prioritized variants and hypertension (see `Association between SNVs and Hypertension` tab).
2. We repeated the lo-siRF analysis using only the subset of UK Biobank individuals who were not reported to have hypertension (see `Lo-siRF Results Summary` tab).

Here, we define hypertensive diseases as anyone with self-reported hypertension, high blood pressure as diagnosed by a doctor, or any ICD10 billing code diagnosis in I10-I16. Out of the 29,661 UK Biobank participants in the original lo-siRF analysis, 7,371 individuals had hypertension, leaving 22,290 individuals for the non-hypertensive analysis. 

</div>

## Association between SNVs and Hypertension {.unnumbered .tabset .tabset-pills .tabset-square}

<div class="panel panel-default padded-panel">
To assess their marginal association with hypertension, we fit a separate logistic regression model for each of the 1405 GWAS-filtered SNVs, regressing hypertension (i.e., a binary indicator of whether or not one has hypertension) on the genotype of the SNV, while adjusting for the first five principal components of ancestry, sex, age, height, and body weight.

Overall, we found that none of the 1405 SNVs met the genome-wide (p < 5E-8) nor the suggestive (p < 1E-5) significance level (Figure \@ref(fig:subchunk-78) and Figure \@ref(fig:subchunk-80)). Given our interest in the lo-siRF-prioritized interactions, which involve the *TTN*, *CCDC141*, *IGF1R*, and *LOC157273;TNKS* loci, we highlight the SNVs in each locus with the smallest p-values:

- *TTN*: rs35112591 (p = 0.0328)
- *CCDC141*: rs6725028 (p = 0.0403)
- *IGF1R*: rs62024491 (p = 0.376)
- *LOC157273;TNKS*: rs73185221 (0.0491)

The top lo-siRF-prioritized variants in each loci (see Figure 2f in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1)) exhibited even weaker marginal associations with hypertension;

- *TTN*: rs66733621 (p = 0.776)
- *CCDC141*: rs7591091 (p = 0.446)
- *IGF1R*: rs62024491 (p = 0.376)
- *LOC157273;TNKS*: rs6999852 (0.304)

We note however that the *MIR588;RSPO3* locus with lead SNV rs2022479 gave the smallest p-value of 5E-5. This, in combination with the findings in the non-hypertension lo-siRF analysis next, may suggest a possible contribution of the MIR588;RSPO3 locus on mediating LV hypertrophy through regulating blood pressure.
</div>

```{r htn-gwas, results = "asis"}
if (!params$eval) {
  display_image(
    file.path(res_dir, "no_HTN", "fuma_gwas", "manhattan_FUMA_jobs453594_edited.png"),
    chunk_idx = chunk_idx,
    caption = "'Manhattan plot for the association between the 1405 GWAS-filtered SNVs used in the original lo-siRF analysis and hypertension.'"
  )
  chunk_idx <- chunk_idx + 1

  htn_gwas_results <- data.table::fread(
    file.path(res_dir, "no_HTN", "htn_gwas_results_full.csv")
  )
  tab <- htn_gwas_results %>%
    dplyr::select(rsID, Chr, Pos = `hg19 Pos`, `Lo-siRF Locus` = Gene, p = p.value) %>%
    dplyr::arrange(p) %>%
    dplyr::slice(1:100) %>%
    vthemes::pretty_DT(rownames = FALSE)
  subchunkify(
    tab,
    i = chunk_idx,
    caption = "'**List of top 100 SNVs (out of the 1405 GWAS-filtered SNVs used in the original lo-siRF analysis) with the highest association with hypertension.** rsID = rsID of the SNV. Chr = Chromosome. Pos = Position on hg19. Lo-siRF Locus = assigned locus of the variant in the original lo-siRF analysis. p = P-value measuring the association between the SNV and hypertension.'"
  )
  chunk_idx <- chunk_idx + 1
}
```

## Non-hypertensive Lo-siRF Results Summary {.unnumbered .tabset .tabset-pills .tabset-square}

<div class="panel panel-default padded-panel">
Using the same set of 1405 GWAS-filtered SNVs as in the original lo-siRF analysis, we repeated steps 2-4 of the lo-siRF analysis using only the non-hypertension UK Biobank cohort. 

- In Figure \@ref(fig:subchunk-81), we confirm that the siRF fit on the non-hypertensive cohort performs better than random guessing and satisfies the prediction check according to the PCS framework [@yu2020veridical]. We thus choose to proceed with interpreting the siRF fit.
- Overall, we found that the top 3 interactions between loci from our original lo-siRF analysis --- namely, *CCDC141-IGF1R*, *CCDC141-TTN*, and *CCDC141-LOC157273;TNKS* --- were again prioritized by lo-siRF when excluding the hypertensive individuals. That is, each of the aforementioned interactions yielded a nominal lo-siRF p-value less than our pre-specified threshold of 0.1 (in fact, all were < 0.01) for each of the three binarization thresholds (see Figure \@ref(fig:subchunk-82) below). This is to say that the top lo-siRF-prioritized interactions are stable with respect to the inclusion or exclusion of hypertensive individuals and provides evidence that these interactions are not solely mediating LV hypertrophy through the regulation of blood pressure. 
- All of the previously prioritized loci, with the exception of the *MIR588;RSPO3* locus, were also prioritized in this non-hypertensive analysis. While the nominal lo-siRF p-value for the *MIR588;RSPO3* locus was 0.019 and 0.007 for the 20\% and 25\% binarization thresholds, the nominal lo-siRF p-value was 0.187 (> 0.1 threshold) for the 15\% binarization threshold and thus deemed not stable and was not prioritized in the non-hypertensive lo-siRF analysis. Though the *MIR588;RSPO3* locus almost met the prioritization criteria here, this result, combined with the suspect marginal association results, suggests a possibility of pleiotropic locus contributing to variations of both blood pressure and LV mass.
- We provide the full list of loci and interactions between loci from the non-hypertensive lo-siRF analysis in Figure \@ref(fig:subchunk-83). This list includes many loci and interactions that were not stable in the original lo-siRF analysis.
</div>

```{r nohtn-pred-results-binary, results = "asis"}
# phenotype_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
section_header <- "\n\n### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
thr_header <- "\n\n#### %s {.unnumbered .tabset .tabset-pills .tabset-fade} \n\n"
# subsection_header <- "\n\n##### %s {.unnumbered .tabset .tabset-pills .tabset-fade .tabset-circle} \n\n"
# subsubsection_header <- "\n\n####### %s {.unnumbered} \n\n"

keep_ints <- c("CCDC141-_TTN-", "CCDC141-_IGF1R-", "CCDC141-_LOC157273;TNKS-")
keep_features <- c("IGF1R", "MIR588;RSPO3", "TTN", "CCDC141", "LSP1")
keep_features <- c(
  paste0(keep_features, "+"), paste0(keep_features, "-")
)

DIST_HEIGHT <- 12
HEATMAP_HEIGHT <- 8
EPI_HEIGHT <- 8
ITER <- 3
LEFT_MARGIN <- 5

if (params$eval) {
  for (phenotype in phenotypes) {
    print(phenotype)
    eval_ls <- list()
    int_results_ls <- list()
    gene_results_ls <- list()
    for (thr in c(0.15, 0.2, 0.25)) {
      print(thr)
      fpath <- file.path(
        res_dir, "no_htn",
        paste0(phenotype, "_binary_thr", thr, "_gwas_filtered_nsnps1000")
      )
      # > out, yhat_tr2, yhat_va, irf_gene_out, snps.df, irf_snv_int_out
      load(file.path(fpath, "irf_interaction_fit.Rdata"))
      y_test <- ifelse(out$pheno_test == 0, "Low LVMi", "High LVMi")

      # prediction results
      eval_out <- bind_rows(
        evalPreds(y = out$pheno_test, yhat = yhat_va, metric = c("AUC", "PR")),
        evalPreds(y = out$pheno_test, yhat = round(yhat_va), metric = c("Class"))
      ) %>%
        spread(key = "Metric", value = "Value")
      eval_ls[[as.character(thr)]] <- eval_out

      # local interaction stability results
      load(file.path(fpath, "local_interaction_stability_results.Rdata"))
      int_out <- list(
        stab_tr2_ls = int_stab_tr2_ls,
        stab_va_ls = int_stab_va_ls,
        test_ints_ls = test_ints_ls,
        perm_out_ls = int_perm_out_ls
      )
      int_results_ls[[as.character(thr)]] <- int_perm_out_ls

      int_all_out <- list(`Using all splits` = int_out)

      pval_int <- map_dfr(int_out$perm_out_ls[[ITER]],
        function(x) {
          data.frame(
            pval = x$pval,
            check.names = FALSE
          )
        },
        .id = "int"
      ) %>%
        dplyr::mutate(
          pval_str = dplyr::case_when(
            pval < 1e-4 ~ "< 10<sup>-4</sup>",
            pval < 1e-3 ~ "< 10<sup>-3</sup>",
            TRUE ~ sprintf("%s", formatC(pval, digits = 3, format = "f"))
          )
        )

      style_names_basic <- function(x, italic = TRUE) {
        x <- x %>%
          stringr::str_replace_all("-_", "<sup>&#8211;</sup>&#8211;") %>%
          stringr::str_replace_all("\\+_", "<sup>+</sup>&#8211;") %>%
          stringr::str_replace_all("-$", "<sup>&#8211;</sup>") %>%
          stringr::str_replace_all("\\+$", "<sup>+</sup>") %>%
          kableExtra::cell_spec(escape = FALSE, italic = italic)
        return(x)
      }

      # iRF results
      tab_out <- irf_gene_out$interaction[[ITER]] %>%
        left_join(y = pval_int, by = "int") %>%
        dplyr::arrange(pval) %>%
        dplyr::select(-pval) %>%
        dplyr::rename(
          "Interaction" = "int",
          "Prevalence" = "prevalence",
          "Precision" = "precision",
          "Class Difference in Prevalence" = "cpe",
          "Stability of Class Difference in Prevalence" = "sta.cpe",
          "Feature Selection Dependence" = "fsd",
          "Stability of Feature Selection Dependence" = "sta.fsd",
          "Increase in Precision" = "mip",
          "Stability of Increase in Precision" = "sta.mip",
          "Stability" = "stability",
          "lo-siRF Mean Permutation p-value" = "pval_str"
        ) %>%
        dplyr::mutate(Interaction = style_names_basic(Interaction))
      tab <- pretty_DT(tab_out,
        digits = 3, sigfig = TRUE, rownames = FALSE,
        na_disp = "--"
      )
      # saveRDS(
      #   tab,
      #   file.path(
      #     tab_res_dir,
      #     paste0(
      #       phenotype, "_thr", thr,
      #       "_irf_interaction_nohtn_results.rds"
      #     )
      #   )
      # )

      for (type in names(int_all_out)) {
        print(type)
        # distribution plots
        int_pvals <- map_dfr(int_all_out[[type]]$perm_out_ls[[ITER]],
          ~ data.frame(pval = .x$pval),
          .id = "int"
        ) %>%
          arrange(pval) %>%
          dplyr::mutate(
            pval_str = dplyr::case_when(
              pval < 1e-4 ~ "p < 10^-4",
              pval < 1e-3 ~ "p < 10^-3",
              # pval < 1e-4 ~ "p < 10<sup>-4</sup>",
              # pval < 1e-3 ~ "p < 10<sup>-3</sup>",
              TRUE ~ sprintf("p~`=`~%s", formatC(pval, digits = 3, format = "f"))
            )
          ) %>%
          mutate(int_pval = sprintf("%s (%s)", int, pval_str))
        plt_df <- int_all_out[[type]]$stab_va_ls[[ITER]] %>%
          bind_cols(y = y_test) %>%
          gather(key = "int", value = "Stability", -y) %>%
          left_join(y = int_pvals, by = "int") %>%
          # mutate(int = paste0(int, " (p = ", formatC(pval, 3, format = "g"), ")")) %>%
          filter(int %in% !!keep_ints) %>%
          mutate(
            int = factor(int, levels = int_pvals$int[int_pvals$int %in% keep_ints])
          )
        plt <- plot_density(data = plt_df, x_str = "Stability", fill_str = "y") +
          facet_wrap(~int, scales = "free", ncol = 3) +
          ggplot2::labs(x = "Local Stability Importance Score", fill = "") +
          ggplot2::geom_text(
            ggplot2::aes(x = Inf, y = Inf, hjust = 1.1, vjust = 1.5, label = pval_str),
            data = plt_df %>% dplyr::distinct(int, .keep_all = TRUE),
            parse = TRUE
          ) +
          ggplot2::theme(axis.title = ggplot2::element_text(size = 14))
        # saveRDS(
        #   plt,
        #   file.path(
        #     fig_res_dir,
        #     paste0(
        #       phenotype, "_thr", thr,
        #       "_int_local_stability_dist_", type, "_nohtn.rds"
        #     )
        #   )
        # )

        # individual snp plots
        plt <- plotRFSnpIntDist(
          irf_gene_out$rf.list[[ITER]], snps.df,
          int_pvals$int[int_pvals$int %in% keep_ints],
          prop = TRUE
        ) +
          ggplot2::labs(x = "SNV-SNV Pair")
        # saveRDS(
        #   plt,
        #   file.path(
        #     fig_res_dir,
        #     paste0(
        #       phenotype, "_thr", thr,
        #       "_int_local_stability_snp_dist_", type, "_nohtn.rds"
        #     )
        #   )
        # )
      }

      # local feature stability results
      load(file.path(fpath, "local_feature_stability_results.Rdata"))
      stab_out <- list(
        stab_tr2_ls = stab_tr2_ls,
        stab_va_ls = stab_va_ls,
        test_genes_ls = test_genes_ls,
        perm_out_ls = perm_out_ls
      )
      gene_results_ls[[as.character(thr)]] <- perm_out_ls

      stab_all_out <- list(`Using all splits` = stab_out)

      for (type in names(int_all_out)) {
        print(type)
        stab_df <- stab_all_out[[type]]$stab_va_ls[[ITER]] %>%
          select(all_of(stab_all_out[[type]]$test_genes_ls[[ITER]]))

        # distribution plots
        ft_pvals <- map_dfr(stab_all_out[[type]]$perm_out_ls[[ITER]],
          ~ data.frame(pval = .x$pval),
          .id = "feature"
        ) %>%
          arrange(pval) %>%
          dplyr::mutate(
            pval_str = dplyr::case_when(
              pval < 1e-4 ~ "p < 10^-4",
              pval < 1e-3 ~ "p < 10^-3",
              # pval < 1e-4 ~ "p < 10<sup>-4</sup>",
              # pval < 1e-3 ~ "p < 10<sup>-3</sup>",
              TRUE ~ sprintf("p~`=`~%s", formatC(pval, digits = 3, format = "f"))
            )
          ) %>%
          mutate(ft_pval = paste0(feature, " (p = ", formatC(pval, 3, format = "g"), ")"))
        plt_df <- stab_df %>%
          bind_cols(y = y_test) %>%
          gather(key = "feature", value = "Stability", -y) %>%
          left_join(y = ft_pvals, by = "feature") %>%
          # mutate(feature = paste0(feature, " (p = ", formatC(pval, 3, format = "g"), ")")) %>%
          filter(feature %in% !!keep_features) %>%
          mutate(
            feature = factor(
              feature,
              levels = ft_pvals$feature[ft_pvals$feature %in% keep_features]
            )
          )
        plt <- plot_density(data = plt_df, x_str = "Stability", fill_str = "y") +
          facet_wrap(~feature, scales = "free", ncol = 4) +
          ggplot2::labs(x = "Local Stability Importance Score", fill = "") +
          ggplot2::geom_text(
            ggplot2::aes(x = Inf, y = Inf, hjust = 1.1, vjust = 1.5, label = pval_str),
            data = plt_df %>% dplyr::distinct(feature, .keep_all = TRUE),
            parse = TRUE
          ) +
          ggplot2::theme(axis.title = ggplot2::element_text(size = 14))
        # saveRDS(
        #   plt,
        #   file.path(
        #     fig_res_dir,
        #     paste0(
        #       phenotype, "_thr", thr,
        #       "_local_stability_dist_", type, "_nohtn.rds"
        #     )
        #   )
        # )

        # individual snp plots
        plt <- plotRFSnpDist(
          irf_gene_out$rf.list[[ITER]], snps.df,
          ft_pvals$feature[ft_pvals$feature %in% keep_features],
          prop = TRUE
        ) +
          ggplot2::labs(x = "SNV (Ordered by Position)")
        # saveRDS(
        #   plt,
        #   file.path(
        #     fig_res_dir,
        #     paste0(
        #       phenotype, "_thr", thr,
        #       "_local_stability_snp_dist_", type, "_nohtn.rds"
        #     )
        #   )
        # )
      }
    }

    tab <- dplyr::bind_rows(eval_ls, .id = "Binarization Threshold") %>%
      dplyr::rename("Accuracy" = "Class", "AUROC" = "AUC", "AUPRC" = "PR") %>%
      dplyr::select(`Binarization Threshold`, Accuracy, AUROC, AUPRC) %>%
      pretty_DT(
        digits = 3, sigfig = FALSE, rownames = FALSE,
        options = list(dom = "t", ordering = FALSE)
      )
    # saveRDS(
    #   tab,
    #   file.path(
    #     tab_res_dir,
    #     paste0(phenotype, "_irf_prediction_nohtn_results.rds")
    #   )
    # )

    gene_results_df <- purrr::map_dfr(
      gene_results_ls,
      function(gene_results) {
        purrr::map_dfr(
          gene_results[[ITER]],
          ~ data.frame(
            `Mean Difference (High - Low)` = formatC(.x$T_obs, digits = 2, format = "g"),
            `SD` = formatC(sd(.x$perm_dist), digits = 2, format = "g"),
            `p-value` = dplyr::case_when(
              .x$pval < 1e-4 ~ "< 10<sup>-4</sup>",
              .x$pval < 1e-3 ~ "< 10<sup>-3</sup>",
              TRUE ~ formatC(.x$pval, digits = 3, format = "f")
            ),
            p = .x$pval,
            check.names = FALSE
          ),
          .id = "Locus / Interaction"
        )
      },
      .id = "Binarization Threshold"
    )

    int_results_df <- purrr::map_dfr(
      int_results_ls,
      function(int_results) {
        purrr::map_dfr(
          int_results[[ITER]],
          ~ data.frame(
            `Mean Difference (High - Low)` = formatC(.x$T_obs, digits = 2, format = "g"),
            `SD` = formatC(sd(.x$perm_dist), digits = 2, format = "g"),
            `p-value` = dplyr::case_when(
              .x$pval < 1e-4 ~ "< 10<sup>-4</sup>",
              .x$pval < 1e-3 ~ "< 10<sup>-3</sup>",
              TRUE ~ formatC(.x$pval, digits = 3, format = "f")
            ),
            p = .x$pval,
            check.names = FALSE
          ),
          .id = "Locus / Interaction"
        )
      },
      .id = "Binarization Threshold"
    )

    style_names <- function(x) {
      color_vec <- stringr::str_detect(x, "_")
      x <- x %>%
        stringr::str_replace_all("-_", "<sup>&#8211;</sup>&#8211;") %>%
        stringr::str_replace_all("-$", "<sup>&#8211;</sup>") %>%
        stringr::str_replace_all("\\+", "<sup>+</sup>")
      x <- dplyr::case_when(
        color_vec ~ kableExtra::cell_spec(
          x,
          color = "cornflowerblue", escape = FALSE, italic = TRUE
        ),
        TRUE ~ kableExtra::cell_spec(x, escape = FALSE, italic = TRUE)
      )
      return(x)
    }

    feature_order <- dplyr::bind_rows(gene_results_df, int_results_df) %>%
      dplyr::group_by(`Locus / Interaction`) %>%
      dplyr::summarise(
        n = dplyr::n(),
        mean = mean(p)
      ) %>%
      dplyr::arrange(-n, mean) %>%
      dplyr::select(`Locus / Interaction`)

    color_vec <- stringr::str_detect(feature_order$`Locus / Interaction`, "_")

    results_df <- dplyr::bind_rows(gene_results_df, int_results_df) %>%
      dplyr::mutate(
        `Difference in Local Feature Stability (High - Low)` = sprintf("%s (%s)", `Mean Difference (High - Low)`, SD)
      ) %>%
      tidyr::pivot_wider(
        id_cols = `Locus / Interaction`,
        names_from = `Binarization Threshold`,
        values_from = c(`Difference in Local Feature Stability (High - Low)`, `p-value`),
        names_vary = "slowest"
      ) %>%
      dplyr::left_join(x = feature_order, y = ., by = "Locus / Interaction") %>%
      dplyr::mutate(
        `Locus / Interaction` = style_names(`Locus / Interaction`)
      )
    results_tab <- pretty_DT(
      results_df,
      na_disp = "--", rownames = FALSE,
      grouped_header = c(
        " " = 1, "Binarization Threshold = 0.15" = 2,
        "Binarization Threshold = 0.20" = 2,
        "Binarization Threshold = 0.25" = 2
      ),
      grouped_subheader = stringr::str_remove(colnames(results_df), "\\_[.0-9]+")
    )
    # saveRDS(
    #   results_tab,
    #   file.path(
    #     tab_res_dir,
    #     paste0(phenotype, "_irf_pvalues_summary_full_nohtn.rds")
    #   )
    # )

    results_df <- dplyr::bind_rows(gene_results_df, int_results_df) %>%
      dplyr::filter(`Locus / Interaction` %in% c(keep_ints, keep_features)) %>%
      dplyr::mutate(
        `Difference in Local Feature Stability (High - Low)` = sprintf("%s (%s)", `Mean Difference (High - Low)`, SD)
      ) %>%
      tidyr::pivot_wider(
        id_cols = `Locus / Interaction`,
        names_from = `Binarization Threshold`,
        values_from = c(`Difference in Local Feature Stability (High - Low)`, `p-value`),
        names_vary = "slowest"
      ) %>%
      dplyr::inner_join(x = feature_order, y = ., by = "Locus / Interaction") %>%
      dplyr::mutate(
        `Locus / Interaction` = style_names(`Locus / Interaction`)
      )
    results_tab <- pretty_DT(
      results_df,
      na_disp = "--", rownames = FALSE,
      grouped_header = c(
        " " = 1, "Binarization Threshold = 0.15" = 2,
        "Binarization Threshold = 0.20" = 2,
        "Binarization Threshold = 0.25" = 2
      ),
      grouped_subheader = stringr::str_remove(colnames(results_df), "\\_[.0-9]+"),
      options = list(dom = "t")
    )
    # saveRDS(
    #   results_tab,
    #   file.path(
    #     tab_res_dir,
    #     paste0(phenotype, "_irf_pvalues_summary_nohtn.rds")
    #   )
    # )
  }
} else {
  method_types <- c("Using all splits")
  thrs <- c(0.15, 0.2, 0.25)
  for (phenotype in phenotypes) {
    phenotype_name <- names(phenotypes)[phenotypes == phenotype]
    # cat(sprintf(phenotype_header, phenotype_name))

    # cat(sprintf(section_header, "Summary"))
    tab <- readRDS(file.path(
      tab_res_dir,
      paste0(phenotype, "_irf_prediction_nohtn_results.rds")
    ))
    subchunkify(tab,
      i = chunk_idx,
      caption = sprintf("'Summary of the validation prediction performance from siRF on the non-hypertensive cohort across various %s binarization thresholds.'", phenotype_name)
    )
    chunk_idx <- chunk_idx + 1

    tab <- readRDS(file.path(
      tab_res_dir,
      paste0(phenotype, "_irf_pvalues_summary_nohtn.rds")
    ))
    tab$x$container <- stringr::str_replace_all(
      tab$x$container, "p-value", "Nominal p-value"
    )
    subchunkify(tab,
      i = chunk_idx,
      caption = sprintf("'Summary of the non-hypertensive lo-siRF permutation test results across various %s binarization thresholds including the test statistic (i.e., difference in local feature stability scores between the high and the low %s groups (SD in parentheses) and the resulting nominal p-values. Results are shown for the loci/interactions that were previously prioritized in the original lo-siRF analysis (including both hypertensive and non-hypertensive individuals). Loci/interactions are ranked according to the mean nominal p-value across all three binarization thresholds. Interactions between genetic loci are highlighted in blue. This table is a subset of the table below.'", phenotype_name, phenotype_name)
    )
    chunk_idx <- chunk_idx + 1

    tab <- readRDS(file.path(
      tab_res_dir,
      paste0(phenotype, "_irf_pvalues_summary_full_nohtn.rds")
    ))
    tab$x$container <- stringr::str_replace_all(
      tab$x$container, "p-value", "Nominal p-value"
    )
    subchunkify(tab,
      i = chunk_idx,
      caption = sprintf("'Summary of the non-hypertensive lo-siRF permutation test results across various %s binarization thresholds including the test statistic (i.e., difference in local feature stability scores between the high and the low %s groups (SD in parentheses) and the resulting nominal p-values. Blank cells indicate that the locus/interaction was not selected for testing for the given binarization threshold. Loci/interactions are ranked according to the mean nominal p-value across all three binarization thresholds. Interactions between genetic loci are highlighted in blue.'", phenotype_name, phenotype_name)
    )
    chunk_idx <- chunk_idx + 1

    # cat(sprintf(section_header, "siRF Interaction Table"))
    # for (thr in thrs) {
    #   cat(sprintf(thr_header, thr))
    #   tab <- readRDS(file.path(tab_res_dir,
    #                            paste0(phenotype, "_thr", thr,
    #                                   "_irf_interaction_nohtn_results.rds")))
    #   subchunkify(tab, i = chunk_idx,
    #               caption = sprintf("'Summary of the siRF locus-level interaction table output for the binarized %s phenotype (binarization threshold = %s) in the non-hypertensive lo-siRF analysis. For all metrics excluding the lo-siRF mean permutation p-value, higher values indicate a stronger interaction. Details regarding these metrics in the siRF interaction table output can be found in Kumbier et al. (2018).'", phenotype_name, thr))
    #   chunk_idx <- chunk_idx + 1
    #   cat("\n\n")
    # }
    #
    # cat(sprintf(section_header, "Local Stability Importance Plots - Interactions"))
    # for (thr in thrs) {
    #   cat(sprintf(thr_header, thr))
    #   for (type in method_types) {
    #     # distribution plots
    #     plt <- readRDS(
    #       file.path(fig_res_dir,
    #                 paste0(phenotype, "_thr", thr,
    #                        "_int_local_stability_dist_", type, "_nohtn.rds"))
    #     )
    #     nrows <- ceiling(nlevels(plt$data$int) / 3)
    #     subchunkify(plt, i = chunk_idx, fig_height = DIST_HEIGHT / 4 * nrows,
    #                 add_class = c("panel panel-default padded-panel"),
    #                 other_args = "out.width='100%'",
    #                 caption = sprintf("'Distribution of local stability importance scores between the high and low %s groups (binarization threshold = %s) in the non-hypertensive lo-siRF analysis. Results are shown for each signed interaction between loci (specified by subplot) that was prioritized in the original lo-siRF analysis. The permutation test p-value, assessing distributional differences between the high and low %s groups, is provided in the top right corner of each subplot.'", phenotype_name, thr, phenotype_name))
    #     chunk_idx <- chunk_idx + 1
    #     cat("\n\n")
    #
    #     # individual snp plots
    #     plt <- readRDS(
    #       file.path(fig_res_dir,
    #                 paste0(phenotype, "_thr", thr,
    #                        "_int_local_stability_snp_dist_", type, "_nohtn.rds"))
    #     )
    #     nrows <- ceiling(length(unique(plt$data$Genes)) / 3)
    #     subchunkify(ggplotly(plt), i = chunk_idx, fig_height = DIST_HEIGHT / 4 * nrows,
    #                 add_class = c("panel panel-default padded-panel"),
    #                 other_args = "out.width='100%'",
    #                 caption = sprintf("'Frequency of occurrence of SNV-SNV combinations in the siRF fit for the binarized %s phenotype (binarization threshold = %s) in the non-hypertensive lo-siRF analysis. SNV-SNV combinations with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level interaction importance in the siRF fit.'", phenotype_name, thr))
    #     chunk_idx <- chunk_idx + 1
    #     cat("\n\n")
    #   }
    # }
    #
    # cat(sprintf(section_header, "Local Stability Importance Plots - Loci"))
    # for (thr in thrs) {
    #   cat(sprintf(thr_header, thr))
    #   for (type in method_types) {
    #     # distribution plots
    #     plt <- readRDS(
    #       file.path(fig_res_dir,
    #                 paste0(phenotype, "_thr", thr,
    #                        "_local_stability_dist_", type, "_nohtn.rds"))
    #     )
    #     nrows <- ceiling(nlevels(plt$data$feature) / 4)
    #     subchunkify(plt, i = chunk_idx, fig_height = DIST_HEIGHT / 4 * nrows,
    #                 add_class = c("panel panel-default"),
    #                 other_args = "out.width='100%'",
    #                 caption = sprintf("'Distribution of local stability importance scores between the high and low %s groups (binarization threshold = %s) in the non-hypertensive lo-siRF analysis. Results are shown for each signed locus (specified by subplot) that was prioritized in the original lo-siRF analysis. The permutation test p-value, assessing distributional differences between the high and low %s groups, is provided in the top right corner of each subplot.'", phenotype_name, thr, phenotype_name))
    #     chunk_idx <- chunk_idx + 1
    #     cat("\n\n")
    #
    #     # individual snp plots
    #     plt <- readRDS(
    #       file.path(fig_res_dir,
    #                 paste0(phenotype, "_thr", thr,
    #                        "_local_stability_snp_dist_", type, "_nohtn.rds"))
    #     )
    #     nrows <- ceiling(length(unique(plt$data$Gene)) / 3)
    #     subchunkify(ggplotly(plt, tooltip = c("y", "text")), i = chunk_idx, fig_height = DIST_HEIGHT / 4 * nrows,
    #                 add_class = c("panel panel-default padded-panel"),
    #                 other_args = "out.width='100%'",
    #                 caption = sprintf("'Frequency of occurrence of SNVs in the siRF fit for the binarized %s phenotype (binarization threshold = %s) in the non-hypertensive lo-siRF analysis. SNVs with the highest number of occurrences in the siRF decision paths contribute the most to the overall locus-level importance in the siRF fit. SNVs are ordered on the x-axis by genomic location.'", phenotype_name, thr))
    #     chunk_idx <- chunk_idx + 1
    #     cat("\n\n")
    #   }
    # }
  }
}
```

# Comparison to Existing Methods {.tabset .tabset-fade .tabset-vmodern}

## Epistasis Detection Methods {.unnumbered .tabset .tabset-pills .tabset-square}

<div class="panel panel-default padded-panel">
In this section, we compare the lo-siRF-prioritized interactions between loci to those interactions prioritized by other existing epistasis detection methods, namely, an exhaustive pairwise interaction search via linear (or logistic) regression, MAPIT [@crawford2017detecting], and a locus-level version of MAPIT where we combined MAPIT and gene set enrichment analysis (GSEA) [@subramanian2005gene; @mootha2003pgc].
</div>

### Regression-based pairwise interaction scan {.unnumbered .tabset .tabset-pills .tabset-circle}

<div class="panel panel-default padded-panel">
Below, we present the results from conducting an exhaustive pairwise interaction scan using linear (or logistic) regression over all GWAS-filtered SNVs with a minor allele frequency > 0.05. 
In this exhaustive interaction search, there are four different transformations of the LVMi response/phenotype that are of interest --- namely, the rank-based inverse normal-transformed (RINT) LVMi (fitted using linear regression) and the binarized LVMi at three different thresholds (0.15, 0.2, and 0.25) (fitted using logistic regression). 
Details regarding the methodology can be found in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1).

- In Figure \@ref(fig:subchunk-84), we examined the stability of the top ranked SNV-SNV interactions (i.e., those with the smallest p-values for the interaction term) across the different choices of LVMi response (i.e., RINT or one of the binarized versions). While there is some positive correlation across the different binarization runs, we found that there is very little correlation between the prioritized interactions when using the RINT-transformed LVMi versus the binarized LVMi as the response.
  - We note that the exhaustive interaction search was also conducted using all 1405 GWAS-filtered SNVs without the minor allele frequency filter. However, this led to even greater instability. We hence omit these results and focus on the results after restricting to SNVs with a minor allele frequency > 0.05.
- In Figure \@ref(fig:subchunk-85), we provide a table of the SNV-SNV interactions and their corresponding interaction term p-values for each choice of LVMi response. We show only those SNV-SNV interactions with a p-value < 0.01 for at least one of the LVMi responses.
  - We note that none of the SNV-SNV interactions met the $\alpha = 0.05$ significance threshold after performing a Bonferroni multiple testing adjustment. 
  - Still, one could in theory use the p-values for ranking SNV-SNV interactions. For the unbinarized RINT-transformed LVMi response, the top interaction was between rs10227393 (chromosome 7; *TPK1;CNTNAP2* locus) and rs8076248 (chromosome 17; *CFAP52* locus). For the binarized LVMi response using both the 15% and 25% thresholds, the top interaction was between rs3176323 (chromosome 6; *CDKN1A* locus) and rs12795076 (chromosome 11; *KIRREL3* locus). To our knowledge, these loci are not known to be closely related to hypertrophy. Only for the binarized LVMi response with the 20% threshold do we obtain a top interaction involving one of the lo-siRF-prioritized loci. Here, the top interaction is between rs12618871 (chromosome 2; *CCDC141* locus) and rs9267356 (chromosome 6; *MICB* locus). 

These differences in the top interactions across the different LVMi responses further highlight the instability of the regression-based pairwise interaction scan. This instability poses issues in interpreting results, as it can be unclear which results to trust. We suspect that at least part of the instability is a result of the low signal-to-noise ratio in this problem and highlights the need for more robust and stability-driven methods, such as lo-siRF, for detecting epistasis in low signal-to-noise regimes.
  
</div>

```{r pairwise-interaction-scan, results = "asis"}
thr_header <- "\n\n#### %s\n\n"

pheno_names <- c(
  "iLVM_norm",
  "iLVM_binary_thr0.15",
  "iLVM_binary_thr0.2",
  "iLVM_binary_thr0.25"
)

if (params$eval) {
  maf_df <- data.table::fread(
    file.path(data_dir, "plink_data", "HTN_training.frq")
  ) %>%
    tibble::as_tibble()
  rm_snps <- maf_df %>%
    dplyr::filter(MAF < 0.05) %>%
    dplyr::left_join(snps_df, by = c("SNP" = "rsID")) %>%
    dplyr::pull(Name)

  results_ls <- list()
  for (pheno_name in pheno_names) {
    results <- readRDS(
      file.path(
        res_dir, "epistasis_comparisons", "manual",
        sprintf("%s_snpxsnp_scan.rds", pheno_name)
      )
    ) %>%
      dplyr::filter(
        !(snp1_name %in% rm_snps), !(snp2_name %in% rm_snps)
      )
    snps_df1 <- loadSNPInfo(unique(results$snp1_name)) %>%
      dplyr::select(Name, rsID1 = rsID)
    snps_df2 <- loadSNPInfo(unique(results$snp2_name)) %>%
      dplyr::select(Name, rsID2 = rsID)
    results <- results %>%
      dplyr::left_join(snps_df1, by = c("snp1_name" = "Name")) %>%
      dplyr::left_join(snps_df2, by = c("snp2_name" = "Name"))

    results_ls[[pheno_name]] <- results %>%
      dplyr::select(
        `SNP1 rsID` = rsID1, `SNP1 Chr`, `SNP1 Locus` = `SNP1 Gene`,
        `SNP2 rsID` = rsID2, `SNP2 Chr`, `SNP2 Locus` = `SNP2 Gene`,
        `SNP1xSNP2 Interaction p-value` = `snp1:snp2`
      )
  }

  agg_results_ls <- purrr::imap(
    results_ls,
    ~ .x %>%
      dplyr::mutate(
        order = purrr::map2(
          `SNP1 Locus`, `SNP2 Locus`,
          ~ rank(c(.x, .y))
        ),
        rsID = purrr::pmap_chr(
          list(snp1 = `SNP1 rsID`, snp2 = `SNP2 rsID`, order = order),
          function(snp1, snp2, order) {
            paste(c(snp1, snp2)[order], collapse = "-")
          }
        ),
        Locus = purrr::pmap_chr(
          list(snp1 = `SNP1 Locus`, snp2 = `SNP2 Locus`, order = order),
          function(snp1, snp2, order) {
            paste(c(snp1, snp2)[order], collapse = "-")
          }
        ),
        Chr = purrr::pmap_chr(
          list(snp1 = `SNP1 Chr`, snp2 = `SNP2 Chr`, order = order),
          function(snp1, snp2, order) {
            paste(c(snp1, snp2)[order], collapse = "-")
          }
        )
      ) %>%
      dplyr::select(rsID, Locus, Chr, `SNP1xSNP2 Interaction p-value`) %>%
      dplyr::arrange(`SNP1xSNP2 Interaction p-value`) %>%
      dplyr::rename(
        {{ .y }} := `SNP1xSNP2 Interaction p-value`
      ) %>%
      dplyr::distinct(rsID, .keep_all = TRUE)
  )

  results_df <- purrr::reduce(
    agg_results_ls, dplyr::full_join,
    by = c("rsID", "Locus", "Chr")
  ) %>%
    dplyr::rowwise() %>%
    dplyr::mutate(
      mean_p = mean(dplyr::c_across(`iLVM_norm`:`iLVM_binary_thr0.25`))
    ) %>%
    dplyr::ungroup() %>%
    dplyr::arrange(mean_p) %>%
    na.omit() %>%
    dplyr::filter(
      iLVM_norm < 0.01 |
        iLVM_binary_thr0.15 < 0.01 |
        iLVM_binary_thr0.2 < 0.01 |
        iLVM_binary_thr0.25 < 0.01
    )

  # saveRDS(
  #   results_df,
  #   file = file.path(res_dir, "epistasis_comparisons", "manual", "iLVM_snpxsnp_scan_results.rds")
  # )
} else {
  results_df <- readRDS(
    file.path(res_dir, "epistasis_comparisons", "manual", "iLVM_snpxsnp_scan_results.rds")
  ) %>%
    dplyr::rename(
      `LVMi (RINT)` = iLVM_norm,
      `LVMi (Binarization = 0.15)` = iLVM_binary_thr0.15,
      `LVMi (Binarization = 0.2)` = iLVM_binary_thr0.2,
      `LVMi (Binarization = 0.25)` = iLVM_binary_thr0.25,
      `Mean p-value` = mean_p
    )
  plt <- results_df %>%
    dplyr::mutate(
      dplyr::across(tidyselect::starts_with("LVMi"), ~ -log10(.x))
    ) %>%
    vdocs::plot_pairs(
      columns = colnames(results_df)[
        stringr::str_starts(colnames(results_df), "LVMi")
      ]
    )
  subchunkify(
    plt,
    i = chunk_idx, fig_height = 8,
    add_class = c("panel panel-default"),
    other_args = "out.width='100%'",
    caption = "'**Pair plot of the LVMi interaction effect p-values from an exhaustive regression-based SNVxSNV interaction scan using different transformations of the LVMi phenotype.** The transformations of interest include the rank-based inverse normal-transformed LVMi and the binarized LVMi using three different binarization thresholds (15%, 20%, 25%). The x- and y-axes show the interaction effect p-values after a -log10 transformation. Each point represents an SNV-SNV pair. We observe that the interaction effect p-values are generally unstable and depend on the choice of transformation of the phenotype (or response).'"
  )
  chunk_idx <- chunk_idx + 1

  tab <- pretty_DT(results_df)
  subchunkify(tab,
    i = chunk_idx,
    caption = "'**List of SNV-SNV interactions from an exhaustive regression-based SNVxSNV interaction scan using different transformations of the LVMi phenotype.** The transformations of interest include the rank-based inverse normal-transformed LVMi and the binarized LVMi using three different binarization thresholds (15%, 20%, 25%). The table sorts the SNV-SNV interactions based on the mean p-value across the different transformations of the LVMi phenotype. To sort the SNV-SNV interactions based upon their p-value from a particular transformation, please use the toggles next to the column names. Only interactions that yielded a p-value < 0.01 for at least one transformation of the LVMi phenotype are shown for brevity.'"
  )
  chunk_idx <- chunk_idx + 1
}
```

### MAPIT {.unnumbered .tabset .tabset-pills .tabset-circle}

<div class="panel panel-default padded-panel">
Given the instabilities observed in the exhaustive regression-based interaction search, we applied MAPIT [@crawford2017detecting] to our dataset as an additional comparison method. At a high-level, MAPIT aims to identify variants with non-zero marginal epistatic effects, defined as the total pairwise interaction effect between the given variant and all other variants in the data.

Similar to before, we used the GWAS-filtered SNVs with minor allele frequency > 0.05 as input to MAPIT. For the response, we used the rank-based inverse normal-transformed LVMi only since the MAPIT extension to binary traits has not yet been implemented. Details regarding the methodology can be found in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1).

  - Note: we also tried applying MAPIT to the full set of 1405 GWAS-filtered SNVs (without the additional minor allele frequency filter). However, the resulting top SNVs were predominantly those with the smallest minor allele frequencies, and we omit these results.

The top 5 SNVs from MAPIT with the smallest p-values were located in the *MAT2B;LINC02143*, *MICB*, *BRMS1L;LINC00609*, and *CDKN1A* loci (Figure \@ref(fig:subchunk-86)). Interestingly, the highest-ranked SNVs from the *CCDC141*, *TTN*, and *IGF1R* loci were ranked 109 (p = 0.028), 75 (p = 0.003), and 87 (p = 0.011), respectively (compared to p < 8E-5 for the top 5 SNVs), according to MAPIT. 

As is, the results from MAPIT and lo-siRF are not directly comparable since they are operating at different biological scales (i.e., MAPIT aims to identify epistatic variants while lo-siRF aims to identify epistatic loci). Variant-level approaches are known to be under-powered in traits where the effect of each variant is very small and/or noisy while group-level (or locus-level) approaches aggregate these signals across a collection of variants in hope of detecting more stable and biologically-relevant loci. In addition to this difference, there are a number of other potential reasons why MAPIT and lo-siRF may yield different, but non-contradictory results. For example, MAPIT and lo-siRF assume different underlying model assumptions and may capture different forms of interactions. In short, MAPIT encodes interactions as pairwise multiplicative effects (more typical in the statistical epistasis literature) whereas lo-siRF takes a more machine learning approach, encoding pairwise and higher-order interactions through decision paths (or boolean thresholding operations) in a decision tree. Other potential differences may stem from the different handling of variants with minor allele frequencies below 0.05 or the slightly different phenotypes used (i.e., lo-siRF used the binarized LVMi phenotype while MAPIT unfortunately requires a continuous phenotype and thus used the rank-based inverse normal-transformed LVMi). 
</div>

```{r mapit, results = "asis"}
if (!params$eval) {
  mapit_df <- readRDS(
    file.path(res_dir, "epistasis_comparisons", "mapit", "mvmapit_normal_maf0.05_ranknorm.rds")
  )$pvalues
  snps_df <- loadSNPInfo(mapit_df$id)
  mapit_df <- mapit_df %>%
    dplyr::left_join(snps_df, by = c("id" = "Name")) %>%
    dplyr::mutate(Rank = as.integer(rank(p))) %>%
    dplyr::arrange(p) %>%
    dplyr::select(
      Rank, rsID, Chr,
      Pos = `hg19 Pos`, `Lo-siRF Locus` = Gene, p
    ) %>%
    dplyr::filter(p < 0.05)
  # write.csv(mapit_df, file.path(res_dir, "epistasis_comparisons", "mapit", "mapit_results.csv"), row.names = FALSE)
  tab <- vthemes::pretty_DT(mapit_df, rownames = FALSE)

  vthemes::subchunkify(
    tab,
    i = chunk_idx,
    caption = "'**Top epistatic variants from MAPIT, ranked by p-value.** We show only the epistatic variants with p-value < 0.05. rsID = rsID of the variant. Chr = Chromosome. Pos = Position on hg19. Lo-siRF Locus = assigned locus of the variant in the lo-siRF analysis. P-value = p-value from MAPIT.'"
  )
  chunk_idx <- chunk_idx + 1
}
```

### MAPIT+GSEA {.unnumbered .tabset .tabset-pills .tabset-circle}

<div class="panel panel-default padded-panel">
Since the results of MAPIT and lo-siRF are on different biological scales (i.e., SNVs versus loci), we also conducted a locus-level analysis using MAPIT [@crawford2017detecting] in conjunction with gene set enrichment analysis (GSEA) [@subramanian2005gene; @mootha2003pgc]. In this pipeline, we took the union of the top 10,000 GWAS hits from PLINK and BOLT-LMM, ran MAPIT on these SNVs to obtain a ranked list of SNVs according to their MAPIT p-values, and then ran GSEA on this ranked list to identify loci that are enriched at the top of this ranked list. The results of this MAPIT+GSEA analysis are provided below. We highlight three notable findings from this analysis: When ranking the loci by their normalized enrichment score,

- The top-ranked locus (by MAPIT+GSEA) was the intergenic region between *IGFBP3* and *LOC730338* (denoted as *IGFBP3;LOC730338*). *IGFBP3* encodes an insulin-like growth factor (IGF) binding protein, which is known to interact closely with *IGF1R* [@baxter2023signaling].
- The three interacting loci, prioritized by lo-siRF, were all ranked within the top 30 by the MAPIT+GSEA analysis (*LOC157273;TNKS*: rank 2; *IGF1R*: rank 7; *TTN*: rank 23; *CCDC141*: rank 28 out of 863 total loci). Their corresponding nominal p-value, FDR q-value, and FWER p-value were 0 (or < 1e-3 given that we used 1000 permutations). We note however that the magnitudes of these p-values are not directly comparable to those from lo-siRF given the different modeling assumptions.
- Aside from *IGFBP3;LOC730338*, *LOC157273;TNKS*, *IGF1R*, and *ADAMTS16*, other loci ranked in the top 10 by MAPIT+GSEA do not appear to be closely related to cardiac hypertrophy or cardiovascular diseases in the current literature. Moreover, except for *IGFBP3;LOC730338*, *LOC157273;TNKS*, *IGF1R*, *ADAMTS16*, and *FBXL17*, the remaining 5 loci in the top 10 were not among the top 1000 GWAS-filtered SNVs used in the original lo-siRF analysis. In other words, half of the top 10 loci identified by MAPIT+GSEA did not show strong marginal associations with LVMi. There is no expectation that variants involved in epistasis also have large marginal effects, so the lack of marginal associations alone is not particularly concerning. However, combined with the observation that the primary biological functions of these loci are not directly related to the heart, we are uncertain whether these epistatic findings would be validated in wet lab experiments.

Overall, in comparing lo-siRF with MAPIT+GSEA, we are encouraged to see that both methods identified loci like *LOC157273;TNKS*, *IGF1R*, *TTN*, and *CCDC141*. On the other hand, MAPIT+GSEA also identified loci deprioritized by lo-siRF, which turn out to lack biological relevance to cardiac diseases. These discrepancies may arise from the fundamental differences between the computational frameworks of lo-siRF and MAPIT+GSEA, which make direct comparisons challenging. First, MAPIT (or MAPIT+GSEA) outputs single SNVs or loci with marginal epistatic effects while lo-siRF directly identifies specific interactions between genetic loci. Secondly, lo-siRF and MAPIT are based on different modeling assumptions, so their definitions of epistasis are not directly comparable. Lastly, since performing GSEA after MAPIT has not been extensively studied, it is unclear whether the observed strengths and weaknesses of MAPIT+GSEA arise from MAPIT, GSEA, or their combination. It also remains uncertain whether using alternative SNV-to-locus grouping methods (in place of GSEA) would improve upon the epistatic loci rankings. 
</div>

```{r mapit-genes, results = "asis"}
if (!params$eval) {
  tab <- data.table::fread(
    file.path(
      res_dir, 
      "mapit_genes",
      "mapit_gene_gsea_classic.GseaPreranked.1726843556954",
      "gsea_report_for_na_pos_1726843556954.tsv"
    )
  ) |>
  tibble::as_tibble() |>
  dplyr::mutate(
    Rank = 1:dplyr::n()
  ) |>
  dplyr::select(
    Rank,
    Name = NAME,
    `# SNPs` = SIZE,
    `Enrichment Score` = ES,
    `Normalized Enrichment Score` = NES,
    `Nominal p-value` = `NOM p-val`,
    `FDR q-value` = `FDR q-val`,
    `FWER p-value` = `FWER p-val`
  ) |>
  dplyr::filter(
    `Nominal p-value` < 0.05
  ) |>
  vthemes::pretty_DT(
    rownames = FALSE,
    options = list(pageLength = 15, ordering = FALSE)
  )
  vthemes::subchunkify(
    tab,
    i = chunk_idx,
    caption = "'**Top epistatic loci from MAPIT+GSEA, ranked by normalized enrichment score.** We show only the epistatic loci with nominal p-value < 0.05.'"
  )
  chunk_idx <- chunk_idx + 1
}
```


## Marginal Gene-based Methods {.unnumbered .tabset .tabset-pills .tabset-square}

<div class="panel panel-default padded-panel">
In this section, we compare the lo-siRF-prioritized loci to those loci prioritized by other existing marginal set-based methods, such as MAGMA [@de2015magma] (Figures \@ref(fig:subchunk-88)-\@ref(fig:subchunk-90)) and SKAT-O [@lee2012optimal] (Figure \@ref(fig:subchunk-91)). Details regarding the methodology can be found in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1). 

Notably, we found that while lo-siRF prioritized *IGF1R* as one of its top recommended loci, SKAT-O and MAGMA did not rank *IGF1R* very highly. According to SKAT-O, *IGF1R* yielded a p-value of 1.0E-5 and was the 45th ranked locus, and using MAGMA, *IGF1R* yielded a p-value of 0.002 and was the 146th ranked locus. Given our extensive validation efforts demonstrating the importance of *IGF1R* for left ventricular hypertrophy, we view this as an interesting example where lo-siRF is able to identify important loci that are not necessarily prioritized by other methods, even for marginal effects.
</div>

```{r set-tests, results = "asis"}
if (!params$eval) {
  ## MAGMA
  magma_df <- data.table::fread(
    file.path(res_dir, "gwas_fuma", "magma", "magma.genes.out")
  ) %>%
    dplyr::select(
      Locus = SYMBOL, Chr = CHR, Start = START, Stop = STOP, Z = ZSTAT, p = P
    ) %>%
    dplyr::arrange(p) %>%
    dplyr::slice(1:200)

  display_image(
    file.path(res_dir, "gwas_fuma", "magma", "geneManhattan_FUMA_jobs457602.png"),
    chunk_idx = chunk_idx,
    caption = "'Manhattan plot of the gene-based test as computed by MAGMA. The genome-wide significance level is shown by the red dotted line.'"
  )
  chunk_idx <- chunk_idx + 1

  tab <- vthemes::pretty_DT(magma_df)
  vthemes::subchunkify(
    tab,
    i = chunk_idx,
    caption = "'**Top genes from MAGMA, ranked by p-value.** We show only the top 200 genes. Locus = Gene symbol. Chr = Chromosome. Start = Start position on hg19. Stop = Stop position on hg19. Z = Z-statistic from MAGMA. P-value = p-value from MAGMA.'"
  )
  chunk_idx <- chunk_idx + 1

  ## SKAT
  load(
    file.path(res_dir, "skat", "skat_ranknorm_results_maf0.05.Rdata")
  )
  tab <- skat_df %>%
    dplyr::filter(mode == "SKATO") %>%
    dplyr::arrange(pval) %>%
    dplyr::select(
      Locus = Gene, Chr, p = pval
    ) %>%
    vthemes::pretty_DT()
  vthemes::subchunkify(
    tab,
    i = chunk_idx,
    caption = "'**Top genes from SKAT-O, ranked by p-value.** Locus = Name of locus from lo-siRF analysis. Chr = Chromosome. P-value = p-value from SKAT-O.'"
  )
  chunk_idx <- chunk_idx + 1
}
```




# Lo-siRF Permutation Test Simulations {.tabset .tabset-fade .tabset-vmodern}

<div class="panel panel-default padded-panel">
In this section, we conduct several simulations to investigate the performance and demonstrate the validity of the permutation p-values from lo-siRF. 
</div>

## P-value Calibration Simulations {.unnumbered .tabset .tabset-pills .tabset-square}

<div class="panel panel-default padded-panel">
**Calibration of p-values under null model**: In the first set of simulations, we aimed to assess whether the permutation p-values are calibrated under the null model where no epistasis is present. To this end, we simulated binary responses $y$ from a null, purely-noise model, where $y \sim Bernoulli(1/2)$. Under this null simulation setup, we generated covariates $X$ from three different data-generating processes:

<ol type="A">
<li>Independent normal: $X_{ij} \stackrel{iid}{\sim} N(0, 1)$ for $i = 1, \ldots, n$ and $j = 1, \ldots, p$.</li>
<li>Independent SNPs: $X_{ij}$ iid from $\{0, 1, 2\}$ with uniform probabilities for $i = 1, \ldots, n$ and $j = 1, \ldots, p$.</li>
<li>Real-world SNPs: The $n \times p$ matrix $X$ is taken to be the GWAS-filtered SNV matrix used in lo-siRF. In this case, we we randomly subsampled the features and the columns to meet the desired dimensions of $n$ and $p$.</li>
</ol>

Under each of these three settings, we ran the simulated data through a simplified lo-siRF pipeline, where we:

1. Split the data into two partition (2/3 training and 1/3 validation).
2. Fit siRF on the training data to obtain a list of candidate interactions.
3. For each of the interactions from step 2, evaluate the local stability importance scores and compute the associated permutation p-value using the validation data.

In simulations (A) and (B), we searched for interactions at the level of individual features (or SNVs), rather than a group of features (or genes). This was done to simplify the simulation setup and interpretation of results, given that the features were simulated independently. For simulation setup (C), we leveraged the same SNV-to-gene mapping used previously in lo-siRF and searched for interactions at the gene level.

We also note that the simplified lo-siRF pipeline omits the interaction filtering step based upon their siRF stability scores. This was done because the interaction filtering step excludes almost all non-relevant candidate interactions under the simulation setup. In particular, under the null simulation model, the interaction filtering (using the recommended thresholds from the original siRF paper [@kumbier2018refining]) removed all but 2 interactions across 200 simulated data replicates. Given our goal of assessing whether the permutation p-values are calibrated, we will focus our evaluation on the permutation p-values from all candidate interactions in this setting regardless of whether the interactions passed the stability filtering. 

**Results**: In Figure \@ref(fig:subchunk-92) below, we plot the proportion of rejected null hypotheses across different significance levels $\alpha$ under the aforementioned null simulation setup. We found that the observed type I error of our permutation test closely aligns with the significance level $\alpha$, thereby confirming that the p-values are calibrated. Moreover, these results hold across different sample sizes, number of features, and all three data-generating processes for X.
</div>

```{r perm-test-simulations, results = "asis"}
dgp_names <- c(
  "Gaussian X",
  "Real SNP X",
  "SNP X",
  "SNP X (order 1)",
  "SNP X (order 2)"
)
null_dgp_names <- c(
  "Gaussian X",
  "SNP X",
  "Real SNP X"
)
fit_results_ls <- list()
for (dgp_name in dgp_names) {
  num_features <- dplyr::case_when(
    dgp_name == "Real SNP X" ~ 500,
    TRUE ~ 100
  )
  fit_results_ls[[dgp_name]] <- readRDS(
    file.path(
      res_dir, "Permutation Validity", dgp_name, "Varying n", "fit_results.rds"
    )
  ) |> 
    dplyr::mutate(p = num_features)
}

# calibration plot
alphas <- seq(0, 1, by = 0.01)
plt_df <- dplyr::bind_rows(fit_results_ls, .id = "dgp_name") |>
  tidyr::unnest(result) |> 
  dplyr::group_by(dgp_name, n, p) |>
  dplyr::summarise(
    alpha = tibble::tibble(
      alpha = alphas,
      reject_prob = purrr::map_dbl(alphas, ~ mean(pval < .x))
    ) |>
      list(),
    .groups = "drop"
  ) |>
  tidyr::unnest(alpha) |>
  dplyr::mutate(
    setting = sprintf("Num. Samples = %s\nNum. Features = %s", n, p)
  )
plt_ls <- list()
for (dgp_name in null_dgp_names) {
  plt_ls[[dgp_name]] <- plt_df |>
    dplyr::filter(
      dgp_name == !!dgp_name
    ) |> 
    dplyr::mutate(
      dgp_name = stringr::str_replace(dgp_name, "SNP", "SNV")
    ) |> 
    ggplot2::ggplot() +
    ggplot2::aes(
      x = alpha, y = reject_prob
    ) +
    ggplot2::geom_line() +
    ggplot2::geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
    ggplot2::facet_grid(dgp_name ~ setting) +
    ggplot2::labs(
      x = "Significance Level (alpha)",
      y = "Proportion of Rejected Null Hypotheses",
    ) +
    vthemes::theme_vmodern()
}
plt <- patchwork::wrap_plots(plt_ls, ncol = 1) +
  patchwork::plot_layout(axes = "collect")
subchunkify(
  plt,
  i = chunk_idx, 
  fig_height = 6.5,
  fig_width = 8,
  add_class = c("panel panel-default"),
  other_args = "out.width='100%'",
  caption = "'Proportion of rejected null hypotheses across varying significance levels under the null simulation response model are represented by the solid black line. Dotted line represents a perfectly calibrated hypothesis test (i.e., the y = x line). Each row corresponds to a different data-generating process for X, and each column corresponds to a different choice of number of samples. Results are aggregated across 200 simulation replicates.'"
)
chunk_idx <- chunk_idx + 1
cat("\n\n")
```

## Marginal and Interaction Effect Simulations {.unnumbered .tabset .tabset-pills .tabset-square}

<div class="panel panel-default padded-panel">
**Marginal and Interaction Effect Simulations**: Moving beyond this basic calibration investigation and the null, purely-noise model, we were more interested in studying the broader performance of lo-siRF and the permutation test. To this end, we conducted two additional simulations using boolean-type rules. We note that the boolean-type rules were chosen as they have been extensively studied in previous RF-based work and are thought to closely resemble the switch-like and stereospecific nature of interactions among biomolecules [@nelson2008lehninger; @kumbier2018refining; @behr2020learning]. Specifically, we first simulated the $X$ data from independent SNVs, where $X_{ij}$ is simulated iid from $\{0, 1, 2\}$ with uniform probabilities. Next, we simulated binary responses $y \mid X \sim Bernoulli(p(X))$ via:

<ol type="A" start="4">
<li>Marginal effect model: $p(X) = 0.4 \cdot \mathbf{1}\{X_1 > 0\} + 0.4 \cdot \mathbf{1}\{X_2 > 0\}$</li>
<ul>
<li> In this model, there are no true interactions.</li>
</ul>
<li>Interaction effect model: $p(X) = 0.4 \cdot \mathbf{1}\{X_1 > 0\} \mathbf{1}\{X_2 > 0\} + 0.4 \cdot \mathbf{1}\{X_3 > 0\} \mathbf{1}\{X_4 > 0\}$</li>
<ul>
<li> In this model, the true interactions are $X_1:X_2$ and $X_3:X_4$.</li>
</ul>
</ol>

We set the number of features to 100 and varied the number of samples across n = 1000, 1500, 2000. For each simulation model and choice of sample size, we ran 200 simulation replicates.

**Results**: The simulation results are summarized in Table \@ref(fig:subchunk-93) below. In what follows, we define "partial interaction" as a candidate interaction that involves at least one true signal feature from an interaction that appeared in the simulation model --- specifically, at least one of $X_1$ or $X_2$ in the marginal effect simulation (D), or at least one of $X_1$, $X_2$, $X_3$, or $X_4$ in the interaction effect simulation (E). We define "true interaction" as a candidate interaction that involves both $X_1$ and $X_2$ or both $X_3$ and $X_4$ in the interaction effect simulation. 

The main takeaways from these simulations are:

1. Under both the marginal effect and interaction effect scenarios, the candidate interactions from siRF almost always include at least one true signal feature (and equivalently, is a partial interaction).

    - Specifically, under both the marginal effect (D) and interaction effect (E) scenarios, over 99% of the candidate interactions identified by siRF were found to be "partial interactions" (the 5th column in Table \@ref(fig:subchunk-93)). Moreover, under the interaction effect (E) simulation, over 38% of the candidate interactions identified by siRF were "true interactions" (the 4th column in Table \@ref(fig:subchunk-93)). 

2. The interaction filtering step within siRF, where we filter out candidate interactions that do not meet the siRF stability thresholds, is both necessary and effective at distinguishing between interactions and marginal effects.

    - First, the siRF interaction filtering is necessary to differentiate between interactions and marginal effects since the permutation test itself is not designed to distinguish between "true interactions" and "partial interactions."
    
        - Because the permutation test within lo-siRF formally assesses whether the local stability importance of an interaction differs between the high and low LVMi groups, any candidate interaction involving at least one true signal feature (i.e., any partial interaction) is expected to show a difference between the two groups and result in a significant p-value. Our results support this expectation. In both the interaction and marginal simulation models, an overwhelming majority (>99%) of the permutation tests led to a significant p-value (p < 0.05) (the last column in Table \@ref(fig:subchunk-93)). This is expected since each of these tested interactions was a partial interaction and involved at least one true signal feature (the 8th column in Table \@ref(fig:subchunk-93)).
        - We thus relied on the siRF interaction filtering step to distinguish between marginal and true interaction effects. This siRF interaction filtering step is detailed in the Methods section *Lo-siRF step 4.3: Permutation test for difference in local stability importance scores* in [@wang2023epistasis](https://www.medrxiv.org/content/10.1101/2023.11.06.23297858v1). Specifically, the feature selection dependence threshold (discussed in greater detail in the original siRF paper [@kumbier2018refining]) was originally designed to differentiate between additive (e.g., marginal) and non-additive (e.g., interaction) effects. This metric showed strong performance in empirical simulations from the original siRF paper.
        
    - As in the original siRF paper, the siRF interaction filtering demonstrated similarly strong effectiveness for distinguishing between true interactions and true marginal effects under our simulation setup.
    
        - In particular, under the interaction simulation model, the siRF interaction filtering step reduced the number of interactions from over 3000 candidates (the 3rd column in Table \@ref(fig:subchunk-93)), of which $\sim$ 40% were true interactions (4th column), to ~1300 candidates (6th column). Of these remaining $\sim$ 1300 candidates, 87%, 99%, and 100% were true interactions in the n = 1000, 1500, and 2000 simulations, respectively (the 7th column). 
        - Furthermore, under the marginal simulation model, the siRF interaction filtering step reduced the number of interactions from over 4400 candidates (the 3rd column in Table \@ref(fig:subchunk-93)) to fewer than 100 candidates (4th column). That is, less than 2% of the candidate interactions remain after the siRF interaction filtering step in the marginal effect simulation setting.
        
    - Thus, even though the permutation test rejects most tests involving partial interactions, in lo-siRF, the permutation test occurs after the siRF interaction filtering, which we have shown to effectively eliminate the vast majority of non-signal or partial interactions.
</div>

```{r perm-test-simulations-2, results = "asis"}
# marginal and interaction simulations
left_join_with_na <- function(x, y, by) {
  if (!is.null(x) && !is.null(y)) {
    dplyr::left_join(x, y, by = by)
  } else if (!is.null(x)) {
    x
  } else {
    NULL
  }
}

tab <- dplyr::bind_rows(fit_results_ls, .id = "dgp_name") |>
  dplyr::filter(
    stringr::str_detect(.dgp_name, "order")
  ) |> 
  dplyr::rowwise() |> 
  dplyr::mutate(
    result = list(left_join_with_na(int_df, result, by = "int"))
  ) |> 
  dplyr::ungroup() |> 
  tidyr::unnest(result) |> 
  dplyr::mutate(
    int_vec = stringr::str_remove_all(int, "[[+/-]]") |> 
      stringr::str_split("_"),
    true_feat = purrr::map2_lgl(
      int_vec, .dgp_name,
      function(.x, .y) {
        if (.y == "SNP X (order 2)") {
          ("X1" %in% .x) || ("X2" %in% .x) || ("X3" %in% .x) || ("X4" %in% .x)
        } else if (.y == "SNP X (order 1)") {
          ("X1" %in% .x) || ("X2" %in% .x)
        } else {
          FALSE
        }
      }
    ),
    true_int = purrr::map2_lgl(
      int_vec, .dgp_name,
      function(.x, .y) {
        if (.y == "SNP X (order 2)") {
          (("X1" %in% .x) && ("X2" %in% .x)) ||
            (("X3" %in% .x) && ("X4" %in% .x))
        } else {
          FALSE
        }
      }
    ),
    tested_int = !is.na(pval)
  ) |> 
  dplyr::group_by(
    dgp_name, n, p
  ) |>
  dplyr::summarise(
    `# Candidate Interactions from siRF` = dplyr::n(),
    `Prop. (Num.) of True Interactions in Candidate Set` = 
      sprintf("%.3f (%s)", sum(true_int) / dplyr::n(), sum(true_int)),
    `Prop. (Num.) of Partial Interactions in Candidate Set` = 
      sprintf("%.3f (%s)", sum(true_feat) / dplyr::n(), sum(true_feat)),
    `Prop. (Num.) of Stable Interactions in Candidate Set` = 
      sprintf("%.3f (%s)", sum(tested_int) / dplyr::n(), sum(tested_int)),
    `Prop. (Num.) of True Interactions in Stable Set` = 
      sprintf(
        "%.2f (%s)", 
        sum(true_int & tested_int) / sum(tested_int),
        sum(true_int & tested_int)
      ),
    `Prop. (Num.) of Partial Interactions in Stable Set` = 
      sprintf(
        "%.2f (%s)", 
        sum(true_feat & tested_int) / sum(tested_int),
        sum(true_feat & tested_int)
      ),
    `# Interactions Rejected (p < 0.05)` = 
      sprintf(
        "%s (out of %s tests)", 
        sum(pval <= 0.05, na.rm = TRUE), 
        sum(!is.na(pval))
      ),
    .groups = "drop"
  ) |> 
  dplyr::mutate(
    dgp_name = dplyr::case_when(
      dgp_name == "SNP X (order 1)" ~ "Marginal Model",
      dgp_name == "SNP X (order 2)" ~ "Interaction Model"
    ) |> 
      factor(levels = c("Interaction Model", "Marginal Model"))
  ) |> 
  dplyr::arrange(dgp_name) |> 
  dplyr::rename(
    "Simulation Name" = dgp_name,
    `Num. Samples` = n
  ) |> 
  dplyr::select(-p)
subchunkify(
  vthemes::pretty_DT(
    tab, rownames = FALSE, options = list(dom = "t", ordering = FALSE)
  ), 
  i = chunk_idx,
  caption = "'Summary of marginal and interaction effect simulations. # Candidate Interactions from siRF: The total number of interactions identified by siRF before applying the stability filtering step. Prop. (Num.) of True Interactions in Candidate Set: The proportion (or number) of true interactions out of the total number of candidate interactions from siRF. Prop. (Num.) of Partial Interactions in Candidate Set: The proportion (or number) of partial interactions out of the total number of candidate interactions from siRF. Prop. (Num.) of Stable Interactions in Candidate Set: The proportion (or number) of interactions that passed the stability filter step out of the total number of candidate interactions from siRF. Prop. (Num.) of True Interactions in Stable Set: The proportion (or number) of true interactions out of the total number of interactions that passed the stability filter. Prop. (Num.) of Partial Interactions in Stable Set: The proportion (or number) of partial interactions out of the total number of interactions that passed the stability filter. # Interactions Rejected (p < 0.05): The number of stable (or tested) interactions with a permutation test p-value below 0.05.'"
)
chunk_idx <- chunk_idx + 1
cat("\n\n")
```


# Final Remarks

<div class="panel panel-default padded-panel">
We reiterate that lo-siRF was originally developed as a first-stage recommendation (or hypothesis generation) tool within a broader pipeline in order to prioritize genetic loci and interactions between loci for downstream analyses and experimental validation. Importantly, the output of lo-siRF is a prioritization/ranking order, not a proper statistical test of significance. We rely on follow-up investigations and in this study rigorous gene-silencing experiments in order to validate the prioritized genetic loci and interaction between loci. With that, many of the modeling decisions and human judgment calls were made with this original use case in mind. For other use cases or problems, it is highly likely that different choices may be better suited. We also acknowledge that though many stability checks were conducted to help ensure that our conclusions are stable across different reasonable data and/or modeling perturbations, these stability checks are inevitably constrained by computational limitations, and additional stability analyses can be conducted. We hope that by documenting our modeling choices, stability analyses, and providing insights into our motivations/reasons, this may help to both clarify the limitations of our current work and to facilitate the trustworthy (veridical) use of lo-siRF (and variants thereof) in other scientific problems.
</div>

# Bibliography

<div class="panel panel-default padded-panel">
<div id="refs"></div>
</div>
