Genotype Pattern Mining for Digenic Traits course
The Rockefeller University | 1230 York Avenue, New York, NY 10065 – 8 March 2024
Laboratory of Statistical Genetics
Course on Genotype Pattern Mining for Digenic Traits
This course was originally planned for February 21-25, 2022, but had to be postponed for two reasons, (1) the covid-19 pandemic and (2) a lack of reliable software. Now, both problems have been overcome and a course date has been announced.
Date: 29 April - 3 May 2024 This course has been canceled
Location: Rockefeller University, New York
Frequent Pattern Mining
Various examples of the joint actions of two variants (SNPs) have been published under the heading of digenic inheritance [1]. There has been much debate about the definition of epistasis and how to detect it [2]. Here, we simply consider two genotypes, one each from two different SNPs, possibly located on different chromosomes. We refer to such a pair of genotypes as a genotype pattern; classically this has been called a diplotype in analogy to a haplotype which refers to alleles. We want to find genotype patterns with frequencies different in cases than controls. Various approaches have been taken to tackle this problem for a genome-wide set of SNPs, but often short-cuts were necessary to reduce the search space to a manageable size, and current implementations generally can handle only very small datasets before running out of memory (see our recent reviews [3,4]).
As more and more desktop workstations become available with large numbers of CPUs (threads), we developed our own approach, GPM, which scans all possible pairs of genotypes across the genome and distributes the workload over CPUs. In this course, you will be introduced to the two components of GPM, the programs Vpairs and Gpairs, and how to efficiently work with program parameters for your specific needs. A follow-up approach, DNT, just published [11], can find highly interactive, significant variants by counting the number of genotype patterns in which each occurs.
In our analyses of various datasets, more and more evidence accumulates that working under a digenic model produces many more significant results than conventional GWAS, or leads to significant results while GWAS does not. This observation is to be expected because, after all, gene networks are the rule rather than the exception.
Course Description
In this course, we will largely focus on detecting combined effects of two DNA variants on disease. The course is planned for in-person attendance. Details will be forthcoming. If you are interested in attending, please send me email so I can put you on our attendance list. Please note that you cannot enter the Rockefeller campus without being vaccinated against Covid-19.
On the basis of example datasets provided by us, you will be introduced to the following tasks:
- Installing the Vpairs and Gpairs programs in Windows and Linux.
- Doing runs with different parameter settings; reading output into Excel and LibreOffice (the latter is available in Linux). This will mostly be done in a Command window (cmd, Windows) / terminal (Linux).
- Performing power calculations based on program output [5].
- Based on output of Gpairs, identifying highly connected SNPs that by themselves may not be significant, and computing p-values based on permutation analysis (DNT program) [11].
Tentative Outline
The course will be taught by Drs. Jurg Ott (Rockefeller University, New York), Taesung Park, and Qingrun Zhang, with a guest lecture by Dr. Suzanne Leal, Columbia University, New York. Taesung Park is Professor of Statistics at Seoul National University in Korea and has published in the area of pattern mining, and Qingrun Zhang is an Assistant Professor at University of Calgary, Canada; she was instrumental in the HapMap project in China and is the author of the AprioriGWAS approach to genotype pattern mining.
Costs for the course are $850 for academics and $1,800 for non-academics. An initial deposit of $100 will be required, refundable until two months before the course starts. Payment details will be provided shortly. At this point, no money is due, but you may want to reserve your spot on the participant list (first come first served) by sending me email. You will be notified when a deposit is due.
As in previous courses, there will be lectures followed by exercises. Most exercises can be done in Windows, but programs run more efficiently in Linux. We will provide accounts on our Linux servers. Course participants are expected to bring their own Windows laptops, perhaps with dualboot installed so the laptop can be booted up in Windows or Linux (Kubuntu preferred). We will also provide you with instructions on how to run Linux from within Windows ("Ubuntu on Windows").
Monday
- Welcome
- Statistical principles in hypothesis testing (J. Ott, lecture)
- Principles of frequent pattern mining (FPM) or frequent itemset mining (FIM) (T. Park, lecture)
Tuesday
- Identification of highly penetrant disease variants (S. Leal, lecture)
- Implementations of FPM methods (T. Park, lecture and exercises)
Wednesday
- FPM methods in genetics, permutation testing, plink program for genetic databases (J. Ott, lecture)
- Exhaustive search for genotype patterns, GPM (J. Ott, lecture and exercises)
Thursday
- Statistical evaluation of significance (p-values) and discovery (q-values, false discovery rates) (J. Ott, lecture)
- GPM programs for Linux (J. Ott, exercises)
Friday
Genotype pattern analysis currently works on genotypes but does not allow for concomitant variables like age and environmental effects. Thus, we devote Friday to recent developments in characterizing interactions via transcriptome data (Q. Zhang).
Stabilized COre Gene and Pathway Election (SCOPE) – a tool for characterizing interactions via transcriptome data
Approaches for systematically characterizing interactions via transcriptomic data typically fall into two categories: (i) co-expression network analyses focusing on correlations between genes, and (ii) linear regressions to select multiple genes jointly. Both approaches suffer from stability problems: Slight changes in parameterization or dataset can result in significant alterations of outcomes.
In response to this challenge, we introduced Stabilized COre gene and Pathway Election (SCOPE) [6], a tool that integrates bootstrapped least absolute shrinkage and a selection operator with co-expression analysis. This integration leads to robust outcomes. SCOPE empowers researchers to conduct stable investigations into complex interactions using transcriptome data.
Integrating multi-omics data with GWAS studies
Our recent method advancements allow utilizing other multi-scale data (omics, imaging, and sub-clinical traits) in the process of model training. Importantly, the other data do not have to be accessed in the main genotype-phenotype cohort, as our unique Data-bridge Framework extracts and integrates their underlying genetics. Our work was originally inspired by transcriptome-wide association studies (TWAS) that use gene expressions as mediators to improve genotype-phenotype association mapping.
TWAS should be interpreted as a feature selection and feature aggregation method in statistical learning, so we developed the following enhancements: (1) mkTWAS [7], which uses marginal effects for feature selection and a kernel-based method for feature aggregation; (2) IMAS (Image mediated Association Study) [8]: The use of imaging data to discover genetic variants underlying brain disorders; (3) EDLMM [9] and rvTWAS [10]: Detecting low-effect variants even if only samples of moderate size are available.
References
- Schaffer AA (2013) Digenic inheritance in medical genetics. J Med Genet 50, 641-652. 10.1136/jmedgenet-2013-101713
- Wang X et al. (2010) Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet 12, 74
- Okazaki A, Ott J (2022) Machine learning approaches to explore digenic inheritance. Trends Genet. doi: 10.1016/j.tig.2022.04.009
- Ott J, Park T (2022) Overview of frequent pattern mining.Genomics Inform 20, e39. doi: 10.5808/gi.22074
- Zhang Q et al. (2023) A multi-threaded approach to genotype pattern mining for detecting digenic disease genes. Front Genet 14, 1222517. doi: 10.3389/fgene.2023.1222517
- Kossinna P, Cai W, Lu X, Shemanko CS, Zhang Q. Stabilized COre gene and Pathway Election uncovers pan-cancer shared pathways and a cancer-specific driver. Sci Adv. 2022 Dec 21;8(51):eabo2846
- Cao C, Kossinna P, Kwok D, Li Q, He J, Su L, et al. Disentangling genetic feature selection and aggregation in transcriptome-wide association studies. Genetics. 2022 Feb 4;220(2).
- IMAS (Under revision for Am J Hum Genet) https://www.biorxiv.org/content/10.1101/2023.06.16.545326v1
- edLMM (Under revision for Genetics) https://www.biorxiv.org/content/10.1101/2023.07.13.548939v1
- rvTWAS (Under revision for Genetics) https://www.biorxiv.org/content/10.1101/2023.07.16.549227v1
- Wang G, Ott J. (2023) Digenic analysis finds highly interactive genetic variants underlying polygenic traits. Medical Research Archives 11. doi: 10.18103/mra.v11i10.4604