Systems and Methods for Analyzing Genetic Data for Assessment of Gene
Regulatory Activity in Disease Prediction, Diagnosis and Treatment
Princeton Docket # 21-3806
Sequence is at the basis of how the genome and its variations shape chromatin organization, regulate gene expression, and impact traits and diseases. Epigenomic profiling efforts have enabled large-scale identification of regulatory elements with chromatin states, yet we lack methods to systematically predict regulatory activities from any sequence and thus predict the effects of any variant on these activities.
Researchers in Computer Science and the Lewis-Sigler Institute for Integrative Genomics, Princeton University, the Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center and The Flatiron Institute, Simons Foundation have addressed this challenge with Sei, a new framework for integrating human genetics data with sequence information to discover the regulatory basis of human traits and diseases.
Sei predicts a compendium of 21,907 chromatin profiles across >1,300 cell lines and tissues, the most comprehensive to-date, with a new deep learning sequence model, and integrates these predictions into 40 sequence classes. Sequence classes enable the prediction of quantitative variant effects on regulatory activities, such as loss or gain of cell-type-specific enhancer function. Sequence class predictions are supported by experimental data, including tissue-specific gene expression, expression quantitative trait loci (QTLs), and evolutionary constraints based on population allele frequencies. For proof of concept, sequence classes were applied to human genetics data. Sequence classes uniquely provide a non-overlapping partitioning of genome-wide association study (GWAS) heritability by tissue-specific regulatory activity categories, which were used to characterize the regulatory architecture of 47 traits and diseases from UK Biobank.
Furthermore, the predicted directional alterations of sequence class activities suggest specific hypotheses for mechanisms of individual regulatory pathogenic mutations. Sei provides interpretations of human variation in the noncoding genome, identifying regulatory states de novo from sequence and the functional impact of mutations to those regulatory functions that may alter gene expression and lead to changes in health and disease. The molecular impact predicted by Sei can be used to make diagnoses and inform patient risk factors and treatment options.
State of Development
The Sei framework is complete, with the Sei deep learning model fully trained and evaluated.
As part of the validation of the framework, the researchers systematically compared Sei predictions to expression QTLs associated with changes in gene expression in specific tissue and demonstrated that Sei predictions for second order sequence class activity significantly correlate with the direction of expression change of tissue eQTLs.
Furthermore, Sei predictions are consistent with the expected evolutionary constraints on highly impactful alleles: variants predicted to strongly perturb regulatory sequence classes had significantly lower allele frequencies than variants that weakly perturb these classes.
Finally, the Sei framework predicted the disease regulatory mutation mechanisms for >100 known human disease mutations, of which the molecular mechanism of regulatory disruption for the majority of these mutations was previously unknown. The framework was also applied to analyze genome-wide regulatory signals for all 47 diseases and traits from UK Biobank GWAS and revealed a large concentration of GWAS heritability in trait-relevant, tissue-specific enhancer sequence classes.
Patent protection is pending.
Princeton is currently seeking commercial partners for the further development and commercialization of this opportunity.
Olga Troyanskaya is a professor at the Lewis-Sigler Institute for Integrative Genomics and the Department of Computer Science at Princeton University, where she has been on the faculty since 2003. In 2014 she became the deputy director of Genomics at the Center for Computational Biology at the Flatiron Institute, a part of the Simons Foundation in NYC. She holds a Ph.D. in Biomedical Informatics from Stanford University, has been honored as one of the top young technology innovators by the MIT Technology Review, and is a recipient of the Sloan Research Fellowship, the National Science Foundation CAREER award, the Overton award from the International Society for Computational Biology, and the Ira Herskowitz award from the Genetic Society of America.
Jian Zhou is an assistant professor at the Lyda Hill Department of Bioinformatics at the University of Texas Southwestern Medical Center, where he joined as a faculty member in 2019. He received Ph.D. in Quantitative and Computational Biology from Princeton University. He is a Lupe Murchison Foundation Scholar in Medical Research and a CPRIT Scholar in Cancer Research.
Kathleen Chen is a graduate student in the Department of Computer Science at Princeton University. She worked as a data scientist in the Center for Computational Biology at the Flatiron Institute from 2017-2021. She is currently a NSF Graduate Research Fellow, and was awarded the Gordon Y.S. Wu Engineering Fellowship from Princeton in 2021.
Princeton University Office of Technology Licensing
(609) 258-7256 • firstname.lastname@example.org