Single Cell Genomics Tools

We propose a new deep generative model framework, named SAILER, for analysing scATAC-seq data. SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects. SAILER adopts the conventional encoder-decoder framework to learn the latent representation but imposes additional constraints to ensure the independence of the learned representations from the confounding factors. Experimental results on both simulated and real scATAC-seq datasets demonstrate that SAILER learns better and biologically more meaningful representations of cells than other methods. Its noise-free cell embeddings bring in significant benefits in downstream analyses: Clustering and imputation based on SAILER result in 6.9% and 18.5% improvements over existing methods, respectively. Moreover, because no matrix factorization is involved, SAILER can easily scale to process millions of cells. We implemented SAILER into a software package, freely available to all for large-scale scATAC-seq data analysis.

Mapping distal regulatory elements, such as enhancers, is the cornerstone for investigating genome evolution, understanding critical biological functions, and ultimately elucidating how genetic var-iations may influence diseases. Previous enhancer prediction methods have used either unsupervised approaches or supervised methods with limited training data. Moreover, past approaches have opera-tionalized enhancer discovery as a binary classification problem without accurate enhancer boundary detection, producing low-resolution annotations with redundant regions and reducing the statistical power for downstream analyses (e.g., causal variant mapping and functional validations). Here, we addressed these challenges via a two-step model called DECODE. First, we employed direct enhancer activity readouts from novel functional characterization assays, such as STARR-seq, to train a deep neural net-work classifier for accurate cell-type-specific enhancer prediction. Second, to improve the annotation resolution (~500 bp), we implemented a weakly-supervised object detection framework for enhancer local-ization with precise boundary detection (at 10 bp resolution) using gradient-weighted class activation mapping.

scATAC-seq is a powerful approach for characterizing cell-type-specific regulatory landscapes. However, it is difficult to benchmark the performance of various scATAC-seq analysis without having a set of gold-standard cell types a priori. To simulate scATAC-seq experiments with known cell type labels, we introduce an efficient and scalable scATAC-seq simulation method that down-samples bulk-tissue ATAC-seq data in an organized fashion. Our simulation protocol creates a homogeneous signal-to-noise ratio in a single scATAC-seq experiment by integrating different levels of background noise for separate bulk-tissue experiments and independently samples twice without replacement to ac-count for the diploid genome. Our implementation in C++ allows millions of cells to be simulated in less than an hour on a laptop computer, as it uses an efficient weighted reservoir sampling algorithm and is highly parallelizable with OpenMP.

Functional Genomics Tools

Overall, the ENCODEC resource consists of (1) comprehensive networks that allow us to see global alterations in network rewiring and regulatory hierarchy; (2) an annotated catalogue of cell types that allows us to place oncogenic changes relative to normal and stem cells and accurately model tumor background mutation rate (BMR); and (3) compact noncoding annotations and extended gene definitions that can potentially increase statistical power to interpret genome variation (both germline and somatic) and gene expression changes. Practically, the resource consists of a set of annotation files and computer codes available online.

RADAR includes a comprehensive RBP regulome by integrating the full catalog of 318 eCLIP, 76 Bind-n-Seq, and 472 RNA-Seq experiments after RBP knockdown. Based on this regulome, it uses an entropy based scoring scheme to investigate variant impact in such regions. It first combines RBP binding, cross-species and cross-population conservation, network, and motif features with polymorphism data to quantify variant impact described by a universal score. Then, it allows tissue- or disease-specific inputs, such as patient expression, somatic mutation profiles, and gene rank list, to further highlight relevant variants.

LARVA

LARVA

LARVA is a computational framework designed to facilitate the study of noncoding variants. It addresses issues that have made it difficult to derive an accurate model of the background mutation rates of noncoding elements in cancer genomes. These issues include limited noncoding functional annotation, great mutation heterogeneity, and potential mutation correlations between neighboring sites. As a result, there is substantial overdispersion in the mutation count of noncoding elements.

MOAT

MOATjpeg

MOAT is a computational system for identifying significant mutation burdens in genomic elements with an empirical, nonparametric method. Taking a set of variant calls and a set of annotations, MOAT calculates which annotations have observed variant counts that are substantially elevated with respect to a distribution of expected variant counts determined by permutation of the input data. To produce this expected distribution, MOAT offers two types of permutation algorithm: one that permutes the locations of annotations (MOAT-a), and one that permutes the locations of variants (MOAT-v).

WemIQ

OP-CBIO140813 1..8

WemIQ integrates an effective bias removal with a weighted expectation maximization (EM) algorithm to distribute reads among isoforms efficiently. The weight represents the oversampling or undersampling of sequence reads and is estimated through a generalized Poisson model without any presumption on the bias sources and formats. WemIQ significantly improves the quantification of isoform and gene expression as well as the derived exon inclusion rates. It provides robust expression estimates across different laboratories and protocols, which is valuable for the integrative analysis of RNA-seq. For the recent single-cell RNA-seq data, WemIQ also provides the opportunity to distinguish bias heterogeneity from true biological heterogeneity and uncovers smaller cell-to-cell expression variability.