Duong Vo
About
My research project
Statistical methods to the analysis of large scale single cell RNA-seq dataWith the advance in next-generation sequencing technologies, single-cell RNA sequencing (scRNA-seq) allows researchers to analyze transcriptomic information of individual cells. To dissect scRNA-seq data, computational methods are applied in several steps including mapping, quality control, quantification, clustering or differentially expressed gene analysis. There are challenges that remained in the computational analysis of scRNA-seq data. For example, high level of noise, sparsity and batch effects are some reported properties of scRNA-seq data. The importance of scRNA-seq analysis has been demonstrated in several studies, for example, from differentially expressed gene analysis of scRNA-seq data between circulating tumor cells and primary tumor cells of hepatocellular carcinoma patients, chemokine CCL5 was identified as the mediator for immune evasion of circulating tumor cells. Developing computational tools which allows to overcome challenges such as noise or batch effects in scRNA-seq data analysis based on gene regulatory network inference is the main goal of our research which can induce single-cell analyses such as cell-type, cell-state identification using scRNA-seq data.
Supervisors
With the advance in next-generation sequencing technologies, single-cell RNA sequencing (scRNA-seq) allows researchers to analyze transcriptomic information of individual cells. To dissect scRNA-seq data, computational methods are applied in several steps including mapping, quality control, quantification, clustering or differentially expressed gene analysis. There are challenges that remained in the computational analysis of scRNA-seq data. For example, high level of noise, sparsity and batch effects are some reported properties of scRNA-seq data. The importance of scRNA-seq analysis has been demonstrated in several studies, for example, from differentially expressed gene analysis of scRNA-seq data between circulating tumor cells and primary tumor cells of hepatocellular carcinoma patients, chemokine CCL5 was identified as the mediator for immune evasion of circulating tumor cells. Developing computational tools which allows to overcome challenges such as noise or batch effects in scRNA-seq data analysis based on gene regulatory network inference is the main goal of our research which can induce single-cell analyses such as cell-type, cell-state identification using scRNA-seq data.
Publications
In the study of single cell RNA-seq data, a key component of the analysis is to identify sub-populations of cells in the data. A variety of approaches to this have been considered, and although many machine learning based methods have been developed, these rarely give an estimate of uncertainty in the cluster assignment. To allow for this probabilistic models have been developed, but single cell RNA-seq data exhibit a phenomenon known as dropout, whereby a large proportion of the observed read counts are zero. This poses challenges in developing probabilistic models that appropriately model the data. We develop a novel Dirichlet process mixture model which employs both a mixture at the cell level to model multiple populations of cells, and a zero-inflated negative binomial mixture of counts at the transcript level. By taking a Bayesian approach we are able to model the expression of genes within clusters, and to quantify uncertainty in cluster assignments. It is shown that this approach out-performs previous approaches that applied multinomial distributions to model single cell RNA-seq counts and negative binomial models that do not take into account zero-inflation. Applied to a publicly available data set of single cell RNA-seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish sub-populations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a sub-population. The methodology is implemented as an open source Snakemake pipeline available from https://github.com/ tt104/scmixture.