I joined the Computer Science at the University of Surrey in June 2020 as a Senior Lecturer, having previously been a lecturer at the University of Reading. Prior to that I was a Safra research fellow in the Division of Brain Sciences at Imperial College London and a Chancellors Fellow in the School of Informatics at the University of Edinburgh. I originally studied Computer Science at King's College Cambridge before taking an MSc and PhD in Bioinformatics at Imperial College London.
My research applies statistical methods to analysing large biological data sets, especially in learning networks and studying their structure. You can read more about my current research interests under the research section.
Areas of specialism
I am open to applications from funded PhD students with interests in statistical methods in Systems Biology.
My research focusses on statistical methods in Systems Biology and Bioinformatics, including the inference of biological networks, and the statistical analysis of networks. I am also interested in the use of approximate Bayesian methods for complex models and large data sets, and the acceleration of statistical methods with GPUs.
Recently research has involved developing tools for the analysis of single cell transcriptomic data, which provides information about the heterogeneity of gene expression across a population of cells. Another area of interest is comparing data sets between different conditions to learn differential networks that jointly infer network structures across data sets but also allow differences to be identified.
I am currently leading the teaching team of COMM054 Data Science Principles and Practices on the MSc Data Science at the University of Surrey. This module introduces the fundamental concepts in probability and statistics that provide a solid background for students studying data science.
Motivation Inferring the parameters of models describing biological systems is an important problem in the reverse engineering of the mechanisms underlying these systems. Much work has focused on parameter inference of stochastic and ordinary differential equation models using Approximate Bayesian Computation (ABC). While there is some recent work on inference in spatial models, this remains an open problem. Simultaneously, advances in topological data analysis (TDA), a field of computational mathematics, have enabled spatial patterns in data to be characterized. Results Here, we focus on recent work using TDA to study different regimes of parameter space for a well-studied model of angiogenesis. We propose a method for combining TDA with ABC to infer parameters in the Anderson–Chaplain model of angiogenesis. We demonstrate that this topological approach outperforms ABC approaches that use simpler statistics based on spatial features of the data. This is a first step toward a general framework of spatial parameter inference for biological systems, for which there may be a variety of filtrations, vectorizations and summary statistics to be considered. Availability and implementation All code used to produce our results is available as a Snakemake workflow from github.com/tt104/tabc_angio.
Sentiment analysis is one of the key tasks of natural language understanding. Sentiment Evolution models the dynamics of sentiment orientation over time. It can help people have a more profound and deep understanding of opinion and sentiment implied in user generated content. Existing work mainly focuses on sentiment classication, while the analysis of how the sentiment orientation of a topic has been inuenced by other topics or the dynamic interaction of topics from the aspect of sentiment has been ignored. In this paper, we propose to construct a Gaussian Process Dynamic Bayesian Network to model the dynamics and interactions of the sentiment of topics on social media such as Twitter. We use Dynamic Bayesian Networks to model time series of the sentiment of related topics and learn relationships between them. The network model itself applies Gaussian Process Regression to model the sentiment at a given time point based on related topics at previous time.We conducted experiments on a real world dataset that was crawled from Twitter with 9.72 million tweets. The experiment demonstrates a case study of analysing the sentiment dynamics of topics related to the event Brexit.
Background: Inference of gene regulatory network structures from RNA-Seq data is challenging due to the natureof the data, as measurements take the form of counts of reads mapped to a given gene. Here we present a model forRNA-Seq time series data that applies a negative binomial distribution for the observations, and uses sparse regressionwith a horseshoe prior to learn a dynamic Bayesian network of interactions between genes. We use a variationalinference scheme to learn approximate posterior distributions for the model parameters. Results: The methodology is benchmarked on synthetic data designed to replicate the distribution of real worldRNA-Seq data. We compare our method to other sparse regression approaches and find improved performance inlearning directed networks. We demonstrate an application of our method to a publicly available human neuronalstem cell differentiation RNA-Seq time series data set to infer the underlying network structure. Conclusions: Our method is able to improve performance on synthetic data by explicitly modelling the statisticaldistribution of the data when learning networks from RNA-Seq time series. Applying approximate inferencetechniques we can learn network structures quickly with only moderate computing resources.
The availability of large quantities of transcriptomic data in the form of RNA-seq count data has necessitated the development of methods to identify genes differentially expressed between experimental conditions. Many existing approaches apply a parametric model of gene expression and so place strong assumptions on the distribution of the data. Here we explore an alternate nonparametric approach that applies an empirical likelihood framework, allowing us to define likelihoods without specifying a parametric model of the data. We demonstrate the performance of our method when applied to gold standard datasets, and to existing experimental data. Our approach outperforms or closely matches performance of existing methods in the literature, and requires modest computational resources. An R package, EmpDiff implementing the methods described in the paper is available from: http://homepages.inf.ed.ac.uk/tthorne/software/packages/EmpDiff_0.99.tar.gz.
Differential networks allow us to better understand the changes in cellular processes that are exhibited in conditions of interest, identifying variations in gene regulation or protein interaction between, for example, cases and controls, or in response to external stimuli. Here we present a novel methodology for the inference of differential gene regulatory networks from gene expression microarray data. Specifically we apply a Bayesian model selection approach to compare models of conserved and varying network structure, and use Gaussian graphical models to represent the network structures. We apply a variational inference approach to the learning of Gaussian graphical models of gene regulatory networks, that enables us to perform Bayesian model selection that is significantly more computationally efficient than Markov Chain Monte Carlo approaches. Our method is demonstrated to be more robust than independent analysis of data from multiple conditions when applied to synthetic network data, generating fewer false positive predictions of differential edges. We demonstrate the utility of our approach on real world gene expression microarray data by applying it to existing data from amyotrophic lateral sclerosis cases with and without mutations in C9orf72, and controls, where we are able to identify differential network interactions for further investigation.
Developing mechanistic models has become an integral aspect of systems biology, as has the need to differentiate between alternative models. Parameterizing mathematical models has been widely perceived as a formidable challenge, which has spurred the development of statistical and optimisation routines for parameter inference. But now focus is increasingly shifting to problems that require us to choose from among a set of different models to determine which one offers the best description of a given biological system. We will here provide an overview of recent developments in the area of model selection. We will focus on approaches that are both practical as well as build on solid statistical principles and outline the conceptual foundations and the scope for application of such methods in systems biology.
Transcriptomic data quantifying gene expression states for single cells or cell populations at a genomic level is now readily available. Changes in transcriptional state during cell development and function are governed by gene regulatory networks, comprising a collection of genes and regulatory interactions between these genes (or gene products). Network inference algorithms aim to infer functional interactions between genes from experimentally observed expression profiles, and identify the structure of the underlying regulatory networks. Here we describe popular classes of network inference algorithms, highlighting their respective strengths and weaknesses, along with some general challenges faced by these methods. Analyzing inferred network structures can provide insight into the genes, transcriptional changes, and regulatory interactions that play key roles in biological and disease-related processes of interest.
Motivation: One of the challenging questions in modelling biological systems is to characterize the functional forms of the processes that control and orchestrate molecular and cellular phenotypes. Recently proposed methods for the analysis of metabolic pathways, for example, dynamic flux estimation, can only provide estimates of the underlying fluxes at discrete time points but fail to capture the complete temporal behaviour. To describe the dynamic variation of the fluxes, we additionally require the assumption of specific functional forms that can capture the temporal behaviour. However, it also remains unclear how to address the noise which might be present in experimentally measured metabolite concentrations. Results: Here we propose a novel approach to modelling metabolic fluxes: derivative processes that are based on multiple-output Gaussian processes (MGPs), which are a flexible non-parametric Bayesian modelling technique. The main advantages that follow from MGPs approach include the natural non-parametric representation of the fluxes and ability to impute the missing data in between the measurements. Our derivative process approach allows us to model changes in metabolite derivative concentrations and to characterize the temporal behaviour of metabolic fluxes from time course data. Because the derivative of a Gaussian process is itself a Gaussian process, we can readily link metabolite concentrations to metabolic fluxes and vice versa. Here we discuss how this can be implemented in an MGP framework and illustrate its application to simple models, including nitrogen metabolism in Escherichia coli.
Altered lipid metabolism is a feature of chronic inflammatory disorders. Increased plasma lipids and lipoproteins have been associated with multiple sclerosis (MS) disease activity. Our objective was to characterise the specific lipids and associated plasma lipoproteins increased in MS and to test for an association with disability. Plasma samples were collected from 27 RRMS patients (median EDSS, 1.5, range 1–7) and 31 healthy controls. Concentrations of lipids within lipoprotein sub-classes were determined from NMR spectra. Plasma cytokines were measured using the MesoScale Discovery V-PLEX kit. Associations were tested using multivariate linear regression. Differences between the patient and volunteer groups were found for lipids within VLDL and HDL lipoprotein sub-fractions (p < 0.05). Multivariate regression demonstrated a high correlation between lipids within VLDL sub-classes and the Expanded Disability Status Scale (EDSS) (p < 0.05). An optimal model for EDSS included free cholesterol carried by VLDL-2, gender and age (R2= 0.38, p < 0.05). Free cholesterol carried by VLDL-2 was highly correlated with plasma cytokines CCL-17 and IL-7 (R2= 0.78, p < 0.0001). These results highlight relationships between disability, inflammatory responses and systemic lipid metabolism in RRMS. Altered lipid metabolism with systemic inflammation may contribute to immune activation
Reconstructing continuous signals from discrete time-points is a challenging inverse problem encountered in many scientific and engineering applications. For oscillatory signals classical results due to Nyquist set the limit below which it becomes impossible to reliably reconstruct the oscillation dynamics. Here we revisit this problem for vector-valued outputs and apply Bayesian non-parametric approaches in order to solve the function estimation problem. The main aim of the current paper is to map how we can use of correlations among different outputs to reconstruct signals at a sampling rate that lies below the Nyquist rate. We show that it is possible to use multiple-output Gaussian processes to capture dependences between outputs which facilitate reconstruction of signals in situation where conventional Gaussian processes (i.e. this aimed at describing scalar signals) fail, and we delineate the phase and frequency dependence of the reliability of this type of approach. In addition to simple toy-models we also consider the dynamics of the tumour suppressor gene p53, which exhibits oscillations under physiological conditions, and which can be reconstructed more reliably in our new framework.
Objective: To infer molecular effectors of therapeutic effects and adverse events for dimethyl fumarate (DMF) in patients with relapsing-remitting MS (RRMS) using untargeted plasma metabolomics. Methods: Plasma from 27 patients with RRMS was collected at baseline and 6 weeks after initiating DMF. Patients were separated into discovery (n = 15) and validation cohorts (n = 12). Ten healthy controls were also recruited. Metabolomic profiling using ultra-high-performance liquid chromatography mass spectrometry (UPLC-MS) was performed on the discovery cohort and healthy controls at Metabolon Inc (Durham, NC). UPLC-MS was performed on the validation cohort at the National Phenome Centre (London, UK). Plasma neurofilament concentration (pNfL) was assayed using the Simoa platform (Quanterix, Lexington, MA). Time course and cross-sectional analyses were performed to identify pharmacodynamic changes in the metabolome secondary to DMF and relate these to adverse events. Results: In the discovery cohort, tricarboxylic acid (TCA) cycle intermediates fumarate and succinate, and TCA cycle metabolites succinyl-carnitine and methyl succinyl-carnitine increased 6 weeks following treatment (q < 0.05). Methyl succinyl-carnitine increased in the validation cohort (q < 0.05). These changes were not observed in the control population. Increased succinylcarnitine and methyl succinyl-carnitine were associated with adverse events from DMF (flushing and abdominal symptoms). pNfL concentration was higher in patients with RRMS than in controls and reduced over 15 months of treatment. Conclusion: TCA cycle intermediates and metabolites are increased in patients with RRMS treated with DMF. The results suggest reversal of flux through the succinate dehydrogenase complex. The contribution of succinyl-carnitine ester agonism at hydroxycarboxylic acid receptor 2 to both therapeutic effects and adverse events requires investigation.