Professor Nophar Geifman

Professor of Health and Biomedical Informatics, and Director of Informatics, the Veterinary Health Innovation Engine (vHIVE)

M Med Sc, PhD, FHEA

n.geifman@surrey.ac.uk

Academic and research departments

School of Health Sciences, Digital Health Expert Group.

About

Biography

My interests lie in data sciences within healthcare and medicine; extending the use of artificial intelligence and big-data analytics to improve patient-centric predictions, treatment and outcomes, while also enhancing the open sharing of biomedical data. I have developed and employed a wide range of AI methods and applications, from text mining to machine learning, with the overarching goal of producing translational research with real-world impact.

My research centres on patient stratification, and biomarker discovery from large, diverse clinical and ‘omics’ datasets; applying informatics techniques, AI and machine learning for discovery in various areas of medicine, particularly where conventional research methods have over-simplified the inherent complexity of disease and care.

News

24 NOV 2025

Blood protein profiles can predict mortality

06 AUG 2025

Personalised chronic kidney disease management on the horizon, as new biomarker research spurs hope

17 DEC 2024

Is fake meat good to eat? Processed plant-based meat alternatives linked to depression risk in vegetarians

09 OCT 2024

Having a sweet tooth is linked to higher risk of depression, type 2 diabetes, and stroke, study finds

A medical health professional is holding tablet, graphics are emerging from the screen the demonstrate a medical concept.

21 FEB 2023

Five reasons to study health and biomedical informatics at Surrey

03 FEB 2022

Meet the academic: Professor Nophar Geifman

Publications

Danny Maupin, Tulsi Suchak, Abhijit Sengupta, Mariana Marra, Nophar Geifman, Matt Spick Drastic changes in collaboration networks and publication patterns in research using the CDC WONDER dataset, In: medRxiv

DOI: 10.64898/2026.01.13.26343992

The growth of generative AI and easily available Open Access health datasets has transformed researcher productivity, leading to an explosion in publications that has in part been attributed to paper mills (organisations that provide manuscripts for payment) and other unethical actors. These entities are not, however, homogenous, and have a range of products and target markets. While the demand from China has received much attention, here we provide a case study of CDC WONDER, a dataset that has been exploited by a network of researchers reporting affiliations in Pakistan, the United States and the UK, potentially linked to medical residency driven demand from junior clinicians or trainees. The number of publications using CDC WONDER grew from 88 in 2021 to 1223 in 2025. Over the same time period, the proportion of papers reporting at least one author from Pakistan grew from 0.5% in 2021 to 27.2% in 2025, with unusually extensive collaboration networks. In some cases these works featured over 15 co-authors, often including representation from Western institutions, but in spite of this high level of resourcing only resulted in straightforward analyses of well-described conditions using publicly available data. The majority of these outputs additionally show evidence of being produced from a template, with formulaic titles and identical methods, for example using the same statistical model and platform (Joinpoint regression). Identifying papers produced by fast-churn workflows is essential to protect the integrity of the scientific literature from being flooded with low-quality research. This can be achieved through more proactive desk rejection of misleading and formulaic mass-produced submissions, and through better understanding of which use cases are appropriate for different Open Science resources.

Christos Dadousis, Nicos Angelopoulos, Yaoyao Zhang, Anna Eleonora Karagianni, Huanmin Zhang, John A. Hammond, Venugopal Nair, Yongxiu Yao, Nophar Geifman (2025)Genetic Variation Associated with Marek’s Disease Resistance and Susceptibility in White Leghorn Chickens, In: Poultry science105(3)106311 Elsevier Inc

DOI: 10.1016/j.psj.2025.106311

Despite of effective control by vaccination, Marek’s disease virus (MDV) remains a significant threat to poultry health and productivity due to continued virus evolution, which drives the need to better understand host genetic factors underlying the resistance and susceptibility. Although great efforts have been made toward understanding the genetic resistance to MD, the genetic variations underlying varied susceptibility to MD remains poorly understood. In this study, we conducted a comprehensive genome-wide association and pathway enrichment analysis in chicken genomes of a MD-resistant inbred line 6 (LN6), a MD-susceptible inbred line 7 (SUS), and 6 recombinant congenic strains (RCS) derived from the lines 6 and 7. The incidence of MD in the RCS varied significantly, ranging from RCS-J ̶ 0%, RCS-D ̶ 4%, line 6 ̶ 6%, RCS-F ̶ 10%, RCS-W ̶ 15%, RCS-K ̶ 24%, to RCS-M ̶ 41%. The progenital line 7 was observed with a 97% incidence in response to infection with a very virulent plus MDV strain (648A). Three models were constructed: GWASLN6 with the line 6 birds contrasted against the remaining birds, GWASSUS with the line 7 birds contrasted against the remaining birds, and GWAS RES-JD6 with group of RCS-J, RCS-D and line 6 together contrasted against the remaining birds. Our results revealed distinct enrichment patterns: while WNT/SHH Axonal Guidance Signaling Pathway is enriched in both GWASSUS and GWAS RES-JD6, the Th2 pathway, Th1/Th2 Activation Pathway, and the interleukin (IL)-33 were predominant in GWASSUS. On the other hand, the ISG15 antiviral mechanism and HIPPO signalling pathways were enriched in the GWASRES-JD6. In contrast, thyroid cancer signalling, CXCR4, ILK, IL-8, IL-3, JAK/STAT and mTOR signalling pathways were significantly enriched in GWASLN6. These findings underscore the complex interplay of immune signalling, host-pathogen interactions, and genetic regulation in shaping MD resistance. Key pathways and candidate genes identified in this study provide valuable targets for further functional validation and may inform future genetic selection and new vaccine strategies to enhance MD resistance in poultry.

Daniel Maupin, Matt Spick, Nophar Geifman (2025)Safeguarding Open Science from exploitative practices, In: PLoS medicine22(12)e1004851 PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pmed.1004851

Open research and data transparency are a bulwark against unethical activities, but can also introduce integrity risks. As with all public goods, freely available data can be exploited, and here we set out the case for the use of safeguarding practices.

Natalia Koziar, Anthony D Whetton, Nophar Geifman (2025)A plasma-based protein signature association with all-cause mortality, In: PloS one20(11)e0336845 PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pone.0336845

Circulating plasma proteins play key roles in measuring and reflecting states of disease and health. Developments in protein metrology allow for over 2,900 proteins to be quantified in a single sample. In major epidemiological studies, this allows for profound insights into protein expression in liquid biopsies, mortality, and morbidity. Here, we have investigated the relationship between peripheral blood protein profiles and non-accident all-cause mortality within 5- and 10-year timeframes, using data on 38,150 participants from the UK Biobank. Adjusting for lifestyle and health covariates, we identified 392 proteins associated with an increase in risk for death within 5-years, and 377 associated with an increase in risk within 10-years. Proteomic signatures of cause-specific mortality (cardiovascular, cancer, all-other causes) were also identified, with 19 proteins found to overlap across those. Using logistic regression modelling, we constructed a parsimonious predictive protein panel for each respective all-cause mortality timeframe, including markers such as adrenomedullin, SERPINA1 and PLAUR. When compared to models inclusive of standalone traditional risk criteria, such as demographic and lifestyle factors, models utilising the protein panels modestly improve prediction for 5 and 10-year mortality (from AUC 0.49-0.57 to AUC of 0.62-0.68). Our results demonstrate the potential of the plasma proteome in risk stratification for all-cause mortality.

Melissa T Benavente, Nophar Geifman, Sarah C Bath, Kourosh R Ahmadi, Andrew W Fogarty, Charles R. Marshall, Sumantra Ray, Laila J. Tata, Chittaranjan Yajnik, Anand Ahankari (2025)Cohort profile: the Maharashtra Anaemia Study 3 (MAS 3)—a maternal-child cohort study up to age 18 years in India, In: BMJ Open15(10)e104184 BMJ Publishing Group

DOI: 10.1136/bmjopen-2025-104184

Purpose The Maharashtra Anaemia Study 3 (MAS 3) aims to (1) Investigate the nutritional, environmental, and economic impacts on haemoglobin concentration/anaemia, (2) Identify the underlying micronutrient causes of anaemia and (3) Investigate the association between anaemia and physical and cognitive development of Indian children during their first 18 years of life. This paper introduces the MAS 3 cohort, which consists of data collected from the participants in the prospective Pune Maternal Nutrition Study from the antenatal period to children at 18 years of age (1996–2014) in the Maharashtra state, India. Participants Recruitment of 2466 married non-pregnant women, and their husbands, took place between June 1994 and April 1996 in six villages, approximately 50 km from Pune city in India. Women were followed up monthly to identify those who became pregnant. A total of 797 pregnant women were followed up for data collection at or near gestational week 18 and 28, with further data collection for women and children occurring within 72 hours of delivery, for both live and stillbirths. Of the 797 women, 710 were included in the MAS 3 cohort, and long-term follow-up of children occurred at 6 years, 12 years and 18 years of age. Findings to date In the MAS 3 cohort, most mothers (73%) were aged between 18 and 25 years at the time of their final prepregnancy visit (baseline), and half (55%) belonged to families of middle-upper socioeconomic status (SES). At the children’s baseline (birth) visit, children had a mean birth weight of 2630 g (SD: 376), with one third (31%) of low birth weight. At the 6-year, 12-year and 18-year follow-up visits, data were available for 706 (99%), 689 (97%) and 694 (98%) children. Future plans MAS 3 will be used to address a number of research objectives, including (1) Trends of haemoglobin and anaemia-related micronutrients from age 6 to 18 years, (2) Micronutrient causes of anaemia during childhood, (3) Prevalence and risk factors for maternal anaemia and childhood anaemia, (4) Impact of maternal anaemia on immediate birth outcomes and (5) Intergenerational risk factors associated with anaemia.

Hana Fitria Navratilova, Anthony David Whetton, Nophar Geifman (2025)Integrating Food Preference Profiling, Behavior Change Strategies, and Machine Learning for Cardiovascular Disease Prevention in a Personalized Nutrition Digital Health Intervention: Conceptual Pipeline Development and Proof-of- Principle Study, In: Journal of medical Internet research27(27)e75106 JMIR PUBLICATIONS, INC

DOI: 10.2196/75106

Background: Personalized dietary advice needs to consider the individual's health risks as well as specific food preferences, offering healthier options aligned with personal tastes. Objective: This study aimed to develop a digital health intervention (DHI) that provides personalized nutrition recommendations based on individual food preference profiles (FPP), using data from the UK Biobank. Methods: Data from 61,229 UK Biobank participants were used to develop a conceptual pipeline for a DHIs. The pipeline included three steps: (1) developing a simplified food preference profiling tool, (2) creating a cardiovascular disease (CVD) prediction model using the subsequent profiles, and (3) selecting intervention features. The CVD prediction model was created using 3 different predictor sets (Framingham set, diet set, and FPP set) across 4 machine learning models: logistic regression, linear discriminant analysis, random forest, and support vector machine. Intervention functions were designed using the Behavior Change Wheel, and behavior change techniques were selected for the DHI features. Results: The feature selection process identified 14 food items out of 140 that effectively classify FPPs. The food preference profile prediction set, which did not include blood measurements or detailed nutrient intake, demonstrated comparable accuracy (across the 4 models: 0.721-0.725) to the Framingham set (0.724-0.727) and diet set (0.722-0.725). Linear discriminant analysis was chosen as the best-performing model. Four key features of the DHI were identified: food source and portion information, recipes, a dietary recommendation system, and community exchange platforms. The FPP and CVD risk prediction model serve as inputs for the dietary recommendation system. Two levels of personalized nutrition advice were proposed: level 1—based on food portion intake and FPP; and level 2—based on nutrient intake, FPP, and CVD risk probability. Conclusions: This study presents proof of principle for a conceptual pipeline for a DHI that empowers users to make informed dietary choices and reduce CVD risk by catering to person-specific needs and preferences. By making healthy eating more accessible and sustainable, the DHI has the potential to significantly impact public health outcomes.

Kris Elomaa, Matt Spick, Earn H Gan, Simon H Pearce, Nophar Geifman (2025)Variable hyperthyroidism outcomes related to different treatment regimens: an analysis of UK Biobank data, In: European thyroid journal14(2)e240393 BIOSCIENTIFICA LTD

DOI: 10.1530/ETJ-24-0393

BackgroundUK guidance on the assessment and management of thyroid disease was set out in NICE guideline NG145 in 2019 and is expected to result in an increase in radioactive iodine (RAI) being offered as a first-line definitive treatment for hyperthyroidism.MethodologyIn this work we analyse longitudinal UK Biobank data to assess all-cause mortality and comorbidity risks associated with the main treatment modalities for 793 participants with hyperthyroidism, specifically antithyroid drugs (ATDs), RAI and thyroidectomy.ResultsParticipants treated with RAI showed reduced all-cause mortality compared with those treated with ATD alone (time to event ratio: 1.8, 95% CI: 0.9-3.6), albeit the result did not reach statistical significance, as did those treated by thyroidectomy (time ratio: 2.0, 95% CI: 1.1-3.9). For treated patients, odds ratios were generally elevated for osteoporosis, cardiovascular events and atrial fibrillation, but again did not reach statistical significance except for those patients treated by ATDs, with an odds ratio for atrial fibrillation of 2.2 (95% CI: 1.2-4.1) versus controls.ConclusionOur findings were consistent with those previously reported in the literature and do not reveal any evidence from the UK Biobank to contradict the safety of RAI being offered as a first-line treatment. The data are also suggestive, however, that treatments do not fully eliminate risks of complications related to hyperthyroidism. This reinforces the need for both clear communication where there may be risks of complications such as osteoporosis as well as clinical support for patients even after definitive treatment.

Charlotte Watson, Nophar Geifman, Andrew G Renehan (2022)Latent class trajectory modelling: impact of changes in model specification, In: American journal of translational research14(10)pp. 7593-7606 e-Century Publishing Corporation

Latent class trajectory models (LCTMs) are often used to identify subgroups of patients that are clinically meaningful in terms of longitudinal exposure and outcome, e.g. drug response patterns. These models are increasingly applied in medicine and epidemiology. However, in many published studies, it is not clear whether the chosen models, where subgroups of patients are identified, represent real heterogeneity in the population, or whether any associations with clinically meaningful characteristics are accidental. In particular, we note an apparent over-reliance on lowest AIC or BIC values. While these are objective measures of goodness of fit, and can help identify the optimal number of subgroups, they are not sufficient on their own to fully evaluate a given trajectory model. Here we demonstrate how longitudinal latent class models can substantially change by making small modifications in model specification, and the impact of this on the relationship to clinical outcomes. We show that the predicted trajectory patterns and outcome probabilities differ when pre-specified cubic versus linear shapes are tested on the same data. However, both could be interpreted to be the “correct” model. We emphasise that LCTMs, like all unsupervised approaches, are hypotheses generating, and should not be directly implemented in clinical practice without significant testing and validation.

Anthony Onoja, Thomas McDonnell, Isabelle Annessi, Rosamonde E Banks, Marianne Bergin, Paul Cockwell, Rodolphe Dusaulcy, Simon D S Fraser, Tim Johnson, Philip A. Kalra, Barbara Lemaître, Moin Saleem, Phillipp Skroblin, Magnus Soderberg, Maarten W. Taal, Robert Unwin, Nicolas Vuilleumier, David C Wheeler, Nophar Geifman (2025)Biomarkers of Kidney Failure and All-Cause Mortality in CKD, In: Journal of the American Society of Nephrology AMER SOC NEPHROLOGY

DOI: 10.1681/ASN.0000000767

Chronic kidney disease (CKD) carries a variable risk for multiple adverse outcomes, highlighting the need for a personalised approach. This study evaluated several novel biomarkers linked to key disease mechanisms to predict the risk of kidney failure (first event of eGFR

A Dagliati, N Peek, R.D Brinton, N Geifman (2021)Sex and apoe genotype differences related to statin use in the aging population7(1)e12156 John Wiley and Sons Inc

DOI: 10.1002/trc2.12156

Background: Significant evidence suggests that the cholesterol-lowering statins can affect cognitive function and reduce the risk for Alzheimer’s disease (AD) and dementia. These potential effects may be constrained by specific combinations of an individual’s sex and apolipoprotein E (APOE) genotype. Methods: Here we examine data from 252,327 UK Biobank participants, aged 55 or over, and compare the effects of statin use in males and females. We assessed difference in statin treatments taking a matched cohort approach, and identified key stratifiers using regression models and conditional inference trees. Using statistical modeling, we further evaluated the effect of statins on survival, cognitive decline over time, and on AD prevalence. Results: We identified that in the selected population, males were older, had a higher level of education, better cognitive scores, higher incidence of cardiovascular and metabolic diseases, and a higher rate of statin use. We observed that males and those participants with an APOE ε4–positive genotype had higher probabilities of being treated with statins; while participants with an AD diagnosis had slightly lower probabilities. We found that use of statins was not significantly associated with overall higher rates of survival. However, when considering the interaction of statin use with sex, the results suggest higher survival rates in males treated with statins. Finally, examination of cognitive function indicates a potential beneficial effect of statins that is selective for APOE ε4–positive genotypes. Discussion: Our evaluation of the aging population in a large cohort from the UK Biobank confirms sex and APOE genotype as fundamental risk stratifiers for AD and cognitive function, furthermore it extends them to the specific area of statin use, clarifying their specific interactions with treatments. © 2021 The Authors. Alzheimer’s & Dementia: Translational Research & Clinical Interventions published by Wiley Periodicals LLC on behalf of Alzheimer’s Association. Open access journal This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.

Stephanie J W Shoop-Worrall, Saskia Lawson-Tovey, Lucy R Wedderburn, Kimme L Hyrich, Nophar Geifman, (2024)Towards stratified treatment of JIA: machine learning identifies subtypes in response to methotrexate from four UK cohorts, In: EBioMedicine100104946 Elsevier

DOI: 10.1016/j.ebiom.2023.104946

Methotrexate (MTX) is the gold-standard first-line disease-modifying anti-rheumatic drug for juvenile idiopathic arthritis (JIA), despite only being either effective or tolerated in half of children and young people (CYP). To facilitate stratified treatment of early JIA, novel methods in machine learning were used to i) identify clusters with distinct disease patterns following MTX initiation; ii) predict cluster membership; and iii) compare clusters to existing treatment response measures. Discovery and verification cohorts included CYP who first initiated MTX before January 2018 in one of four UK multicentre prospective cohorts of JIA within the CLUSTER consortium. JADAS components (active joint count, physician (PGA) and parental (PGE) global assessments, ESR) were recorded at MTX start and over the following year. Clusters of MTX 'response' were uncovered using multivariate group-based trajectory modelling separately in discovery and verification cohorts. Clusters were compared descriptively to ACR Pedi 30/90 scores, and multivariate logistic regression models predicted cluster-group assignment. The discovery cohorts included 657 CYP and verification cohorts 1241 CYP. Six clusters were identified: Fast improvers (11%), Slow Improvers (16%), Improve-Relapse (7%), Persistent Disease (44%), Persistent PGA (8%) and Persistent PGE (13%), the latter two characterised by improvement in all features except one. Factors associated with clusters included ethnicity, ILAR category, age, PGE, and ESR scores at MTX start, with predictive model area under the curve values of 0.65-0.71. Singular ACR Pedi 30/90 scores at 6 and 12 months could not capture speeds of improvement, relapsing courses or diverging disease patterns. Six distinct patterns following initiation of MTX have been identified using methods in artificial intelligence. These clusters demonstrate the limitations in traditional yes/no treatment response assessment (e.g., ACRPedi30) and can form the basis of a stratified medicine programme in early JIA. Medical Research Council, Versus Arthritis, Great Ormond Street Hospital Children's Charity, Olivia's Vision, and the National Institute for Health Research.

Christos Dadousis, Anthony D. Whetton, Kennedy Mwacalimba, Alexandre Merlo, Andrea Wright, Nophar Geifman (2024)Renal Disease in Cats and Dogs—Lessons Learned from Text-Mined Trends in Humans, In: Animals14(23)3349 MDPI

DOI: 10.3390/ani14233349

Chronic kidney disease (CKD) is characterised by progressive kidney damage and encompasses a broad range of renal pathologies and aetiologies. In humans, CKD is an increasing global health problem, in particular in the western world, while in cats and dogs, CKD is one of the leading causes of mortality and morbidity. Here, we aimed to develop an enhanced understanding of the knowledge base related to the pathophysiology of renal disease and CKD in cats and dogs. To achieve this, we leveraged a text-mining approach for reviewing trends in the literature and compared the findings to evidence collected from publications related to CKD in humans. Applying a quantitative text-mining technique, we examined data on clinical signs, diseases, clinical and lab methods, cell types, cytokine, and tissue associations (co-occurrences) captured in PubMed biomedical literature. Further, we examined different types of pain within human CKD-related publications, as publications on this topic are sparser in companion animals, but with the growing importance of animal welfare and quality of life, it is an area of interest. Our findings could serve as substance for future research studies. The systematic automated review of relevant literature, along with comparative analysis, has the potential to summarise scientific evidence and trends in a quick, easy, and cost-effective way. Using this approach, we identified targeted and novel areas of investigation for renal disease in cats and dogs.

Anthony Onoja, Abdullah Zahid, Kris Elomaa, Nophar Geifman (2025)An Interpretable Model for Predicting Acute Myocardial Infarction in Distinct Patient Profiles, In: Proceedings of MIE 2025pp. 452-456

DOI: 10.3233/SHTI250378

Introduction: Acute myocardial infarction (AMI) is highly prevalent (3.8% in developed countries), affecting heterogenous populations, and can be influenced by varied factors, including demographics, clinical risk factors, and comorbidities. Identifying distinct AMI patient profiles can aid in understanding the disease and developing personalised treatment strategies. Methods: This study analysed data from UK Biobank participants with an AMI diagnosis. Using unsupervised clustering techniques - UMAP, latent profile analysis, and K-means clustering - distinct and robust patient profiles were identified and associated with co-morbidity prevalence. Next, we trained three supervised machine learning classifiers (Logistic Regression, Random Forest, and XGBoost) to predict profile membership from 28 biochemistry markers. SHAP values were used for post-hoc interpretation of the best-performing model. Finding: Four distinct patient profiles were identified: “CMR-GIRespRenal”, “AG-CMS”, “CM-MultiCardio”, and “PostMeno-CMSurgGI”. Each profile showed unique characteristics in socio- demographics, clinical risk factors (e.g., BMI, age, smoking, alcohol intake, waist and hip circumference) and disease prevalence. The Random Forest classifier outperformed all others, achieving an average weighted AUROC score of 78%. SHAP analysis highlighted key biochemical markers, such as Testosterone, Creatinine, Vitamin D, Urate, and lipid profile markers, as significant predictors of AMI profiles. Conclusion: This study underscores the heterogeneity of AMI patients and the importance of integrating patient profiles with biochemical markers for improved stratification in diagnosis and treatment. These identified profiles can guide personalised treatment strategies, tailoring interventions to the specific needs of each group. Understanding these profiles may also lead to novel therapeutic targets.

Hana F. Navratilova, Anthony D. Whetton, Nophar Geifman (2024)Plant‐Based Meat Alternatives Intake and Its Association With Health Status Among Vegetarians of the UK Biobank Volunteer Population, In: Food FrontiersEarly View(Early View)e2532 Wiley

DOI: 10.1002/fft2.532

Consumption of Plant-Based Meat Alternatives (PBMAs) within the vegetarian population is increasing. This study assessed the relationship between PBMA intake and health markers using the UK Biobank cohort. Participants were categorized into vegetarian PBMA consumers and vegetarian PBMA non-consumers. Non-parametric statistical tests were used to evaluate differences in participants’ characteristics, food intake, 30 blood biochemistry measures after assessing data distributions. Metabolomics (168 metabolites) and proteomics (2,923 proteins) data were further examined to identify significant differences between the two participant groups. Relative Risks (RRs) for 45 chronic diseases and mental conditions were calculated using Poisson regression. Sensitivity analysis accounted for sociodemographic factors and the proportion of energy from Ultra-Processed Food (UPF) intake was determined. No substantial differences in sodium, free sugar, total sugar, or saturated fatty acids intake between vegetarian PBMA consumers and non-consumers were found. However, PBMA consumers exhibited higher blood pressure (130/79 mmHg and 129/78 mmHg for consumer and non-consumer groups, respectively) and elevated C-reactive protein (CRP) levels (1.76±3.12 mg/L and 1.57±3.17 mg/L for consumer and non-consumer groups, respectively). Metabolite and protein abundance analysis showed no notable differences. Pathway enrichment analysis suggested that PBMAs may influence immune reactions through cell signalling pathways. PBMA consumers had a 42% increased risk of depression (p=0.03) and 40% reduction in irritable bowel syndrome (IBS) risk (p=0.02), compared to non-consumers. In conclusion, while no clear health risks or benefits were associated with PBMA consumption in vegetarians, the higher risk of depression, elevated C-reactive protein, and lower apolipoprotein A levels in PBMA consumers suggest potential inflammatory concerns that warrant further investigation.

Nophar Geifman, Jo Armes, Anthony David Whetton (2023)Identifying developments over a decade in the digital health and telemedicine landscape in the UK using quantitative text mining, In: Frontiers in Digital Health51092008 Frontiers Media

DOI: 10.3389/fdgth.2023.1092008

The use of technologies that provide objective, digital data to clinicians, carers, and service users to improve care and outcomes comes under the unifying term Digital Health. This field, which includes the use of high-tech health devices, telemedicine and health analytics has, in recent years, seen significant growth in the United Kingdom and worldwide. It is clearly acknowledged by multiple stakeholders that digital health innovations are necessary for the future of improved and more economic healthcare service delivery. Here we consider digital health-related research and applications by using an informatics tool to objectively survey the field. We have used a quantitative text-mining technique, applied to published works in the field of digital health, to capture and analyse key approaches taken and the diseases areas where these have been applied. Key areas of research and application are shown to be cardiovascular, stroke, and hypertension; although the range seen is wide. We consider advances in digital health and telemedicine in light of the COVID-19 pandemic.

Matt Spick, Jan Higgins, Cynthia L. Green, Roland Matsouaka, Daniel B. Shin, Russell P. Hall III, Nophar Geifman (2024)Observations from statistical review editors: a commentary, In: JID Innovations100302 Elsevier

DOI: 10.1016/j.xjidi.2024.100302

Reproducibility and replicability are crucial components of the scientific method, but they may be compromised when there are inherent issues related to a study and analytic choices such as statistical errors, or misalignments between the study’s objectives and implementation. Indeed, statistical errors and misunderstandings contribute to low reproducibility and replicability, hindering independent verification or changes in the direction of research. (McNutt, 2014) Such problems can easily occur in health science, where there are many confounding factors and low prior odds of genuine findings (Ioannidis, 2005). Guidelines for statistical reporting that can minimize these issues are well-established, but are not always followed. In January of 2023, to help address these challenges in a more targeted way, JID Innovations established a statistical review board as part of its overall editorial process, nominating editors with expertise in statistical analysis and data science. (Hall, 2023) All submissions to the journal are reviewed by one of the statistical review editors to provide specialist evaluation and feedback on study design, statistical tests, and analyses as well as bioinformatic aspects of the manuscript. In this commentary, common themes identified by statistical review editors in their peer reviews are brought forth along with comments that are made during the ‘routine’ peer review process in order to highlight prevalent issues in statistical methodologies and reporting seen in submissions to JID Innovations. The goal of this commentary is to propose easy steps that authors can take to inform study design at the outset of any data-driven project, reduce the number of potential revisions to statistical methodology and presentation in the original submission and ultimately to improve the reproducibility and replicability of the work published in JID Innovations, with the added benefit of a more efficient submission process.

MATTHEW PAUL SPICK, Amy Campbell, Ivona Baricevic-Jones, JOHANNA VON GERICHTEN, HOLLY-MAY LEWIS, CECILE FRANCE FRAMPAS, Katie Longman, ALEXANDER STEWART, DEBORAH DUNN-WALTERS, DEBRA JEAN SKENE, NOPHAR GEIFMAN, Anthony D. Whetton, Melanie J. Bailey (2022)Multi-Omics Reveals Mechanisms of Partial Modulation of COVID-19 Dysregulation by Glucocorticoid Treatment, In: International journal of molecular sciences23(20)12079 MDPI

DOI: 10.3390/ijms232012079

Treatments for COVID-19 infections have improved dramatically since the beginning of the pandemic, and glucocorticoids have been a key tool in improving mortality rates. The UK’s National Institute for Health and Care Excellence guidance is for treatment to be targeted only at those requiring oxygen supplementation, however, and the interactions between glucocorticoids and COVID-19 are not completely understood. In this work, a multi-omic analysis of 98 inpatient-recruited participants was performed by quantitative metabolomics (using targeted liquid chromatography-mass spectrometry) and data-independent acquisition proteomics. Both ‘omics datasets were analysed for statistically significant features and pathways differentiating participants whose treatment regimens did or did not include glucocorticoids. Metabolomic differences in glucocorticoid-treated patients included the modulation of cortisol and bile acid concentrations in serum, but no alleviation of serum dyslipidemia or increased amino acid concentrations (including tyrosine and arginine) in the glucocorticoid-treated cohort relative to the untreated cohort. Proteomic pathway analysis indicated neutrophil and platelet degranulation as influenced by glucocorticoid treatment. These results are in keeping with the key role of platelet-associated pathways and neutrophils in COVID-19 pathogenesis and provide opportunity for further understanding of glucocorticoid action. The findings also, however, highlight that glucocorticoids are not fully effective across the wide range of ‘omics dysregulation caused by COVID-19 infections.

Hana F. Navratilova, Anthony D. Whetton, Nophar Geifman (2024)Artificial intelligence driven definition of food preference endotypes in UK Biobank volunteers is associated with distinctive health outcomes and blood based metabolomic and proteomic profiles, In: Journal of translational medicine22881 BMC

DOI: 10.1186/S12967-024-05663-0

Background: Specific food preferences can determine an individual’s dietary patterns and therefore, may be associated with certain health risks and benefits. Methods: Using food preference questionnaire (FPQ) data from a subset comprising over 180,000 UK Biobank participants, we employed Latent Profile Analysis (LPA) approach to identify the main patterns or profiles among participants. blood biochemistry across groups/profiles was compared using the non-parametric Kruskal-Wallis test. We applied the Limma algorithm for differential abundance analysis on 168 metabolites and 2923 proteins, and utilized the Database for Annotation, Visualization and Integrated Discovery (DAVID) to identify enriched biological processes and pathways. Relative risks (RR) were calculated for chronic diseases and mental conditions per group, adjusting for sociodemographic factors.Results: Based on their food preferences, three profiles were termed: the putative Health-conscious group (low preference for animal-based or sweet foods, and high preference for vegetables and fruits), the Omnivore group (high preference for all foods), and the putative Sweet-tooth group (high preference for sweet foods and sweetened beverages). The Health-conscious group exhibited lower risk of heart failure (RR = 0.86, 95%CI 0.79 – 0.93) and chronic kidney disease (RR = 0.69, 95%CI 0.65 – 0.74) compared to the two other groups. The Sweet-tooth group had greater risk of depression (RR = 1.27, 95%CI 1.21 – 1.34), diabetes (RR = 1.15, 95%CI 1.01 – 1.31), and stroke (RR = 1.22, 95%CI 1.15 – 1.31) compared to the other two groups. Cancer (overall) relative risk showed little difference across the Health-conscious, Omnivore, and Sweet-tooth groups with RR of 0.98 (95%CI 0.96 – 1.01), 1.00 (95%CI 0.98 – 1.03), and 1.01 (95%CI 0.98 – 1.04), respectively. The Health-conscious group was associated with lower levels of inflammatory biomarkers (e.g., C-reactive Protein) which are also known to be elevated in those with common metabolic diseases (e.g., cardiovascular disease). Other markers modulated in the Health-conscious group, ketone bodies, insulin-like growth factor-binding protein (IGFBP), and Growth Hormone 1 were more abundant, while leptin was less abundant. Further, the IGFBP pathway, which influences IGF1 activity, may be significantly enhanced by dietary choices. Conclusions: These observations align with previous findings from studies focusing on weight loss interventions, which include a reduction in leptin levels. Overall, the Health-conscious group, with preference to healthier food options, has better health outcomes, compared to Sweet-tooth and Omnivore groups.

Anthony Onoja, Johanna Von Gerichten, Holly-May Lewis, Melanie Jane Bailey, Debra Jean Skene, Nophar Geifman, Matt Spick (2023)Meta-Analysis of COVID-19 Metabolomics Identifies Variations in Robustness of Biomarkers, In: International journal of molecular sciences24(18)14371 Mdpi

DOI: 10.3390/ijms241814371

The global COVID-19 pandemic resulted in widespread harms but also rapid advances in vaccine development, diagnostic testing, and treatment. As the disease moves to endemic status, the need to identify characteristic biomarkers of the disease for diagnostics or therapeutics has lessened, but lessons can still be learned to inform biomarker research in dealing with future pathogens. In this work, we test five sets of research-derived biomarkers against an independent targeted and quantitative Liquid Chromatography-Mass Spectrometry metabolomics dataset to evaluate how robustly these proposed panels would distinguish between COVID-19-positive and negative patients in a hospital setting. We further evaluate a crowdsourced panel comprising the COVID-19 metabolomics biomarkers most commonly mentioned in the literature between 2020 and 2023. The best-performing panel in the independent dataset-measured by F1 score (0.76) and AUROC (0.77)-included nine biomarkers: lactic acid, glutamate, aspartate, phenylalanine, & beta;-alanine, ornithine, arachidonic acid, choline, and hypoxanthine. Panels comprising fewer metabolites performed less well, showing weaker statistical significance in the independent cohort than originally reported in their respective discovery studies. Whilst the studies reviewed here were small and may be subject to confounders, it is desirable that biomarker panels be resilient across cohorts if they are to find use in the clinic, highlighting the importance of assessing the robustness and reproducibility of metabolomics analyses in independent populations.

Stephanie Shoop-Worrall, Kimme Hyrich, Lucy R. Wedderburn, Nophar Geifman (2022)Distinct Clusters of JIA at Methotrexate Initiation Identified Using Topological Data Analysis

Richard Beesley, Freya Luling Feilding, Anna CLUSTER Consortium Champions, Elizabeth C. Rosser, Stephanie J. W. Shoop-Worrall, Alyssia Mcneece, Zoe Wanstall, Kimme Hyrich, Lucy R. Wedderburn (2024)Development and implementation of 'A guide to PPIE - Early Integration into Research Proposals' in a multi-disciplinary consortium, In: Rheumatology (Oxford, England)63(3)pp. e88-e91 Oxford Univ Press

DOI: 10.1093/rheumatology/kead482

Nophar Geifman, Atul J. Butte (2016)DO CANCER CLINICAL TRIAL POPULATIONS TRULY REPRESENT CANCER PATIENTS? A COMPARISON OF OPEN CLINICAL TRIALS TO THE CANCER GENOME ATLAS, In: R B Altman, A K Dunker, L Hunter, M D Ritchie, T Murray, T E Klein (eds.), PACIFIC SYMPOSIUM ON BIOCOMPUTING 201621pp. 309-320 World Scientific

DOI: 10.1142/9789814749411_0029

Open clinical trial data offer many opportunities for the scientific community to independently verify published results, evaluate new hypotheses and conduct meta-analyses. These data provide a springboard for scientific advances in precision medicine but the question arises as to how representative clinical trials data are of cancer patients overall. Here we present the integrative analysis of data from several cancer clinical trials and compare these to patient-level data from The Cancer Genome Atlas (TCGA). Comparison of cancer type-specific survival rates reveals that these are overall lower in trial subjects. This effect, at least to some extent, can be explained by the more advanced stages of cancer of trial subjects. This analysis also reveals that for stage IV cancer, colorectal cancer patients have a better chance of survival than breast cancer patients. On the other hand, for all other stages, breast cancer patients have better survival than colorectal cancer patients. Comparison of survival in different stages of disease between the two datasets reveals that subjects with stage IV cancer from the trials dataset have a lower chance of survival than matching stage IV subjects from TCGA. One likely explanation for this observation is that stage IV trial subjects have lower survival rates since their cancer is less likely to respond to treatment. To conclude, we present here a newly available clinical trials dataset which allowed for the integration of patient-level data from many cancer clinical trials. Our comprehensive analysis reveals that cancer-related clinical trials are not representative of general cancer patient populations, mostly due to their focus on the more advanced stages of the disease. These and other limitations of clinical trials data should, perhaps, be taken into consideration in medical research and in the field of precision medicine.

Nophar Geifman, Raphael Cohen, Eitan Rubin (2013)Redefining meaningful age groups in the context of disease, In: Age35(6)pp. 2357-2366 Springer Netherlands

DOI: 10.1007/s11357-013-9510-6

Age is an important factor when considering phenotypic changes in health and disease. Currently, the use of age information in medicine is somewhat simplistic, with ages commonly being grouped into a small number of crude ranges reflecting the major stages of development and aging, such as childhood or adolescence. Here, we investigate the possibility of redefining age groups using the recently developed Age-Phenome Knowledge-base (APK) that holds over 35,000 literature-derived entries describing relationships between age and phenotype. Clustering of APK data suggests 13 new, partially overlapping, age groups. The diseases that define these groups suggest that the proposed divisions are biologically meaningful. We further show that the number of different age ranges that should be considered depends on the type of disease being evaluated. This finding was further strengthened by similar results obtained from clinical blood measurement data. The grouping of diseases that share a similar pattern of disease-related reports directly mirrors, in some cases, medical knowledge of disease–age relationships. In other cases, our results may be used to generate new and reasonable hypotheses regarding links between diseases.

Nophar Geifman, Sanchita Bhattacharya, Atul J Butte (2016)Immune modulators in disease: integrating knowledge from the biomedical literature and gene expression, In: Journal of the American Medical Informatics Association : JAMIA23(3)617pp. 617-626

DOI: 10.1093/jamia/ocv166

Cytokines play a central role in both health and disease, modulating immune responses and acting as diagnostic markers and therapeutic targets. This work takes a systems-level approach for integration and examination of immune patterns, such as cytokine gene expression with information from biomedical literature, and applies it in the context of disease, with the objective of identifying potentially useful relationships and areas for future research. We present herein the integration and analysis of immune-related knowledge, namely, information derived from biomedical literature and gene expression arrays. Cytokine-disease associations were captured from over 2.4 million PubMed records, in the form of Medical Subject Headings descriptor co-occurrences, as well as from gene expression arrays. Clustering of cytokine-disease co-occurrences from biomedical literature is shown to reflect current medical knowledge as well as potentially novel relationships between diseases. A correlation analysis of cytokine gene expression in a variety of diseases revealed compelling relationships. Finally, a novel analysis comparing cytokine gene expression in different diseases to parallel associations captured from the biomedical literature was used to examine which associations are interesting for further investigation. We demonstrate the usefulness of capturing Medical Subject Headings descriptor co-occurrences from biomedical publications in the generation of valid and potentially useful hypotheses. Furthermore, integrating and comparing descriptor co-occurrences with gene expression data was shown to be useful in detecting new, potentially fruitful, and unaddressed areas of research. Using integrated large-scale data captured from the scientific literature and experimental data, a better understanding of the immune mechanisms underlying disease can be achieved and applied to research.

Johanna von Gerichten, Kyle Saunders, Melanie J. Bailey, Lee A. Gethings, Anthony Onoja, Nophar Geifman, Matt Spick (2024)Challenges in Lipidomics Biomarker Identification: Avoiding the Pitfalls and Improving Reproducibility, In: Metabolites14(8)461 Mdpi

DOI: 10.3390/metabo14080461

Identification of features with high levels of confidence in liquid chromatography-mass spectrometry (LC-MS) lipidomics research is an essential part of biomarker discovery, but existing software platforms can give inconsistent results, even from identical spectral data. This poses a clear challenge for reproducibility in biomarker identification. In this work, we illustrate the reproducibility gap for two open-access lipidomics platforms, MS DIAL and Lipostar, finding just 14.0% identification agreement when analyzing identical LC-MS spectra using default settings. Whilst the software platforms performed more consistently using fragmentation data, agreement was still only 36.1% for MS2 spectra. This highlights the critical importance of validation across positive and negative LC-MS modes, as well as the manual curation of spectra and lipidomics software outputs, in order to reduce identification errors caused by closely related lipids and co-elution issues. This curation process can be supplemented by data-driven outlier detection in assessing spectral outputs, which is demonstrated here using a novel machine learning approach based on support vector machine regression combined with leave-one-out cross-validation. These steps are essential to reduce the frequency of false positive identifications and close the reproducibility gap, including between software platforms, which, for downstream users such as bioinformaticians and clinicians, can be an underappreciated source of biomarker identification errors.

Philip J Scott, Ronald Cornet, Colin McCowan, Niels Peek, Paolo Fraccaro, Nophar Geifman, Wouter T Gude, William Hulme, Glen P Martin, Richard Williams (2017)Informatics for Health 2017: Advancing both science and practice, In: Journal of innovation in health informatics24(1)1pp. 1-185

DOI: 10.14236/jhi.v24i1.939

The Informatics for Health congress, 24-26 April 2017, in Manchester, UK, brought together the Medical Informatics Europe (MIE) conference and the Farr Institute International Conference. This special issue of the Journal of Innovation in Health Informatics contains 113 presentation abstracts and 149 poster abstracts from the congress. The twin programmes of "Big Data" and "Digital Health" are not always joined up by coherent policy and investment priorities. Substantial global investment in health IT and data science has led to sound progress but highly variable outcomes. Society needs an approach that brings together the science and the practice of health informatics. The goal is multi-level Learning Health Systems that consume and intelligently act upon both patient data and organizational intervention outcomes. Informatics for Health demonstrated the art of the possible, seen in the breadth and depth of our contributions. We call upon policy makers, research funders and programme leaders to learn from this joined-up approach.

Nophar Geifman, Eitan Rubin (2011)Towards an Age-Phenome Knowledge-base, In: BMC bioinformatics12(1)229pp. 229-229 BioMed Central

DOI: 10.1186/1471-2105-12-229

Ilia Zhidkov, Raphael Cohen, Nophar Geifman, Dan Mishmar, Eitan Rubin (2011)CHILD: a new tool for detecting low-abundance insertions and deletions in standard sequence traces, In: Nucleic acids research39(7)pp. e47-e47 Oxford University Press

DOI: 10.1093/nar/gkq1354

Several methods have been proposed for detecting insertion/deletions (indels) from chromatograms generated by Sanger sequencing. However, most such methods are unsuitable when the mutated and normal variants occur at unequal ratios, such as is expected to be the case in cancer, with organellar DNA or with alternatively spliced RNAs. In addition, the current methods do not provide robust estimates of the statistical confidence of their results, and the sensitivity of this approach has not been rigorously evaluated. Here, we present CHILD, a tool specifically designed for indel detection in mixtures where one variant is rare. CHILD makes use of standard sequence alignment statistics to evaluate the significance of the results. The sensitivity of CHILD was tested by sequencing controlled mixtures of deleted and undeleted plasmids at various ratios. Our results indicate that CHILD can identify deleted molecules present as just 5% of the mixture. Notably, the results were plasmid/primer-specific; for some primers and/or plasmids, the deleted molecule was only detected when it comprised 10% or more of the mixture. The false positive rate was estimated to be lower than 0.4%. CHILD was implemented as a user-oriented web site, providing a sensitive and experimentally validated method for the detection of rare indel-carrying molecules in common Sanger sequence reads.

Nophar Geifman, Richard E. Kennedy, Lon S. Schneider, Iain Buchan, Roberta Diaz Brinton (2018)Data-driven identification of endophenotypes of Alzheimer's disease progression: implications for clinical trials and therapeutic interventions, In: Alzheimer's research & therapy10(1)4pp. 4-4 Springer Nature

DOI: 10.1186/s13195-017-0332-0

Background: Given the complex and progressive nature of Alzheimer's disease (AD), a precision medicine approach for diagnosis and treatment requires the identification of patient subgroups with biomedically distinct and actionable phenotype definitions. Methods: Longitudinal patient-level data for 1160 AD patients receiving placebo or no treatment with a follow-up of up to 18 months were extracted from an integrated clinical trials dataset. We used latent class mixed modelling (LCMM) to identify patient subgroups demonstrating distinct patterns of change over time in disease severity, as measured by the Alzheimer's Disease Assessment Scale-cognitive subscale score. The optimal number of subgroups (classes) was selected by the model which had the lowest Bayesian Information Criterion. Other patient-level variables were used to define these subgroups' distinguishing characteristics and to investigate the interactions between patient characteristics and patterns of disease progression. Results: The LCMM resulted in three distinct subgroups of patients, with 10.3% in Class 1, 76.5% in Class 2 and 13.2% in Class 3. While all classes demonstrated some degree of cognitive decline, each demonstrated a different pattern of change in cognitive scores, potentially reflecting different subtypes of AD patients. Class 1 represents rapid decliners with a steep decline in cognition over time, and who tended to be younger and better educated. Class 2 represents slow decliners, while Class 3 represents severely impaired slow decliners: patients with a similar rate of decline to Class 2 but with worse baseline cognitive scores. Class 2 demonstrated a significantly higher proportion of patients with a history of statins use; Class 3 showed lower levels of blood monocytes and serum calcium, and higher blood glucose levels. Conclusions: Our results, 'learned' from clinical data, indicate the existence of at least three subgroups of Alzheimer's patients, each demonstrating a different trajectory of disease progression. This hypothesis-generating approach has detected distinct AD subgroups that may prove to be discrete endophenotypes linked to specific aetiologies. These findings could enable stratification within a clinical trial or study context, which may help identify new targets for intervention and guide better care.

Nophar Geifman, Roberta Diaz Brinton, Richard E Kennedy, Lon S Schneider, Atul J Butte (2017)Evidence for benefit of statins to modify cognitive decline and risk in Alzheimer's disease, In: Alzheimer's research & therapy9(1)10

DOI: 10.1186/s13195-017-0237-y

Despite substantial research and development investment in Alzheimer's disease (AD), effective therapeutics remain elusive. Significant emerging evidence has linked cholesterol, β-amyloid and AD, and several studies have shown a reduced risk for AD and dementia in populations treated with statins. However, while some clinical trials evaluating statins in general AD populations have been conducted, these resulted in no significant therapeutic benefit. By focusing on subgroups of the AD population, it may be possible to detect endotypes responsive to statin therapy. Here we investigate the possible protective and therapeutic effect of statins in AD through the analysis of datasets of integrated clinical trials, and prospective observational studies. Re-analysis of AD patient-level data from failed clinical trials suggested by trend that use of simvastatin may slow the progression of cognitive decline, and to a greater extent in ApoE4 homozygotes. Evaluation of continual long-term use of various statins, in participants from multiple studies at baseline, revealed better cognitive performance in statin users. These findings were supported in an additional, observational cohort where the incidence of AD was significantly lower in statin users, and ApoE4/ApoE4-genotyped AD patients treated with statins showed better cognitive function over the course of 10-year follow-up. These results indicate that the use of statins may benefit all AD patients with potentially greater therapeutic efficacy in those homozygous for ApoE4.

N. Geifman, A. J. Butte (2016)Analysis: A patient-level data meta-analysis of standard-of-care treatments from eight prostate cancer clinical trials, In: Scientific data3160027 NATURE PORTFOLIO

DOI: 10.1038/sdata.2016.27

Open clinical trial data offer many opportunities for the scientific community to independently verify published results, evaluate new hypotheses and conduct meta-analyses. These data provide valuable opportunities for scientific advances in medical research. Herein we present the comparative meta-analysis of different standard of care treatments from newly available comparator arm data from several prostate cancer clinical trials. Comparison of survival rates following treatment with mitoxantrone or docetaxel in combination with prednisone as well as prednisone alone, validated the previously demonstrated superiority of treatment with docetaxel. Additionally, comparison of four testosterone suppression treatments in hormone-refractory prostate cancer revealed that subjects who had undergone surgical castration had significantly lower survival rates than those treated with LHRH, anti-androgen or LHRH plus anti-androgen, suggesting that this treatment option is less optimal. This study illustrates how the use of patient-level clinical trial data enables meta-analyses that can provide new insights into clinical outcomes of standard of care treatments and thus, once validated, has the potential to help optimize healthcare delivery.

Nophar Geifman, Alon Monsonego, Eitan Rubin (2010)The Neural/Immune Gene Ontology: clipping the Gene Ontology for neurological and immunological systems, In: BMC bioinformatics11(1)458pp. 458-458 BioMed Central

DOI: 10.1186/1471-2105-11-458

Nophar Geifman, Jennifer Bollyky, Sanchita Bhattacharya, Atul J. Butte (2015)Opening clinical trial data: are the voluntary data-sharing portals enough?, In: BMC medicine13(1)280pp. 280-280 Springer Nature

DOI: 10.1186/s12916-015-0525-y

Data generated by the numerous clinical trials conducted annually worldwide have the potential to be extremely beneficial to the scientific and patient communities. This potential is well recognized and efforts are being made to encourage the release of raw patient-level data from these trials to the public. The issue of sharing clinical trial data has recently gained attention, with many agreeing that this type of data should be made available for research in a timely manner. The availability of clinical trial data is most important for study reproducibility, meta-analyses, and improvement of study design. There is much discussion in the community over key data sharing issues, including the risks this practice holds. However, one aspect that remains to be adequately addressed is that of the accessibility, quality, and usability of the data being shared. Herein, experiences with the two current major platforms used to store and disseminate clinical trial data are described, discussing the issues encountered and suggesting possible solutions.

Toby Wilkinson, Siddharth Sinha, Niels Peek, Nophar Geifman (2019)Clinical trial data reuse - overcoming complexities in trial design and data sharing, In: Trials20(1)513pp. 513-513 Springer Nature

DOI: 10.1186/s13063-019-3627-6

There are many acknowledged benefits for the reuse of clinical trial data; from independent verification of published results to the evaluation of new hypotheses. However, the reuse of shared clinical trial data is not without obstacles. Here we present some of the issues and lessons learned from our own experiences in accessing and analyzing trial data; specifically, where we aim to combine and pool data from multiple different trials. In addition to issues around missing annotation and incomplete datasets, we identify trial-design complexity as a potential hurdle that may complicate downstream analyses. We address potential solutions and emphasize the need for benefits of transparent sharing and analysis of participant-level clinical trial data with appropriate risk mitigation, a matter important to efficient clinical research.

Nophar Geifman, Eitan Rubin (2012)The age-phenome database, In: SpringerPlus1(1)4pp. 4-4 Springer

DOI: 10.1186/2193-1801-1-4

Data linking specific ages or age ranges with disease are abundant in biomedical literature. However, these data are organized such that searching for age-phenotype relationships is difficult. Recently, we described the Age-Phenome Knowledge-base (APK), a computational platform for storage and retrieval of information concerning age-related phenotypic patterns. Here, we report that data derived from over 1.5 million human-related PubMed abstracts have been added to APK. Using a text-mining pipeline, 35,683 entries which describe relationships between age and phenotype (such as disease) have been introduced into the database. Comparing the results to those obtained by a human reader reveals that the overall accuracy of these entries is estimated to exceed 80%. The usefulness of these data for obtaining new insight regarding age-disease relationships is demonstrated using clustering analysis, which is shown to capture obvious, as well as potentially interesting relationships between diseases. In addition, a new tool for browsing and searching the APK database is presented. We thus present a unique resource and a new framework for studying age-disease relationships and other phenotypic processes.

Nophar Geifman, Izhak Haviv, Razelle Kurzrock, Eitan Rubin (2014)Promoting Precision Cancer Medicine through a Community-Driven Knowledgebase, In: Journal of personalized medicine4(4)475pp. 475-488 Mdpi

DOI: 10.3390/jpm4040475

Increasing efforts are being dedicated towards improving cancer care via personalized medicine. These efforts depend to a large degree on the availability of a knowledge foundation. Unfortunately, existing knowledge linking cancer drugs and potential efficacy biomarkers is in its infancy; and where links are known, they are frequently unstructured and poorly documented. We have developed a new open-access knowledgebase for precision cancer medicine (the PCM Wiki and Knowledgebase). This knowledgebase was constructed using an innovative, two-pronged approach involving a structured knowledgebase at the back-end, and an intuitive knowledge-sharing interface and user-friendly query engine in front. The knowledgebase was seeded with several patient case reports and information was mined via text-mining and literature review by human curators. Using our novel Wiki-based platform to present and share knowledge stored in the PCM knowledgebase, users are able to suggest corrections, propose additions or point to errors in the knowledgebase. The result is a community-driven evolving knowledgebase holding integrated and consolidated knowledge of markers and indications for personalized cancer medicine. We suggest that the PCM Knowledgebase and Wiki could serve as an important tool for the advancement of clinical trials and care in the field of precision cancer medicine.

Nophar Geifman, Eitan Rubin (2013)The Mouse Age Phenome Knowledgebase and Disease-Specific Inter-Species Age Mapping, In: PloS one8(12)81114pp. e81114-e81114 Public Library Science

DOI: 10.1371/journal.pone.0081114

Background: Similarities between mice and humans lead to generation of many mouse models of human disease. However, differences between the species often result in mice being unreliable as preclinical models for human disease. One difference that might play a role in lowering the predictivity of mice models to human diseases is age. Despite the important role age plays in medicine, it is too often considered only casually when considering mouse models. Methods: We developed the mouse-Age Phenotype Knowledgebase, which holds knowledge about age-related phenotypic patterns in mice. The knowledgebase was extensively populated with literature-derived data using text mining techniques. We then mapped between ages in humans and mice by comparing the age distribution pattern for 887 diseases in both species. Results: The knowledgebase was populated with over 9800 instances generated by a text-mining pipeline. The quality of the data was manually evaluated, and was found to be of high accuracy (estimated precision >86%). Furthermore, grouping together diseases that share similar age patterns in mice resulted in clusters that mirror actual biomedical knowledge. Using these data, we matched age distribution patterns in mice and in humans, allowing for age differences by shifting either of the patterns. High correlation (r(2) >.0.5) was found for 223 diseases. The results clearly indicate a difference in the age mapping between different diseases: age 30 years in human is mapped to 120 days in mice for Leukemia, but to 295 days for Anemia. Based on these results we generated a mice-to-human age map which is publicly available. Conclusions: We present here the development of the mouse-APK, its population with literature-derived data and its use to map ages in mice and human for 223 diseases. These results present a further step made to bridging the gap between humans and mice in biomedical research.

Nophar Geifman, Hannah Lennon, Niels Peek (2018)Patient Stratification Using Longitudinal Data - Application of Latent Class Mixed Models, In: Studies in health technology and informatics247pp. 176-180

DOI: 10.3233/978-1-61499-852-5-176

Analysis of longitudinal data in medical research is becoming increasingly important, in particular for the identification of patient subgroups, as the focus of medical research is shifting toward personalised medicine. Here we present the use of a statistical learning approach for the identification of subgroups of hypertension patients demonstrating different patterns of response to treatment. This method, applied to large-scale patient-level data, has identified three such groups found to be associated with different clinical characteristics. We further consider the utility of this method in medical research by comparison to the application in two additional studies.

Ryan Malcolm Hum, James B. Lilleker, Janine A. Lamb, Alexander G. S. Oldroyd, Guochun Wang, Lucy R. Wedderburn, Louise P. Diederichsen, Jens Schmidt, Maria Giovanna Danieli, Paula Oakley, Zoltan Griger, Thuy Nguyen Thi Phuong, Chanakya Kodishala, Monica Vazquez-Del Mercado, Helena Andersson, Boel De Paepe, Jan L. De Bleecker, Britta Maurer, Liza McCann, Nicolo Pipitone, Neil McHugh, Robert Paul New, William E. Ollier, Niels Steen Krogh, Jiri Vencovsky, Ingrid E. Lundberg, Hector Chinoy, Nophar Geifman (2024)Comparison of clinical features between patients with anti-synthetase syndrome and dermatomyositis: results from the MYONET registry, In: Rheumatology (Oxford, England)63(8)pp. 2093-2100 Oxford Univ Press

DOI: 10.1093/rheumatology/kead481

Objectives: To compare clinical characteristics, including the frequency of cutaneous, extramuscular manifestations and malignancy, between adults with anti-synthetase syndrome (ASyS) and DM. Methods: Using data regarding adults from the MYONET registry, a cohort of DM patients with anti-Mi2/-TIF1 gamma/-NXP2/-SAE/-MDA5 autoantibodies, and a cohort of ASyS patients with anti-tRNA synthetase autoantibodies (anti-Jo1/-PL7/-PL12/-OJ/-EJ/-Zo/-KS) were identified. Patients with DM sine dermatitis or with discordant dual autoantibody specificities were excluded. Sub-cohorts of patients with ASyS with or without skin involvement were defined based on presence of DM-type rashes (heliotrope rash, Gottron's papules/sign, violaceous rash, shawl sign, V-sign, erythroderma, and/or periorbital rash). Results: In total 1054 patients were included (DM, n = 405; ASyS, n = 649). In the ASyS cohort, 31% (n = 203) had DM-type skin involvement (ASyS-DMskin). A higher frequency of extramuscular manifestations, including Mechanic's hands, Raynaud's phenomenon, arthritis, interstitial lung disease and cardiac involvement differentiated ASyS-DMskin from DM (all P < 0.001), whereas higher frequency of any of four DM-type rashes-heliotrope rash (n = 248, 61% vs n = 90, 44%), violaceous rash (n = 166, 41% vs n = 57, 9%), V-sign (n = 124, 31% vs n = 28, 4%), and shawl sign (n = 133, 33% vs n = 18, 3%)-differentiated DM from ASyS-DMskin (all P < 0.005). Cancer-associated myositis (CAM) was more frequent in DM (n = 67, 17%) compared with ASyS (n = 21, 3%) and ASyS-DMskin (n = 7, 3%) cohorts (both P < 0.001). Conclusion: DM-type rashes are frequent in patients with ASyS; however, distinct clinical manifestations differentiate these patients from classical DM. Skin involvement in ASyS does not necessitate increased malignancy surveillance. These findings will inform future ASyS classification criteria and patient management.

Zsolt Zador, Andrew T. King, Nophar Geifman (2018)New drug candidates for treatment of atypical meningiomas: An integrated approach using gene expression signatures for drug repurposing, In: PloS one13(3)0194701pp. e0194701-e0194701 Public Library Science

DOI: 10.1371/journal.pone.0194701

Background Atypical meningiomas are common central nervous system neoplasms with high recurrence rate and poorer prognosis compared to their grade I counterparts. Surgical excision and radiotherapy remains the mainstay therapy but medical treatments are limited. We explore new drug candidates using computational drug repurposing based on the gene expression signature of atypical meningioma tissue with subsequent analysis of drug-generated expression profiles. We further explore possible mechanisms of action for the identified drug candidates using ingenuity pathway analysis (IPA). Methods We extracted gene expression profiles for atypical meningiomas (12 samples) and normal meningeal tissue (4 samples) from the Gene Expression Omnibus, which were then used to generate a gene signature comprising of 281 differentially expressed genes. Drug candidates were explored using both the Board Institute Connectivity Map (cmap) and Library of Integrated Network-Based Cellular Signatures (LINCS). Functional analysis of significant differential gene expression for drug candidates was performed with IPA. Results Using our integrated approach, we identified multiple, already licensed, drug candidates such as emetine, verteporfin, phenoxybenzamine and trazodone. Analysis with IPA revealed that these drugs target signal cascades potentially relevant in pathogenesis of meningiomas, particular examples are the effect on ERK by trazodone, MAP kinases by Conclusion Gene expression profiling and use of drug expression profiles have yielded several plausible drug candidates for treating atypical meningioma, some of which have already been suggested by preceding studies. Although our analyses suggested multiple anti-tumour mechanisms for these drugs, further in vivo studies are required for validation.

Chris Flood, Shashivadan P Hirani, Kathleen Mulligan, Jo Taylor, Sally Harris, Lucy R Wedderburn, Stanton P Newman, Nophar Geifman (2024)Economic evaluation of a trial exploring the effects of a web-based support tool for parents of children with juvenile idiopathic arthritis, In: Rheumatology (Oxford, England)63(SI2)pp. SI136-SI142

DOI: 10.1093/rheumatology/keae188

To explore the cost-effectiveness of a web-based support tool for parents of children with Juvenile idiopathic arthritis. A multi-centred randomized controlled trial was conducted in paediatric rheumatology centres in England. The WebParC intervention consisted of online information about JIA and its treatment and a toolkit using cognitive-behavioural therapy principles to support parents manage their child's JIA. An economic evaluation was performed alongside the trial involving 220 parents. The primary outcome was the self-report Pediatric Inventory for Parents measure of illness-related parenting stress, with two dimensions: difficulty and frequency. These measures along with costs were assessed post intervention at 4 and 12 months. Costs were calculated for healthcare usage using a UK NHS economic perspective. Data was collected and analysed on the impact of caring costs on families. Uncertainty around cost-effectiveness was explored using bootstrapping and cost-effectiveness acceptability curves. The intervention arm showed improved average Pediatric Inventory for Parents scores for the dimensions of frequency and difficulty, of 1.5 and 3.6 respectively at 4 months and 0.35 and 0.39 at 12 months, representing improved PIP scores for the intervention arm. At both 4 and 12 month follow-up, the average total cost per case was higher in the control group when compared with the intervention arm with mean differences of £360 (95% CI £29.6 to £691) at 4 months and £203 (95% CI £16 to £390) at 12 months. The probability of the intervention being cost-effective ranged between 49% and 54%. The WebParC intervention led to reductions in primary and secondary healthcare resource use and costs at 4 and 12 months. The intervention demonstrated particular savings for rheumatology services at both follow-ups. Future economies of scale could be realised by health providers with increased opportunities for cost-effectiveness over time. ISRCTN, ISRCTN13159730.

Helen Le Sueur, Ian N. Bruce, Nophar Geifman (2020)The challenges in data integration - heterogeneity and complexity in clinical trials and patient registries of Systemic Lupus Erythematosus, In: BMC medical research methodology20(1)164pp. 164-164 Springer Nature

DOI: 10.1186/s12874-020-01057-0

Background Individual clinical trials and cohort studies are a useful source of data, often under-utilised once a study has ended. Pooling data from multiple sources could increase sample sizes and allow for further investigation of treatment effects; even if the original trial did not meet its primary goals. Through the MASTERPLANS (MAximizing Sle ThERapeutic PotentiaL by Application of Novel and Stratified approaches) national consortium, focused on Systemic Lupus Erythematosus (SLE), we have gained valuable real-world experiences in aligning, harmonising and combining data from multiple studies and trials, specifically where standards for data capture, representation and documentation, were not used or were unavailable. This was not without challenges arising both from the inherent complexity of the disease and from differences in the way data were captured and represented across different studies. Main body Data were, unavoidably, aligned by hand, matching up equivalent or similar patient variables across the different studies. Heterogeneity-related issues were tackled and data were cleaned, organised and combined, resulting in a single large dataset ready for analysis. Overcoming these hurdles, often seen in large-scale data harmonization and integration endeavours of legacy datasets, was made possible within a realistic timescale and limited resource by focusing on specific research questions driven by the aims of MASTERPLANS. Here we describe our experiences tackling the complexities in the integration of large, diverse datasets, and the lessons learned. Conclusions Harmonising data across studies can be complex, and time and resource consuming. The work carried out here highlights the importance of using standards for data capture, recording, and representation, to facilitate both the integration of large datasets and comparison between studies. Where standards are not implemented at the source harmonisation is still possible by taking a flexible approach, with systematic preparation, and a focus on specific research questions.

Adrian Heald, Narges Azadbakht, Bethany Geary, Silke Conen, Helene Fachim, Dave Chi Hoo Lee, Nophar Geifman, Sanam Farman, Oliver Howes, Anthony Whetton, Bill Deakin (2020)Application of SWATH mass spectrometry in the identification of circulating proteins does not predict future weight gain in early psychosis, In: Clinical proteomics17(1)38pp. 38-38

DOI: 10.1186/s12014-020-09299-2

Weight gain is a common consequence of treatment with antipsychotic drugs in early psychosis, leading to further morbidity and poor treatment adherence. Identifying tools that can predict weight change in early psychosis may contribute to better-individualised treatment and adherence. Recently we showed that proteomic profiling with sequential window acquisition of all theoretical fragment ion spectra (SWATH) mass spectrometry (MS) can identify individuals with pre-diabetes more likely to experience weight change in relation to lifestyle change. We investigated whether baseline proteomic profiles predicted weight change over time using data from the BeneMin clinical trial of the anti-inflammatory antibiotic, minocycline, versus placebo. Expression levels for 844 proteins were determined by SWATH proteomics in 83 people (60 men and 23 women). Hierarchical clustering analysis and principal component analysis of baseline proteomics data did not reveal distinct separation between the proteome profiles of participants in different weight change categories. However, individuals with the highest weight loss had higher Positive and Negative Syndrome Scale (PANSS) scores. Our findings imply that mode of treatment i.e. the pharmacological intervention for psychosis may be the determining factor in weight change after diagnosis, rather than predisposing proteomic dynamics.

Kathryn A. McGurk, Arianna Dagliati, Davide Chiasserini, Dave Lee, Darren Plant, Ivona Baricevic-Jones, Janet Kelsall, Rachael Eineman, Rachel Reed, Bethany Geary, Richard Unwin, Anna Nicolaou, Bernard D. Keavney, Anne Barton, Anthony D. Whetton, Nophar Geifman (2020)The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination, In: BIOINFORMATICS36(7)pp. 2217-2223 Oxford Univ Press

DOI: 10.1093/bioinformatics/btz898

Motivation: Data-independent acquisition mass spectrometry allows for comprehensive peptide detection and relative quantification than standard data-dependent approaches. While less prone to missing values, these still exist. Current approaches for handling the so-called missingness have challenges. We hypothesized that non-random missingness is a useful biological measure and demonstrate the importance of analysing missingness for proteomic discovery within a longitudinal study of disease activity. Results: The magnitude of missingness did not correlate with mean peptide concentration. The magnitude of missingness for each protein strongly correlated between collection time points (baseline, 3months, 6months; R=0.95-0.97, confidence interval = 0.94-0.97) indicating little time-dependent effect. This allowed for the identification of proteins with outlier levels of missingness that differentiate between the patient groups characterized by different patterns of disease activity. The association of these proteins with disease activity was confirmed by machine learning techniques. Our novel approach complements analyses on complete observations and other missing value strategies in biomarker prediction of disease activity.

Sean P Gavan, Ian N Bruce, Katherine Payne, Nophar Geifman (2023)Valuing Health Gain from Composite Response Endpoints for Multisystem Diseases, In: Value in health26(1)pp. 115-122

DOI: 10.1016/j.jval.2022.07.001

This study aimed to demonstrate how to estimate the value of health gain after patients with a multisystem disease achieve a condition-specific composite response endpoint. Data from patients treated in routine practice with an exemplar multisystem disease (systemic lupus erythematosus) were extracted from a national register (British Isles Lupus Assessment Group Biologics Register). Two bespoke composite response endpoints (Major Clinical Response and Improvement) were developed in advance of this study. Difference-in-differences regression compared health utility values (3-level version of EQ-5D; UK tariff) over 6 months for responders and nonresponders. Bootstrapped regression estimated the incremental quality-adjusted life-years (QALYs), probability of QALY gain after achieving the response criteria, and population monetary benefit of response. Within the sample (n = 171), 18.2% achieved Major Clinical Response and 49.1% achieved Improvement at 6 months. Incremental health utility values were 0.0923 for Major Clinical Response and 0.0454 for Improvement. Expected incremental QALY gain at 6 months was 0.020 for Major Clinical Response and 0.012 for Improvement. Probability of QALY gain after achieving the response criteria was 77.6% for Major Clinical Response and 72.7% for Improvement. Population monetary benefit of response was £1 106 458 for Major Clinical Response and £649 134 for Improvement. Bespoke composite response endpoints are becoming more common to measure treatment response for multisystem diseases in trials and observational studies. Health technology assessment agencies face a growing challenge to establish whether these endpoints correspond with improved health gain. Health utility values can generate this evidence to enhance the usefulness of composite response endpoints for health technology assessment, decision making, and economic evaluation.

Anthony D. Whetton, George W. Preston, Semira Abubeker, Nophar Geifman (2020)Proteomics and Informatics for Understanding Phases and Identifying Biomarkers in COVID-19 Disease, In: Journal of proteome research19(11)pp. 4219-4232 Amer Chemical Soc

DOI: 10.1021/acs.jproteome.0c00326

The emergence of novel coronavirus disease 2019 (COVID-19), caused by the SARS-CoV-2 coronavirus, has necessitated the urgent development of new diagnostic and therapeutic strategies. Rapid research and development, on an international scale, has already generated assays for detecting SARS-CoV-2 RNA and host immunoglobulins. However, the complexities of COVID-19 are such that fuller definitions of patient status, trajectory, sequelae, and responses to therapy are now required. There is accumulating evidence-from studies of both COVID-19 and the related disease SARS-that protein biomarkers could help to provide this definition. Proteins associated with blood coagulation (D-dimer), cell damage (lactate dehydrogenase), and the inflammatory response (e.g., C-reactive protein) have already been identified as possible predictors of COVID-19 severity or mortality. Proteomics technologies, with their ability to detect many proteins per analysis, have begun to extend these early findings. To be effective, proteomics strategies must include not only methods for comprehensive data acquisition (e.g., using mass spectrometry) but also informatics approaches via which to derive actionable information from large data sets. Here we review applications of proteomics to COVID-19 and SARS and outline how pipelines involving technologies such as artificial intelligence could be of value for research on these diseases.

Jennifer C. Davies, Emil Carlsson, Angela Midgley, Eve M. D. Smith, Ian N. Bruce, Michael W. Beresford, Christian M. Hedrich, Nophar Geifman (2021)A panel of urinary proteins predicts active lupus nephritis and response to rituximab treatment, In: Rheumatology (Oxford, England)60(8)pp. 3747-3759 Oxford Univ Press

DOI: 10.1093/rheumatology/keaa851

Objectives. similar to 30% of patients with SLE develop LN. Presence and/or severity of LN are currently assessed by renal biopsy, but biomarkers in serum or urine samples may provide an avenue for non-invasive routine testing. We aimed to validate a urinary protein panel for its ability to predict active renal involvement in SLE. Methods. A total of 197 SLE patients and 48 healthy controls were recruited, and urine samples collected. Seventy-five of the SLE patients had active LN and 104 had no or inactive renal disease. Concentrations of lipocalin-like prostaglandin D synthase (LPGDS), transferrin, alpha-1-acid glycoprotein (AGP-1), ceruloplasmin, monocyte chemoattractant protein 1 (MCP-1) and soluble vascular cell adhesion molecule-1 (sVCAM-1) were quantified by MILLIPLEX (R) Assays using the MAGPIX Luminex platform. Binary logistic regression was conducted to examine whether proteins levels associate with active renal involvement and/or response to rituximab treatment. Results. Urine levels of transferrin (P

Charlotte Watson, Dr Nophar Geifman, Nophar Geifman (2020)Do traditional BMI categories capture future obesity? A comparison with trajectories of BMI and incidence of cancer, In: AMIA ... Annual Symposium proceedings2020pp. 1287-1294

In 2016, 13 specific obesity related cancers were identified by IARC. Here, using baseline WHO BMI categories, latent profile analysis (LPA) and latent class trajectory modelling (LCTM) we evaluated the usefulness of one-off measures when predicting cancer risk vs life-course changes. Our results in LPA broadly concurred with the three basic WHO BMI categories, with similar stepwise increase in cancer risk observed. In LCTM, we identified 5 specific trajectories in men and women. Compared to the leanest class, a stepwise increase in risk for obesity related cancer was observed for all classes. When latent class membership was compared to baseline BMI, we found that the trajectories were composed of a range of BMI (baseline) categories. All methods reveal a link between obesity and the 13 cancers identified by IARC. However, the additional information included by LCTM indicates that lifetime BMI may highlight additional group of people that are at risk.

Charlotte Watson, Andrew G. Renehan, Nophar Geifman (2021)Associations of specific-age and decade recall body mass index trajectories with obesity-related cancer, In: BMC cancer21(1)502pp. 502-502 Springer Nature

DOI: 10.1186/s12885-021-08226-4

Background Excess body fatness, commonly approximated by a one-off determination of body mass index (BMI), is associated with increased risk of at least 13 cancers. Modelling of longitudinal BMI data may be more informative for incident cancer associations, e.g. using latent class trajectory modelling (LCTM) may offer advantages in capturing changes in patterns with time. Here, we evaluated the variation in cancer risk with LCTMs using specific age recall versus decade recall BMI. Methods We obtained BMI profiles for participants from the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial. We developed gender-specific LCTMs using recall data from specific ages 20 and 50 years (72,513 M; 74,837 W); decade data from 30s to 70s (42,113 M; 47,352 W) and a combination of both (74,106 M, 76,245 W). Using an established methodological framework, we tested 1:7 classes for linear, quadratic, cubic and natural spline shapes, and modelled associations for obesity-related cancer (ORC) incidence using LCTM class membership. Results Different models were selected depending on the data type used. In specific age recall trajectories, only the two heaviest classes were associated with increased risk of ORC. For the decade recall data, the shapes appeared skewed by outliers in the heavier classes but an increase in ORC risk was observed. In the combined models, at older ages the BMI values were more extreme. Conclusions Specific age recall models supported the existing literature changes in BMI over time are associated with increased ORC risk. Modelling of decade recall data might yield spurious associations.

Helen Le Sueur, Arianna Dagliati, Iain Buchan, Anthony D. Whetton, Glen P. Martin, Tim Dornan, Nophar Geifman (2020)Pride and prejudice - What can we learn from peer review?, In: Medical teacher42(9)1012pp. 1012-1018 Taylor & Francis

DOI: 10.1080/0142159X.2020.1774527

Objectives: Peer review is a powerful tool that steers the education and practice of medical researchers but may allow biased critique by anonymous reviewers. We explored factors unrelated to research quality that may influence peer review reports, and assessed the possibility that sub-types of reviewers exist. Our findings could potentially improve the peer review process. Methods: We evaluated the harshness, constructiveness and positiveness in 596 reviews from journals with open peer review, plus 46 reviews from colleagues' anonymously reviewed manuscripts. We considered possible influencing factors, such as number of authors and seasonal trends, on the content of the review. Finally, using machine-learning we identified latent types of reviewer with differing characteristics. Results: Reviews provided during a northern-hemisphere winter were significantly harsher, suggesting a seasonal effect on language. Reviews for articles in journals with an open peer review policy were significantly less harsh than those with an anonymous review process. Further, we identified three types of reviewers: nurturing, begrudged, and blasé. Conclusion: Nurturing reviews were in a minority and our findings suggest that more widespread open peer reviewing could improve the educational value of peer review, increase the constructive criticism that encourages researchers, and reduce pride and prejudice in editorial processes.

Stephanie Shoop-Worrall, Kimme Hyrich, Lucy Wedderburn, Wendy Thomson, Nophar Geifman (2019)P02 Multi-trajectories of disease activity in juvenile idiopathic arthritis, In: Rheumatology (Oxford, England)58(Supplement_4)

DOI: 10.1093/rheumatology/kez415

Abstract Background Composite disease scores in juvenile idiopathic arthritis (JIA), such as the clinical Juvenile Arthritis Disease Activity Score (cJADAS), include multiple disease manifestations, presented as a single score. These overall scores aid understanding of disease holistically in each child or young person (CYP), and have been suggested as outcomes for clinical trials and targets in treat to target clinical strategies. However, signs and symptoms of disease may not follow similar patterns following a JIA diagnosis. It is not currently known what the patterns of disease activity are in CYP with JIA and how these cluster over time. Methods CYP with JIA were selected if enrolled in the Childhood Arthritis Prospective Study (CAPS), a UK multicentre inception cohort, before January 2015. cJADAS10 components (active joint count 0-10, physician global, patient/parent global) were collected at diagnosis, six months, one year and then annually to three years. Multivariate group-based trajectory models modelled cJADAS10 component scores using censored-normal (physician and parent global) and zero-inflated Poisson (active joint count) distributions. Within linear, quadratic and cubic polynomials, one to ten trajectories were tested. The optimal models were selected using Bayesian Information Criteria, model parsimony and clinical plausibility. Results Of 1,183 CYP selected, the majority were female (65%) and of white ethnicity (90%) with oligoarticular JIA the most common JIA category (45%). The optimal model identified six multivariate patterns of disease. In four of these clusters, signs and symptoms of disease had similar patterns over time: Low-Remission (32%), Low-Low (20%), High-Low (16%) and High-Low-High (10%). However, in two groups, Low-Chronic (14%) and High Chronic (8%), manifestations of inflammation and wellbeing followed different trajectory severities and shapes over time. These groups demonstrated persistent poor wellbeing despite control of inflammatory signs. Conclusion Disease activity in CYP with JIA does not improve in a uniform manner following initial presentation to paediatric rheumatology. Six latent multivariate trajectories have been identified in young people with JIA, two of which persist with chronic poor wellbeing despite lowered inflammation. Conflicts of Interest The authors declare no conflicts of interest.

N. Geifman, N. Azadbakht, J. Zeng, T. Wilkinson, N. Dand, I. Buchan, D. Stocken, P. Di Meglio, R.B. Warren, J.N. Barker, N.J. Reynolds, M.R. Barnes, C.H. Smith, C.E.M. Griffiths, N. Peek (2021)Defining trajectories of response in patients with psoriasis treated with biologic therapies, In: British journal of dermatology (1951)185(4)825pp. 825-835

DOI: 10.1111/bjd.20140

Summary Background The effectiveness and cost‐effectiveness of biologic therapies for psoriasis are significantly compromised by variable treatment responses. Thus, more precise management of psoriasis is needed. Objectives To identify subgroups of patients with psoriasis treated with biologic therapies, based on changes in their disease activity over time, that may better inform patient management. Methods We applied latent class mixed modelling to identify trajectory‐based patient subgroups from longitudinal, routine clinical data on disease severity, as measured by the Psoriasis Area and Severity Index (PASI), from 3546 patients in the British Association of Dermatologists Biologics and Immunomodulators Register, as well as in an independent cohort of 2889 patients pooled across four clinical trials. Results We discovered four discrete classes of global response trajectories, each characterized in terms of time to response, size of effect and relapse. Each class was associated with differing clinical characteristics, e.g. body mass index, baseline PASI and prevalence of different manifestations. The results were verified in a second cohort of clinical trial participants, where similar trajectories following the initiation of biologic therapy were identified. Further, we found differential associations of the genetic marker HLA‐C*06:02 between our registry‐identified trajectories. Conclusions These subgroups, defined by change in disease over time, may be indicative of distinct endotypes driven by different biological mechanisms and may help inform the management of patients with psoriasis. Future work will aim to further delineate these mechanisms by extensively characterizing the subgroups with additional molecular and pharmacological data. What is already known about this topic? While many patients with psoriasis respond to treatment with biologics, there are those who show little or no response and those who respond initially but then either lose response or suffer from adverse effects. Better characterization of patients who will, or will not, benefit from biologic therapy will facilitate the understanding of relevant biological mechanisms and explain treatment outcome variation in patient cohorts. What does this study add? Using a data‐driven approach, we identified four subgroups of patients with psoriasis defined by global trajectories of response to biologic therapies. Our results were replicated in a second cohort obtained by pooling data from four clinical trials of biologic therapies for psoriasis. We further identified potential human leucocyte antigen biomarkers that help to distinguish between the trajectory‐based subgroups. Linked Comment: L.S. van der Schoot and J.M.P.A. van den Reek. Br J Dermatol 2021; 185:698–699.

Carlos R Ramírez Medina, Ibrahim Ali, Ivona Baricevic-Jones, Aghogho Odudu, Moin A Saleem, Anthony David Whetton, Philip A Kalra, Nophar Geifman (2023)Proteomic signature associated with chronic kidney disease (CKD) progression identified by data-independent acquisition mass spectrometry, In: Clinical Proteomics2019 (2023) BMC

DOI: 10.1186/s12014-023-09405-0

Background Halting progression of chronic kidney disease (CKD) to established end stage kidney disease is a major goal of global health research. The mechanism of CKD progression involves pro-inflammatory, pro-fibrotic, and vascular pathways, but pathophysiological differentiation is currently lacking. Methods Plasma samples of 414 non-dialysis CKD patients, 170 fast progressors (with ∂ eGFR-3 ml/min/1.73 m2/year or worse) and 244 stable patients (∂ eGFR of − 0.5 to + 1 ml/min/1.73 m2/year) with a broad range of kidney disease aetiologies, were obtained and interrogated for proteomic signals with SWATH-MS. We applied a machine learning approach to feature selection of proteins quantifiable in at least 20% of the samples, using the Boruta algorithm. Biological pathways enriched by these proteins were identified using ClueGo pathway analyses. Results The resulting digitised proteomic maps inclusive of 626 proteins were investigated in tandem with available clinical data to identify biomarkers of progression. The machine learning model using Boruta Feature Selection identified 25 biomarkers as being important to progression type classification (Area Under the Curve = 0.81, Accuracy = 0.72). Our functional enrichment analysis revealed associations with the complement cascade pathway, which is relevant to CKD as the kidney is particularly vulnerable to complement overactivation. This provides further evidence to target complement inhibition as a potential approach to modulating the progression of diabetic nephropathy. Proteins involved in the ubiquitin–proteasome pathway, a crucial protein degradation system, were also found to be significantly enriched. Conclusions The in-depth proteomic characterisation of this large-scale CKD cohort is a step toward generating mechanism-based hypotheses that might lend themselves to future drug targeting. Candidate biomarkers will be validated in samples from selected patients in other large non-dialysis CKD cohorts using a targeted mass spectrometric analysis.

Kelechi Njoku, Andrew Pierce, Davide Chiasserini, Bethany Geary, Amy E. Campbell, Janet Kelsall, Rachel Reed, Nophar Geifman, Anthony D. Whetton, Emma J. Crosbie (2024)Detection of endometrial cancer in cervico-vaginal fluid and blood plasma: leveraging proteomics and machine learning for biomarker discovery, In: EBioMedicine102105064pp. 105064-105064 Elsevier B.V

DOI: 10.1016/j.ebiom.2024.105064

The anatomical continuity between the uterine cavity and the lower genital tract allows for the exploitation of uterine-derived biomaterial in cervico-vaginal fluid for endometrial cancer detection based on non-invasive sampling methodologies. Plasma is an attractive biofluid for cancer detection due to its simplicity and ease of collection. In this biomarker discovery study, we aimed to identify proteomic signatures that accurately discriminate endometrial cancer from controls in cervico-vaginal fluid and blood plasma. Blood plasma and Delphi Screener-collected cervico-vaginal fluid samples were acquired from symptomatic post-menopausal women with (n = 53) and without (n = 65) endometrial cancer. Digitised proteomic maps were derived for each sample using sequential window acquisition of all theoretical mass spectra (SWATH-MS). Machine learning was employed to identify the most discriminatory proteins. The best diagnostic model was determined based on accuracy and model parsimony. A protein signature derived from cervico-vaginal fluid more accurately discriminated cancer from control samples than one derived from plasma. A 5-biomarker panel of cervico-vaginal fluid derived proteins (HPT, LG3BP, FGA, LY6D and IGHM) predicted endometrial cancer with an AUC of 0.95 (0.91–0.98), sensitivity of 91% (83%–98%), and specificity of 86% (78%–95%). By contrast, a 3-marker panel of plasma proteins (APOD, PSMA7 and HPT) predicted endometrial cancer with an AUC of 0.87 (0.81–0.93), sensitivity of 75% (64%–86%), and specificity of 84% (75%–93%). The parsimonious model AUC values for detection of stage I endometrial cancer in cervico-vaginal fluid and blood plasma were 0.92 (0.87–0.97) and 0.88 (0.82–0.95) respectively. Here, we leveraged the natural shed of endometrial tumours to potentially develop an innovative approach to endometrial cancer detection. We show proof of principle that endometrial cancers secrete unique protein signatures that can enable cancer detection via cervico-vaginal fluid assays. Confirmation in a larger independent cohort is warranted. Cancer Research UK, Blood Cancer UK, National Institute for Health Research.

Jing Yang, Taariq Salie, Carlos R Ramírez Medina, Simon Frain, Nophar Geifman, Anthony Whetton, Mark Engel, Bernard Keavney (2022)47 Data independent acquisition mass spectrometry in severe rheumatic heart disease (rhd) identifies a proteomic signature showing ongoing inflammation and effectively classifying rhd cases, In: Heart (British Cardiac Society)108(Suppl 1)pp. A34-A35 BMJ Publishing Group Ltd and British Cardiovascular Society

DOI: 10.1136/heartjnl-2022-BCS.47

Rheumatic heart disease (RHD) remains a major source of morbidity and mortality in developing countries. A deeper insight into the pathogenetic mechanisms underlying RHD could provide opportunities for drug repurposing, guide recommendations for secondary penicillin prophylaxis, and/or inform development of near-patient diagnostics.We performed quantitative proteomics using Sequential Windowed Acquisition of All Theoretical Fragment Ion Mass Spectrometry (SWATH-MS) to screen protein expression in 215 African patients with severe RHD, and 230 controls. We applied a machine learning (ML) approach to feature selection among the 366 proteins quantifiable in at least 40% of samples, using the Boruta wrapper algorithm. The case-control differences and contribution to area under the Receiver Operating Curve for each of the 56 proteins identified by the Boruta algorithm were calculated by Logistic Regression adjusted for age, sex and BMI. Biological pathways and functions enriched for proteins were identified using ClueGo pathway analyses.Adiponectin, complement component C7 and fibulin-1, a component of heart valve matrix, were significantly higher in cases when compared with controls (Table 1). Ficolin-3, a protein with calcium-independent lectin activity that activates the complement pathway, was lower in cases than controls (Table 1). The top six biomarkers, including adiponectin, complement component C7, quiescin sulfhydryl oxidase 1, insulin like growth factor binding protein acid labile subunit, pregnancy zone protein and phosphatidylinositol-glycan-specific phospholipase D, from the Boruta analyses (Fig. 1a) conferred an AUC of 0.90 indicating excellent discriminatory capacity between RHD cases and controls (Fig. 1b).ClueGo pathway analysis results of these biomarkers support the presence of an ongoing inflammatory response in RHD (Fig. 2), at a time when severe valve disease has developed, and distant from previous episodes of acute rheumatic fever. This biomarker signature could have potential utility in recognizing different degrees of ongoing inflammation in RHD patients, which may, in turn, be related to prognostic severity.Conflict of InterestNone

Mumina Akhtar, Nisha Nair, Lucy M Carter, Edward M Vital, Emily Sutton, Neil McHugh, Ian N Bruce, John A Reynolds, Nophar Geifman (2023)Deconvolution of whole blood transcriptomics identifies changes in immune cell composition in patients with systemic lupus erythematosus (SLE) treated with mycophenolate mofetil, In: Arthritis research & therapy25(1)111pp. 111-111

DOI: 10.1186/s13075-023-03089-5

Systemic lupus erythematosus (SLE) is a clinically and biologically heterogeneous autoimmune disease. We explored whether the deconvolution of whole blood transcriptomic data could identify differences in predicted immune cell frequency between active SLE patients, and whether these differences are associated with clinical features and/or medication use. Patients with active SLE (BILAG-2004 Index) enrolled in the BILAG-Biologics Registry (BILAG-BR), prior to change in therapy, were studied as part of the MASTERPLANS Stratified Medicine consortium. Whole blood RNA-sequencing (RNA-seq) was conducted at enrolment into the registry. Data were deconvoluted using CIBERSORTx. Predicted immune cell frequencies were compared between active and inactive disease in the nine BILAG-2004 domains and according to immunosuppressant use (current and past). Predicted cell frequency varied between 109 patients. Patients currently, or previously, exposed to mycophenolate mofetil (MMF) had fewer inactivated macrophages (0.435% vs 1.391%, p = 0.001), naïve CD4 T cells (0.961% vs 2.251%, p = 0.002), and regulatory T cells (1.858% vs 3.574%, p = 0.007), as well as a higher proportion of memory activated CD4 T cells (1.826% vs 1.113%, p = 0.015), compared to patients never exposed to MMF. These differences remained statistically significant after adjusting for age, gender, ethnicity, disease duration, renal disease, and corticosteroid use. There were 2607 differentially expressed genes (DEGs) in patients exposed to MMF with over-representation of pathways relating to eosinophil function and erythrocyte development and function. Within CD4 + T cells, there were fewer predicted DEGs related to MMF exposure. No significant differences were observed for the other conventional immunosuppressants nor between patients according disease activity in any of the nine organ domains. MMF has a significant and persisting effect on the whole blood transcriptomic signature in patients with SLE. This highlights the need to adequately adjust for background medication use in future studies using whole blood transcriptomics.

Jie Man Low, Kimme L. Hyrich, Coziana Ciurtin, Flora McErlane, Lucy R. Wedderburn, Nophar Geifman, Stephanie J. W. Shoop-Worrall, (2024)The impact of psoriasis on wellbeing and clinical outcomes in juvenile psoriatic arthritis, In: Rheumatology (Oxford, England)63(5)pp. 1273-1280 Oxford University Press

DOI: 10.1093/rheumatology/kead370

Objectives Juvenile PsA (JPsA) has varied clinical features that are distinctive from other JIA categories. This study investigates whether such features impact patient-reported and clinical outcomes. Methods Children and young people (CYP) were selected if recruited to the Childhood Arthritis Prospective Study, a UK multicentre JIA inception cohort, between January 2001 and March 2018. At diagnosis, patient/parent-reported outcomes (as age-appropriate) included the parental global assessment (10 cm visual analogue scale), functional ability (Childhood Health Assessment Questionnaire (CHAQ)), pain (10 cm visual analogue scale), health-related quality of life (Child Health Questionnaire PF50 psychosocial score), mood/depressive symptoms (Moods and Feelings Questionnaire) and parent psychosocial health (General Health Questionnaire 30). Three-year outcome trajectories have previously been defined using active joint counts, physician and parent global assessments (PGA and PaGA, respectively). Patient-reported outcomes and outcome trajectories were compared in (i) CYP with JPsA vs other JIA categories and (ii) CYP within JPsA, with and without psoriasis via multivariable linear regression. Results There were no significant differences in patient-reported outcomes at diagnosis between CYP with JPsA and non-JPsA. Within JPsA, those with psoriasis had more depressive symptoms (coefficient = 9.8; 95% CI: 0.5, 19.0) than those without psoriasis at diagnosis. CYP with JPsA had 2.3 times the odds of persistent high PaGA than other ILAR categories, despite improving joint counts and PGA (95% CI: 1.2, 4.6). Conclusion CYP with psoriasis at JPsA diagnosis report worse mood, supporting a greater disease impact in those with both skin and joint involvement. Multidisciplinary care with added focus to support wellbeing in children with JPsA plus psoriasis may help improve these outcomes.

Matea Deliu, Sara Fontanella, Sadia Haider, Matthew Sperrin, Nophar Geifman, Clare Murray, Angela Simpson, Adnan Custovic (2020)Longitudinal trajectories of severe wheeze exacerbations from infancy to school age and their association with early-life risk factors and late asthma outcomes, In: Clinical and experimental allergy50(3)315pp. 315-324 Wiley

DOI: 10.1111/cea.13553

Introduction Exacerbation-prone asthma subtype has been reported in studies using data-driven methodologies. However, patterns of severe exacerbations have not been studied. Objective To investigate longitudinal trajectories of severe wheeze exacerbations from infancy to school age. Methods We applied longitudinal k-means clustering to derive exacerbation trajectories among 887 participants from a population-based birth cohort with severe wheeze exacerbations confirmed in healthcare records. We examined early-life risk factors of the derived trajectories, and their asthma-related outcomes and lung function in adolescence. Results 498/887 children (56%) had physician-confirmed wheeze by age 8 years, of whom 160 had at least one severe exacerbation. A two-cluster model provided the optimal solution for severe exacerbation trajectories among these 160 children: "Infrequent exacerbations (IE)" (n = 150, 93.7%) and "Early-onset frequent exacerbations (FE)" (n = 10, 6.3%). Shorter duration of breastfeeding was the strongest early-life risk factor for FE (weeks, median [IQR]: FE, 0 [0-1.75] vs. IE, 6 [0-20], P < .001). Specific airway resistance (sR(aw)) was significantly higher in FE compared with IE trajectory throughout childhood. We then compared children in the two exacerbation trajectories with those who have never wheezed (NW, n = 389) or have wheezed but had no severe exacerbations (WNE, n = 338). At age 8 years, FEV1/FVC was significantly lower and FeNO significantly higher among FE children compared with all other groups. By adolescence (age 16), subjects in FE trajectory were significantly more likely to have current asthma (67% FE vs. 30% IE vs. 13% WNE, P < .001) and use inhaled corticosteroids (77% FE vs. 15% IE vs. 18% WNE, P < .001). Lung function was significantly diminished in the FE trajectory (FEV1/FVC, mean [95%CI]: 89.9% [89.3-90.5] vs. 88.1% [87.3-88.8] vs. 85.1% [83.4-86.7] vs. 74.7% [61.5-87.8], NW, WNE, IE, FE respectively, P < .001). Conclusion We have identified two distinct trajectories of severe exacerbations during childhood with different early-life risk factors and asthma-related outcomes in adolescence.

Georgina Torrandell-Haro, Gregory L. Branigan, Francesca Vitali, Nophar Geifman, Julie M. Zissimopoulos, Roberta Diaz Brinton (2020)Statin therapy and risk of Alzheimer's and age-related neurodegenerative diseases, In: Alzheimer's & dementia : translational research & clinical interventions6(1)12108pp. e12108-n/a Wiley

DOI: 10.1002/trc2.12108

Introduction: Establishing efficacy of and molecular pathways for statins has the potential to impact incidence of Alzheimer's and age-related neurodegenerative diseases (NDD). Methods: This retrospective cohort study surveyed US-based Humana claims, which includes prescription and patient records from private-payer and Medicare insurance. Claims from 288,515 patients, aged 45 years and older, without prior history of NDD or neurological surgery, were surveyed for a diagnosis of NDD starting 1 year following statin exposure. Patients were required to be enrolled with claims data for at least 6 months prior to first statin prescription and at least 3 years thereafter. Computational system biology analysis was conducted to determine unique target engagement for each statin. Results: Of the 288,515 participants included in the study, 144,214 patients (mean [standard deviation (SD)] age, 67.22 [3.8] years) exposed to statin therapies, and 144,301 patients (65.97 [3.2] years) were not treated with statins. The mean (SD) follow-up time was 5.1 (2.3) years. Exposure to statins was associated with a lower incidence of Alzheimer's disease (1.10% vs 2.37%; relative risk [RR], 0.4643; 95% confidence interval [CI], 0.44-0.49; P < .001), dementia 3.03% vs 5.39%; RR, 0.56; 95% CI, 0.54-0.58; P < .001), multiple sclerosis (0.08% vs 0.15%; RR, 0.52; 95% CI, 0.410.66; P < .001), Parkinson's disease (0.48% vs 0.92%; RR, 0.53; 95% CI, 0.48-0.58; P < .001), and amyotrophic lateral sclerosis (0.02% vs 0.05%; RR, 0.46; 95% CI, 0.300.69; P < .001). All NDD incidence for all statins, except for fluvastatin (RR, 0.91; 95% CI, 0.65-1.30; P = 0.71), was reduced with variances in individual risk profiles. Pathway analysis indicated unique and common profiles associated with risk reduction efficacy. Discussion: Benefits and risks of statins relative to neurological outcomes should be considered when prescribed for at-risk NDD populations. Common statin activated pathways indicate overarching systems required for risk reduction whereas unique targets could advance a precision medicine approach to prevent neurodegenerative diseases.

Taariq M Salie, Jing Yang, Carlos R Medina, Nophar Geifman, Liesl J Zuhlke, Simon Frain, Anthony Whetton, Bernard Keavney, Mark E. Engel (2021)Abstract 13789: Identification of a Proteomic Signature Showing Ongoing Inflammation in Severe Rheumatic Heart Disease, In: Circulation (New York, N.Y.)144(S_1) Lippincott Williams & Wilkins, WK Health

DOI: 10.1161/circ.144.suppl_1.13789

Byline: Taariq M Salie, Univ of Cape Town, Cape Town, South Africa; Jing Yang, Univ of Manchester, Manchester, United Kingdom; Carlos R Medina, Univ of Manchester, Manchester, United Kingdom; Nophar Geifman, Div of Informatics, Imaging & Data Sciences, Univ of Manchester, Manchester, United Kingdom; Liesl J Zuhlke, Paediatrics, Univ of Cape Town, Institute of Child Health, Red Cross Children's Hosp, Cape Town, South Africa; Simon Frain, Div of Cardiovascular Sciences, Univ of Manchester, Manchester, United Kingdom; Anthony Whetton, Univ of Manchester, Manchester, United Kingdom; Bernard Keavney, Univ of Manchester, Manchester; Mark E Engel, Univ of Cape Town, OBSERVATORY; Introduction: Rheumatic heart disease (RHD) remains a major source of morbidity and mortality in developing countries. A deeper insight into the pathogenetic mechanisms underlying RHD could provide opportunities for drug repurposing, guide recommendations for secondary penicillin prophylaxis, and/or inform development of near-patient diagnostics. Methods: We conducted a proteomic study in 215 African patients with severe RHD and 230 controls, using the SWATH-MS technique. We applied a machine learning (ML) approach to feature selection among the 366 proteins quantifiable in at least 40% of samples, using the Boruta wrapper algorithm. The case-control differences and contribution to AUC of the ROC for each of the 56 proteins identified by the Boruta algorithm were calculated by Logistic Regression adjusted for age, sex and BMI. Biological pathways and functions enriched for proteins were identified using ClueGo pathway analyses. Results: Adiponectin, complement component C7 and fibulin-1, a component of heart valve matrix, were each higher in cases when compared with controls. Ficolin-3, a protein with calcium-independent lectin activity that activates the complement pathway, was lower in cases than controls. The top 6 biomarkers from the Boruta analyses conferred an AUC of 0.90 indicating excellent discriminatory capacity between RHD cases and controls. Conclusions: These results support the presence of an ongoing inflammatory response in RHD, at a time when severe valve disease has developed, and distant from previous episodes of acute rheumatic fever. This biomarker signature could have potential utility in recognizing different degrees of ongoing inflammation in RHD patients, which may in turn be related to prognostic severity.

Beatrice Amico, Arianna Dagliati, Darren Plant, Anne Barton, Niels Peek, Nophar Geifman (2019)A Dashboard for Latent Class Trajectory Modeling: Application in Rheumatoid Arthritis, In: L OhnoMachado, B Seroussi (eds.), MEDINFO 2019: HEALTH AND WELLBEING E-NETWORKS FOR ALL264pp. 911-915 Ios Press

DOI: 10.3233/SHTI190356

A key trend in current medical research is a shift from a one-size-fit-all to precision treatment strategies, where the focus is on identifying narrow subgroups of the population that would benefit from a given intervention. Precision medicine will greatly benefit from accessible tools that clinicians can use to identify suchsuch subgroups, and to generate novel inferences about the patient population they are treating. We present a novel dashboard app that enables clinician users to explore patient subgroups with varying longitudinal treatment response, using latent class mixed modeling. The dashboard was developed in R Shiny. We present results of our approach applied to an observational study of patients with moderate to severe rheumatoid arthritis (RA) on first-line biologic treatment.

Hana F Navratilova, Susan Lanham-New, Anthony D Whetton, Nophar Geifman (2024)Associations of Diet with Health Outcomes in the UK Biobank: A Systematic Review, In: Nutrients16(4)523

DOI: 10.3390/nu16040523

The UK Biobank is a cohort study that collects data on diet, lifestyle, biomarkers, and health to examine diet-disease associations. Based on the UK Biobank, we reviewed 36 studies on diet and three health conditions: type 2 diabetes (T2DM), cardiovascular disease (CVD), and cancer. Most studies used one-time dietary data instead of repeated 24 h recalls, which may lead to measurement errors and bias in estimating diet-disease associations. We also found that most studies focused on single food groups or macronutrients, while few studies adopted a dietary pattern approach. Several studies consistently showed that eating more red and processed meat led to a higher risk of lung and colorectal cancer. The results suggest that high adherence to "healthy" dietary patterns (consuming various food types, with at least three servings/day of whole grain, fruits, and vegetables, and meat and processed meat less than twice a week) slightly lowers the risk of T2DM, CVD, and colorectal cancer. Future research should use multi-omics data and machine learning models to account for the complexity and interactions of dietary components and their effects on disease risk.

Matt Spick, Olivier Cexus, Hardev Singh Pandha, Agnieszka Michael, Anthony David Whetton, Nophar Geifman, Paul Andrew Townsend (2023)A Novel Blood Proteomic Signature for Prostate Cancer, In: Cancers15(4)1051 MDPI

DOI: 10.3390/cancers15041051

Prostate cancer is the most common malignant tumour in men. Improved testing for di- agnosis, risk prediction, and response to treatment would improve care. Here, we identified a pro- teomic signature of prostate cancer in peripheral blood using data-independent acquisition mass spectrometry combined with machine learning. A highly predictive signature was derived, which was associated with relevant pathways, including the coagulation, complement, and clotting cas- cades, as well as plasma lipoprotein particle remodeling. We further validated the identified bi- omarkers against a second cohort, identifying a panel of five key markers (GP5, SERPINA5, ECM1, IGHG1, and THBS1) which retained most of the diagnostic power of the overall dataset, achieving an AUC of 0.91. Taken together, this study provides a proteomic signature complementary to PSA for the diagnosis of patients with localised prostate cancer, with the further potential for assessing risk of future development of prostate cancer. Data are available via ProteomeXchange with identi- fier PXD025484.

Ammara Muazzam, Davide Chiasserini, Janet Kelsall, Nophar Geifman, Anthony D. Whetton, Anthony David Whetton, Paul A. Townsend (2021)A prostate cancer proteomics database for swath-ms based protein quantification, In: Cancers13(21)5580 Mdpi

DOI: 10.3390/cancers13215580

Simple Summary: Prostate cancer is the third most frequent cancer in men worldwide, with a notable increase in prevalence over the past two decades. The PSA is the only well-established protein biomarker for prostate cancer diagnosis, staging, and surveillance. It frequently leads to inaccurate diagnosis and overtreatment since it is an organ-specific biomarker rather than a tumour-specific biomarker. As a result, one of the primary goals of prostate cancer proteome research is to identify novel biomarkers that can be used with or instead of PSA, particularly in non-invasive blood samples. Thousands of peptides or assays were detected in blood samples from patients with low- to high-grade prostate cancer and healthy individuals, allowing data processing of sequential window acquisition of all theoretical mass spectra (SWATH-MS). By assisting in the detection of prostate cancer biomarkers in blood samples, this useful resource will improve our understanding of the role of proteomics in prostate cancer diagnosis and risk assessment. Prostate cancer is the most frequent form of cancer in men, accounting for more than one-third of all cases. Current screening techniques, such as PSA testing used in conjunction with routine procedures, lead to unnecessary biopsies and the discovery of low-risk tumours, resulting in overdiagnosis. SWATH-MS is a well-established data-independent (DI) method requiring prior knowledge of targeted peptides to obtain valuable information from SWATH maps. In response to the growing need to identify and characterise protein biomarkers for prostate cancer, this study explored a spectrum source for targeted proteome analysis of blood samples. We created a comprehensive prostate cancer serum spectral library by combining data-dependent acquisition (DDA) MS raw files from 504 patients with low, intermediate, or high-grade prostate cancer and healthy controls, as well as 304 prostate cancer-related protein in silico assays. The spectral library contains 114,684 transitions, which equates to 18,479 peptides translated into 1227 proteins. The robustness and accuracy of the spectral library were assessed to boost confidence in the identification and quantification of prostate cancer-related proteins across an independent cohort, resulting in the identification of 404 proteins. This unique database can facilitate researchers to investigate prostate cancer protein biomarkers in blood samples. In the real-world use of the spectrum library for biomarker detection, using a signature of 17 proteins, a clear distinction between the validation cohort's pre- and post-treatment groups was observed. Data are available via ProteomeXchange with identifier PXD028651.

Stephanie J. W. Shoop-Worrall, Kimme L. Hyrich, Lucy R. Wedderburn, Wendy Thomson, Nophar Geifman, (2021)Patient-reported wellbeing and clinical disease measures overtime captured by multivariate trajectories of disease activity in individuals with juvenile idiopathic arthritis in the UK: a multicentre prospective longitudinal study, In: The Lancet. Rheumatology3(2)pp. e111-e121 Elsevier

DOI: 10.1016/S2665-9913(20)30269-1

Background Juvenile idiopathic arthritis (JIA) is a heterogeneous disease, the signs and symptoms of which can be summarised with use of composite disease activity measures, including the clinical Juvenile Arthritis Disease Activity Score (cJADAS). However, clusters of children and young people might experience different global patterns in their signs and symptoms of disease, which might run in parallel or diverge over time. We aimed to identify such clusters in the 3 years after a diagnosis of JIA. The identification of these clusters would allow for a greater understanding of disease progression in JIA, including how physician-reported and patient-reported outcomes relate to each other over the JIA disease course. Methods In this multicentre prospective longitudinal study, we included children and young people recruited before Jan 1, 2015, to the Childhood Arthritis Prospective Study (CAPS), a UK multicentre inception cohort. Participants without a cJADAS score were excluded. To assess groups of children and young people with similar disease patterns in active joint count, physician's global assessment, and patient or parental global evaluation, we used latent profile analysis at initial presentation to paediatric rheumatology and multivariate group-based trajectory models for the following 3 years. Optimal models were selected on the basis of a combination of model fit, clinical plausibility, and model parsimony. Findings Between Jan 1, 2001, and Dec 31, 2014, 1423 children and young people with JIA were recruited to CAPS, 239 of whom were excluded, resulting in a final study population of 1184 children and young people. We identified five clusters at baseline and six trajectory groups using longitudinal follow-up data. Disease course was not well predicted from clusters at baseline; however, in both cross-sectional and longitudinal analyses, substantial proportions of children and young people had high patient or parent global scores despite low or improving joint counts and physician global scores. Participants in these groups were older, and a higher proportion of them had enthesitis-related JIA and lower socioeconomic status, compared with those in other groups. Interpretation Almost one in four children and young people with JIA in our study reported persistent, high patient or parent global scores despite having low or improving active joint counts and physician's global scores. Distinct patient subgroups defined by disease manifestation or trajectories of progression could help to better personalise health-care services and treatment plans for individuals with JIA. Copyright (C) 2020 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license.

Zsolt Zador, Alexander Landry, Michael D Cusimano, Nophar Geifman (2019)Multimorbidity states associated with higher mortality rates in organ dysfunction and sepsis: a data-driven analysis in critical care, In: Critical care (London, England)23(1)247pp. 247-247

DOI: 10.1186/s13054-019-2486-6

Sepsis remains a complex medical problem and a major challenge in healthcare. Diagnostics and outcome predictions are focused on physiological parameters with less consideration given to patients' medical background. Given the aging population, not only are diseases becoming increasingly prevalent but occur more frequently in combinations ("multimorbidity"). We hypothesized the existence of patient subgroups in critical care with distinct multimorbidity states. We further hypothesize that certain multimorbidity states associate with higher rates of organ failure, sepsis, and mortality co-occurring with these clinical problems. We analyzed 36,390 patients from the open source Medical Information Mart for Intensive Care III (MIMIC III) dataset. Morbidities were defined based on Elixhauser categories, a well-established scheme distinguishing 30 classes of chronic diseases. We used latent class analysis to identify distinct patient subgroups based on demographics, admission type, and morbidity compositions and compared the prevalence of organ dysfunction, sepsis, and inpatient mortality for each subgroup. We identified six clinically distinct multimorbidity subgroups labeled based on their dominant Elixhauser disease classes. The "cardiopulmonary" and "cardiac" subgroups consisted of older patients with a high prevalence of cardiopulmonary conditions and constituted 6.1% and 26.4% of study cohort respectively. The "young" subgroup included 23.5% of the cohort composed of young and healthy patients. The "hepatic/addiction" subgroup, constituting 9.8% of the cohort, consisted of middle-aged patients (mean age of 52.25, 95% CI 51.85-52.65) with the high rates of depression (20.1%), alcohol abuse (47.75%), drug abuse (18.2%), and liver failure (67%). The "complicated diabetics" and "uncomplicated diabetics" subgroups constituted 9.4% and 24.8% of the study cohort respectively. The complicated diabetics subgroup demonstrated higher rates of end-organ complications (88.3% prevalence of renal failure). Rates of organ dysfunction and sepsis ranged 19.6-69% and 12.5-46.7% respectively in the six subgroups. Mortality co-occurring with organ dysfunction and sepsis ranges was 8.4-23.8% and 11.7-27.4% respectively. These adverse outcomes were most prevalent in the hepatic/addiction subgroup. We identify distinct multimorbidity states that associate with relatively higher prevalence of organ dysfunction, sepsis, and co-occurring mortality. The findings promote the incorporation of multimorbidity in healthcare models and the shift away from the current single-disease paradigm in clinical practice, training, and trial design.

Stephen McDonald, Sean Yiu, Li Su, Caroline Gordon, Matt Truman, Laura Lisk, Neil Solomons, for the MASTERPLANS Consortium, Ian N Bruce, Katherine Payne, Mark Lunt, Niels Peek, Nophar Geifman, Sean Gavan, Gillian Armitt, Patrick Doherty, Jennifer Prattley, Narges Azadbakht, Angela Papazian, Helen Le Sueur, Carmen Farrelly, Clare Richardson, Zunnaira Shabbir, Lauren Hewitt, Neil McHugh, John Reynolds, Stephen Young, David Jayne, Vern Farewell, Matthew Pickering, Elizabeth Lightstone, Alyssa Gilmore, Marina Botto, Timothy Vyse, David Lester Morris, D D’Cruz, Edward Vital, Miriam Wittmann, Paul Emery, Michael Beresford, Christian Hedrich, Angela Midgley, Jenna Gritzfeld, Michael Ehrenstein, David Isenberg, Mariea Parvaz, Jane Dunnage, Jane Batchelor, E Holland, Pauline Upsall (2022)Predictors of treatment response in a lupus nephritis population: lessons from the Aspreva Lupus Management Study (ALMS) trial BMJ Publishing Group

DOI: 10.17863/CAM.85179

Objectives: To identify predictors of overall lupus and lupus nephritis (LN) responses in patients with LN. Methods: Data from the Aspreva Lupus Management Study (ALMS) trial cohort was used to identify baseline predictors of response at 6 months. Endpoints were major clinical response (MCR), improvement, complete renal response (CRR) and partial renal response (PRR). Univariate and multivariate logistic regressions with least absolute shrinkage and selection operator (LASSO) and cross-validation in randomly split samples were utilised. Predictors were ranked by the percentage of times selected by LASSO and prediction performance was assessed by the area under the receiver operating characteristics (AUROC) curve. Results: We studied 370 patients in the ALMS induction trial. Improvement at 6 months was associated with older age (OR=1.03 (95% CI: 1.01 to 1.05) per year), normal haemoglobin (1.85 (1.16 to 2.95) vs low haemoglobin), active lupus (British Isles Lupus Assessment Group A or B) in haematological and mucocutaneous domains (0.61 (0.39 to 0.97) and 0.50 (0.31 to 0.81)), baseline damage (SDI>1 vs =0) (0.38 (0.16 to 0.91)) and 24-hour urine protein (0.63 (0.50 to 0.80)). LN duration 2–4 years (0.43 (0.19 to 0.97) vs

Saskia Lawson-Tovey, Lucy R Wedderburn, Nophar Geifman, Michael Barnes, Kimme L Hyrich (2022)OA19 Successes and challenges in harmonising four national juvenile idiopathic arthritis cohorts: an example from CLUSTER consortium, In: Rheumatology (Oxford, England)61(Supplement_1)pp. i10-i11

DOI: 10.1093/rheumatology/keac132.019

Abstract Background/Aims The CLUSTER consortium aims to identify biomarkers and strata that improve personalised treatments for JIA/JIA-uveitis. By bringing together knowledge and data, CLUSTER can conduct novel analyses in this rare, heterogeneous disease. Data harmonisation across existing JIA cohorts facilitates new, larger datasets that would otherwise take years to collect; however, challenges exist as datasets are often collected autonomously. Here we present progress towards a large-scale, unique JIA data resource, bringing together treatment data from four real-world JIA treatment studies. Methods Four studies (CAPS, CHARMS, BCRD and BSPAR-ETN; the latter two being part of the UK JIA Biologics register) contributed data into CLUSTER. We created two clinical datasets of JIA patients starting first-line methotrexate (MTX) or tumour necrosis factor inhibitors (TNFi). Variables were selected based on a previously developed core dataset, accounting for different levels of granularity across studies. The same inclusion and exclusion criteria were agreed for both datasets, designed to allow for combined analysis of these. OpenPseudonymiser software encrypted NHS numbers - these were matched cross-study to identify duplicates and checked against known duplicate lists. Errors in NHS numbers and existing duplicate matches were identified and corrected. Each NHS number was assigned a CLUSTER ID, meaning one child has the same ID across all relevant studies such that children contributing similar data across multiple studies could be identified. Results A total of 7013 records (from 5435 individuals) were identified, of which 2882 (41%, corresponding to 1304 individuals) represented the same child across >1 study. 197 individuals had duplicate records within one study, 961 in two studies, 142 in three, and four children had duplicate records in all four studies. After removing 350 MTX and 605 TNFi duplicate entries, the final datasets contain 2899 and 2401 unique MTX and TNFi patients respectively; 1018 are in both datasets having received both treatments. Missingness across core outcome variables ranged from 10% (active joint count MTX timepoint 2) to 60% (physician VAS TNFi timepoint 2) and was not improved through combining datasets with duplicate entries. Specificity in some variables was lost to allow integration by combining data using least common denominators (e.g. ethnicity captured as Caucasian/Non-Caucasian, despite more specific categories available in some studies). Conclusion Combining data across studies has achieved dataset sizes rarely seen in JIA, which is invaluable to progressing research into personalised treatments and disease outcomes. However, losing specificity in some variables and missingness (a known challenge in observational data) and their impact on future analyses requires further consideration. Ongoing work includes identifying patients with both clinical and biological data that can be combined for more in-depth analyses. Both datasets are available for researchers to use via the CLUSTER Consortium Data Management Committee. Disclosure S. Lawson-Tovey: None. L.R. Wedderburn: Consultancies; L.W. reports consulting fees from Pfizer unrelated to this work. Grants/research support; CLUSTER consortium receives support from AbbVie, UCB, Pfizer, Sobi and GSK. N. Geifman: None. M. Barnes: None. K.L. Hyrich: Grants/research support; KLH reports grant income from BMS, UCB, and Pfizer. Other; KLH reports non-personal speaker's fees from Abbvie.

Alexia Sampri, Nophar Geifman, Helen Le Sueur, Patrick Doherty, Philip Couch, Ian Bruce, Niels Peek (2020)Probabilistic Approaches to Overcome Content Heterogeneity in Data Integration: A Study Case in Systematic Lupus Erythematosus, In: L B PapeHaugaard, C Lovis, I C Madsen, P Weber, P H Nielsen, P Scott (eds.), DIGITAL PERSONALIZED HEALTH AND MEDICINE270pp. 387-391 Ios Press

DOI: 10.3233/SHTI200188

Integrating data from different sources into homogeneous dataset increases the opportunities to study human health. However, disparate data collections are often heterogeneous, which complicates their integration. In this paper, we focus on the issue of content heterogeneity in data integration. Traditional approaches for resolving content heterogeneity map all source datasets to a common data model that includes only shared data items, and thus omit all items that vary between datasets. Based on an example of three datasets in Systemic Lupus Erythematosus, we describe and experimentally evaluate a probabilistic data integration approach which propagates the uncertainty resulting from content heterogeneity into statistical inference, avoiding the need to map to a common data model.

Saskia Lawson-Tovey, Samantha Louise Smith, Nophar Geifman, Stephanie Shoop-Worrall, Sandra Ng, Michael Barnes, Lucy Wedderburn, Kimme Hyrich, (2023)The successes and challenges of harmonising juvenile idiopathic arthritis (JIA) datasets to create a large-scale JIA data resource, In: Pediatric rheumatology online journal21(1)70pp. 70-70 Springer Nature

DOI: 10.1186/s12969-023-00839-2

BackgroundCLUSTER is a UK consortium focussed on precision medicine research in JIA/JIA-Uveitis. As part of this programme, a large-scale JIA data resource was created by harmonizing and pooling existing real-world studies. Here we present challenges and progress towards creation of this unique large JIA dataset.MethodsFour real-world studies contributed data; two clinical datasets of JIA patients starting first-line methotrexate (MTX) or tumour necrosis factor inhibitors (TNFi) were created. Variables were selected based on a previously developed core dataset, and encrypted NHS numbers were used to identify children contributing similar data across multiple studies.ResultsOf 7013 records (from 5435 individuals), 2882 (1304 individuals) represented the same child across studies. The final datasets contain 2899 (MTX) and 2401 (TNFi) unique patients; 1018 are in both datasets. Missingness ranged from 10 to 60% and was not improved through harmonisation.ConclusionsCombining data across studies has achieved dataset sizes rarely seen in JIA, invaluable to progressing research. Losing variable specificity and missingness, and their impact on future analyses requires further consideration.

Kevin Y. C. A. Su, John Reynolds, Rachel Reed, Rachael Da Silva, Janet Kelsall, Ivona Baricevic-Jones, David D. Lee, Anthony D. Whetton, Nophar Geifman, Neil N. McHugh, Ian Bruce, MASTERPLANS BILAG BR consortia, (2023)Proteomic analysis identifies subgroups of patients with active systemic lupus erythematosus, In: Clinical proteomics20(1)29 Springer Nature

DOI: 10.1186/s12014-023-09420-1

ObjectiveSystemic lupus erythematosus (SLE) is a clinically and biologically heterogenous autoimmune disease. We aimed to investigate the plasma proteome of patients with active SLE to identify novel subgroups, or endotypes, of patients.MethodPlasma was collected from patients with active SLE who were enrolled in the British Isles Lupus Assessment Group Biologics Registry (BILAG-BR). The plasma proteome was analysed using a data-independent acquisition method, Sequential Window Acquisition of All theoretical mass spectra mass spectrometry (SWATH-MS). Unsupervised, data-driven clustering algorithms were used to delineate groups of patients with a shared proteomic profile.ResultsIn 223 patients, six clusters were identified based on quantification of 581 proteins. Between the clusters, there were significant differences in age (p = 0.012) and ethnicity (p = 0.003). There was increased musculoskeletal disease activity in cluster 1 (C1), 19/27 (70.4%) (p = 0.002) and renal activity in cluster 6 (C6) 15/24 (62.5%) (p = 0.051). Anti-SSa/Ro was the only autoantibody that significantly differed between clusters (p = 0.017). C1 was associated with p21-activated kinases (PAK) and Phospholipase C (PLC) signalling. Within C1 there were two sub-clusters (C1A and C1B) defined by 49 proteins related to cytoskeletal protein binding. C2 and C6 demonstrated opposite Rho family GTPase and Rho GDI signalling. Three proteins (MZB1, SND1 and AGL) identified in C6 increased the classification of active renal disease although this did not reach statistical significance (p = 0.0617).ConclusionsUnsupervised proteomic analysis identifies clusters of patients with active SLE, that are associated with clinical and serological features, which may facilitate biomarker discovery. The observed proteomic heterogeneity further supports the need for a personalised approach to treatment in SLE.

Stephanie J. W. Shoop-Worrall, Katherine Cresswell, Imogen Bolger, Beth Dillon, Kimme L. Hyrich, Nophar Geifman, (2021)Nothing about us without us: involving patient collaborators for machine learning applications in rheumatology, In: Annals of the rheumatic diseases80(12)pp. 1505-1510 Bmj Publishing Group

DOI: 10.1136/annrheumdis-2021-220454

Novel machine learning methods open the door to advances in rheumatology through application to complex, high-dimensional data, otherwise difficult to analyse. Results from such efforts could provide better classification of disease, decision support for therapy selection, and automated interpretation of clinical images. Nevertheless, such data-driven approaches could potentially model noise, or miss true clinical phenomena. One proposed solution to ensure clinically meaningful machine learning models is to involve primary stakeholders in their development and interpretation. Including patient and health care professionals' input and priorities, in combination with statistical fit measures, allows for any resulting models to be well fit, meaningful, and fit for practice in the wider rheumatological community. Here we describe outputs from workshops that involved healthcare professionals, and young people from the Your Rheum Young Person's Advisory Group, in the development of complex machine learning models. These were developed to better describe trajectory of early juvenile idiopathic arthritis disease, as part of the CLUSTER consortium. We further provide key instructions for reproducibility of this process.Involving people living with, and managing, a disease investigated using machine learning techniques, is feasible, impactful and empowering for all those involved.

Arianna Dagliati, Nophar Geifman, Niels Peek, John H. Holmes, Lucia Sacchi, Seyed Erfan Sajjadi, Allan Tucker (2019)Inferring Temporal Phenotypes with Topological Data Analysis and Pseudo Time-Series, In: D Riano, S Wilk, A TenTeije (eds.), ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 201911526pp. 399-409 Springer Nature

DOI: 10.1007/978-3-030-21642-9_50

Temporal phenotyping enables clinicians to better under-stand observable characteristics of a disease as it progresses. Modelling disease progression that captures interactions between phenotypes is inherently challenging. Temporal models that capture change in disease over time can identify the key features that characterize disease subtypes that underpin these trajectories. These models will enable clinicians to identify early warning signs of progression in specific sub-types and therefore to make informed decisions tailored to individual patients. In this paper, we explore two approaches to building temporal phenotypes based on the topology of data: topological data analysis and pseudo time-series. Using type 2 diabetes data, we show that the topological data analysis approach is able to identify trajectories representing different temporal phenotypes and that pseudo time-series can infer a state space model characterized by transitions between hidden states that represent distinct temporal phenotypes. Both approaches highlight lipid profiles as key factors in distinguishing the phenotypes.

Angelica Arioli, Arianna Dagliati, Bethany Geary, Niels Peek, Philip A. Kalra, Anthony D. Whetton, Nophar Geifman (2021)OptiMissP: A dashboard to assess missingness in proteomic data-independent acquisition mass spectrometry, In: PloS one16(4)0249771pp. e0249771-e0249771 Public Library Science

DOI: 10.1371/journal.pone.0249771

Background Missing values are a key issue in the statistical analysis of proteomic data. Defining the strategy to address missing values is a complex task in each study, potentially affecting the quality of statistical analyses. Results We have developed OptiMissP, a dashboard to visually and qualitatively evaluate missingness and guide decision making in the handling of missing values in proteomics studies that use data-independent acquisition mass spectrometry. It provides a set of visual tools to retrieve information about missingness through protein densities and topology-based approaches, and facilitates exploration of different imputation methods and missingness thresholds. Conclusions OptiMissP provides support for researchers' and clinicians' qualitative assessment of missingness in proteomic datasets in order to define study-specific strategies for the handling of missing values. OptiMissP considers biases in protein distributions related to the choice of imputation method and helps analysts to balance the information loss caused by low missingness thresholds and the noise introduced by selecting high missingness thresholds. This is complemented by topological data analysis which provides additional insight to the structure of the data and their missingness. We use an example in Chronic Kidney Disease to illustrate the main functionalities of OptiMissP.

Arianna Dagliati, Nophar Geifman, Niels Peek, John H. Holmes, Lucia Sacchi, Riccardo Bellazzi, Seyed Erfan Sajjadi, Allan Tucker (2020)Using topological data analysis and pseudo time series to infer temporal phenotypes from electronic health records, In: Artificial intelligence in medicine108101930pp. 101930-101930 Elsevier

DOI: 10.1016/j.artmed.2020.101930

Temporal phenotyping enables clinicians to better understand observable characteristics of a disease as it progresses. Modelling disease progression that captures interactions between phenotypes is inherently challenging. Temporal models that capture change in disease over time can identify the key features that characterize disease subtypes that underpin these trajectories. These models will enable clinicians to identify early warning signs of progression in specific sub-types and therefore to make informed decisions tailored to individual patients. In this paper, we explore two approaches to building temporal phenotypes based on the topology of data: topological data analysis and pseudo time-series. Using type 2 diabetes data, we show that the topological data analysis approach is able to identify disease trajectories and that pseudo time-series can infer a state space model characterized by transitions between hidden states that represent distinct temporal phenotypes. Both approaches highlight lipid profiles as key factors in distinguishing the phenotypes.

Saskia Lawson-Tovey, Samantha Smith, Nophar Geifman, Stephanie Shoop-Worrall, Sandra Ng, Michael Barnes, Lucy Wedderburn, Kimme Hyrich (2022)OA31 Successes and challenges in harmonising 4 national Juvenile Idiopathic Arthritis cohorts: an example from CLUSTER consortium, In: Rheumatology advances in practice6(Suppl 1) Oxford University Press

DOI: 10.1093/rap/rkac066.031

Carlos Raúl Ramírez Medina, Ibrahim Ali, Ivona Baricevic-Jones, Moin A Saleem, Anthony David Whetton, Philip A. Kalra, Nophar Geifman (2024)Evaluation of a proteomic signature coupled with the kidney failure risk equation in predicting end stage kidney disease in a chronic kidney disease cohort, In: Clinical Proteomics2134 BioMedCentral

DOI: 10.1186/s12014-024-09486-5

Background The early identification of patients at high‑risk for end‑stage renal disease (ESRD) is essential for pro‑ viding optimal care and implementing targeted prevention strategies. While the Kidney Failure Risk Equation (KFRE) offers a more accurate prediction of ESRD risk compared to static eGFR‑based thresholds, it does not provide insights into the patient‑specific biological mechanisms that drive ESRD. This study focused on evaluating the effectiveness of KFRE in a UK‑based advanced chronic kidney disease (CKD) cohort and investigating whether the integration of a proteomic signature could enhance 5‑year ESRD prediction. Methods Using the Salford Kidney Study biobank, a UK‑based prospective cohort of over 3000 non‑dialysis CKD patients, 433 patients met our inclusion criteria: a minimum of four eGFR measurements over a two‑year period and a linear eGFR trajectory. Plasma samples were obtained and analysed for novel proteomic signals using SWATH‑ Mass‑Spectrometry. The 4‑variable UK‑calibrated KFRE was calculated for each patient based on their baseline clinical characteristics. Boruta machine learning algorithm was used for the selection of proteins most contributing to differ‑ entiation between patient groups. Logistic regression was employed for estimation of ESRD prediction by (1) prot‑ eomic features; (2) KFRE; and (3) proteomic features alongside KFRE. Results SWATH maps with 943 quantified proteins were generated and investigated in tandem with available clinical data to identify potential progression biomarkers. We identified a set of proteins (SPTA1, MYL6 and C6) that, when used alongside the 4‑variable UK‑KFRE, improved the prediction of 5‑year risk of ESRD (AUC = 0.75 vs AUC = 0.70). Functional enrichment analysis revealed Rho GTPases and regulation of the actin cytoskeleton pathways to be statistically significant, inferring their role in kidney function and the pathogenesis of renal disease. Conclusions Proteins SPTA1, MYL6 and C6, when used alongside the 4‑variable UK‑KFRE achieve an improved performance when predicting a 5‑year risk of ESRD. Specific pathways implicated in the pathogenesis of podocyte dysfunction were also identified, which could serve as potential therapeutic targets. The findings of our study carry

John A Reynolds, Jennifer Prattley, Nophar Geifman, Mark Lunt, Caroline Gordon, Ian N Bruce (2021)Distinct patterns of disease activity over time in patients with active SLE revealed using latent class trajectory models, In: Arthritis research & therapy23(1)203pp. 203-203

DOI: 10.1186/s13075-021-02584-x

Systemic lupus erythematosus (SLE) is a heterogeneous systemic autoimmune condition for which there are limited licensed therapies. Clinical trial design is challenging in SLE due at least in part to imperfect outcome measures. Improved understanding of how disease activity changes over time could inform future trial design. The aim of this study was to determine whether distinct trajectories of disease activity over time occur in patients with active SLE within a clinical trial setting and to identify factors associated with these trajectories. Latent class trajectory models were fitted to a clinical trial dataset of a monoclonal antibody targeting CD22 (Epratuzumab) in patients with active SLE using the numerical BILAG-2004 score (nBILAG). The baseline characteristics of patients in each class and changes in prednisolone over time were identified. Exploratory PK-PD modelling was used to examine cumulative drug exposure in relation to latent class membership. Five trajectories of disease activity were identified, with 3 principal classes: non-responders (NR), slow responders (SR) and rapid-responders (RR). In both the SR and RR groups, significant changes in disease activity were evident within the first 90 days of the trial. The SR and RR patients had significantly higher baseline disease activity, exposure to epratuzumab and activity in specific BILAG domains, whilst NR had lower steroid use at baseline and less change in steroid dose early in the trial. Longitudinal nBILAG scores reveal different trajectories of disease activity and may offer advantages over fixed endpoints. Corticosteroid use however remains an important confounder in lupus trials and can influence early response. Changes in disease activity and steroid dose early in the trial were associated with the overall disease activity trajectory, supporting the feasibility of performing adaptive trial designs in SLE.

Lorenzo Chiudinelli, Arianna Dagliati, Valentina Tibollo, Sara Albasini, Nophar Geifman, Niels Peek, John H. Holmes, Fabio Corsi, Riccardo Bellazzi, Lucia Sacchi (2020)Mining post-surgical care processes in breast cancer patients, In: Artificial intelligence in medicine105101855pp. 101855-101855 Elsevier B.V

DOI: 10.1016/j.artmed.2020.101855

•A data analysis pipeline to extract frequent patterns in breast cancer patients using administrative data from EHR.•A Topic Modeling step allows synthesizing the ICD9-CM codes of the procedures carried out during hospitalizations.•Frequent patterns of care are extracted through a careflow mining algorithm.•The results reveal interesting temporal phenotypes, which are different in terms of clinical outcome.•The resulting careflows reflect the clinical practice guidelines enacted at the considered Breast Unit. In this work we describe the application of a careflow mining algorithm to detect the most frequent patterns of care in a cohort of 3000 breast cancer patients. The applied method relies on longitudinal data extracted from electronic health records, recorded from the first surgical procedure after a breast cancer diagnosis. Careflows are mined from events data recorded for administrative purposes, including procedures from ICD9 – CM billing codes and chemotherapy treatments. Events data have been pre-processed with Topic Modelling to create composite events based on concurrent procedures. The results of the careflow mining algorithm allow the discovery of electronic temporal phenotypes across the studied population. These phenotypes are further characterized on the basis of clinical traits and tumour histopathology, as well as in terms of relapses, metastasis occurrence and 5-year survival rates. Results are highly significant from a clinical perspective, since phenotypes describe well characterized pathology classes, and the careflows are well matched with existing clinical guidelines. The analysis thus facilitates deriving real-world evidence that can inform clinicians as well as hospital decision makers.

Matt Spick, Ammara Muazzam, Hardev Singh Pandha, Agnieszka Michael, Lee A Gethings, Christopher J. Hughes, Nyasha Munjoma, Robert S. Plumb, Ian D. Wilson, Anthony David Whetton, Paul Andrew Townsend, Nophar Geifman (2023)Multi-omic diagnostics of prostate cancer in the presence of benign prostatic hyperplasia, In: Heliyon9(12)e22604 Elsevier

DOI: 10.1016/j.heliyon.2023.e22604

There is an unmet need for improved diagnostic testing and risk prediction for cases of prostate cancer (PCa) to improve care and reduce overtreatment of indolent disease. Here we have analysed the serum proteome and lipidome of 262 study participants by liquid chromatography-mass spectrometry, including participants diagnosed with PCa, benign prostatic hyperplasia (BPH), or otherwise healthy volunteers, with the aim of improving biomarker specificity. Although a two class machine learning model separated PCa from controls with sensitivity of 0.82 and specificity of 0.95, adding BPH resulted in a statistically significant decline in specificity for prostate cancer to 0.76, with half of BPH cases being misclassified by the model as PCa. A small number of biomarkers differentiating between BPH and prostate cancer were identified, including proteins in MAP Kinase pathways, as well as in lipids containing oleic acid; these may offer a route to greater specificity. These results highlight, however, that whilst there are opportunities for machine learning, these will only be achieved by use of appropriate training sets that include confounding comorbidities, especially when calculating the specificity of a test.

Arianna Dagliati, Darren Plant, Nisha Nair, Meghna Jani, Beatrice Amico, Niels Peek, Ann W Morgan, John Isaacs, Anthony G Wilson, Kimme L Hyrich, Nophar Geifman, Anne Barton (2020)Latent Class Trajectory Modeling of 2-Component Disease Activity Score in 28 Joints Identifies Multiple Rheumatoid Arthritis Phenotypes of Response to Biologic Disease-Modifying Antirheumatic Drugs, In: Arthritis & rheumatology (Hoboken, N.J.)72(10)pp. 1632-1642

DOI: 10.1002/art.41379

To determine whether using a reweighted disease activity score that better reflects joint synovitis, i.e., the 2-component Disease Activity Score in 28 joints (DAS28) (based on swollen joint count and C-reactive protein level), produces more clinically relevant treatment outcome trajectories compared to the standard 4-component DAS28. Latent class mixed modeling of response to biologic treatment was applied to 2,991 rheumatoid arthritis (RA) patients in whom treatment with a biologic disease-modifying antirheumatic drug was being initiated within the Biologics in Rheumatoid Arthritis Genetics and Genomics Study Syndicate cohort, using both 4-component and 2-component DAS28 scores as outcome measures. Patient groups with similar trajectories were compared in terms of pretreatment baseline characteristics (including disability and comorbidities) and follow-up characteristics (including antidrug antibody events, adherence to treatments, and blood drug levels). We compared the trajectories obtained using the 4- and 2-component scores to determine which characteristics were better captured by each. Using the 4-component DAS28, we identified 3 trajectory groups, which is consistent with previous findings. We showed that the 4-component DAS28 captures information relating to depression. Using the 2-component DAS28, 7 trajectory groups were identified; among them, distinct groups of nonresponders had a higher incidence of respiratory comorbidities and a higher proportion of antidrug antibody events. We also identified a group of patients for whom the 2-component DAS28 scores remained relatively low; this group included a high percentage of patients who were nonadherent to treatment. This highlights the utility of both the 4- and 2-component DAS28 for monitoring different components of disease activity. Here we show that the 2-component modified DAS28 defines important biologic and clinical phenotypes associated with treatment outcome in RA and characterizes important underlying response mechanisms to biologic drugs.

Nophar Geifman, Anthony D Whetton (2020)A consideration of publication-derived immune-related associations in Coronavirus and related lung damaging diseases, In: Journal of translational medicine18(1)297pp. 297-297

DOI: 10.1186/s12967-020-02472-z

The severe acute respiratory syndrome virus SARS-CoV-2, a close relative of the SARS-CoV virus, is the cause of the recent COVID-19 pandemic affecting, to date, over 14 million individuals across the globe and demonstrating relatively high rates of infection and mortality. A third virus, the H5N1, responsible for avian influenza, has caused infection with some clinical similarities to those in COVID-19 infections. Cytokines, small proteins that modulate immune responses, have been directly implicated in some of the severe responses seen in COVID-19 patients, e.g. cytokine storms. Understanding the immune processes related to COVID-19, and other similar infections, could help identify diagnostic markers and therapeutic targets. Here we examine data of cytokine, immune cell types, and disease associations captured from biomedical literature associated with COVID-19, Coronavirus in general, SARS, and H5N1 influenza, with the objective of identifying potentially useful relationships and areas for future research. Cytokine and cell-type associations captured from Medical Subject Heading (MeSH) terms linked to thousands of PubMed records, has identified differing patterns of associations between the four corpuses of publications (COVID-19, Coronavirus, SARS, or H5N1 influenza). Clustering of cytokine-disease co-occurrences in the context of Coronavirus has identified compelling clusters of co-morbidities and symptoms, some of which already known to be linked to COVID-19. Finally, network analysis identified sub-networks of cytokines and immune cell types associated with different manifestations, co-morbidities and symptoms of Coronavirus, SARS, and H5N1. Systematic review of research in medicine is essential to facilitate evidence-based choices about health interventions. In a fast moving pandemic the approach taken here will identify trends and enable rapid comparison to the literature of related diseases.

Britt W Jensen, Charlotte Watson, Nophar Geifman, Jennifer L Baker, Ellena Badrick, Andrew G Renehan (2021)Weight Changes in Type 2 Diabetes and Cancer Risk: A Latent Class Trajectory Model Study, In: Obesity facts Karger

DOI: 10.1159/000520200

Introduction: Body mass index (BMI) is often elevated at type 2 diabetes (T2D) diagnosis. Using latent class trajectory modelling (LCTM) of BMI, we examined whether weight loss after diagnosis influenced cancer incidence and all-cause mortality. Methods: From 1995 to 2010, we identified 7,708 patients with T2D from the Salford Integrated Record database (UK) and linked to the cancer registry for information on obesity-related cancer (ORC), non-ORC; and all-cause mortality. Repeated BMIs were used to construct sex-specific latent class trajectories. Hazard ratios (HRs) and 95% confidence intervals (CIs) were estimated using Cox regression models. Results: Four sex-specific BMI classes were identified; stable-overweight, stable-obese, obese-slightly-decreasing, and obese-steeply-decreasing; comprising 41%, 45%, 13%, and 1% of women, and 45%, 37%, 17%, and 1% of men, respectively. In women, the stable-obese class had similar ORC risks as the obese-slightly-decreasing class, whereas the stable-overweight class had lower risks. In men, the obese-slightly-decreasing class had higher risks of ORC (HR = 1.86, 95% CI: 1.05–3.32) than the stable-obese class, while the stable-overweight class had similar risks No associations were observed for non-ORC. Compared to the stable-obese class, women (HR = 1.60, 95% CI: 0.99–2.58) and men (HR = 2.37, 95% CI: 1.66–3.39) in the obese-slightly-decreasing class had elevated mortality. No associations were observed for the stable-overweight classes. Conclusion: Patients who lost weight after T2D diagnosis had higher risks for ORC (in men) and higher all-cause mortality (both genders) than patients with stable obesity.

Giovanna Nicora, Francesca Vitali, Arianna Dagliati, Nophar Geifman, Riccardo Bellazzi (2020)Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools, In: Frontiers in oncology101030pp. 1030-1030 Frontiers Media S.A

DOI: 10.3389/fonc.2020.01030

In recent years, high-throughput sequencing technologies provide unprecedented opportunity to depict cancer samples at multiple molecular levels. The integration and analysis of these multi-omics datasets is a crucial and critical step to gain actionable knowledge in a precision medicine framework. This paper explores recent data-driven methodologies that have been developed and applied to respond major challenges of stratified medicine in oncology, including patients' phenotyping, biomarker discovery, and drug repurposing. We systematically retrieved peer-reviewed journals published from 2014 to 2019, select and thoroughly describe the tools presenting the most promising innovations regarding the integration of heterogeneous data, the machine learning methodologies that successfully tackled the complexity of multi-omics data, and the frameworks to deliver actionable results for clinical practice. The review is organized according to the applied methods: Deep learning, Network-based methods, Clustering, Features Extraction, and Transformation, Factorization. We provide an overview of the tools available in each methodological group and underline the relationship among the different categories. Our analysis revealed how multi-omics datasets could be exploited to drive precision oncology, but also current limitations in the development of multi-omics data integration.

M. Taariq Salie, Jing Yang, Carlos R. Ramírez Medina, Liesl J. Zühlke, Chishala Chishala, Mpiko Ntsekhe, Bernard Gitura, Stephen Ogendo, Emmy Okello, Peter Lwabi, John Musuku, Agnes Mtaja, Christopher Hugo-Hamman, Ahmed El-Sayed, Albertino Damasceno, Ana Mocumbi, Fidelia Bode-Thomas, Christopher Yilgwan, Ganiyu A. Amusa, Esin Nkereuwem, Gasnat Shaboodien, Rachael Da Silva, Dave Chi Hoo Lee, Simon Frain, Anthony D. Whetton, NOPHAR GEIFMAN, Bernard Keavney, Mark E. Engel (2022)Data-independent acquisition mass spectrometry in severe rheumatic heart disease (RHD) identifies a proteomic signature showing ongoing inflammation and effectively classifying RHD cases, In: Clinical proteomics197 BMC

DOI: 10.1186/s12014-022-09345-1

Background Rheumatic heart disease (RHD) remains a major source of morbidity and mortality in developing countries. A deeper insight into the pathogenetic mechanisms underlying RHD could provide opportunities for drug repurposing, guide recommendations for secondary penicillin prophylaxis, and/or inform development of near-patient diagnostics. Methods We performed quantitative proteomics using Sequential Windowed Acquisition of All Theoretical Fragment Ion Mass Spectrometry (SWATH-MS) to screen protein expression in 215 African patients with severe RHD, and 230 controls. We applied a machine learning (ML) approach to feature selection among the 366 proteins quantifiable in at least 40% of samples, using the Boruta wrapper algorithm. The case–control differences and contribution to Area Under the Receiver Operating Curve (AUC) for each of the 56 proteins identified by the Boruta algorithm were calculated by Logistic Regression adjusted for age, sex and BMI. Biological pathways and functions enriched for proteins were identified using ClueGo pathway analyses. Results Adiponectin, complement component C7 and fibulin-1, a component of heart valve matrix, were significantly higher in cases when compared with controls. Ficolin-3, a protein with calcium-independent lectin activity that activates the complement pathway, was lower in cases than controls. The top six biomarkers from the Boruta analyses conferred an AUC of 0.90 indicating excellent discriminatory capacity between RHD cases and controls. Conclusions These results support the presence of an ongoing inflammatory response in RHD, at a time when severe valve disease has developed, and distant from previous episodes of acute rheumatic fever. This biomarker signature could have potential utility in recognizing different degrees of ongoing inflammation in RHD patients, which may, in turn, be related to prognostic severity.