Tulsi Suchak

PhD Student in Health Sciences

t.suchak@surrey.ac.uk

LinkedIn Profile

Academic and research departments

School of Health Sciences.

About

My research project

Addressing Data Integrity Challenges to the Clinical Translation of Metabolomics and Health Research

My research focuses on the reliability of biomedical evidence, particularly in studies using large, publicly available health datasets, and investigates how these resources can be misused through formulaic research practices, paper mills, and the emerging use of generative artificial intelligence in scientific publishing. I am also interested in improving transparency and reproducibility in metabolomics research through examining data availability and FAIR compliance across the literature.

Supervisors

Matt Spick

Nophar Geifman

Publications

Maud A J de Kinderen, Xiao Ma, Fabián E Vaistij, Jiyang Chang, Tony Larson, Yi Li, Fekadu Dinssa, Chen-Yu Lin, Tulsi Suchak, Samuel J Smit, Daniel C Jeffares, Allen Van Deynze, Willem S Jansen van Rensburg, Abe S Gerrano, Michael W Bairu, Sonja L Venter, Roland Schafleitner, Yves Van de Peer, Katherine Denby (2026)An inter-specific Amaranthus pangenome captures genetic variation potentially underlying key leafy vegetable traits in this underutilised crop, In: The New PhytologistEarly View(Early View) Wiley

DOI: 10.1111/nph.71183

Amaranthus species (particularly Amaranthus cruentus, Amaranthus hypochondriacus, Amaranthus tricolor and Amaranthus caudatus) are traditional underutilised crops with the potential to contribute to sustainable, healthy food systems. We focus on amaranth as a leafy vegetable aiming to develop improved lines for cultivation by smallholder farmers in Sub-Saharan Africa. We demonstrate differences in leaf yield and metabolites relevant to human nutrition across eight amaranth accessions: four A. cruentus and four A. hypochondriacus. These accessions are founders of an inter-specific multi-parent advanced generation inter-cross population. We generated high-quality genome assemblies and annotations for these founder lines and identified sequence and structural variants (SVs) compared with a reference A. cruentus genome. Pangenome analysis (also including A. cruentus, A. hypochondriacus and A. tricolor reference genomes) identified core, dispensable and private gene families. Fifty per cent of gene families were core, highlighting the value, in terms of gene discovery, of sequencing additional accessions and the inclusion of three Amaranthus species. A graphical pangenome was constructed using SVs and demonstrated variation in copy number of genes with a likely role in disease resistance. This inter-specific pangenome will be highly valuable for future research on amaranth and facilitate usage of SVs in trait mapping and causal gene discovery.

Tulsi Suchak, Anietie E Aliu, Charlie Harrison, Reyer Zwiggelaar, Nophar Geifman, Matt Spick (2025)Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database, In: PLoS biology23(5)e3003152 PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pbio.3003152

With the growth of artificial intelligence (AI)-ready datasets such as the National Health and Nutrition Examination Survey (NHANES), new opportunities for data-driven research are being created, but also generating risks of data exploitation by paper mills. In this work, we focus on two areas of potential concern for AI-supported research efforts. First, we describe the production of large numbers of formulaic single-factor analyses, relating single predictors to specific health conditions, where multifactorial approaches would be more appropriate. Employing AI-supported single-factor approaches removes context from research, fails to capture interactions, avoids false discovery correction, and is an approach that can easily be adopted by paper mills. Second, we identify risks of selective data usage, such as analyzing limited date ranges or cohort subsets without clear justification, suggestive of data dredging, and post-hoc hypothesis formation. Using a systematic literature search for single-factor analyses, we identified 341 NHANES-derived research papers published over the past decade, each proposing an association between a predictor and a health condition from the wide range contained within NHANES. We found evidence that research failed to take account of multifactorial relationships, that manuscripts did not account for the risks of false discoveries, and that researchers selectively extracted data from NHANES rather than utilizing the full range of data available. Given the explosion of AI-assisted productivity in published manuscripts (the systematic search strategy used here identified an average of 4 papers per annum from 2014 to 2021, but 190 in 2024-9 October alone), we highlight a set of best practices to address these concerns, aimed at researchers, data controllers, publishers, and peer reviewers, to encourage improved statistical practices and mitigate the risks of paper mills using AI-assisted workflows to introduce low-quality manuscripts to the scientific literature.

Danny Maupin, Tulsi Suchak, Abhijit Sengupta, Mariana Marra, Nophar Geifman, Matt Spick Drastic changes in collaboration networks and publication patterns in research using the CDC WONDER dataset, In: medRxiv

DOI: 10.64898/2026.01.13.26343992

The growth of generative AI and easily available Open Access health datasets has transformed researcher productivity, leading to an explosion in publications that has in part been attributed to paper mills (organisations that provide manuscripts for payment) and other unethical actors. These entities are not, however, homogenous, and have a range of products and target markets. While the demand from China has received much attention, here we provide a case study of CDC WONDER, a dataset that has been exploited by a network of researchers reporting affiliations in Pakistan, the United States and the UK, potentially linked to medical residency driven demand from junior clinicians or trainees. The number of publications using CDC WONDER grew from 88 in 2021 to 1223 in 2025. Over the same time period, the proportion of papers reporting at least one author from Pakistan grew from 0.5% in 2021 to 27.2% in 2025, with unusually extensive collaboration networks. In some cases these works featured over 15 co-authors, often including representation from Western institutions, but in spite of this high level of resourcing only resulted in straightforward analyses of well-described conditions using publicly available data. The majority of these outputs additionally show evidence of being produced from a template, with formulaic titles and identical methods, for example using the same statistical model and platform (Joinpoint regression). Identifying papers produced by fast-churn workflows is essential to protect the integrity of the scientific literature from being flooded with low-quality research. This can be achieved through more proactive desk rejection of misleading and formulaic mass-produced submissions, and through better understanding of which use cases are appropriate for different Open Science resources.

Danny Maupin, Tulsi Suchak, Adrian Barnett, Matt Spick Dramatic increases in redundant publications in the Generative AI era, In: medRxiv Cold Spring Harbor Laboratory Press

DOI: 10.1101/2025.09.09.25335401

Redundant publication, the practice of submitting the same or substantially overlapping manuscripts multiple times, distorts the scientific record and wastes resources. Since 2022, publications using large open-science data resources have increased substantially, raising concerns that Generative AI (GenAI) may be facilitating the production of formulaic, redundant manuscripts. In this work we aim to quantify the extent of redundant publication from a single, large health dataset and to investigate whether GenAI can create submissions that evade standard integrity checks. We conducted a systematic search for the years 2021 to 2025 (year to end-July) to identify redundant publications using the US Centers for Disease Control and Prevention National Health and Nutrition Examination Survey (NHANES) dataset. Redundancy was defined as publications analysing the same exposures associated with the same outcomes in the same national population. To test whether GenAI could facilitate creating these papers, we prompted large language models to write three synthetic manuscripts using redundant publications from our dataset as input, instructing them to maximise syntactic differences and evade plagiarism detectors. These three synthetic manuscripts were then tested using a leading plagiarism detection platform to assess their similarity scores. Our search identified 411 redundant publications across 156 unique exposure-outcome pairings; for example, the association between oxidative balance score and chronic kidney disease using NHANES data was published six times in one year. In many instances, redundant articles appeared within the same journals. The three synthetic manuscripts created by GenAI to evade detection yielded overall similarity scores of 26%, 19%, and 14%, with individual similarity contributions below the typical 5% warning thresholds used by plagiarism detectors. The rapid growth in redundant publications (a 17-fold increase between 2022 and 2024) is suggestive of a systemic failure of editorial checks. These papers distort meta-analyses and scientometric studies, waste scarce peer review resources and pose a significant threat to the integrity of the scientific record. We conclude that current checks for redundant publications and plagiarism are no longer fit for purpose in the GenAI era.