Dr Wing Yan Li
Academic and research departments
Computer Science Research Centre, School of Computer Science and Electronic Engineering.About
Biography
I am a Postdoctoral Research Fellow in Computer Science Research Center of University of Surrey and a member of the Nature Inspired Computing and Engineering (NICE) research group. I collaborate with Dr. Suparna De on the MetaCurate-ML project, leading one of the work packages: conceptual comparison for data harmonisation of longitudinal social science survey questions.
I completed my Ph.D. and BSc. (1st Class Honours) degrees at the University of Sussex.
ResearchResearch interests
My research interests lie in natural language processing (NLP), information retrieval (IR), and the application of AI across disciplines. My current work focusses on developing machine comprehension and information extraction methods for longitudinal text, with applications to metadata extraction and uplift for social science questionnaires, and adaptive privacy-preserving models for online social networks. Previous work has included research in multilingual NLP understanding and AI interpretability for understanding how language model semantic representations evolve during the task fine-tuning process.
Research interests
My research interests lie in natural language processing (NLP), information retrieval (IR), and the application of AI across disciplines. My current work focusses on developing machine comprehension and information extraction methods for longitudinal text, with applications to metadata extraction and uplift for social science questionnaires, and adaptive privacy-preserving models for online social networks. Previous work has included research in multilingual NLP understanding and AI interpretability for understanding how language model semantic representations evolve during the task fine-tuning process.
Publications
Longitudinal and comparative research relies heavily on repeated measures and harmonisation of data, DDI-Lifecycle has strong support for this through the variable cascade, however, scaling such activity has proven difficult to put into practice.Social science (and other!) researchers approach the development of questions from a range of perspectives, even where the response options are (nearly) identical, the phrasing and orchestration of the questions can vary considerably. This places limits on the utility of standard text comparison techniques (e.g. TF-IDF, Bag-of-Words). The presentation will outline the strengths and weaknesses of the different approaches taken during the project to address this problem. This includes problem decomposition which breaks the problem into sub-problems to mitigate the insensitivity of unsupervised methods to nuanced question relationships. Additionally, we will cover techniques for fine-tuning generative large language models for concept extraction and injecting the results into a subsequent retrieval model.
Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multidisciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mis-matched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.
Vocabularies such as the European Language Social Science Thesaurus (ELSST) and the CLOSER ontology are the foundational taxonomies capturing core social science concepts that form the foundations of large-scale longitudinal social science surveys. However , standard text embeddings often fail to capture the complex hierarchical and relational structures of the sociological concepts, relying on surface similarity. In this work, we propose a framework to model these nuances by adapting a large language model based text embedding model with a learnable diagonal Riemannian metric. This metric allows for a flexible geometry where dimensions can be scaled to reflect semantic importance. Additionally, we introduce a Hierarchical Ranking Loss with dynamic margins as the sole training objective to enforce the multi-level hierarchical constraints (e.g., distinguishing 'self' from narrower, broader, or related concepts, and all from 'unrelated' ones) from ELSST within the Riemannian space, such as ensuring a specific concept like 'social stratification' is correctly positioned by, for instance, being embedded closer to 'social inequality' (as its broader, related concept) and substantially further from an 'unrelated' concept like 'particle physics'. Lastly, we show that our parameter-efficient approach significantly out-performs strong contrastive learning and hyperbolic embedding baselines on hierarchical concept retrieval and classification tasks using the ELSST and CLOSER datasets. Visualizations confirm the learned embedding space exhibits a clear hierarchical structure. Our work offers a more accurate and geometrically informed method for representing complex sociological constructs.
Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.
Questions from the CLOSER DDI-Lifecycle repository will be used to assist in training a model that is capable of using questions and response domains from the metadata extraction workstream to create conceptually equivalent items from which data variables can be concorded. Approaches such as fine-tuned large language model (LLM)-based relevance scores model and vector retrieval-LLM reordering will be presented. The session will present initial results in question concept tagging that feed into the conceptual comparison task, addressing challenges of long-tail distribution of the data, model memorisation and human annotation bias in the dataset. Higher-level machine learning (ML) limitations of identifying indeterminate tags and the notion of probability in model outputs will be explored.