
Dr Suparna De
Academic and research departments
Surrey Institute for People-Centred Artificial Intelligence (PAI), School of Computer Science and Electronic Engineering.About
Biography
Suparna De is a Senior Lecturer in Computer Science, a Surrey AI Fellow in the Surrey Institute for People-Centred AI and an Honorary Senior Research Fellow in the Social Research Institute at UCL, UK. She is also a visiting lecturer at the Research Center for Information and Communications Technologies (CITIC), University of Granada, Spain.
She is a member of the Nature Inspired Computing and Engineering (NICE) research group.
She serves on the editorial board of Nature Scientific Reports (Computational Science division) and Elsevier High-Confidence Computing journals.
She completed her Ph.D. and MSc. (with distinction) degrees at the Dept. of Electronic Engineering from the University of Surrey.
Areas of specialism
University roles and responsibilities
- Computer Science Admissions Tutor (UG)
Affiliations and memberships
News
In the media
ResearchResearch interests
Suparna's research applies machine learning and Semantic Web technologies to the broad domain of knowledge and data engineering, including deep learning for text data (derived from social networks and longitudinal social science datasets), semantic modelling and search and IoT data analytics.
Her current work focusses on researching machine comprehension and information extraction for longitudinal text, applied to metadata extraction and uplift for social science questionnaires, and adaptive privacy-preserving models for online social networks.
Previous work has included research in Big Data analytics, visualisation and data fusion techniques for understanding city dynamics from multimodal spatio-temporal data.
Research projects
ESRC-funded, (Surrey) Principal-Investigator (PI), April 2024 - September 2025.
Grant ref: ES/Z502935/1, £757k
This project is a multi-disciplinary collaboration between social survey specialists, survey data curators/archivists and computer scientists, with partners from the University of Surrey, CLOSER, UCL, ScotCen and the UK Data Archive (UKDA) at the University of Essex.
This project aims to develop novel ML models tailored to the specific challenges of semantically rich survey data collection and research datasets. The project is aimed at the alignment of both structural (standards) and semantic metadata (controlled vocabularies and conceptual frameworks), where the output from the developed ML models will be used to create metadata resources. These metadata resources, realised as knowledge graphs of the questionnaire items, will extend recent advances in text-layout ML models. Additionally, the project contributes to enabling automated risk disclosure assessments in large UKDA datasets by developing novel algorithms for question text equivalence, addressing the challenge of semantic shifts in large longitudinal studies. The project meets the evolving needs of researchers from a range of disciplines who utilise longitudinal population survey (LPS) and other survey data.
EPSRC-funded, Co-Investigator (Co-I), April 2022 - March 2025.
This three-year, £3.4 million project will produce Privacy-Enhancing Technologies (PETs) to support the online privacy and safety of people navigating significant life transitions. The researchers on the project are comprised of a multi-disciplinary team of experts in Cybersecurity, Psychology, Law, Business, and Criminology from the Universities of Cambridge, Edge Hill, Edinburgh, Queen Mary University of London, Strathclyde and Surrey.
KTP - Predictive Maintenance for Rail InfrastructureInnovateUK-funded, Co-Principal Investigator (PI) and Academic Lead, August 2023 - Jan 2026. Grant number: 10054741
This 30-month, £212k Knowledge Transfer partnership (KTP) grant, funded by Innovate UK, will develop novel machine learning algorithms and models for a cloud-based predictive maintenance platform that also enables real-time monitoring and risk prediction of IoT multi-sensors that sense critical parts of the rail infrastructure (i.e. tracks, bridges, structures supporting overhead lines).
Understanding the multiple dimensions of prediction of concepts in social and biomedical science questionnairesScience and Technology Facilities (STFC) DiRAC-funded, PI, Oct. '21 - March '22.
Part of Grant Number: ST/S003916/1
This grant is a collaboration with CLOSER (Cohort and Longitudinal Studies Enhancement Resources), UCL, and RITS (Research IT Services), UCL. The project investigates various dimensions of concept prediction: against a range of different types of unseen data from UK Data Archive's longitudinal studies, e.g. social science vs biomedical, 1995 vs 2015; and to build up an understanding of different predictions rate by category. Hierarchical approaches for topic classification against the European Language Social Science Thesaurus (ELSST) thesaurus were developed.
ESRC-funded, PI, Feb. 2021 - Feb. 2022.
Total funding amount: £81,500. Part of Grant Number: ES/K000357/1
This grant is a collaboration with CLOSER, UCL. The project investigates automated capture of metadata from the CLOSER longitudinal population studies. Automation of question extraction from paper questionnaires will form part of a pipeline to populate question banks and other metadata repositories and provide a low cost solution to the manual processes undertaken as part of the CLOSER project and UKDA (and other archives) to enhance survey metadata alongside the data description.
Science and Technology Facilities (STFC) DiRAC-funded, PI, Feb. - Dec. 2021. Part of Grant Number: ST/S003916/1.
This grant is a collaboration with CLOSER, UCL, and RITS, UCL. The project will utilise the questions and linked concepts (based on the European Language Social Science Thesaurus (ELSST)) held in CLOSER Discovery. The aim is to create a model that will be able to classify existing questions (and predict from new questions) to these existing concepts in ELSST.
TagItSmart (2015-18) is a Smart Tags driven service platform for enabling ecosystems of connected objects that dynamically change their status in response to a variety of environmental factors and be seamlessly tracked during their lifecycle. TagItSmart is developing smart tags that combine the power of functional inks with the pervasiveness of digital and electronic markers, e.g. dynamic QR codes, NFC tags, in order to capture new contextual information. Beside this, the ubiquitous presence of smartphones with their cameras and NFC readers facilitates seamless observation measurements and lifecycle tracking of smart tag big data.
I design and develop semantic models and reasoning algorithms for capturing the characteristics of the Smart Tags and to provide decision-support mechanisms for connecting their lifecycle data to semantically-enabled workflows.
The EU H2020 frontierCities2 project (December 2016 - November 2018) is a acceleration and incubation programme that aims to accelerate market update of the FIWARE generic enablers in the Internet of Things and Smart Cities domain by targeting SMEs and startups to develop FIWARE-enabled smart mobility and smart city solutions.
I work together with the FIWARE Foundation to deliver technical coaching to the startups and lead the work-package tasked with further developing the FIWARE enablers and support mechanisms.
The iKaaS project (Oct 2015 - Oct 2017), jointly funded by the EC H2020 programme and MIC, Japan, delivers a secure, robust and scalable multi-cloud platform that brings together the paradigms of IoT, Big Data and the Cloud.
I researched aspects of data analysis in smart city platforms built on heterogeneous cloud platforms. As part of this, we developed novel anomaly detection algorithms for city environmental features (such as measured air pollution levels) and real-time detection of city social events by analysing Twitter feeds. The research delivered tools to discover correlation between large-scale city events and anomalies detected in pollution levels.
The IoT.est project (2011-14), funded by the EU FP7 Programme, developed a test-driven service creation environment for Internet of Things enabled business services. I served as the work-package leader for semantic annotation and large-scale discovery of IoT services, as well as the University of Surrey technical coordinator. I was also in-charge of the proof-of-concept project demonstrator that integrated modules from project partners.
The iCore project (2011-14), funded by the EU FP7 Programme, defined the concept of Virtual Objects (VO) to abstract the technological heterogeneity derived from the vast amounts of heterogeneous objects and devices forming part of the IoT. I researched the dynamic association derivation between ICT and real-world objects and service workflow composition in iCore.
The IoT-A project (2010-13), funded by the EU FP7 Programme, was the lighthouse EU project that defined a reference architecture and model for the IoT, including defining its constituent concepts such as entity, resource and IoT service. I was the deputy leader of the WP that researched various mechanisms for resolution framewoks for the IoT.
MVCE Core 4I worked as a PhD researcher on the Mobile VCE Core 4 Programme on Ubiquitous Services (funded by the UK Technology Strategy Board), work area: ontology-based context management for mobile systems.
Indicators of esteem
Scholarships:
- Overseas Research Student Sponsorship for PhD research: University of Surrey and MobileVCE Core 4 programme
- DFIDSSS Scholarship: jointly funded by the University of Surrey and the British Commonwealth Scholarship Commission for MSc programme.
Awards:
- Cable and Wireless Award: University of Surrey, for the best overall performance from a student graduating with an MSc in Satellite Communication Engineering or Communications Networks and Software
- IET Certificate in recognition of significant contribution to IET On Campus at the University of Surrey
Research interests
Suparna's research applies machine learning and Semantic Web technologies to the broad domain of knowledge and data engineering, including deep learning for text data (derived from social networks and longitudinal social science datasets), semantic modelling and search and IoT data analytics.
Her current work focusses on researching machine comprehension and information extraction for longitudinal text, applied to metadata extraction and uplift for social science questionnaires, and adaptive privacy-preserving models for online social networks.
Previous work has included research in Big Data analytics, visualisation and data fusion techniques for understanding city dynamics from multimodal spatio-temporal data.
Research projects
ESRC-funded, (Surrey) Principal-Investigator (PI), April 2024 - September 2025.
Grant ref: ES/Z502935/1, £757k
This project is a multi-disciplinary collaboration between social survey specialists, survey data curators/archivists and computer scientists, with partners from the University of Surrey, CLOSER, UCL, ScotCen and the UK Data Archive (UKDA) at the University of Essex.
This project aims to develop novel ML models tailored to the specific challenges of semantically rich survey data collection and research datasets. The project is aimed at the alignment of both structural (standards) and semantic metadata (controlled vocabularies and conceptual frameworks), where the output from the developed ML models will be used to create metadata resources. These metadata resources, realised as knowledge graphs of the questionnaire items, will extend recent advances in text-layout ML models. Additionally, the project contributes to enabling automated risk disclosure assessments in large UKDA datasets by developing novel algorithms for question text equivalence, addressing the challenge of semantic shifts in large longitudinal studies. The project meets the evolving needs of researchers from a range of disciplines who utilise longitudinal population survey (LPS) and other survey data.
EPSRC-funded, Co-Investigator (Co-I), April 2022 - March 2025.
This three-year, £3.4 million project will produce Privacy-Enhancing Technologies (PETs) to support the online privacy and safety of people navigating significant life transitions. The researchers on the project are comprised of a multi-disciplinary team of experts in Cybersecurity, Psychology, Law, Business, and Criminology from the Universities of Cambridge, Edge Hill, Edinburgh, Queen Mary University of London, Strathclyde and Surrey.
InnovateUK-funded, Co-Principal Investigator (PI) and Academic Lead, August 2023 - Jan 2026. Grant number: 10054741
This 30-month, £212k Knowledge Transfer partnership (KTP) grant, funded by Innovate UK, will develop novel machine learning algorithms and models for a cloud-based predictive maintenance platform that also enables real-time monitoring and risk prediction of IoT multi-sensors that sense critical parts of the rail infrastructure (i.e. tracks, bridges, structures supporting overhead lines).
Science and Technology Facilities (STFC) DiRAC-funded, PI, Oct. '21 - March '22.
Part of Grant Number: ST/S003916/1
This grant is a collaboration with CLOSER (Cohort and Longitudinal Studies Enhancement Resources), UCL, and RITS (Research IT Services), UCL. The project investigates various dimensions of concept prediction: against a range of different types of unseen data from UK Data Archive's longitudinal studies, e.g. social science vs biomedical, 1995 vs 2015; and to build up an understanding of different predictions rate by category. Hierarchical approaches for topic classification against the European Language Social Science Thesaurus (ELSST) thesaurus were developed.
ESRC-funded, PI, Feb. 2021 - Feb. 2022.
Total funding amount: £81,500. Part of Grant Number: ES/K000357/1
This grant is a collaboration with CLOSER, UCL. The project investigates automated capture of metadata from the CLOSER longitudinal population studies. Automation of question extraction from paper questionnaires will form part of a pipeline to populate question banks and other metadata repositories and provide a low cost solution to the manual processes undertaken as part of the CLOSER project and UKDA (and other archives) to enhance survey metadata alongside the data description.
Science and Technology Facilities (STFC) DiRAC-funded, PI, Feb. - Dec. 2021. Part of Grant Number: ST/S003916/1.
This grant is a collaboration with CLOSER, UCL, and RITS, UCL. The project will utilise the questions and linked concepts (based on the European Language Social Science Thesaurus (ELSST)) held in CLOSER Discovery. The aim is to create a model that will be able to classify existing questions (and predict from new questions) to these existing concepts in ELSST.
TagItSmart (2015-18) is a Smart Tags driven service platform for enabling ecosystems of connected objects that dynamically change their status in response to a variety of environmental factors and be seamlessly tracked during their lifecycle. TagItSmart is developing smart tags that combine the power of functional inks with the pervasiveness of digital and electronic markers, e.g. dynamic QR codes, NFC tags, in order to capture new contextual information. Beside this, the ubiquitous presence of smartphones with their cameras and NFC readers facilitates seamless observation measurements and lifecycle tracking of smart tag big data.
I design and develop semantic models and reasoning algorithms for capturing the characteristics of the Smart Tags and to provide decision-support mechanisms for connecting their lifecycle data to semantically-enabled workflows.
The EU H2020 frontierCities2 project (December 2016 - November 2018) is a acceleration and incubation programme that aims to accelerate market update of the FIWARE generic enablers in the Internet of Things and Smart Cities domain by targeting SMEs and startups to develop FIWARE-enabled smart mobility and smart city solutions.
I work together with the FIWARE Foundation to deliver technical coaching to the startups and lead the work-package tasked with further developing the FIWARE enablers and support mechanisms.
The iKaaS project (Oct 2015 - Oct 2017), jointly funded by the EC H2020 programme and MIC, Japan, delivers a secure, robust and scalable multi-cloud platform that brings together the paradigms of IoT, Big Data and the Cloud.
I researched aspects of data analysis in smart city platforms built on heterogeneous cloud platforms. As part of this, we developed novel anomaly detection algorithms for city environmental features (such as measured air pollution levels) and real-time detection of city social events by analysing Twitter feeds. The research delivered tools to discover correlation between large-scale city events and anomalies detected in pollution levels.
The IoT.est project (2011-14), funded by the EU FP7 Programme, developed a test-driven service creation environment for Internet of Things enabled business services. I served as the work-package leader for semantic annotation and large-scale discovery of IoT services, as well as the University of Surrey technical coordinator. I was also in-charge of the proof-of-concept project demonstrator that integrated modules from project partners.
The iCore project (2011-14), funded by the EU FP7 Programme, defined the concept of Virtual Objects (VO) to abstract the technological heterogeneity derived from the vast amounts of heterogeneous objects and devices forming part of the IoT. I researched the dynamic association derivation between ICT and real-world objects and service workflow composition in iCore.
The IoT-A project (2010-13), funded by the EU FP7 Programme, was the lighthouse EU project that defined a reference architecture and model for the IoT, including defining its constituent concepts such as entity, resource and IoT service. I was the deputy leader of the WP that researched various mechanisms for resolution framewoks for the IoT.
I worked as a PhD researcher on the Mobile VCE Core 4 Programme on Ubiquitous Services (funded by the UK Technology Strategy Board), work area: ontology-based context management for mobile systems.
Indicators of esteem
Scholarships:
- Overseas Research Student Sponsorship for PhD research: University of Surrey and MobileVCE Core 4 programme
- DFIDSSS Scholarship: jointly funded by the University of Surrey and the British Commonwealth Scholarship Commission for MSc programme.
Awards:
- Cable and Wireless Award: University of Surrey, for the best overall performance from a student graduating with an MSc in Satellite Communication Engineering or Communications Networks and Software
- IET Certificate in recognition of significant contribution to IET On Campus at the University of Surrey
Supervision
Postgraduate research supervision
Postdoctoral Research Fellows (Line manager):
Dr. Chandresh Pravin, "NLP and Text-layout LLMs", 2024 - present, funded by UKRI ESRC.
Dr. Justina Li, "Longitudinal NLP", 2024 - present, funded by UKRI ESRC.
Principal PhD Supervisor: Zeqiang Wang, "Natural Language Processing for Longitudinal Social and Biomedical Science Datasets", (Oct. 2023 - present).
PhD Co-supervisor: Chao Jiang, "Improving inference of Large Language Models", (Jan. 2024 - present).
Collaborative PhD supervisor: Yuqi Wang, Xi’an Jiaotong-Liverpool University, China (Oct. 2021 - present).
Completed postgraduate research projects I have supervised
Co-supervisor (2015-18): Yuchao Zhou: Data-driven Cyber-Physical-Social System for Knowledge Discovery in Smart Cities.
Co-supervisor (2021-22): Tarek Elsaleh, Semantic Data Management for Dynamic Internet of Things (IoT) Services, PhD by published works.
Collaborative PhD supervisor (2017-21): Qi Chen: Distributed Intelligence for Big Smart City Data Processing, Xi’an Jiaotong-Liverpool University, China.
Teaching
2024-25, Spring semester:
Natural Language Processing: module co-convener.
2023-24, Spring semester:
Parallel Computing: module lead.
Computer Networks: module lead.
Professional Project (BSc final year project) and MSc dissertation: academic supervisor
Publications
Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multidisciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mis-matched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.
Recent studies have outlined the accessibility challenges faced by blind or visually impaired, and less-literate people, in interacting with social networks, in-spite of facilitating technologies such as monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to syn-thesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers' emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this un-predictability. We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.
With cycling moving from being a pastime to a mainstream form of mobility and transport, bike-sharing systems (BSSs) are increasingly being deployed in many cities. Analysis of BSS usage data can provide insights into factors that shape the patterns of trips, uncovering latent city dynamics.
Accurately assigning standardized diagnosis and procedure codes from clinical text is crucial for healthcare applications. However, this remains challenging due to the complexity of medical language. This paper proposes a novel model that incorporates extreme multi-label classification tasks to enhance International Classification of Diseases (ICD) coding. The model utilizes deformable convolutional neural networks to fuse representations from hidden layer outputs of pre-trained language models and external medical knowledge embeddings fused using a multimodal approach to provide rich semantic encodings for each code. A probabilistic label tree is constructed based on the hierarchical structure existing in ICD labels to incorporate ontological relationships between ICD codes and enable structured output prediction. Experiments on medical code prediction on the MIMIC-III database demonstrate competitive performance, highlighting the benefits of this technique for robust clinical code assignment.
Verbose and complicated legal terminology in online service terms and conditions (T&C) means that users typically don't read these documents before accepting the terms of such unilateral service contracts. With such services becoming part of mainstream digital life, highlighting Terms of Service (ToS) clauses that impact on the collection and use of user data and privacy are important concerns. Advances in text summarization can help to create informative and concise summaries of the terms, but existing approaches geared towards news and mi-croblogging corpora are not directly applicable to the ToS domain, which is hindered by a lack of T&C-relevant resources for training and evaluation. This paper presents a ToS model, developing a hybrid extractive-classifier-abstractive pipeline that highlights the privacy and data collection/use-related sections in a ToS document and paraphrases these into concise and informative sentences. Relying on significantly less training data (4313 training pairs) than previous representative works (287,226 pairs), our model outperforms extractive baselines by at least 50% in ROUGE-1 score and 54% in METEOR score. The paper also contributes to existing community efforts by curating a dataset of online service T&C, through a developed web scraping tool.
Extending the results of our work on pre-trained language models with recent developments in text-layout models and zero-shot techniques. Since relying solely on textual information makes it difficult to accurately classify and extract metadata, a combination of textual content and visual logic that incorporates vision transformers with optimisation techniques will be explored. This will allow us to extract the specific items with questionnaires such as question texts, responses and routing to create a rich source of metadata which provenances’ data collection methodology to the resultant data which can be transformed into DDI-Lifecycle. We will investigate the feasibility of document understanding multimodal models that employ masked language techniques and present the resulting challenges.
Recent studies have outlined the accessibility challenges faced by blind or visually impaired, and less-literate people, in interacting with social networks, in-spite of facilitating technologies such as monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to synthesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers' emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this unpredictability. We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.
Questions from the CLOSER DDI-Lifecycle repository will be used to assist in training a model that is capable of using questions and response domains from the metadata extraction workstream to create conceptually equivalent items from which data variables can be concorded. Approaches such as fine-tuned large language model (LLM)-based relevance scores model and vector retrieval-LLM reordering will be presented. The session will present initial results in question concept tagging that feed into the conceptual comparison task, addressing challenges of long-tail distribution of the data, model memorisation and human annotation bias in the dataset. Higher-level machine learning (ML) limitations of identifying indeterminate tags and the notion of probability in model outputs will be explored.
Social media is recognized as an important source for deriving insights into public opinion dynamics and social impacts due to the vast textual data generated daily and the 'un-constrained' behavior of people interacting on these platforms. However, such analyses prove challenging due to the semantic shift phenomenon , where word meanings evolve over time. This paper proposes an unsupervised dynamic word embedding method to capture longitudinal semantic shifts in social media data without predefined anchor words. The method leverages word co-occurrence statistics and dynamic updating to adapt embed-dings over time, addressing the challenges of data sparseness, imbalanced distributions, and synergistic semantic effects. Evaluated on a large COVID-19 Twitter dataset, the method reveals semantic evolution patterns of vaccine-and symptom-related entities across different pandemic stages, and their potential correlations with real-world statistics. Our key contributions include the dynamic embedding technique , empirical analysis of COVID-19 semantic shifts, and discussions on enhancing semantic shift modeling for computational social science research. This study enables capturing longitudinal semantic dynamics on social media to understand public discourse and collective phenomena.
The Internet of Things (IoT) and its applications emphasize the need for being context-aware to be able to sense the changing environmental conditions and to make use of the rich contextual information for analysis. The huge volume and high-velocity characteristics of IoT data necessitates that representation of IoT data takes into consideration the contextual information at scale during every step of the data processing life cycle, from production to storage, publication, and search. This chapter categorizes and describes the diverse forms of IoT data that are obtained from heterogeneous sensing sources. It also presents a framework for describing and analyzing the different types of contextual information that need to be associated with the IoT data in order to drive context-aware management and intelligent analytics. In addition, mechanisms for storing big IoT data and its contextual information are described, and common search and discovery methods for making IoT data accessible to applications and analysis components are presented.
In the era of the Internet of Things (IoT), the retrieval of relevant medical information has become essential for efficient clinical decision-making. This paper introduces MedFusionRank, a novel approach to zero-shot medical information retrieval (MIR) that combines the strengths of pre-trained language models and statistical methods while addressing their limitations. The proposed approach leverages a pre-trained BERT-style model to extract compact yet informative keywords. These keywords are then enriched with domain knowledge by linking them to conceptual entities within a medical knowledge graph. Experimental evaluations on medical datasets demonstrate MedFusionRank’s superior performance over existing methods, with promising results with a variety of evaluation metrics. MedFusionRank demonstrates efficacy in retrieving relevant information, even from short or single-term queries.
Computational analyses driven by Artificial Intelligence (AI)/Machine Learning (ML) methods to generate patterns and inferences from big datasets in computational social science (CSS) studies can suffer from biases during the data construction, collection and analysis phases as well as encounter challenges of generalizability and ethics. Given the interdisciplinary nature of CSS, many factors such as the need for a comprehensive understanding of different facets such as the policy and rights landscape, the fast-evolving AI/ML paradigms and dataset-specific pitfalls influence the possibility of biases being introduced. This chapter identifies challenges faced by researchers in the CSS field and presents a taxonomy of biases that may arise in AI/ML approaches. The taxonomy mirrors the various stages of common AI/ML pipelines: dataset construction and collection, data analysis and evaluation. By detecting and mitigating bias in AI, an active area of research, this chapter seeks to highlight practices for incorporating responsible research and innovation into CSS practices.
Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.
The emerging paradigm of urban computing aims to infer latent patterns from various aspects of a city's environment and possibly identify their hidden correlations by analyzing urban big data. This article provides a fine-grained analysis of air quality from diverse sensor data streams retrieved from regions in the city of London. The analysis derives spatio-temporal patterns, that is, across different location categories and time spans, and also reveals the interplay between urban phenomena such as human commuting behavior and the built environment, with the observed air quality patterns. The findings have important implications for the health of ordinary citizens and for city authorities who may formulate policies for a better environment.
Classification techniques are at the heart of many real-world applications, e.g. sentiment analysis, recommender systems and automatic text annotation, to process and analyse large-scale textual data in multiple fields. However, the effectiveness of natural language processing models can only be confirmed when a large amount of up-to-date training data is available. An unprecedented amount of data is continuously created, and new topics are introduced, making it less likely or even infeasible to collect labelled samples covering all topics for training models. We attempt to study the extreme case: there is no labelled data for model training, and the model, without being adapted to any specific dataset, will be directly applied to the testing samples. We propose a transformer-based framework to encode sentences in a contextualised way and leverage the existing knowledge resources, i.e. ConceptNet and WordNet, to integrate both descriptive and structural knowledge for better performance. To enhance the robustness of the model, we design an adversarial example generator based on relations from external knowledge bases. The framework is evaluated on both general and specific domain text classification datasets. Results show that the proposed framework can outperform the existing competitive state-of-the-art baselines, delivering new benchmark results.
In the era of the Internet of Things (IoT), the retrieval of relevant medical information has become essential for efficient clinical decision-making. This paper introduces MedFusionRank, a novel approach to medical information retrieval (MIR) that combines the strengths of pre-trained language models and statistical methods while addressing their limitations. The proposed approach leverages a pre-trained BERT-style model to extract compact yet informative keywords. These keywords are then enriched with domain knowledge by linking them to conceptual entities within a medical knowledge graph. Experimental evaluations on medical datasets demonstrate MedFusionRank's superior performance over existing methods, with promising results with a variety of evaluation metrics. MedFusionRank demonstrates efficacy in retrieving relevant information, even from short or single-term queries.
In this paper, we present a method that facilitates Internet of Things (IoT) for building a product passport and data exchange enabling the next stage of the circular economy. SmartTags based on printed sensors (i.e., using functional ink) and a modified GS1 barcode standard enable unique identification of objects on a per item-level (including Fast-Moving Consumer Goods-FMCG), collecting, sensing, and reading of parameters from environment as well as tracking a products' lifecycle. The developed ontology is the first effort to define a semantic model for dynamic sensors, including datamatrix and QR codes. The evaluation of decoding and readability of identifiers (QR codes) showed good performance for detection of sensor state printed over and outside the QR code data matrix, i.e., the recognition ability with image vision algorithm was possible. The evaluation of the decoding performance of the QR code data matrix printed with sensors was also efficient, i.e., the QR code ability to be decoded with the reader after reversible and irreversible process of ink (dis)appearing was preserved, with slight drop in performance if ink density is low.
Smart cities are cyber-physical-social systems, where city data from different sources could be collected, processed and analyzed to extract useful knowledge. As the volume of data from the social world is exploding, social media data analysis has become an emerging area in many different disciplines. During crisis events, users may post informative tweets about affected individuals, utility damage or cautions on social media platforms. If such tweets are efficiently and effectively processed and analyzed, city organizations may gain a better situational awareness of the affected citizens and provide better response actions. Advances in deep neural networks have significantly improved the performance in many social media analyzing tasks, e.g., sentiment analysis, fake news detection, crisis data classification, etc. However, deep learning models require a large amount of labeled data for model training, which is impractical to collect, especially at the early stage of a crisis event. To address this limitation, we proposed a BERT-based Adversarial Domain Adaptation model (BERT-ADA) for crisis tweet classification. Our experiments with three real-world crisis datasets demonstrate the advantages of the proposed model over several baselines.
Hyperproperties are correctness conditions for labelled transition systems that are more expressive than traditional trace properties, with particular relevance to security. Recently, Attiya and Enea studied a notion of strong observational refinement that preserves all hyperproperties. They analyse the correspondence between forward simulation and strong observational refinement in a setting with finite traces only. We study this correspondence in a setting with both finite and infinite traces. In particular, we show that forward simulation does not preserve hyperliveness properties in this setting. We extend the forward simulation proof obligation with a progress condition, and prove that this progressive forward simulation does imply strong observational refinement.
Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.
—Unsupervised topic modelling is a useful unbiased mechanism for topic labelling of complex longitudinal questionnaires covering multiple domains such as social science and medicine. Manual tagging of such complex datasets increases the propensity of incorrect or inconsistent labels and is a barrier to scaling the processing of longitudinal questionnaires for provision of question banks for data collection agencies. Towards this effort, we propose a tailored BERTopic framework that takes advantage of its novel sentence embedding for creating interpretable topics, and extend it with an enhanced visualisation for comparing the topic model labels with the tags manually assigned to the question literals. The resulting topic clusters uncover instances of mislabelled question tags, while also enabling showcasing the semantic shifts and evolution of the topics across the time span of the longitudinal questionnaires. The tailored BERTopic framework outperforms existing topic modelling baselines for the quantitative evaluation metrics of topic coherence and diversity, while also being 18 times faster than the next best-performing baseline.
In pervasive environments, availability and reliability of a service cannot always be guaranteed. In such environments, automatic and dynamic mechanisms are required to compose services or compensate for a service that becomes unavailable during the runtime. Most of the existing works on services composition do not provide sufficient support for automatic service provisioning in pervasive environments. We propose a Divide and Conquer algorithm that can be used at the service runtime to repeatedly divide a service composition request into several simpler sub-requests. The algorithm repeats until for each sub-request we find at least one atomic service that meets the requirements of that sub-request. The identified atomic services can then be used to create a composite service. We discuss the technical details of our approach and show evaluation results based on a set of composite service requests. The results show that our proposed method performs effectively in decomposing a composite service requests to a number of sub-requests and finding and matching service components that can fulfill the service composition request.
Advances made in the Internet of Things (IoT) and other disruptive technological trends, including big data analytics and edge computing methods, have contributed enabling solutions to the numerous challenges affecting modern communities. With Gartner reporting 14.2 billion IoT devices in 2019 [1] and, according to some reports [2], a projected 30.9 billion devices to be deployed by 2025 in areas like environment monitoring [3], smart agriculture [4], smart healthcare [5] or smart cities [6], one could be tempted to think that most related issues are already resolved. However, there remain practical challenges in large-scale and rapid deployment of sensors for diverse applications, such as problems affecting siting optimization methods and participant recruitment and incentive mechanisms. On a higher level, the deluge of data sources that drive the IoT phenomenon grows every day. With the rise of smartphone-enabled citizen sensing data via social networks or personal health devices, as well as with increasing connectedness in transport, logistics, utilities, or manufacturing domains, this range and complexity of available data calls for even more advanced data processing, mining and fusion methods than those already applied. ...
•A multi-modal Generative Adversarial Network for traffic event detection.•Semi-supervised learning based on generative adversarial network.•Detecting traffic events with both sensor and social media data.•Evaluation based on a large, real-world multi-modal dataset. Advances in the Internet of Things have enabled the development of many smart city applications and expert systems that help citizens and authorities better understand the dynamics of the cities, and make better planning and utilisation of city resources. Smart cities are composed of complex systems that usually process and analyse big data from the Cyber, Physical, and Social worlds. Traffic event detection is an important and complex task in smart transportation modelling and management. We address this problem using semi-supervised deep learning with data of different modalities, e.g., physical sensor observations and social media data. Unlike most existing studies focusing on data of single modality, the proposed method makes use of data of multiple modalities that appear to complement and reinforce each other. Meanwhile, as the amount of labelled data in big data applications is usually extremely limited, we extend the multi-modal Generative Adversarial Network model to a semi-supervised architecture to characterise traffic events. We evaluate the model with a large, real-world dataset consisting of traffic sensor observations and social media data collected from the San Francisco Bay Area over a period of four months. The evaluation results clearly demonstrate the advantages of the proposed model in extracting and classifying traffic events.
The information generated from the Internet of Things (IoT) potentially enables a better understanding of the physical world for humans and supports creation of ambient intelligence for a wide range of applications in different domains. A semantics-enabled service layer is a promising approach to facilitate seamless access and management of the information from the large, distributed and heterogeneous sources. This paper presents the efforts of the IoT.est project towards developing a framework for service creation and testing in an IoT environment. The architecture design extends the existing IoT reference architecture and enables a test-driven, semantics-based management of the entire service lifecycle. The validation of the architecture is shown though a dynamic test case generation and execution scenario.
Social media platforms such as Twitter are increasingly used to collect data of all kinds. During natural disasters, users may post text and image data on social media platforms to report information about infrastructure damage, injured people, cautions and warnings. Effective processing and analysing tweets in real time can help city organisations gain situational awareness of the affected citizens and take timely operations. With the advances in deep learning techniques, recent studies have significantly improved the performance in classifying crisis-related tweets. However, deep learning models are vulnerable to adversarial examples, which may be imperceptible to the human, but can lead to model's misclassification. To process multi-modal data as well as improve the robustness of deep learning models, we propose a multi-modal adversarial training method for crisis-related tweets classification in this paper. The evaluation results clearly demonstrate the advantages of the proposed model in improving the robustness of tweet classification.
Internet of Things (IoT) refers to an ecosystem where applications and services are driven by data collected from devices interacting with each other and the physical world. Although IoT has already brought spectacular benefits to human society, the progress is actually not as fast as expected. From network structures to control flow graphs, IoT naturally generates an unprecedented volume of graph data continuously, which stimulates fertilization and making use of advanced graph-powered methods on the diverse, dynamic, and large-scale graph IoT data.
In recent years, pre-trained language models have garnered significant attention due to their effectiveness, which stems from the rich knowledge acquired during pre-training. To mitigate the inconsistency issues between pre-training tasks and downstream tasks and to facilitate the resolution of language-related issues, prompt-based approaches have been introduced, which are particularly useful in low-resource scenarios. However, existing approaches mostly rely on verbalizers to translate the predicted vocabulary to task-specific labels. The major limitations of this approach are the ignorance of potentially relevant domain-specific words and being biased by the pre-training data. To address these limitations, we propose a framework that incorporates conceptual knowledge for text classification in the extreme zero-shot setting. The framework includes prompt-based keyword extraction, weight assignment to each prompt keyword, and final representation estimation in the knowledge graph embedding space. We evaluated the method on four widely-used datasets for sentiment analysis and topic detection, demonstrating that it consistently outperforms recently-developed prompt-based approaches in the same experimental settings.
In this study, we demonstrate how we can quantify environmental implications of large-scale events and traffic (e.g., human movement) in public spaces, and identify specific regions of a city that are impacted. We develop an innovative data fusion framework that synthesises the state-of-the-art techniques in extracting pollution episodes and detecting events from citizen-contributed, city-specific messages on social media platforms (Twitter). We further design a fusion pipeline for this cross-domain, multimodal data, which assesses the spatio-temporal impact of the extracted events on pollution levels within a city. Results of the analytics have great potential to benefit citizens and in particular, city authorities, who strive to optimise resources for better urban planning and traffic management.
As Internet of Things (IoT) technologies become more widespread in everyday life, privacy issues are becoming more prominent. The aim of this research is to develop a personal assistant that can answer software engineers’ questions about Privacy by Design (PbD) practices during the design phase of IoT system development. Semantic web technologies are used to model the knowledge underlying PbD measurements, their intersections with privacy patterns, IoT system requirements and the privacy patterns that should be applied across IoT systems. This is achieved through the development of the PARROT ontology, developed through a set of representative IoT use cases relevant for software developers. This was supported by gathering Competency Questions (CQs) through a series of workshops, resulting in 81 curated CQs. These CQs were then recorded as SPARQL queries, and the developed ontology was evaluated using the Common Pitfalls model with the help of the Protégé HermiT Reasoner and the Ontology Pitfall Scanner (OOPS!), as well as evaluation by external experts. The ontology was assessed within a user study that identified that the PARROT ontology can answer up to 58% of privacy-related questions from software engineers.
—As social networks become increasingly integrated with their users' daily lives, and users are willing to publicly share data about their offline activities on these networks, the resultant data offers a powerful tool to non-intrusively understand city dynamics as it captures human behaviour and interactions. In this paper, we derive lifestyle patterns from the Foursquare social network data, using matrix factorization and tensor decomposition as unsupervised methods to extract latent spatio-temporal behavior patterns. The extracted patterns offer precise definition of activity levels associated with specific lifestyles and showcase that users' behaviors are a combination of several lifestyles, in contrast to traditional circadian topology theory which classifies individuals to a specific temporal pattern. The obtained patterns can provide deeper insights into city dynamics, the people within them and how society functions.
Text classification techniques have been substantially important to many smart computing applications, e.g. topic extraction and event detection. However, classification is always challenging when only insufficient amount of labelled data for model training is available. To mitigate this issue, zero-shot learning (ZSL) has been introduced for models to recognise new classes that have not been observed during the training stage. We propose an entailment-based zero-shot text classification model, named as S-BERT-CAM, to better capture the relationship between the premise and hypothesis in the BERT embedding space. Two widely used textual datasets are utilised to conduct the experiments. We fine-tune our model using 50% of the labels for each dataset and evaluate it on the label space containing all labels (including both seen and unseen labels). The experimental results demonstrate that our model is more robust to the generalised ZSL and significantly improves the overall performance against baselines.
Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge. While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift. This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a 'training and test dataset', a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository. The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments. Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics' values, which enable model tuning as well as transparent management of data and experiments.
Knowledge resources, e.g. knowledge graphs, which formally represent essential semantics and information for logic inference and reasoning, can compensate for the unawareness nature of many natural language processing techniques based on deep neural networks. This paper provides a focused review of the emerging but intriguing topic that fuses quality external knowledge resources in improving the performance of natural language processing tasks. Existing methods and techniques are summarised in three main categories: (1) static word embeddings, (2) sentence-level deep learning models, and (3) contextualised language representation models, depending on when, how and where external knowledge is fused into the underlying learning models. We focus on the solutions to mitigate two issues: knowledge inclusion and inconsistency between language and knowledge. Details on the design of each representative method, as well as their strength and limitation, are discussed. We also point out some potential future directions in view of the latest trends in natural language processing research.
Advances in communication technologies and artificial intelligence (AI) are accelerating the paradigm of Industrial Internet of Things (IIoT). With IIoT enabling continuous integration of sensors and controllers with the network, intelligent analysis of the generated Big Data is a critical requirement. Although IIoT is considered a subset of IoT, it has its own peculiarities in terms of higher levels of safety, security and low-latency communication in an environment of critical real-time operations. Under these circumstances, discriminative deep learning (DL) algorithms are unsuitable due to their need for large amounts of labelled and balanced training data and the uncertainty of inputs, etc. To overcome these issues, researchers have started using Deep Generative Models (DGMs), which combine the flexibility of DL with the inference power of probabilistic modeling. In this work, we review the state of the art of DGMs and their applicability to IIoT, classifying the reviewed works into the IIoT application areas of anomaly detection, trust boundary protection, network traffic prediction and platform monitoring. Following an analysis of existing IIoT DGM implementations, we identify challenges (i.e. weak discriminative capability, insufficient interpretability, lack of generalization ability, generated data vulnerability, privacy concern and data complexity) that need to be investigated in order to accelerate the adoption of DGMs in IIoT and also propose some potential research directions.