Dr Suparna De


Senior Lecturer in Computer Science
Ph.D, MSc (distinction), B.Sc. (honours), FHEA

About

Areas of specialism

Natural language understanding for longitudinal data; Computational Social Science; Semantic sensor search and modelling; IoT Big Data analysis

University roles and responsibilities

  • Computer Science Admissions Tutor (UG)

    Affiliations and memberships

    IEEE
    Member
    IEEE Standards Association: P3800 WG
    Working Group member of the IEEE P3800 - Standard for a Data-Trading System
    ACM
    Invited Member

    Academic networks

      News

      In the media

      Research

      Research interests

      Research projects

      Indicators of esteem

      • Scholarships:
        • Overseas Research Student Sponsorship for PhD research: University of Surrey and MobileVCE Core 4 programme
        • DFIDSSS Scholarship: jointly funded by the University of Surrey and the British Commonwealth Scholarship Commission for MSc programme.
      • Awards:
        • Cable and Wireless Award: University of Surrey, for the best overall performance from a student graduating with an MSc in Satellite Communication Engineering or Communications Networks and Software
        • IET Certificate in recognition of significant contribution to IET On Campus at the University of Surrey

        Supervision

        Postgraduate research supervision

        Completed postgraduate research projects I have supervised

        Teaching

        Publications

        Wing Yan Li, Zeqiang Wang, Jon Johnson, Suparna De (2025)Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?, In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 13–18, 2025; Padua, Italy ACM

        Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multidisciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mis-matched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.

        Suparna De, Ionut Bostan, Nishanth Ramakrishna Sastry (2025)Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis, In: Proceedings of the 2024 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2024)15214pp. 101-116 Springer

        Recent studies have outlined the accessibility challenges faced by blind or visually impaired, and less-literate people, in interacting with social networks, in-spite of facilitating technologies such as monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to syn-thesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers' emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this un-predictability. We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.

        Suparna De, Wei Wang, Usamah Jassat, Klaus Moessner (2022)Usage Mining of the London Santander Bike-Sharing System, In: Computer (Long Beach, Calif.)55(12)pp. 98-108

        With cycling moving from being a pastime to a mainstream form of mobility and transport, bike-sharing systems (BSSs) are increasingly being deployed in many cities. Analysis of BSS usage data can provide insights into factors that shape the patterns of trips, uncovering latent city dynamics.

        Zeqiang Wang, Yuqi Wang, Haiyang Zhang, Wei Wang, Jun Qi, Jianjun Chen, Nishanth Ramakrishna Sastry, Jon Johnson, Suparna De (2024)ICDXML: enhancing ICD coding with probabilistic label trees and dynamic semantic representations, In: Scientific Reports14(1)18319 Nature Research

        Accurately assigning standardized diagnosis and procedure codes from clinical text is crucial for healthcare applications. However, this remains challenging due to the complexity of medical language. This paper proposes a novel model that incorporates extreme multi-label classification tasks to enhance International Classification of Diseases (ICD) coding. The model utilizes deformable convolutional neural networks to fuse representations from hidden layer outputs of pre-trained language models and external medical knowledge embeddings fused using a multimodal approach to provide rich semantic encodings for each code. A probabilistic label tree is constructed based on the hierarchical structure existing in ICD labels to incorporate ontological relationships between ICD codes and enable structured output prediction. Experiments on medical code prediction on the MIMIC-III database demonstrate competitive performance, highlighting the benefits of this technique for robust clinical code assignment.

        Emilia Lukose, Suparna De, Jon Johnson (2022)Privacy Pitfalls of Online Service Terms and Conditions: a Hybrid Approach for Classification and Summarization, In: Proceedings of the Natural Legal Language Processing Workshop 2022pp. 65-75 Association for Computational Linguistics (ACL)

        Verbose and complicated legal terminology in online service terms and conditions (T&C) means that users typically don't read these documents before accepting the terms of such unilateral service contracts. With such services becoming part of mainstream digital life, highlighting Terms of Service (ToS) clauses that impact on the collection and use of user data and privacy are important concerns. Advances in text summarization can help to create informative and concise summaries of the terms, but existing approaches geared towards news and mi-croblogging corpora are not directly applicable to the ToS domain, which is hindered by a lack of T&C-relevant resources for training and evaluation. This paper presents a ToS model, developing a hybrid extractive-classifier-abstractive pipeline that highlights the privacy and data collection/use-related sections in a ToS document and paraphrases these into concise and informative sentences. Relying on significantly less training data (4313 training pairs) than previous representative works (287,226 pairs), our model outperforms extractive baselines by at least 50% in ROUGE-1 score and 54% in METEOR score. The paper also contributes to existing community efforts by curating a dataset of online service T&C, through a developed web scraping tool.

        Chandresh Pravin, Suparna De, Zeqiang Wang, Deirdre Lungley, Paul Bradshaw, Jon Johnson Metacurate-ML: Metadata Extraction from CAIs

        Extending the results of our work on pre-trained language models with recent developments in text-layout models and zero-shot techniques. Since relying solely on textual information makes it difficult to accurately classify and extract metadata, a combination of textual content and visual logic that incorporates vision transformers with optimisation techniques will be explored. This will allow us to extract the specific items with questionnaires such as question texts, responses and routing to create a rich source of metadata which provenances’ data collection methodology to the resultant data which can be transformed into DDI-Lifecycle. We will investigate the feasibility of document understanding multimodal models that employ masked language techniques and present the resulting challenges.

        Suparna De, Ionut Bostan, Nishanth Sastry (2024)Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis, In: arXiv.org Cornell University Library, arXiv.org

        Recent studies have outlined the accessibility challenges faced by blind or visually impaired, and less-literate people, in interacting with social networks, in-spite of facilitating technologies such as monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to synthesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers' emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this unpredictability. We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.

        Wing Yan Li, Suparna De, Zeqiang Wang, Deirdre Lungley, Paul Bradshaw, Jon Johnson Metacurate-ML: Conceptual Comparison

        Questions from the CLOSER DDI-Lifecycle repository will be used to assist in training a model that is capable of using questions and response domains from the metadata extraction workstream to create conceptually equivalent items from which data variables can be concorded. Approaches such as fine-tuned large language model (LLM)-based relevance scores model and vector retrieval-LLM reordering will be presented. The session will present initial results in question concept tagging that feed into the conceptual comparison task, addressing challenges of long-tail distribution of the data, model memorisation and human annotation bias in the dataset. Higher-level machine learning (ML) limitations of identifying indeterminate tags and the notion of probability in model outputs will be explored.

        Zeqiang Wang, Jiageng Wu, Yuqi Wang, Wei Wang, Jie Yang, Jon Johnson, Nishanth Ramakrishna Sastry, Suparna De (2024)Revealing COVID-19's Social Dynamics: Diachronic Semantic Analysis of Vaccine and Symptom Discourse on Twitter, In: Empirical Methods in Natural Language Processing FindingsFindings of the Association for Computational Linguistics: EMNLP 2024pp. 3384-3394 Association for Computational Linguistics

        Social media is recognized as an important source for deriving insights into public opinion dynamics and social impacts due to the vast textual data generated daily and the 'un-constrained' behavior of people interacting on these platforms. However, such analyses prove challenging due to the semantic shift phenomenon , where word meanings evolve over time. This paper proposes an unsupervised dynamic word embedding method to capture longitudinal semantic shifts in social media data without predefined anchor words. The method leverages word co-occurrence statistics and dynamic updating to adapt embed-dings over time, addressing the challenges of data sparseness, imbalanced distributions, and synergistic semantic effects. Evaluated on a large COVID-19 Twitter dataset, the method reveals semantic evolution patterns of vaccine-and symptom-related entities across different pandemic stages, and their potential correlations with real-world statistics. Our key contributions include the dynamic embedding technique , empirical analysis of COVID-19 semantic shifts, and discussions on enhancing semantic shift modeling for computational social science research. This study enables capturing longitudinal semantic dynamics on social media to understand public discourse and collective phenomena.

        Suparna De, Wei Wang, Maria Bermudez-Edo (2024)Data Models and Contextual Information, In: Springer Handbook of Internet of Things Springer

        The Internet of Things (IoT) and its applications emphasize the need for being context-aware to be able to sense the changing environmental conditions and to make use of the rich contextual information for analysis. The huge volume and high-velocity characteristics of IoT data necessitates that representation of IoT data takes into consideration the contextual information at scale during every step of the data processing life cycle, from production to storage, publication, and search. This chapter categorizes and describes the diverse forms of IoT data that are obtained from heterogeneous sensing sources. It also presents a framework for describing and analyzing the different types of contextual information that need to be associated with the IoT data in order to drive context-aware management and intelligent analytics. In addition, mechanisms for storing big IoT data and its contextual information are described, and common search and discovery methods for making IoT data accessible to applications and analysis components are presented.

        Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De (2024)Zero-Shot Medical Information Retrieval via Knowledge Graph Embedding, In: Internet of Things of Big Data for Healthcarepp. 29-40 Springer Nature Switzerland

        In the era of the Internet of Things (IoT), the retrieval of relevant medical information has become essential for efficient clinical decision-making. This paper introduces MedFusionRank, a novel approach to zero-shot medical information retrieval (MIR) that combines the strengths of pre-trained language models and statistical methods while addressing their limitations. The proposed approach leverages a pre-trained BERT-style model to extract compact yet informative keywords. These keywords are then enriched with domain knowledge by linking them to conceptual entities within a medical knowledge graph. Experimental evaluations on medical datasets demonstrate MedFusionRank’s superior performance over existing methods, with promising results with a variety of evaluation metrics. MedFusionRank demonstrates efficacy in retrieving relevant information, even from short or single-term queries.

        Suparna De, Shalini Jangra, Vibhor Agarwal, Jon Johnson, Nishanth Sastry (2023)Biases and Ethical Considerations for Machine Learning Pipelines in the Computational Social Sciences, In: Ethics in Artificial Intelligence: Bias, Fairness and Beyondpp. 99-113 Springer Nature Singapore

        Computational analyses driven by Artificial Intelligence (AI)/Machine Learning (ML) methods to generate patterns and inferences from big datasets in computational social science (CSS) studies can suffer from biases during the data construction, collection and analysis phases as well as encounter challenges of generalizability and ethics. Given the interdisciplinary nature of CSS, many factors such as the need for a comprehensive understanding of different facets such as the policy and rights landscape, the fast-evolving AI/ML paradigms and dataset-specific pitfalls influence the possibility of biases being introduced. This chapter identifies challenges faced by researchers in the CSS field and presents a taxonomy of biases that may arise in AI/ML approaches. The taxonomy mirrors the various stages of common AI/ML pipelines: dataset construction and collection, data analysis and evaluation. By detecting and mitigating bias in AI, an active area of research, this chapter seeks to highlight practices for incorporating responsible research and innovation into CSS practices.

        Flavia C. C. Moura, Luiz C. A. Oliveira, Patricia S. de Oliveira Patricio, Suparna De (2020)Preface, In: Catalysis today344pp. 1-1 Elsevier
        Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De (2024)DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness, In: arXiv.org Cornell University Library, arXiv.org

        Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.

        Suparna De, Usamah Jassat, Wei Wang, Charith Perera, KLAUS MOESSNER (2021)Inferring Latent Patterns in Air Quality from Urban Big Data, In: IEEE Internet of Things Magazine4(1)pp. 20-27 IEEE

        The emerging paradigm of urban computing aims to infer latent patterns from various aspects of a city's environment and possibly identify their hidden correlations by analyzing urban big data. This article provides a fine-grained analysis of air quality from diverse sensor data streams retrieved from regions in the city of London. The analysis derives spatio-temporal patterns, that is, across different location categories and time spans, and also reveals the interplay between urban phenomena such as human commuting behavior and the built environment, with the observed air quality patterns. The findings have important implications for the health of ordinary citizens and for city authorities who may formulate policies for a better environment.

        Yuqi Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De (2024)Zero-shot text classification with knowledge resources under label-fully-unseen setting, In: Neurocomputing (Amsterdam)610128580 Elsevier B.V

        Classification techniques are at the heart of many real-world applications, e.g. sentiment analysis, recommender systems and automatic text annotation, to process and analyse large-scale textual data in multiple fields. However, the effectiveness of natural language processing models can only be confirmed when a large amount of up-to-date training data is available. An unprecedented amount of data is continuously created, and new topics are introduced, making it less likely or even infeasible to collect labelled samples covering all topics for training models. We attempt to study the extreme case: there is no labelled data for model training, and the model, without being adapted to any specific dataset, will be directly applied to the testing samples. We propose a transformer-based framework to encode sentences in a contextualised way and leverage the existing knowledge resources, i.e. ConceptNet and WordNet, to integrate both descriptive and structural knowledge for better performance. To enhance the robustness of the model, we design an adversarial example generator based on relations from external knowledge bases. The framework is evaluated on both general and specific domain text classification datasets. Results show that the proposed framework can outperform the existing competitive state-of-the-art baselines, delivering new benchmark results.

        Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De (2023)Zero-Shot Medical Information Retrieval via Knowledge Graph Embedding, In: CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management Association for Computing Machinery (ACM)

        In the era of the Internet of Things (IoT), the retrieval of relevant medical information has become essential for efficient clinical decision-making. This paper introduces MedFusionRank, a novel approach to medical information retrieval (MIR) that combines the strengths of pre-trained language models and statistical methods while addressing their limitations. The proposed approach leverages a pre-trained BERT-style model to extract compact yet informative keywords. These keywords are then enriched with domain knowledge by linking them to conceptual entities within a medical knowledge graph. Experimental evaluations on medical datasets demonstrate MedFusionRank's superior performance over existing methods, with promising results with a variety of evaluation metrics. MedFusionRank demonstrates efficacy in retrieving relevant information, even from short or single-term queries.

        Nenad Gligoric, Srdjan Krco, Liisa Hakola, Kaisa Vehmas, Suparna De, Klaus Moessner, Kristoffer Jansson, Ingmar Polenz, Rob van Kranenburg (2019)SmartTags: IoT Product Passport for Circular Economy Based on Printed Sensors and Unique Item-Level Identifiers, In: Sensors (Basel, Switzerland)19(3) Mdpi

        In this paper, we present a method that facilitates Internet of Things (IoT) for building a product passport and data exchange enabling the next stage of the circular economy. SmartTags based on printed sensors (i.e., using functional ink) and a modified GS1 barcode standard enable unique identification of objects on a per item-level (including Fast-Moving Consumer Goods-FMCG), collecting, sensing, and reading of parameters from environment as well as tracking a products' lifecycle. The developed ontology is the first effort to define a semantic model for dynamic sensors, including datamatrix and QR codes. The evaluation of decoding and readability of identifiers (QR codes) showed good performance for detection of sensor state printed over and outside the QR code data matrix, i.e., the recognition ability with image vision algorithm was possible. The evaluation of the decoding performance of the QR code data matrix printed with sensors was also efficient, i.e., the QR code ability to be decoded with the reader after reversible and irreversible process of ink (dis)appearing was preserved, with slight drop in performance if ink density is low.

        Qi Chen, Wei Wang, Kaizhu Huang, Suparna De, Frans Coenen (2020)Adversarial Domain Adaptation for Crisis Data Classification on Social Media, In: 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics)pp. 282-287 IEEE

        Smart cities are cyber-physical-social systems, where city data from different sources could be collected, processed and analyzed to extract useful knowledge. As the volume of data from the social world is exploding, social media data analysis has become an emerging area in many different disciplines. During crisis events, users may post informative tweets about affected individuals, utility damage or cautions on social media platforms. If such tweets are efficiently and effectively processed and analyzed, city organizations may gain a better situational awareness of the affected citizens and provide better response actions. Advances in deep neural networks have significantly improved the performance in many social media analyzing tasks, e.g., sentiment analysis, fake news detection, crisis data classification, etc. However, deep learning models require a large amount of labeled data for model training, which is impractical to collect, especially at the early stage of a crisis event. To address this limitation, we proposed a BERT-based Adversarial Domain Adaptation model (BERT-ADA) for crisis tweet classification. Our experiments with three real-world crisis datasets demonstrate the advantages of the proposed model over several baselines.

        J Derrick, S Doherty, B Dongol, G Schellhorn, H Wehrheim, Suparna De (2021)Brief announcement: On strong observational refinement and forward simulation Schloss Dagstuhl

        Hyperproperties are correctness conditions for labelled transition systems that are more expressive than traditional trace properties, with particular relevance to security. Recently, Attiya and Enea studied a notion of strong observational refinement that preserves all hyperproperties. They analyse the correspondence between forward simulation and strong observational refinement in a setting with finite traces only. We study this correspondence in a setting with both finite and infinite traces. In particular, we show that forward simulation does not preserve hyperliveness properties in this setting. We extend the forward simulation proof obligation with a progress condition, and prove that this progressive forward simulation does imply strong observational refinement.

        Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De (2024)DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness, In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) Association for Computational Linguistics

        Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.

        Vida Sharifian-Attar, Suparna De, Sanaz Jabbari, Jenny Li, Harry Moss, Jon Johnson (2023)Analysing Longitudinal Social Science Questionnaires: Topic modelling with BERT-based Embeddings, In: 2022 IEEE International Conference on Big Data (Big Data)pp. 5558-5567 IEEE

        —Unsupervised topic modelling is a useful unbiased mechanism for topic labelling of complex longitudinal questionnaires covering multiple domains such as social science and medicine. Manual tagging of such complex datasets increases the propensity of incorrect or inconsistent labels and is a barrier to scaling the processing of longitudinal questionnaires for provision of question banks for data collection agencies. Towards this effort, we propose a tailored BERTopic framework that takes advantage of its novel sentence embedding for creating interpretable topics, and extend it with an enhanced visualisation for comparing the topic model labels with the tags manually assigned to the question literals. The resulting topic clusters uncover instances of mislabelled question tags, while also enabling showcasing the semantic shifts and evolution of the topics across the time span of the longitudinal questionnaires. The tailored BERTopic framework outperforms existing topic modelling baselines for the quantitative evaluation metrics of topic coherence and diversity, while also being 18 times faster than the next best-performing baseline.

        Gilbert Cassar, Payam Barnaghi, Wei Wang, Suparna De, Klaus Moessner (2013)Composition of Services in Pervasive Environments: A Divide and Conquer Approach, In: 2013 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC)pp. 000226-000232 IEEE

        In pervasive environments, availability and reliability of a service cannot always be guaranteed. In such environments, automatic and dynamic mechanisms are required to compose services or compensate for a service that becomes unavailable during the runtime. Most of the existing works on services composition do not provide sufficient support for automatic service provisioning in pervasive environments. We propose a Divide and Conquer algorithm that can be used at the service runtime to repeatedly divide a service composition request into several simpler sub-requests. The algorithm repeats until for each sub-request we find at least one atomic service that meets the requirements of that sub-request. The identified atomic services can then be used to create a composite service. We discuss the technical details of our approach and show evaluation results based on a set of composite service requests. The results show that our proposed method performs effectively in decomposing a composite service requests to a number of sub-requests and finding and matching service components that can fulfill the service composition request.

        Suparna De, Klaus Moessner (2023)Sensor Networks: Physical and Social Sensing in the IoT, In: Sensors23(3)1451 MDPI

        Advances made in the Internet of Things (IoT) and other disruptive technological trends, including big data analytics and edge computing methods, have contributed enabling solutions to the numerous challenges affecting modern communities. With Gartner reporting 14.2 billion IoT devices in 2019 [1] and, according to some reports [2], a projected 30.9 billion devices to be deployed by 2025 in areas like environment monitoring [3], smart agriculture [4], smart healthcare [5] or smart cities [6], one could be tempted to think that most related issues are already resolved. However, there remain practical challenges in large-scale and rapid deployment of sensors for diverse applications, such as problems affecting siting optimization methods and participant recruitment and incentive mechanisms. On a higher level, the deluge of data sources that drive the IoT phenomenon grows every day. With the rise of smartphone-enabled citizen sensing data via social networks or personal health devices, as well as with increasing connectedness in transport, logistics, utilities, or manufacturing domains, this range and complexity of available data calls for even more advanced data processing, mining and fusion methods than those already applied. ...

        Qi Chen, Wei Wang, Kaizhu Huang, Suparna De, Frans Coenen (2021)Multi-modal generative adversarial networks for traffic event detection in smart cities, In: Expert systems with applications177 Elsevier Ltd

        •A multi-modal Generative Adversarial Network for traffic event detection.•Semi-supervised learning based on generative adversarial network.•Detecting traffic events with both sensor and social media data.•Evaluation based on a large, real-world multi-modal dataset. Advances in the Internet of Things have enabled the development of many smart city applications and expert systems that help citizens and authorities better understand the dynamics of the cities, and make better planning and utilisation of city resources. Smart cities are composed of complex systems that usually process and analyse big data from the Cyber, Physical, and Social worlds. Traffic event detection is an important and complex task in smart transportation modelling and management. We address this problem using semi-supervised deep learning with data of different modalities, e.g., physical sensor observations and social media data. Unlike most existing studies focusing on data of single modality, the proposed method makes use of data of multiple modalities that appear to complement and reinforce each other. Meanwhile, as the amount of labelled data in big data applications is usually extremely limited, we extend the multi-modal Generative Adversarial Network model to a semi-supervised architecture to characterise traffic events. We evaluate the model with a large, real-world dataset consisting of traffic sensor observations and social media data collected from the San Francisco Bay Area over a period of four months. The evaluation results clearly demonstrate the advantages of the proposed model in extracting and classifying traffic events.

        Charith Perera, Mahmoud Barhamgi, Suparna De, Tim Baarslag, Massimo Vecchio, Kim-Kwang Raymond Choo (2018)Designing the Sensing as a Service Ecosystem for the Internet of Things, In: IEEE internet of things journal1pp. 18-23 IEEE
        S De, F Carrez, E Reetz, R Toenjes, W Wang (2013)Test-Enabled Architecture for IoT Service Creation and Provisioning Springer Berlin Heidelberg

        The information generated from the Internet of Things (IoT) potentially enables a better understanding of the physical world for humans and supports creation of ambient intelligence for a wide range of applications in different domains. A semantics-enabled service layer is a promising approach to facilitate seamless access and management of the information from the large, distributed and heterogeneous sources. This paper presents the efforts of the IoT.est project towards developing a framework for service creation and testing in an IoT environment. The architecture design extends the existing IoT reference architecture and enables a test-driven, semantics-based management of the entire service lifecycle. The validation of the architecture is shown though a dynamic test case generation and execution scenario.

        Qi Chen, Wei Wang, Kaizhu Huang, Suparna De, Frans Coenen (2020)Multi-modal Adversarial Training for Crisis-related Data Classification on Social Media, In: 2020 IEEE International Conference on Smart Computing (SMARTCOMP)pp. 232-237 IEEE

        Social media platforms such as Twitter are increasingly used to collect data of all kinds. During natural disasters, users may post text and image data on social media platforms to report information about infrastructure damage, injured people, cautions and warnings. Effective processing and analysing tweets in real time can help city organisations gain situational awareness of the affected citizens and take timely operations. With the advances in deep learning techniques, recent studies have significantly improved the performance in classifying crisis-related tweets. However, deep learning models are vulnerable to adversarial examples, which may be imperceptible to the human, but can lead to model's misclassification. To process multi-modal data as well as improve the robustness of deep learning models, we propose a multi-modal adversarial training method for crisis-related tweets classification in this paper. The evaluation results clearly demonstrate the advantages of the proposed model in improving the robustness of tweet classification.

        Zhipeng Cai, Suparna De, Michal Kedziora, Chaokun Wang (2022)Guest Editorial Special Issue on Graph-Powered Machine Learning for Internet of Things, In: IEEE internet of things journal9(12)pp. 9102-9105 IEEE

        Internet of Things (IoT) refers to an ecosystem where applications and services are driven by data collected from devices interacting with each other and the physical world. Although IoT has already brought spectacular benefits to human society, the progress is actually not as fast as expected. From network structures to control flow graphs, IoT naturally generates an unprecedented volume of graph data continuously, which stimulates fertilization and making use of advanced graph-powered methods on the diverse, dynamic, and large-scale graph IoT data.

        Yuki Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De (2023)Prompt-based Zero-shot Text Classification with Conceptual Knowledge, In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)4pp. 30-38 Association for Computational Linguistics

        In recent years, pre-trained language models have garnered significant attention due to their effectiveness, which stems from the rich knowledge acquired during pre-training. To mitigate the inconsistency issues between pre-training tasks and downstream tasks and to facilitate the resolution of language-related issues, prompt-based approaches have been introduced, which are particularly useful in low-resource scenarios. However, existing approaches mostly rely on verbalizers to translate the predicted vocabulary to task-specific labels. The major limitations of this approach are the ignorance of potentially relevant domain-specific words and being biased by the pre-training data. To address these limitations, we propose a framework that incorporates conceptual knowledge for text classification in the extreme zero-shot setting. The framework includes prompt-based keyword extraction, weight assignment to each prompt keyword, and final representation estimation in the knowledge graph embedding space. We evaluated the method on four widely-used datasets for sentiment analysis and topic detection, demonstrating that it consistently outperforms recently-developed prompt-based approaches in the same experimental settings.

        Suparna De, Wei Wang, Yuchao Zhou, Charith Perera, Klaus Moessner, Mansour Naser Alraja (2021)Analysing environmental impact of large-scale events in public spaces with cross-domain multimodal data fusion, In: Computing103(9)pp. 1959-1981 Springer Nature

        In this study, we demonstrate how we can quantify environmental implications of large-scale events and traffic (e.g., human movement) in public spaces, and identify specific regions of a city that are impacted. We develop an innovative data fusion framework that synthesises the state-of-the-art techniques in extracting pollution episodes and detecting events from citizen-contributed, city-specific messages on social media platforms (Twitter). We further design a fusion pipeline for this cross-domain, multimodal data, which assesses the spatio-temporal impact of the extracted events on pollution levels within a city. Results of the analytics have great potential to benefit citizens and in particular, city authorities, who strive to optimise resources for better urban planning and traffic management.

        Lamya Alkhariji, Suparna DE, Omer Rana, Charith Perera (2023)Semantics-based privacy by design for Internet of Things applications, In: Future generation computer systems [e-journal]138pp. 280-295 Elsevier

        As Internet of Things (IoT) technologies become more widespread in everyday life, privacy issues are becoming more prominent. The aim of this research is to develop a personal assistant that can answer software engineers’ questions about Privacy by Design (PbD) practices during the design phase of IoT system development. Semantic web technologies are used to model the knowledge underlying PbD measurements, their intersections with privacy patterns, IoT system requirements and the privacy patterns that should be applied across IoT systems. This is achieved through the development of the PARROT ontology, developed through a set of representative IoT use cases relevant for software developers. This was supported by gathering Competency Questions (CQs) through a series of workshops, resulting in 81 curated CQs. These CQs were then recorded as SPARQL queries, and the developed ontology was evaluated using the Common Pitfalls model with the help of the Protégé HermiT Reasoner and the Ontology Pitfall Scanner (OOPS!), as well as evaluation by external experts. The ontology was assessed within a user study that identified that the PARROT ontology can answer up to 58% of privacy-related questions from software engineers.

        Suparna DE, Usamah Jassat, Alex Grace, Wei Wang, Klaus Moessner (2022)Mining Composite Spatio-Temporal Lifestyle Patterns from Geotagged Social Data

        —As social networks become increasingly integrated with their users' daily lives, and users are willing to publicly share data about their offline activities on these networks, the resultant data offers a powerful tool to non-intrusively understand city dynamics as it captures human behaviour and interactions. In this paper, we derive lifestyle patterns from the Foursquare social network data, using matrix factorization and tensor decomposition as unsupervised methods to extract latent spatio-temporal behavior patterns. The extracted patterns offer precise definition of activity levels associated with specific lifestyles and showcase that users' behaviors are a combination of several lifestyles, in contrast to traditional circadian topology theory which classifies individuals to a specific temporal pattern. The obtained patterns can provide deeper insights into city dynamics, the people within them and how society functions.

        Yuqi Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De (2022)Generalised Zero-shot Learning for Entailment-based Text Classification with External Knowledge, In: 2022 IEEE International Conference on Smart Computing (SMARTCOMP 2022)pp. 19-25 Institute of Electrical and Electronics Engineers (IEEE)

        Text classification techniques have been substantially important to many smart computing applications, e.g. topic extraction and event detection. However, classification is always challenging when only insufficient amount of labelled data for model training is available. To mitigate this issue, zero-shot learning (ZSL) has been introduced for models to recognise new classes that have not been observed during the training stage. We propose an entailment-based zero-shot text classification model, named as S-BERT-CAM, to better capture the relationship between the premise and hypothesis in the BERT embedding space. Two widely used textual datasets are utilised to conduct the experiments. We fine-tune our model using 50% of the labels for each dataset and evaluate it on the label space containing all labels (including both seen and unseen labels). The experimental results demonstrate that our model is more robust to the generalised ZSL and significantly improves the overall performance against baselines.

        Suparna De, Harry Moss, Jon Johnson, Jenny Li, Haeron Pereira, Sanaz Jabbari (2022)Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires, In: IASSIST quarterly46(1) University of Alberta

        Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge. While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift. This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a 'training and test dataset', a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository. The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments. Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics' values, which enable model tuning as well as transparent management of data and experiments.

        Yuqi Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De, A Hussain (2023)Fusing external knowledge resources for natural language understanding techniques: A survey, In: Information Fusion92pp. 190-204 Elsevier

        Knowledge resources, e.g. knowledge graphs, which formally represent essential semantics and information for logic inference and reasoning, can compensate for the unawareness nature of many natural language processing techniques based on deep neural networks. This paper provides a focused review of the emerging but intriguing topic that fuses quality external knowledge resources in improving the performance of natural language processing tasks. Existing methods and techniques are summarised in three main categories: (1) static word embeddings, (2) sentence-level deep learning models, and (3) contextualised language representation models, depending on when, how and where external knowledge is fused into the underlying learning models. We focus on the solutions to mitigate two issues: knowledge inclusion and inconsistency between language and knowledge. Details on the design of each representative method, as well as their strength and limitation, are discussed. We also point out some potential future directions in view of the latest trends in natural language processing research.

        Suparna De, Maria Bermudez-Edo, Honghui Xu, Zhipeng Cai (2022)Deep Generative Models in the Industrial Internet of Things: A Survey, In: IEEE Transactions on Industrial Informatics Institute of Electrical and Electronics Engineers (IEEE)

        Advances in communication technologies and artificial intelligence (AI) are accelerating the paradigm of Industrial Internet of Things (IIoT). With IIoT enabling continuous integration of sensors and controllers with the network, intelligent analysis of the generated Big Data is a critical requirement. Although IIoT is considered a subset of IoT, it has its own peculiarities in terms of higher levels of safety, security and low-latency communication in an environment of critical real-time operations. Under these circumstances, discriminative deep learning (DL) algorithms are unsuitable due to their need for large amounts of labelled and balanced training data and the uncertainty of inputs, etc. To overcome these issues, researchers have started using Deep Generative Models (DGMs), which combine the flexibility of DL with the inference power of probabilistic modeling. In this work, we review the state of the art of DGMs and their applicability to IIoT, classifying the reviewed works into the IIoT application areas of anomaly detection, trust boundary protection, network traffic prediction and platform monitoring. Following an analysis of existing IIoT DGM implementations, we identify challenges (i.e. weak discriminative capability, insufficient interpretability, lack of generalization ability, generated data vulnerability, privacy concern and data complexity) that need to be investigated in order to accelerate the adoption of DGMs in IIoT and also propose some potential research directions.