Dr Xiaowei Gu
Academic and research departments
Computer Science Research Centre, School of Computer Science and Electronic Engineering.About
Biography
Dr. Xiaowei Gu received his PhD in Computer Science from Lancaster University (UK) in 2018. Before joining Surrey, Xiaowei was a Lecturer in Computing at the University of Kent (UK), a Lecturer in Computer Science at Aberystwyth University (UK) and a Senior Research Associate at Lancaster University.
Areas of specialism
My qualifications
ResearchResearch interests
Xiaowei’s research is focused on developing novel machine learning models that 1) have a transparent system structure and human-interpretable reasoning process, and 2) are capable of offering the state-of-the-art performance but with less demand of human expertise involvement. Xiaowei is also interested in developing explainable semi-supervised machine learning models to tackle streaming data problems.
Research projects
Xiaowei Gu, New Investigator Award funded by Engineering and Physical Sciences Research Council (EPSRC), funding amount £267,600, 2024-2026
An Explainable Generic Design of Self-Evolving Intelligent Security Systems for Cyber attack DetectionXiaowei Gu and Gareth Howells, research funded by Frazer-Nash Consultancy Ltd. on behalf of the Defence Science and Technology Laboratory (Dstl), funding amount £122,556, 2023-2024
Research interests
Xiaowei’s research is focused on developing novel machine learning models that 1) have a transparent system structure and human-interpretable reasoning process, and 2) are capable of offering the state-of-the-art performance but with less demand of human expertise involvement. Xiaowei is also interested in developing explainable semi-supervised machine learning models to tackle streaming data problems.
Research projects
Xiaowei Gu, New Investigator Award funded by Engineering and Physical Sciences Research Council (EPSRC), funding amount £267,600, 2024-2026
Xiaowei Gu and Gareth Howells, research funded by Frazer-Nash Consultancy Ltd. on behalf of the Defence Science and Technology Laboratory (Dstl), funding amount £122,556, 2023-2024
Supervision
Postgraduate research supervision
PhD applicants/visitors are very welcome! If you are interested, please contact Xiaowei via email: xiaowei.gu@surrey.ac.uk along with your CV and a short paragraph describing your research idea.
Publications
Nowadays, cyber-attacks have become a common and persistent issue affecting various human activities in modern societies. Due to the continuously evolving landscape of cyber-attacks and the growing concerns around " black box " models, there has been a strong demand for novel explainable and interpretable intrusion detection systems with online learning abilities. In this paper, a novel soft prototype-based autonomous fuzzy inference system (SPAFIS) is proposed for network intrusion detection. SPAFIS learns from network traffic data streams online on a chunk-by-chunk basis and autonomously identifies a set of meaningful, human-interpretable soft prototypes to build an IF-THEN fuzzy rule base for classification. Thanks to the utilization of soft prototypes, SPAFIS can precisely capture the underlying data structure and local patterns, and perform internal reasoning and decision-making in a human-interpretable manner based on the ensemble properties and mutual distances of data. To maintain a healthy and compact knowledge base, a pruning scheme is further introduced to SPAFIS, allowing itself to periodically examine the learned solution and remove redundant soft prototypes from its knowledge base. Numerical examples on public network intrusion detection datasets demonstrated the efficacy of the proposed SPAFIS in both offline and online application scenarios, outperforming the state-of-the-art alternatives. Thanks to the rapid development in electronic manufacturing and information technology, the Internet has become an essential part of everyday life for billions of individuals in modern societies. The Internet has greatly transformed the way people communicate, network and access information. However, the ongoing digitalization in the world has also led to a significant rise in cyber-attacks. According to the Cyber Security Breaches Survey published by the UK government in April 2023 [1], 59% of medium businesses, 69% of large businesses and 56% of high-income charities have encountered cybersecurity breaches and/or cyber-attacks in the last 12 months. Nowadays, the escalating cyber-attacks have posed a major and persistent threat to individuals, businesses and organizations on the Internet. The need for effective techniques to protect information security is highly pronounced. Intrusion detection systems (IDSs) are one of the most effective security techniques to prevent cyber-attacks [2]. The function of an IDS is to monitor the network and identify malicious activities. Traditional IDSs are primarily based on signatures. Such IDSs utilize pattern matching methods to compare current activities against signatures of previous intrusions stored in the database [3]. Signature-based IDSs are highly effective in detecting known attacks, but they are unable to detect novel attacks because of the lack of matching signature in the database. As the technological evolution of cybercrime has made cyber-attacks more sophisticated and difficult to detect, traditional signature-based IDSs have become insufficient in real-world scenarios [4]. Machine learning techniques are capable of learning normal and malicious patterns from empirically observed network activities to constructing accurate predictive models with less human involvement [4]. Conventional machine learning methods, such as decision tree (DT) [5], random forest (RF) [6], support vector machine (SVM) [7], k-nearest neighbour (KNN) [8], etc., have been extensively used for identifying cyber-attacks. IDSs based on conventional machine learning have achieved many successes, but they generally struggle with large-scale, complex intrusion detection problems [9]. Due to the evolving landscape of cyber-attacks, characterized by the increasing sophistication and complexity, there has been a rapidly growing demand for IDSs that leverage more advanced machine learning techniques.
High-dimensional data classification is widely considered as a challenging task in machine learning due to the so-called "curse of dimensionality". In this paper, a novel multilayer jointly evolving and compressing fuzzy neural network (MECFNN) is proposed to learn highly compact multi-level latent representations from high-dimensional data. As a meta-level stacking ensemble system, each layer of MECFNN is based on a single jointly evolving and compressing neural fuzzy inference system (ECNFIS) that self-organises a set of human-interpretable fuzzy rules from input data in a sample-wise manner to perform approximate reasoning. ECNFISs associate a unique compressive projection matrix to each individual fuzzy rule to compress the consequent part into a tighter form, removing redundant information whilst boosting the diversity within the stacking ensemble. The compressive projection matrices of the cascading ECNFISs are self-updating to minimise the prediction errors via error backpropagation together with the consequent parameters, empowering MECFNN to learn more meaningful, discriminative representations from data at multiple levels of abstraction. An adaptive activation control scheme is further introduced in MECFNN to dynamically exclude less activated fuzzy rules, effectively reducing the computational complexity and fostering generalisation. Numerical examples on popular high-dimensional classification problems demonstrate the efficacy of MECFNN.
This paper proposes a dynamic evolving fuzzy system (DEFS) for streaming data prediction. DEFS utilises the enhanced data potential and prediction errors of individual local models as the main criteria for fuzzy rule generation. A vital feature of the proposed system is its novel rule merging scheme that can self-adjust its tolerance towards the degree of similarity between two similar fuzzy rules according to the size of the rule base. To better handle the shifts and drifts in the data patterns, a novel rule quality measure based on both the utility values and the prediction accuracy of individual fuzzy rules is further introduced to help DEFS identify these less activated fuzzy rules with poorer descriptive capabilities and, thereby, maintaining a healthier fuzzy rule base by removing these stale rules. Very importantly, the thresholds used by DEFS are self-adaptive towards the input data. The adaptive thresholds can help DEFS to precisely capture the underlying structure and dynamically changing patterns of streaming data, enabling the system performing accurate approximation reasoning. Numerical examples based on several popular benchmark problems show the superior performance of DEFS over the state-of-the-art evolving fuzzy systems. The prediction performance of the proposed method is at least 2.88% better than the best-performing comparative EFSs on each individual regression benchmark problem considered in this study, and the average performance improvement across all the numerical experiments is approximately 30%.
In this paper, a novel autonomous centreless algorithm is proposed for data partitioning. The proposed algorithm firstly constructs the nearest neighbour affinity graph and identifies the local peaks of data density to build micro-clusters. Unlike the vast majority of partitional clustering algorithms, the proposed algorithm does not rely on singleton prototypes, namely, centres or medoids of the micro-clusters to partition the data space. Instead, these micro-clusters are directly utilised to attract nearby data samples to form shape-free Voronoi tessellations, hence, being centreless and robust to noisy data. A fusion scheme is further implemented to fuse these data clouds with higher intra-cluster similarity together to attain a more compact partitioning of data. The proposed algorithm is able to perform data partitioning on a chunk-wise basis and is highly computationally efficient with the default distance measure. Therefore, it is suitable for both static data partitioning in offline scenarios and streaming data partitioning in online scenarios. Numerical examples on a variety of benchmark datasets demonstrate the efficacy of the proposed algorithm.
Anomaly detection from data streams is a hotly studied topic in the machine learning domain. It is widely considered a challenging task because the underlying patterns exhibited by the streaming data may dynamically change at any time. In this paper, a new algorithm is proposed to detect anomalies autonomously for streaming data. The proposed algorithm is nonparametric and does not require any threshold to be preset by users. The algorithmic procedure of the proposed algorithm is composed of the following three complementary stages. Firstly, the potentially anomalous samples that represent highly different patterns from others are identified from data streams based on data density. Then, these potentially anomalous samples are clustered online using the evolving autonomous data partitioning algorithm. Finally, true anomalies are identified from these minor clusters with the least amounts of samples associated with them. Numerical examples based on three benchmark datasets demonstrated the potential of the proposed algorithm as a highly effective approach for anomaly detection from data streams.
Tropical cyclones (TCs), with an intensive wind pump impact, induce sea surface temperature cooling (SSTC) on the upper ocean. SSTC is a pronounced indicator to reveal TC evolution and oceanic conditions. However, there are few effective methods for accurately approximating the amplitude of the spatial structure of TC-induced SSTC. This study proposes a novel explainable machine learning framework to model and interpret the amplitude of the spatial structure of SSTC over the northwest Pacific (NWP). In particular, 12 predictors related to TC characteristics and pre-storm ocean states are considered as inputs. A composite analysis technique is used to characterize the amplitude of the spatial structure of SSTC across the TC track. Extreme gradient boosting (XGBoost) is utilized to predict the amplitude of SSTC from the 12 predictors. To better interpret the ocean-atmosphere interaction, a SHapely Additive explanations (SHAP) method is further employed to identify the contributions of predictors in determining the amplitude of the TC-induced SSTC, bringing the attribute-oriented explainability to the proposed method. The results showed that the proposed method could accurately predict the amplitude of the spatial structure of SSTC for different TC intensity groups and outperforms a numerical model. The proposed method also serves as an effective tool for reconstructing composite maps of both interannual and seasonal evolutions of SSTC spatial structure. The study offers insight into applying machine learning to model and interpret the responses of oceanic conditions triggered by extreme weather conditions (e.g., TCs).
Fuzzy systems offer a formal and practically popular methodology for modelling nonlinear problems with inherent uncertainties, entailing strong performance and model interpretability. Particularly, semi-supervised boosting is widely recognised as a powerful approach for creating stronger ensemble classification models in the absence of sufficient labelled data without introducing any modification to the employed base classifiers. However, the potential of fuzzy systems in semi-supervised boosting has not been systematically explored yet. In this study, a novel semi-supervised boosting algorithm devised for zero-order evolving fuzzy systems is proposed. It ensures both the consistence amongst predictions made by individual base classifiers at successive boosting iterations and the respective levels of confidence towards their predictions throughout the process of sample weight updating and ensemble output generation. In so doing, the base classifiers are empowered to gradually focus more on challenging samples that are otherwise hard to generalise, enabling the development of more precise integrated classification boundaries. Numerical evaluations on a range of benchmark problems are carried out, demonstrating the efficacy of the proposed semi-supervised boosting algorithm for constructing ensemble fuzzy classifiers with high accuracy.