Dr Wenwu Wang

Research Interests

Research Grants

I appreciate the financial support for my research from the following bodies (since 2008): Engineering and Physical Science Research Council (EPSRC), Ministry of Defence (MOD), Defence Science and Technology Laboratory (DSTL), Home Office (HO), Royal Academy of Engineering (RAENG), European Commission (EC), Samsung Electronics Research Institute UK (SAMSUNG), National Natural Science Foundation of China (NSFC), the University Research Support Fund (URSF), and the Ohio State University (OSU). [Total award to Surrey where I am a PI/CI: approximately £5.4M (as PI £1.2M, as CI £4.2M). As PI/CI, on a total grant award portfolio: approximately £15M]

  1. 01/2015-01/2019, "MacSeNet: machine sensing training network", EC(Horizon 2020, Marie Curie Actions - Innovative Training Network). [jointly with INRIA (France), University of Edinburgh (UK), Technical University of Muenchen (Germany), EPFL (Switzerland), Computer Technology Institute (Greece), Institute of Telecommunications (Portugal), Tampere University of Technology (Finland), Fraunhofer IDMT (Germany), Cedar Audio Ltd (Cambridge, UK), Audio Analytic (Cambridge, UK), VisioSafe SA (Switzerland), and Noiseless Imaging Oy (Finland)]
  2. 02/2015-09/2015, "Array processing exploiting sparsity for submarine hull mounted arrays", Atlas Electronik&MOD (MarCE scheme)
  3. 03/2015-09/2015, "Speech enhancement based on lip tracking",EPSRC(impact acceleration account). [jointly with SAMSUNG (UK)]
  4. 10/2014-10/2018, "SpaRTaN: Sparse representation and compressed sensing training network", EC (FP7, Marie Curie Actions - Initial Training Network). [jointly with University of Edinburgh (UK), EPFL (Switzerland), Institute of Telecommunications (Portugal), INRIA (France), VisioSafe SA (Switzerland), Noiseless Imaging Oy (Finland), Tampere University of Technology (Finland), Cedar Audio Ltd (Cambridge, UK), and Fraunhofer IDMT (Germany)] (project website)
  5. 01/2014-01/2019, "S3A: future spatial audio for an immersive listener experience at home", EPSRC (programme grant). [jointly with University of Southampton, University of Salford, and BBC.] (project website)
  6. 04/2013-04/2018, "Signal processing solutions for a networked battlespace", EPSRC >and DSTL signal processing call). [jointly with Loughborough University, University of Strathclyde, and Cardiff University.] (project website)
  7. 10/2013-03/2014, "Enhancing speech quality using lip tracking", SAMSUNG industrial grant).
  8. 12/2012-12/2013, "Audio-visual cues based attention switching for machine listening", MILES and EPSRC (feasibility study). [jointly with School of Psychology and Department of Computing.]
  9. 11/2012-07/2013, "Audio-visual blind source separation", NSFC (international collaboration scheme). [jointly with Nanchang University, China.]
  10. 12/2011-03/2012, "Enhancement of audio using video", HO (pathway to impact). [jointly with University of East Anglia.]
  11. 10/2010-10/2013, "Audio and video based speech separation for multiple moving sources within a room environment", EPSRC responsive mode). [jointly with Loughborough University.]
  12. 10/2009-10/2012, "Multimodal blind source separation for robot audition", EPSRC and DSTL signal processing call). (project website)
  13. 05/2008-06/2008, "Convolutive non-negative sparse coding", RAENG (international travel grant).
  14. 02/2008-06/2008, "Convolutive non-negative matrix factorization", URSF (small grant).
  15. 02/2008-03/2008, "Computational audition", OSU (visiting scholarship).

Research Team

Postdoc Research Fellows

  • Dr Mark Barnard (09/2014 - ): Visual tracking for future spatial audio (Co-supervisor. Co-supervised with Prof Adrian Hilton and Dr Philip Jackson)
  • Dr Qingju Liu (04/2014 - ): Source separation and objectification for future spatial audio (Primary supervisor. Co-supervised with Dr Philip Jackson and Prof Adrian Hilton)
  • Dr Cemre Zor 04/2013 - ): Statistical anomaly detection (Primary supervisor. Co-supervised with Prof Josef Kittler)
  • Dr Swati Chandna (05/2013 - ): Bootstrapping for robust source separation (Primary supervisor. Co-supervised with Dr Philip Jackson)

  • Dr Mark Barnard (10/2010 - 12/2013): Audio-visual speech separation of multiple moving sources (Primary supervisor. Co-supervised with Prof Josef Kittler. External Collaborators: Prof Jonathon Chambers, Loughborough University; Dr Sangarapillai Lambotharan, Loughborough University; Prof Christian Jutten, Grenoble, France, and Dr Rivet Bertrand, Grenoble, France)
  • Dr Qingju Liu (01/2013 - 03/2014): Words spotting from noisy mixtures & Lip-tracking for voice enhancement

PhD Students

  • Pengming Feng: Multi-target tracking (Co-supervisor. Co-supervised with Prof Jonathon Chambers and Dr Syed Mohsen Naqvi)
  • Waqas Rafique: Acoustic source separation (Co-supervisor. Co-supervised with Prof Jonathon Chambers and Dr Syed Mohsen Naqvi)
  • Luca Remaggi: Informed acoustic source separation (Co-supervisor. Co-supervised with Dr Philip Jackson)
  • Jing Dong: Analysis model based sparse representations for denoising (Primary supervisor. Co-supervised with Dr Philip Jackson; External Collaborator: Dr Wei Dai, Imperial College London)
  • Volkan Kilic: Robust audio visual tracking of multiple moving sources for robot audition (Primary supervisor. Co-supervised with Prof Josef Kittler and Dr Mark Barnard)
  • Shahrzad Shapoori: Tensor factorization in EEG signal processing (Co-supervisor. Co-supervised with Dr Saeid Sanei, Department of Computing)
  • Amran Abdul Hadi: Audio-visual fusion for convolutive source separation (Co-supervisor. Co-supervised with Dr Saeid Sanei, Department of Computing)
  • Atiyeh Alinaghi: Joint sound source localisation and separation (Co-supervisor. Co-supervised with Dr Philip Jackson)

  • Dr Marek Olik (PhD defended in December 2014): Personal sound zone reproducation with room reflections (Co-supervisor. Co-supervised with Dr Philip Jackson)
  • Dr Syed Zubair (PhD awarded in June 2014): Dictionary learning for signal classification (Primary supervisor. Co-supervised with Dr Philip Jackson; Internal collaborator: Dr Fei Yan; External collaborator: Dr Wei Dai, Imperial College London)
  • Dr Philip Coleman (PhD awarded in May 2014): Loudspeaker array processing for personal sound zone reproduction (Co-supervisor. Co-supervised with Dr Philip Jackson)
  • Dr Qingju Liu (PhD awarded in October 2013): Multimodal blind source separation for robot audition (Primary supervisor. Co-supervised with Dr Philip Jackson, Prof Josef Kittler; External collaborator: Prof Jonathon Chambers, Loughborough University) [Qingju Liu was the winner of the Best Solution Award on the DSTL Challenge Workshop for the signal processing challenge "Undersampled Signal Recognition", announced on the SSPD 2012 conference, London, September 25-27, 2012.]
  • Dr Tao Xu (PhD awarded in June 2013): Dictionary learning for sparse representations with applications to blind source separation (Primary supervisor. Co-supervised with Dr Philip Jackson; External collaborator: Dr Wei Dai, Imperial College London)
  • Dr Rakkrit Duangsoithong (PhD awarded in Oct 2012): Feature selection and causal discovery for ensemble classifiers (Co-supervisor; Co-supervised with Dr Terry Windeatt)
  • Dr Tariqullah Jan (PhD awarded in Feb 2012): Blind convolutive speech separation and dereverberation (Primary Supervisor; Co-Supervised with Prof Josef Kittler; External collaborator: Prof DeLiang Wang, The Ohio State University)

Academic Visitors

  • Dr Xiaorong Shen (02/2015 - ): Associate Professor, Beihang University, Beijing, China. Topic: Audio-visual source detection, localization and tracking.
  • Mr Hatem Deif (02/2015 - ): PhD student, Brunel University, London, UK. Topic: Single channel audio source separation.
  • Mr Jian Guan (10/2014 - ): PhD student, Harbin Institute of Techonology, Shenzhen Graduate School, Shenzhen, China. Topic: Approximate message passing and belief propagation.
  • Dr Yang Yu (04/2014 - ): Associate Professor, Northwestern Polytechnical University, Xi'an, China. Topic: Underwater acoustic source localisation and tracking with sparse array and deep learning.

  • Mr Jamie Corr (10/2014 - 10/2014): PhD student, Strathclyde University, Glasgow, UK. Topic: Underwater acoustic data processing with polynomial matrix decomposition.
  • Dr Xionghu Zhong (07/2014 - 07/2014): Independent Research Fellow, Nanyang Technological University, Singapore. Topic: Acoustic source tracking.
  • Xiaoyi Chen (10/2012 - 09/2013 ): PhD student, Northwestern Polytechnical University, Xi'an, China. Topic: Convolutive blind source separation of underwater acoustic mixtures.
  • Dr Ye Zhang (12/2012 - 08/2013): Associate Professor, Nanchang University, Nanchang, China. Topic: Analysis dictionary learning and source separation.
  • Victor Popa (04/2013 - 07/2013), PhD student, University Politehnica of Bucharest, Bucharest, Romania. Topic: Audio source separation.
  • Dr Stefan Soltuz (10/2008 -07/2009), Research Scientist, Tiberiu Popoviciu Institute of Numerical Analysis, Romania. Topic: Non-negative matrix factorization for music audio separation (Primary supervisor. Co-supervised with Dr Philip Jackson)
  • Yanfeng Liang (MSc, 05/2009), MSc Student: Harbin Engineering University, Harbin, China. Topic: Adaptive signal processing for clutter removal in radar images (Co-supervisor. Co-supervised with Prof Jonathon Chambers, Loughborough University)

MSc Students

  • Denise Chew (MSc, 2014, awarded Distinction); Project: Audio impainting
  • Yan Yin (MSc, 2014); Project: Audio super-resolution
  • Dalton Whyte (MSc, 2014); Project: Audio retrieval using deep learning
  • Dan Hua (MSc, 2013, awarded Distinction) Project: Super-resolution audio based on sparse signal processing
  • Dichao Lu (MSc, 2013) Project: Polyphonic pitch tracking of music
  • Xiao Han (MSc, 2012, awarded Distinction); Project: Underdetermined reverberant speech separation
  • Jian Han (MSc, 2012, awarded Distinction); Project: Microphone array based acoustic tracking of multiple moving speakers (co-supervised with Dr Mark Barnard)
  • Tianyu Feng (MSc, 2012); Project: Multi-pitch estimation and tracking
  • Yuli Ling (MSc, 2012); Project: Audio event detection from sound mixtures
  • Danyang Shen (MSc, 2012); Project: Audio-visual tracking of multiple moving speakers (co-supervised with Dr Mark Barnard)
  • Kai Song (MSc, 2012); Project: Environment recognition from sound scenes (co-supervised with Dr Fei Yan)
  • Xinpu Han (MSc, 2012); Project: Compressed sensing for natural image coding
  • Steven Grima (MSc, 2011, awarded Distinction); Project: Multimodal tracking of multiple moving sources (co-supervised with Dr Mark Barnard)
  • Anil Lal (MSc, 2011, awarded Distinction); Project: Monaural music sound separation using spectral envelop template and isolated note information
  • Xi Luo (MSc, 2011, awarded Distinction); Project: Reverberant speech enhancement
  • Yunyi Wang (MSc, 2011); Project: Compressed sensing for image coding
  • Ritesh Agarwal (MSc, 2011); Project: Multiple pitch tracking
  • Yichen Li (MSc, 2011); Project: Environmental sound recognition (co-supervised with Dr Fei Yan)
  • Tengxu Yang (MSc, 2011); Project: Ideal binary mask estimation in computational auditory scene analysis
  • Jin Ke (MSc, 2011); Project: Audio-visual tracking and localisation of moving speakers (co-supervised with Dr Mark Barnard)
  • Zijian Zhang (MSc, 2011); Project: Convolutive blind source separation of speech mixtures
  • Hafiz Mustafa (MSc, 2010); Project: Single channel music sound separation

BSc Students

  • Xiao Cao (BSc, 2014); Project: Real-time speech separation demonstration

Research Collaborations

Academic:

  • Loughborough University (UK)
  • RIKEN (Japan)
  • Ohio State University (USA)
  • Imperial College London (UK)
  • Cardiff University (UK)
  • Strathclyde University (UK)
  • University of East Anglia (UK)
  • Nanchang University (China)
  • Northwestern Polytechnical University (China)
  • RMIT University (Australia)
  • Gipsa-lab (France)
  • Nanyang Technological University (Singapore)

Industrial:

  • Dstl
  • BBC
  • Thales
  • Qinetiq
  • Texas Instruments
  • Stellar
  • Digital Barriers
  • Selex Galileo
  • PrismTech
  • Steepest Ascent

Teaching

2012/2013

  • EEM.ivc - Image and Video Compression (Spring 2013)
  • EEM.sap - Speech and Audio Processing & Coding (Autumn 2012)
  • EE2.mpr - Media (Audio-Visual) Processing (Spring 2013)
  • EE1.pro - Programming: Labs & Marking (Spring 2013)

2011/2012

  • EEM.ivc - Image and Video Compression (Spring 2012)
  • EEM.sap - Speech and Audio Processing & Coding (Autumn 2011)
  • EE2.mpr - Media (Audio-Visual) Processing (Autumn 2011)
  • EE1.pro - Programming: Labs & Marking (Autumn 2011 & Spring 2012)
  • EE1.eps - EDPS: Basic Computing Skills (Autumn 2011)

2010/2011

  • EEM.ivc - Image and Video Compression (Spring 2011)
  • EEM.sap - Speech and Audio Processing & Coding (Autumn 2010)
  • EE1.pro - Programming: Labs & Marking (Autumn 2010 & Spring 2011)
  • EE1.eps - EDPS: Basic Computing Skills (Autumn 2010)

2009/2010

  • EEM.ivc - Image and Video Compression (Spring 2010)
  • EEM.sap - Speech and Audio Processing & Coding (Autumn 2009)
  • EE1.pro - Programming: Labs & Marking (Autumn 2009 & Spring 2010)
  • EE1.eps - EDPS: Basic Computing Skills (Autumn 2009)

2008/2009

  • EEM.ivc - Image and Video Compression (Spring 2009)
  • EEM.sap - Speech and Audio Processing & Coding (Autumn 2008)
  • EE1.pro - Introduction to Programming: Labs & Marking (Autumn 2008 & Spring 2009)

2007/2008

  • EEM.sap - Speech and Audio Processing & Coding (Autumn 2007)
  • EE1.pca - C Programming Labs (Autumn 2007 & Spring 2008)

Note: EEM - Master students module; EE1 - First-year undergraduate students module.

Departmental Duties

Selected Recent Activities

  • Organising Committee Member, CISP 2013, London, December, 2-3, 2013.
  • Program Committee Member, BMVC 2013, Bristol, UK, Sept 9-13, 2013.
  • Program Committee Member, SIP 2013, Banff, Canada, July 17-19, 2013.
  • Special Session Co-Chair (with Jonathon Chambers and Zoran Cvetkovic), DSP 2013, Santorini, Greece, July, 1-3, 2013.
  • Program Committee Member, ICICIP 2013, Beijing, China, June, 09-11, 2013.
  • Program Committee Member, ICASSP 2013, Vancouver, Canada, May, 26-31, 2013.
  • Tutorial Speaker (with Wei Dai and Boris Mailhe), ICASSP 2013, "Dictionary Learning for Sparse Representations: Algorithms and Applications", Vancovar, Canada, May, 26-31, 2013.
  • Program Committee Member, SENSORNETS 2013, Barcelona, Spain, February 19-21, 2013.
  • External PhD Examiner, PhD Thesis: "Sparse Approximation and Dictionary Learning with Applications to Audio Signals", Queen Mary University of London, December 2012.
  • Independent Expert, European Commission, grant evaluation, November 2012.
  • Session Chair, SSPD 2012, "Sensor Arrays", London, UK, 25-27 September, 2012.
  • Program Committee Member, ISCSLP 2012, Hong Kong, China, December 5-8, 2012.
  • Program Committee Member, SSPD 2012, London, UK, 25-27 September, 2012.
  • Session Chair, EUSIPCO 2012, "P-ML-1: Machine Learning", Bucharest, Romania, 27 - 31 August, 2012.
  • Session Co-Chair (with Ali Taylan Cemgil), ICASSP 2012, "MLSP-L3: Applications in Audio, Speech, and Image Processing", Kyoto, Japan, 25-30 March, 2012.
  • Program Committee Member, S+SSPR 2012, Hiroshima, Japan, 7 - 9 November, 2012.
  • Program Committee Member, CISP 2012, Chongqing, China, 16-18 October, 2012.
  • Program Committee Member, UKCI 2012, Edinburgh, UK, 5-11 September, 2012.
  • Area Chair, EUSIPCO 2012, Bucharest, Romania, 27 - 31 August, 2012.
  • Program Committee Member, BMVC 2012, Guildford, UK, 3 - 7 September, 2012.
  • Program Committee Member, EUSIPCO 2012, Bucharest, Romania, 27 - 31 August, 2012.
  • Program Committee Member, SIP 2012, Honolulu, USA, 20 - 22 August, 2012.
  • Program Committee Member, ISNN 2012, Shenyang, China, 11-14 July, 2012.
  • Program Committee Member, ICSAI 2012, Yantai, China, 19-21 May, 2012.
  • Program Committee Member, ICASSP 2012, Kyoto, Japan, 25-30 March, 2012.
  • Program Committee Member, ICIST 2012, Wuhan, China, 23-25 March, 2012.
  • Program Committee Member, SENSORNETS 2012, Rome, Italy, 24-26 February, 2012.
  • Internal PhD Examiner, PhD Thesis: "Novel Tensor Factorization Based Approaches for Blind Source Separation", Department of Computing, University of Surrey, December 2011.
  • Session Chair, EUSIPCO 2011, "Multichannel Acoustic Processing I", Barcelona, Spain, 29 August -2 Sept, 2011.
  • Program Committee Member, SIP 2011, Dallas, USA, 14-16 December, 2011.
  • Program Committee Member, CISP 2011, Shanghai, China, 15-17 October, 2011.
  • Technical Committee Member, SSPD 2011, London, UK, 28-29 September, 2011.
  • Program Committee Member, UKCI 2011, Manchester, UK, September, 2011.
  • Special Session Co-Chair (with Jonathon Chambers and Bertrand Rivet), EUSIPCO 2011, "Multimodal (Audio-Visual) Speech Separation", Barcelona, Spain, 29 August -2 Sept, 2011.
  • Program Committee Member, BMVC 2011, Dundee, UK, 29 August -2 Sept, 2011.
  • Headstart Project Leader, School Outreach, Guildford, 17-20 July, 2011.
  • Program Committee Member, SIPA 2011, Crete, Greece, 22-24 June, 2011.
  • Program Committee Member, CSIE 2011, Changchun, China, 17-19 June, 2011.
  • Program Committee Member, ISNN 2011, Guilin, China, 29 May- 1 June, 2011.
  • Program Committee Member, ICIST 2011, Nanjing, China, 26-28 March, 2011.
  • Grant Reviewer, EPSRC, first grant proposal, March, 2011.
  • External PhD Examiner, School of Engineering, University of Edinburgh, 2010.
  • Grant Reviewer, PASCAL2, internal visiting proposal, August, 2010.
  • Headstart Project Leader, School Outreach, Guildford, 19-22 July, 2010.
  • Technical Committee Member, Conference on Sensor Signal Processing for Defence (SSPD 2010), London, UK, 29-30 September, 2010.
  • Program Committee Member, BMVC 2010, Aberystwyth, UK, 31 August - 3 September, 2010.
  • Program Committee Member, IEEE WCSE 2010, Wuhan, China, December 19-20, 2010.
  • Program Committee Member, IWACI 2010, Suzhou, China, August 25-27, 2010.
  • Program Committee Member, SIP 2010, Maui, Hawaii, USA, August 23-25, 2010.
  • Program Committee Member, SSSPR 2010, Cesme, Turkey, August 18-20, 2010.
  • Program Committee Member, ISNN 2010, Shanghai, China, June 6-9, 2010.
  • Publicity chair, IEEE International Workshop on Statistical Signal Processing (SSP 2009), Cardiff, UK, Aug. 31- Sept. 3, 2009.
  • Program Co-chair, IEEE Global Congress on Intelligent Systems (GCIS 2009), Xiamen, China, May 19-21, 2009.
  • Program Committee Member, SIP 2009, Honolulu, Hawaii, USA, August 17-19, 2009.
  • Program Committee Member, IEEE WCSE 2009, Xiamen, China, May 19-21, 2009.
  • Session Chair, ICA Research Network International Workshop (ICARN 2008), Liverpool, UK, September 25-26, 2008.
  • Chair, oral session "Unsupervised learning III", IEEE WCCI 2008, HongKong, China, June 1-6, 2008.
  • Guest editor, special issue "Advances in Nonnegative Matrix and Tensor Factorization", Computational Intelligence and Neuroscience(Hindawi), edited by A. Cichocki, M. Morup, P. Smaragdis, W. Wang, and R. Zdunek, May 2008.
  • Program Committee Member, SIP 2008, Kailua-Kona, Hawaii, USA, August 18-20, 2008.
  • Technical Committee Member, IEEE WCCI 2008, HongKong, China, June 1-6, 2008.

Invited Talks

  • W. Wang, "Machine Audition at CVSSP", in UK & IE Speech Conference, Birmingham, UK, December 17-18, 2012.
  • W. Wang, "Dictionary Learning Algorithms in Sparse Representations and Signal Processing," (Organizer: Dr Wei Liu),Department of Eletronic and Electrical Engineering , Sheffield University, October 24, 2012.
  • W. Wang, "Dictionary Learning Algorithms in Signal Processing," (Organizer: Dr Lu Gan), School of Engineering and Design, Brunel University, August 1, 2012.
  • W. Wang, "Adaptive Dictionary Learning Algorithms for Image Denoising, Source Separation, and Visual Tracking," (Organizer: Dr Andrew Aubrey), Cardiff School of Computer Science and Informatics, Cardiff University, May 24, 2012.
  • W. Wang, "Dictionary Learning Algorithms and Their Applications in Source separation, Speaker Tracking, and Image Denoising," (Organizer: Prof Mark Plumbley), School of Electronic Engineering and Computer Science, Queen Mary University of London, April 25, 2012.
  • W. Wang, "Audio and Audio-Visual Source Separation," (Organizer: Dr Xiaorong Shen), School of Automation Science and Electrical Engineering, Beihang University, Beijing, September 20, 2011.
  • T. Xu and W. Wang, "Compressive Sensing," (Organizer: Prof. Anthony Ho), Department of Computer Science, University of Surrey, Guildford, January 11, 2010.
  • W. Wang, "Multimodal Blind Source Separation for Robot Audition," (Organizer: Dr. Tania Stathaki), MOD University Defence Research Centre Launch & Theme Meeting, Imperial College London, London, November 5, 2009.
  • W. Wang, "Two-microphone Speech Separation Based on Convolutive ICA and Ideal Binary Mask Coupled with Cepstral Smoothing," (Organizer: Prof. Francis Rumsey), Institute of Sound Recording (IoSR), University of Surrey, Guildford, October 21, 2008.
  • W. Wang, "Convolutive ICA and NMF for Audio Source Separation and Perception," (Organizers: Prof. Vladimir M. Sloutsky & Prof. DeLiang Wang), Center for Cognitive Science, Ohio State University, Columbus, April 11, 2008.
  • W. Wang, "Audio Source Separation and Perception," (Organizer: Prof. DeLiang Wang), Perception and Neurodynamics Laboratory (PNL), Department of Computer Science and Engineering, Ohio State University, Columbus, March 07, 2008.
  • W. Wang, "Intelligent Data Fusion Based Blind Source Separation," (Organizer: Dr Nathan Wood), Royal Academy of Engineering, London, April 11, 2005.
  • W. Wang and J.A. Chambers, "Frequency Domain Blind Source Separation," IEE Seminar on Blind Source Separation in Biomedicine (Organizer: Dr. Christopher J. James), British Institute of Radiology, London, 1 Dec. 2004.
  • W. Wang, "Frequency Domain BSS and its Associated Permutation Problem," Contract Researchers Conference at Cardiff School of Engineering (Organizer: Dr. Adrian Porch), Cardiff University, Cardiff, July 16, 2004.
  • W. Wang, "Blind Signal Processing and Speech Enhancement," Series Forum for Celebration of the 50th Anniversary of Harbin Engineering University (Organizer: Prof. Yanling Hao), Harbin, Apr. 11, 2003.
  • W. Wang, S. Sanei, and J.A. Chambers, "Has the Permutation Problem in Transform Domain BSS Been Solved?," IEE Workshop on Independent Component Analysis: Generalizations, Algorithms and Applications (Organizer: Dr. Mike Davies), Queen Mary University of London, London, Dec. 20, 2002

Tutorial Speech 

  • W. Dai, W. Wang, and B. Mailhe, ICASSP 2013, "Dictionary Learning for Sparse Representations: Algorithms and Applications", Vancovar, Canada, May, 26-31, 2013.

Contact Me

E-mail:
Phone: 01483 68 6039

Find me on campus
Room: 04 BB 01

Publications

Highlights

  • Wang W. (2011) Preface of Machine Audition: Principles, Algorithms and Systems. in Wang W (ed.) Machine Audition: Principles, Algorithms and Systems Information Science Reference , pp. xv-xxi.

    Abstract

    "This book covers advances in algorithmic developments, theoretical frameworks, andexperimental research findings to assist professionals who want an improved ...

  • Jan T, Wang W, Wang D. (2011) 'A Multistage Approach to Blind Separation of Convolutive Speech Mixtures'. Speech Communication, 53 (4), pp. 524-539.
  • Wang W. (2010) Instantaneous versus Convolutive Non-negative Matrix Factorization: Models, Algorithms and Applications to Audio Pattern Separation. in Wang W (ed.) Machine Audition: Principles, Algorithms and Systems Information Science Reference Article number 15 , pp. 353-370.
  • Jan T, Wang W. (2010) Cocktail Party Problem: Source Separation Issues and Computational Methods. in Wang W (ed.) Machine Audition: Principles, Algorithms and Systems New York, USA : Information Science Reference Article number 3 , pp. 61-79.
  • Wang W. (2010) Machine Audition: Principles, Algorithms and Systems. New York, USA : Information Science Reference
  • Zhou S, Wang W. (2009) IEEE/WRI Global Congress on Intelligent Systems Proceedings. USA : IEEE Computer Society
  • Wang W, Cichocki A, Chambers JA. (2009) 'A multiplicative algorithm for convolutive non-negative matrix factorization based on squared euclidean distance'. IEEE Transactions on Signal Processing, 57 (7), pp. 2858-2864.
  • Wang W, Cichocki A, Mørup M, Smaragdis P, Zdunek R. (2008) 'Advances in nonnegative matrix and tensor factorization'. Hindawi Publishing Corporation Computational Intelligence and Neuroscience, 2008 Article number 852187
  • Wang W, Luo Y, Chambers JA, Sanei S. (2008) 'Note onset detection via nonnegative factorization of magnitude spectrum'. HINDAWI PUBLISHING CORPORATION EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, Article number ARTN 231367

Journal articles

  • Liang J, Hu Q, Zhu P, Wang W . (2017) 'Efficient Multi-modal Geometric Mean Metric Learning'. ELSEVIER SCI LTD PATTERN RECOGNITION,
    [ Status: Accepted ]

    Abstract

    With the fast development of information acquisition, there is a rapid growth of multimodality data, e.g., text, audio, image and even video, in fields of health care, multimedia retrieval and scientific research. Confronted with the challenges of clustering, classification or regression with multi-modality information, it is essential to effectively measure the distance or similarity between objects described with heterogeneous features. Metric learning, aimed at finding a task-oriented distance function, is a hot topic in machine learning. However, most existing algorithms lack efficiency for highdimensional multi-modality tasks. In this work, we develop an effective and efficient metric learning algorithm for multi-modality data, i.e., Efficient Multi-modal Geometric Mean Metric Learning (EMGMML). The proposed algorithm learns a distinctive distance metric for each view by minimizing the distance between similar pairs while maximizing the distance between dissimilar pairs. To avoid overfitting, the optimization objective is regularized by symmetrized LogDet divergence. EMGMML is very efficient in that there is a closed-formsolution for each distance metric. Experiment results show that the proposed algorithm outperforms the state-of-the-art metric learning methods in terms of both accuracy and efficiency.

  • Xu Y , Huang Q , Wang W , Foster P, Sigtia S, Jackson PJB, Plumbley MD. (2017) 'Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging'. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 25 (6), pp. pp. 1230-1241.
    [ Status: Accepted ]

    Abstract

    Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.

  • Wang D, Zou Y, Wang W . (2017) 'Learning soft mask with DNN and DNN-SVM for multi-speaker DOA estimation using an acoustic vector sensor'. Elsevier Journal of The Franklin Institute,
    [ Status: Accepted ]

    Abstract

    Using an acoustic vector sensor (AVS), an efficient method has been presented recently for direction-of-arrival (DOA) estimation of multiple speech sources via the clustering of the inter-sensor data ratio (AVS-ISDR). Through extensive experiments on simulated and recorded data, we observed that the performance of the AVS-DOA method is largely dependent on the reliable extraction of the target speech dominated time-frequency points (TD-TFPs) which, however, may be degraded with the increase in the level of additive noise and room reverberation in the background. In this paper, inspired by the great success of deep learning in speech recognition, we design two new soft mask learners, namely deep neural network (DNN) and DNN cascaded with a support vector machine (DNN-SVM), for multi-source DOA estimation, where a novel feature, namely, the tandem local spectrogram block (TLSB) is used as the input to the system. Using our proposed soft mask learners, the TD-TFPs can be accurately extracted under different noisy and reverberant conditions. Additionally, the generated soft masks can be used to calculate the weighted centers of the ISDR-clusters for better DOA estimation as compared with the original center used in our previously proposed AVS-ISDR. Extensive experiments on simulated and recorded data have been presented to show the improved performance of our proposed methods over two baseline AVS-DOA methods in presence of noise and reverberation.

  • Franck A, Wang W , Fazi F. (2017) 'Sparse ℓ1-Optimal Multiloudspeaker Panning and Its Relation to Vector Base Amplitude Panning'. IEEE IEEE Transactions on Audio Speech and Language Processing, 25 (5), pp. pp. 996-1010.

    Abstract

    Panning techniques, such as vector base amplitude panning (VBAP) are a widely-used practical approach for spatial sound reproduction using multiple loudspeakers. Although limited to a relatively small listening area, they are very efficient and offer good localisation accuracy, timbral quality as well as a graceful degradation of quality outside the sweet spot. The aim of this paper is to investigate optimal sound reproduction techniques that adopt some of the advantageous properties of VBAP, such as the sparsity and the locality of the active loudspeakers for the reproduction of a single audio object. To this end, we state the task of multi-loudspeaker panning as an `1 optimization problem. We demonstrate and prove that the resulting solutions are exactly sparse. Moreover, we show the effect of adding a nonnegativity constraint on the loudspeaker gains in order to preserve the locality of the panning solution. Adding this constraint, `1- optimal panning can be formulated as a linear program. Using this representation, we prove that unique `1-optimal panning solutions incorporating a nonnegativity constraint are identical to VBAP using a Delaunay triangulation for the loudspeaker setup. Using results from linear programming and duality theory, we describe properties and special cases, such as solution ambiguity, of the VBAP solution.

  • Guan J, Wang X, Wang W , Huang L. (2017) 'Sparse Blind Speech Deconvolution with Dynamic Range Regularization and Indicator Function'. Springer Verlag Circuits Systems and Signal Processing, , pp. pp. 1-16.

    Abstract

    Blind deconvolution is an ill-posed problem. To solve such a prob- lem, prior information, such as, the sparseness of the source (i.e. input) signal or channel impulse responses, is usually adopted. In speech deconvolution, the source signal is not naturally sparse. However, the direct impulse and early reflections of the impulse responses of an acoustic system can be considered as sparse. In this paper, we exploit the channel sparsity and present an algorithm for speech deconvolution, where the dynamic range of the convolutive speech is also used as the prior information. In this algorithm, the estimation of the impulse response and the source signal is achieved by alternating between two steps, namely, the ℓ1 regularized least squares optimization and a proximal operation. As demonstrated in our experiments, the proposed method pro- vides superior performance for deconvolution of a sparse acoustic system, as compared with two state-of-the-art methods.

  • Dong J, Han Z, Zhao Y, Wang W , Prochazka A, Chambers J. (2017) 'Sparse Analysis Model Based Multiplicative Noise Removal with Enhanced Regularization'. Elsevier Signal Processing, 137, pp. pp. 160-176.

    Abstract

    The multiplicative noise removal problem for a corrupted image has recently been considered under the framework of regularization based approaches, where the regularizations are typically de ned on sparse dictionaries and/or total va- riation (TV). This framework was demonstrated to be e ective. However, the sparse regularizers used so far are based overwhelmingly on the synthesis model, and the TV based regularizer may induce the stair-casing e ect in the recon- structed image. In this paper, we propose a new method using a sparse analysis model. Our formulation contains a data delity term derived from the distri- bution of the noise and two regularizers. One regularizer employs a learned analysis dictionary, and the other regularizer is an enhanced TV by introducing a parameter to control the smoothness constraint de ned on pixel-wise di er- ences. To address the resulting optimization problem, we adapt the alternating direction method of multipliers (ADMM) framework, and present a new method where a relaxation technique is developed to update the variables exibly with either image patches or the whole image, as required by the learned dictionary and the enhanced TV regularizers, respectively. Experimental results demon- strate the improved performance of the proposed method as compared with several recent baseline methods, especially for relatively high noise levels.

  • Remaggi L, Jackson PJB, Coleman P , Wang W . (2017) 'Acoustic Reflector Localization: Novel Image Source Reversion and Direct Localization Methods'. IEEE IEEE Transactions on Audio, Speech and Language Processing, 25 (2), pp. pp. 296-309.

    Abstract

    Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic reflectors given room impulse responses (RIRs). In this article, we categorize these methods as “image-source reversion”, which localizes the image source before finding the reflector position, and “direct localization”, which localizes the reflector without intermediate steps. We present five new contributions. First, an onset detector, called the clustered dynamic programming projected phase-slope algorithm, is proposed to automatically extract the time of arrival for early reflections within the RIRs of a compact microphone array. Second, we propose an image-source reversion method that uses the RIRs from a single loudspeaker. It is constructed by combining an image source locator (the image source direction and range (ISDAR) algorithm), and a reflector locator (using the loudspeaker-image bisection (LIB) algorithm). Third, two variants of it, exploiting multiple loudspeakers, are proposed. Fourth, we present a direct localization method, the ellipsoid tangent sample consensus (ETSAC), exploiting ellipsoid properties to localize the reflector. Finally, systematic experiments on simulated and measured RIRs are presented, comparing the proposed methods with the state-of-the-art. ETSAC generates errors lower than the alternative methods compared through our datasets. Nevertheless, the ISDAR-LIB combination performs well and has a run time 200 times faster than ETSAC.

  • Wang W, Feng P, Dlay S, Naqvi SM, Chambers J. (2017) 'Social Force Model based MCMC-OCSVM Particle PHD Filter for Multiple Human Tracking'. IEEE Transactions on Multimedia,
    [ Status: Accepted ]
  • Liang J, Hu Q, Wang W , Han Y. (2016) 'Semi-Supervised Online Multi-Kernel Similarity Learning for Image Retrieval'. IEEE IEEE Transactions on Multimedia, 19 (5), pp. pp. 1077-1089.
    [ Status: Accepted ]

    Abstract

    Metric learning plays a fundamental role in the fields of multimedia retrieval and pattern recognition. Recently, an online multi-kernel similarity (OMKS) learning method has been presented for content-based image retrieval (CBIR), which was shown to be promising for capturing the intrinsic nonlinear relations within multimodal features from large-scale data. However, the similarity function in this method is learned only from labeled images. In this paper, we present a new framework to exploit unlabeled images and develop a semi-supervised OMKS algorithm. The proposed method is a multi-stage algorithm consisting of feature selection, selective ensemble learning, active sample selection and triplet generation. The novel aspects of our work are the introduction of classification confidence to evaluate the labeling process and select the reliably labeled images to train the metric function, and a method for reliable triplet generation, where a new criterion for sample selection is used to improve the accuracy of label prediction for unlabelled images. Our proposed method offers advantages in challenging scenarios, in particular, for a small set of labeled images with high-dimensional features. Experimental results demonstrate the effectiveness of the proposed method as compared with several baseline methods.

  • Barnard M, Wang W. (2016) 'Audio Head Pose Estimation using the Direct to Reverberant Speech Ratio'. Speech Communication,

    Abstract

    Head pose is an important cue in many applications such as, speech recognition and face recognition. Most approaches to head pose estimation to date have focussed on the use of visual information of a subject’s head. These visual approaches have a number of limitations such as, an inability to cope with occlusions, changes in the appearance of the head, and low resolution images. We present here a novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a reverberant room environment. Our hypothesis is that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This method has the advantage of actually exploiting the reverberations within a room rather than trying to suppress them. This also has the practical advantage that most enclosed living spaces, such as meeting rooms or offices are highly reverberant environments. In order to test this hypothesis we also present a new data set featuring 56 subjects recorded in three different rooms, with different acoustic properties, adopting 8 different head poses in 4 different room positions captured with a 16 element microphone array. As far as the authors are aware this data set is unique and will make a significant contribution to further work in the area of audio head pose estimation. Using this data set we demonstrate that our proposed method of using the DRR for audio head pose estimation provides a significant improvement over previous methods.

  • Feng P, Wang W, Naqvi SM, Chambers J. (2016) 'Adaptive Retrodiction Particle PHD Filter for Multiple Human Tracking'. IEEE Signal Processing Letters,
  • Dong J, Wang W, Dai W, Plumbley MD, Han Z-F, Chambers J. (2016) 'Analysis SimCO Algorithms for Sparse Analysis Model Based Dictionary Learning'. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC IEEE TRANSACTIONS ON SIGNAL PROCESSING, 64 (2), pp. 417-431.
  • Kilic V, Barnard M, Wang W, Hilton A, Kittler J. (2016) 'Mean-Shift and Sparse Sampling Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking'. IEEE Transactions on Multimedia,
    [ Status: Accepted ]

    Abstract

    The probability hypothesis density (PHD) filter based on sequential Monte Carlo (SMC) approximation (also known as SMC-PHD filter) has proven to be a promising algorithm for multi-speaker tracking. However, it has a heavy computational cost as surviving, spawned and born particles need to be distributed in each frame to model the state of the speakers and to estimate jointly the variable number of speakers with their states. In particular, the computational cost is mostly caused by the born particles as they need to be propagated over the entire image in every frame to detect the new speaker presence in the view of the visual tracker. In this paper, we propose to use audio data to improve the visual SMC-PHD (VSMC- PHD) filter by using the direction of arrival (DOA) angles of the audio sources to determine when to propagate the born particles and re-allocate the surviving and spawned particles. The tracking accuracy of the AV-SMC-PHD algorithm is further improved by using a modified mean-shift algorithm to search and climb density gradients iteratively to find the peak of the probability distribution, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. These improved algorithms, named as AVMS-SMCPHD and sparse-AVMS-SMC-PHD respectively, are compared systematically with AV-SMC-PHD and V-SMC-PHD based on the AV16.3, AMI and CLEAR datasets.

  • Zhao L, Hu Q, Wang W. (2015) 'Heterogeneous Feature Selection with Multi-Modal Deep Neural Networks and Sparse Group Lasso'. IEEE Transactions on Multimedia, 17 (11), pp. 1936-1948.
    [ Status: Accepted ]
  • Chen X, Wang W, Wang Y, Zhong X, Alinaghi A. (2015) 'Reverberant speech separation with probabilistic time-frequency masking for B-format recordings'. Speech Communication, 68, pp. 41-54.
  • Kiliç V, Barnard M, Wang W, Kittler J. (2015) 'Audio assisted robust visual tracking with adaptive particle filtering'. IEEE Transactions on Multimedia, 17 (2), pp. 186-200.

    Abstract

    © 1999-2012 IEEE.The problem of tracking multiple moving speakers in indoor environments has received much attention. Earlier techniques were based purely on a single modality, e.g., vision. Recently, the fusion of multi-modal information has been shown to be instrumental in improving tracking performance, as well as robustness in the case of challenging situations like occlusions (by the limited field of view of cameras or by other speakers). However, data fusion algorithms often suffer from noise corrupting the sensor measurements which cause non-negligible detection errors. Here, a novel approach to combining audio and visual data is proposed. We employ the direction of arrival angles of the audio sources to reshape the typical Gaussian noise distribution of particles in the propagation step and to weight the observation model in the measurement step. This approach is further improved by solving a typical problem associated with the PF, whose efficiency and accuracy usually depend on the number of particles and noise variance used in state estimation and particle propagation. Both parameters are specified beforehand and kept fixed in the regular PF implementation which makes the tracker unstable in practice. To address these problems, we design an algorithm which adapts both the number of particles and noise variance based on tracking error and the area occupied by the particles in the image. Experiments on the AV16.3 dataset show the advantage of our proposed methods over the baseline PF method and an existing adaptive PF algorithm for tracking occluded speakers with a significantly reduced number of particles.

  • Gu F, Li W, Wang W. (2014) 'Fourth-order cumulant based sources number estimation from mixtures of unknown number of sources'. 2014 6th International Conference on Wireless Communications and Signal Processing, WCSP 2014,
  • Zhong X, Wang W, Naqvi M, Chng ES. (2014) 'A Bayesian performance bound for time-delay of arrival based acoustic source tracking in a reverberant environment'. FUSION 2014 - 17th International Conference on Information Fusion,
  • Kiliç V, Zhong X, Barnard M, Wang W, Kittler J. (2014) 'Audio-visual tracking of a variable number of speakers with a random finite set approach'. FUSION 2014 - 17th International Conference on Information Fusion,
  • Liu Q, Aubrey AJ, Wang W. (2014) 'Interference reduction in reverberant speech separation with visual voice activity detection'. IEEE Transactions on Multimedia, 16 (6), pp. 1610-1623.
  • Alinaghi A, Jackson PJB, Liu Q, Wang W. (2014) 'Joint Mixing Vector and Binaural Model Based Stereo Source Separation'. IEEE Transactions on Audio, Speech, & Language Processing, 22 Article number 9 , pp. 1434-1448.
  • Barnard M, Wang W, Kittler J, Koniusz P, Naqvi SM, Chambers J. (2014) 'Robust multi-speaker tracking via dictionary learning and identity modeling'. IEEE Transactions on Multimedia, 16 (3), pp. 864-880.
  • Chandna S, Wang W. (2014) 'Improving model-based convolutive blind source separation techniques via bootstrap'. IEEE Workshop on Statistical Signal Processing Proceedings, , pp. 424-427.
  • Liu Q, Wang W, Jackson PJB, Barnard M, Kittler J, Chambers J. (2013) 'Source separation of convolutive and noisy mixtures using audio-visual dictionary learning and probabilistic time-frequency masking'. IEEE Transactions on Signal Processing, 61 (22) Article number 99 , pp. 5520-5535.
  • Zubair S, Yan F, Wang W. (2013) 'Dictionary learning based sparse coefficients for audio classification with max and average pooling'. ACADEMIC PRESS INC ELSEVIER SCIENCE DIGITAL SIGNAL PROCESSING, 23 (3), pp. 960-970.
  • Xu T, Wang W, Dai W. (2013) 'Sparse coding with adaptive dictionary learning for underdetermined blind speech separation'. Speech Communication, 55 (3), pp. 432-450.
  • Zubair S, Yan F, Wang W. (2013) 'Dictionary learning based sparse coefficients for audio classification with max and average pooling'. Digital Signal Processing: A Review Journal,
  • Naik GR, Wang W. (2012) 'Audio analysis of statistically instantaneous signals with mixed Gaussian probability distributions'. International Journal of Electronics, 99 (10), pp. 1333-1350.
  • Liu Q, Wang W, Jackson P. (2012) 'Use of bimodal coherence to resolve the permutation problem in convolutive BSS'. Signal Processing, 92 (8), pp. 1916-1927.

    Abstract

    Recent studies show that facial information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterization of the coherence between the audio and visual speech using, e.g., a Gaussian mixture model (GMM). In this paper, we present three contributions. With the synchronized features, we propose an adapted expectation maximization (AEM) algorithm to model the audiovisual coherence in the off-line training process. To improve the accuracy of this coherence model, we use a frame selection scheme to discard nonstationary features. Then with the coherence maximization technique, we develop a new sorting method to solve the permutation problem in the frequency domain. We test our algorithm on a multimodal speech database composed of different combinations of vowels and consonants. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS, which confirms the benefit of using visual speech to assist in separation of the audio. © 2011 Elsevier B.V. All rights reserved.

  • Liu Q, Wang W, Jackson PJB. (2012) 'Use of bimodal coherence to resolve the permutation problem in convolutive BSS'. Elsevier Signal Processing, 92 (8), pp. 1916-1927.
  • Mohsen Naqvi S, Wang W, Khan MS, Barnard M, Chambers JA. (2012) 'Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking'. IET Signal Processing, 6 (5), pp. 466-477.
  • Dai W, Xu T, Wang W . (2012) 'Simultaneous codeword optimization (SimCO) for dictionary update and learning'. IEEE Transactions on Signal Processing, 60 (12), pp. 6340-6353.

    Abstract

    We consider the data-driven dictionary learning problem. The goal is to seek an over-complete dictionary from which every training signal can be best approximated by a linear combination of only a few codewords. This task is often achieved by iteratively executing two operations: sparse coding and dictionary update. The focus of this paper is on the dictionary update step, where the dictionary is optimized with a given sparsity pattern. We propose a novel framework where an arbitrary set of codewords and the corresponding sparse coefficients are simultaneously updated, hence the term simultaneous codeword optimization (SimCO). The SimCO formulation not only generalizes benchmark mechanisms MOD and K-SVD, but also allows the discovery that singular points, rather than local minima, are the major bottleneck of dictionary update. To mitigate the problem caused by the singular points, regularized SimCO is proposed. First and second order optimization procedures are designed to solve regularized SimCO. Simulations show that regularization substantially improves the performance of dictionary learning. © 1991-2012 IEEE.

  • Jan T, Wang W, Wang D. (2011) 'A Multistage Approach to Blind Separation of Convolutive Speech Mixtures'. Speech Communication, 53 (4), pp. 524-539.
  • Jan T, Wang W, Wang D. (2011) 'A multistage approach to blind separation of convolutive speech mixtures'. Speech Communication, 53 (4), pp. 524-539.
  • Liu Q, Wang W. (2011) 'Blind source separation and visual voice activity detection for target speech extraction'. Proceedings of 2011 3rd International Conference on Awareness Science and Technology, iCAST 2011, , pp. 457-460.
  • Wang W, Cichocki A, Chambers JA. (2009) 'A multiplicative algorithm for convolutive non-negative matrix factorization based on squared euclidean distance'. IEEE Transactions on Signal Processing, 57 (7), pp. 2858-2864.
  • Wang W, Cichocki A, Mørup M, Smaragdis P, Zdunek R. (2008) 'Advances in nonnegative matrix and tensor factorization'. Hindawi Publishing Corporation Computational Intelligence and Neuroscience, 2008 Article number 852187
  • Wang W, Luo Y, Chambers JA, Sanei S. (2008) 'Note onset detection via nonnegative factorization of magnitude spectrum'. HINDAWI PUBLISHING CORPORATION EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, Article number ARTN 231367
  • Luo Y, Wang W, Chambers JA, Lambotharan S, Proudler I. (2006) 'Exploitation of source nonstationarity in underdetermined blind source separation with advanced clustering techniques'. IEEE Transactions on Signal Processing, 54 (6 I), pp. 2198-2212.
  • Jafari MG, Wang W, Chambers JA, Hoya T, Cichocki A, Cichocki A. (2006) 'Sequential blind source separation based exclusively on second-order statistics developed for a class of periodic signals'. IEEE Transactions on Signal Processing, 54 (3), pp. 1028-1040.
  • Yuan L, Wang W, Chambers JA, Yuan L, Wang W. (2005) 'Variable step-size sign natural gradient algorithm for sequential blind source separation'. IEEE Signal Processing Letters, 12 (8), pp. 589-592.
  • Wang W, Sanei S, Chambers JA. (2005) 'Penalty function-based joint diagonalization approach for convolutive blind separation of nonstationary sources'. IEEE Transactions on Signal Processing, 53 (5), pp. 1654-1669.
  • Shoker L, Sanei S, Wang W, Chambers JA. (2005) 'Removal of eye blinking artifact from the electro-encephalogram, incorporating a new constrained blind source separation algorithm.'. Med Biol Eng Comput, England: 43 (2), pp. 290-295.
  • Wang W, Jafari M, Sanei S, Chambers J. (2004) 'Blind Separation of Convolutive Mixtures of Cyclostationary Signals'. International Journal of Adaptive Control and Signal Processing, 18 (3), pp. 279-298.

Conference papers

  • Barnard M, Wang W, Kittler J, Naqvi SM, Chambers JA. 'A dictionary learning approach to tracking'. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, , pp. 981-984.
  • Zermini A , Liu Q , Xu Y , Plumbley MD, Betts D, Wang W . (2017) 'Binaural and Log-Power Spectra Features with Deep Neural Networks for Speech-Noise Separation'. IEEE Proceedings of MMSP 2017 - IEEE 19th International Workshop on Multimedia Signal Processing, Putteridge Bury, Luton, Bedfordshire, UK: MMSP 2017 - IEEE 19th International Workshop on Multimedia Signal Processing
    [ Status: Accepted ]

    Abstract

    Binaural features of interaural level difference and interaural phase difference have proved to be very effective in training deep neural networks (DNNs), to generate timefrequency masks for target speech extraction in speech-speech mixtures. However, effectiveness of binaural features is reduced in more common speech-noise scenarios, since the noise may over-shadow the speech in adverse conditions. In addition, the reverberation also decreases the sparsity of binaural features and therefore adds difficulties to the separation task. To address the above limitations, we highlight the spectral difference between speech and noise spectra and incorporate the log-power spectra features to extend the DNN input. Tested on two different reverberant rooms at different signal to noise ratios (SNR), our proposed method shows advantages over the baseline method using only binaural features in terms of signal to distortion ratio (SDR) and Short-Time Perceptual Intelligibility (STOI).

  • Chen M, Wang W , Barnard M , Chambers J. (2017) 'Wideband DoA Estimation Based on Joint Optimisation of Array and Spatial Sparsity'. Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos Island, Greece: 2017 25th European Signal Processing Conference (EUSIPCO)
    [ Status: Accepted ]

    Abstract

    We study the problem of wideband direction of arrival (DoA) estimation by joint optimisation of array and spatial sparsity. Two-step iterative process is proposed. In the first step, the wideband signal is reshaped and used as the input to derive the weight coefficients using a sparse array optimisation method. The weights are then used to scale the observed signal model for which a compressive sensing based spatial sparsity optimisation method is used for DoA estimation. Simulations are provided to demonstrate the performance of the proposed method for both stationary and moving sources.

  • Liu Q , Wang W , Jackson PJB, Tang Y. (2017) 'A Perceptually-Weighted Deep Neural Network for Monaural Speech Enhancement in Various Background Noise Conditions'. Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos Island, Greece: 2017 25th European Signal Processing Conference (EUSIPCO)
    [ Status: Accepted ]

    Abstract

    Deep neural networks (DNN) have recently been shown to give state-of-the-art performance in monaural speech enhancement. However in the DNN training process, the perceptual difference between different components of the DNN output is not fully exploited, where equal importance is often assumed. To address this limitation, we have proposed a new perceptually-weighted objective function within a feedforward DNN framework, aiming to minimize the perceptual difference between the enhanced speech and the target speech. A perceptual weight is integrated into the proposed objective function, and has been tested on two types of output features: spectra and ideal ratio masks. Objective evaluations for both speech quality and speech intelligibility have been performed. Integration of our perceptual weight shows consistent improvement on several noise levels and a variety of different noise types.

  • Guan J, Wang X, Feng P, Dong J, Wang W . (2017) 'Matrix of Polynomials Model based Polynomial Dictionary Learning Method for Acoustic Impulse Response Modeling'. Proceedings of Interspeech 2017, Stockholm, Sweden: 18th Annual Conference of the International Speech Communication Association (Interspeech 2017 )
    [ Status: Accepted ]

    Abstract

    We study the problem of dictionary learning for signals that can be represented as polynomials or polynomial matrices, such as convolutive signals with time delays or acoustic impulse responses. Recently, we developed a method for polynomial dictionary learning based on the fact that a polynomial matrix can be expressed as a polynomial with matrix coefficients, where the coefficient of the polynomial at each time lag is a scalar matrix. However, a polynomial matrix can be also equally represented as a matrix with polynomial elements. In this paper, we develop an alternative method for learning a polynomial dictionary and a sparse representation method for polynomial signal reconstruction based on this model. The proposed methods can be used directly to operate on the polynomial matrix without having to access its coefficients matrices. We demonstrate the performance of the proposed method for acoustic impulse response modeling.

  • Xu Y , Kong Q, Huang Q , Wang W , Plumbley MD. (2017) 'Attention and Localization based on a Deep Convolutional Recurrent Model forWeakly Supervised Audio Tagging'. Proceedings of Interspeech 2017, Stockholm, Sweden: Interspeech 2017
    [ Status: Accepted ]

    Abstract

    Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research efforts to better analyze and understand the content of the huge amounts of audio data on the web. The difficulty in audio tagging is that it only has a chunk-level label without a frame-level label. This paper presents a weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events. The attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames. The proposed framework is a deep convolutional recurrent model with two auxiliary modules: an attention module and a localization module. The proposed algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art performance was achieved on the evaluation set with equal error rate (EER) reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline system.

  • Zermini A , Wang W , Kong Q, Xu Y , Plumbley MD. (2017) 'Audio source separation with deep neural networks using the dropout algorithm'. Instituto de Telecomunicações Signal Processing with Adaptive Sparse Structured Representations (SPARS) 2017 Book of Abstracts, Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal: Signal Processing with Adaptive Sparse Structured Representations (SPARS) 2017

    Abstract

    A method based on Deep Neural Networks (DNNs) and time-frequency masking has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft time-frequency masks for the recovery/estimation of the individual audio sources. In this paper, an algorithm called ‘dropout’ will be applied to the hidden layers, affecting the sparsity of hidden units activations: randomly selected neurons and their connections are dropped during the training phase, preventing feature co-adaptation. These methods are evaluated on binaural mixtures generated with Binaural Room Impulse Responses (BRIRs), accounting a certain level of room reverberation. The results show that the proposed DNNs system with randomly deleted neurons is able to achieve higher SDRs performances compared to the baseline method without the dropout algorithm.

  • Liu Y, Wang W , Chambers J, Kilic V, Hilton ADM. (2017) 'Particle ow SMC-PHD lter for audio-visual multi-speaker tracking'. Latent Variable Analysis and Signal Separation, Grenoble, France: 13th International Conference on Latent Variable Analysis and Signal Separation, pp. pp. 344-353.

    Abstract

    Sequential Monte Carlo probability hypothesis density (SMC- PHD) ltering has been recently exploited for audio-visual (AV) based tracking of multiple speakers, where audio data are used to inform the particle distribution and propagation in the visual SMC-PHD lter. How- ever, the performance of the AV-SMC-PHD lter can be a ected by the mismatch between the proposal and the posterior distribution. In this pa- per, we present a new method to improve the particle distribution where audio information (i.e. DOA angles derived from microphone array mea- surements) is used to detect new born particles and visual information (i.e. histograms) is used to modify the particles with particle ow (PF). Using particle ow has the bene t of migrating particles smoothly from the prior to the posterior distribution. We compare the proposed algo- rithm with the baseline AV-SMC-PHD algorithm using experiments on the AV16.3 dataset with multi-speaker sequences.

  • Kong Q, Xu Y , Wang W , Plumbley MD. (2017) 'A joint detection-classification model for audio tagging of weakly labelled data'. Proceedings of ICASSP 2017, New Orleans, USA: ICASSP 2017, The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing
    [ Status: Accepted ]

    Abstract

    Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user. Previous works have used the bag of frames model assume the tags occur all the time, which is not the case in practice. We propose a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously. The JDC model has the ability to attend to informative and ignore uninformative sounds. Then only informative regions are used for classification. Experimental results on the “CHiME Home” dataset show that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%. More interestingly, the audio event detector is trained successfully without needing the event level label.

  • Hamon R, Rencker L , Emiya V, Wang W , Plumbley MD. (2017) 'Assessment of musical noise using localization of isolated peaks in time-frequency domain'. ICASSP2017 Proceedings, New Orleans, USA: The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP2017
    [ Status: Accepted ]

    Abstract

    Musical noise is a recurrent issue that appears in spectral techniques for denoising or blind source separation. Due to localised errors of estimation, isolated peaks may appear in the processed spectrograms, resulting in annoying tonal sounds after synthesis known as “musical noise”. In this paper, we propose a method to assess the amount of musical noise in an audio signal, by characterising the impact of these artificial isolated peaks on the processed sound. It turns out that because of the constraints between STFT coefficients, the isolated peaks are described as time-frequency “spots” in the spectrogram of the processed audio signal. The quantification of these “spots”, achieved through the adaptation of a method for localisation of significant STFT regions, allows for an evaluation of the amount of musical noise. We believe that this will pave the way to an objective measure and a better understanding of this phenomenon.

  • Xu Y , Kong Q, Huang Q , Wang W , Plumbley MD. (2017) 'Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging'. IJCNN 2017 Conference Proceedings, Anchorage, Alaska: The 2017 International Joint Conference on Neural Networks (IJCNN 2017)
    [ Status: Accepted ]

    Abstract

    Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.

  • Liu Y, Wang W , Zhao Y. (2017) 'PARTICLE FLOW FOR SEQUENTIAL MONTE CARLO IMPLEMENTATION OF PROBABILITY HYPOTHESIS DENSITY'. Proceedings of ICASSP 2017, New Orleans, USA: 42nd IEEE ICASSP
    [ Status: Accepted ]

    Abstract

    Target tracking is a challenging task and generally no analytical solution is available, especially for the multi-target tracking systems. To address this problem, probability hypothesis density (PHD) filter is used by propagating the PHD instead of the full multi-target posterior. Recently, the particle flow filter based on the log homotopy provides a new way for state estimation. In this paper, we propose a novel sequential Monte Carlo (SMC) implementation for the PHD filter assisted by the particle flow (PF), which is called PF-SMCPHD filter. Experimental results show that our proposed filter has higher accuracy than the SMC-PHD filter and is computationally cheaper than the Gaussian mixture PHD (GM-PHD) filter.

  • Huang Q , Xu Y , Jackson PJB, Wang W , Plumbley MD. (2017) 'Fast Tagging of Natural Sounds Using Marginal Co-regularization'. Proceedings of ICASSP2017, New Orleans, USA: ICASSP2017, The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing
    [ Status: Accepted ]

    Abstract

    Automatic and fast tagging of natural sounds in audio collections is a very challenging task due to wide acoustic variations, the large number of possible tags, the incomplete and ambiguous tags provided by different labellers. To handle these problems, we use a co-regularization approach to learn a pair of classifiers on sound and text. The first classifier maps low-level audio features to a true tag list. The second classifier maps actively corrupted tags to the true tags, reducing incorrect mappings caused by low-level acoustic variations in the first classifier, and to augment the tags with additional relevant tags. Training the classifiers is implemented using marginal co-regularization, pair of which draws the two classifiers into agreement by a joint optimization. We evaluate this approach on two sound datasets, Freefield1010 and Task4 of DCASE2016. The results obtained show that marginal co-regularization outperforms the baseline GMM in both ef- ficiency and effectiveness.

  • Rencker L , Wang W , Plumbley MD. (2017) 'A greedy algorithm with learned statistics for sparse signal reconstruction'. ICASSP 2017 Proceedings, New Orleans, USA: The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2017
    [ Status: Accepted ]

    Abstract

    We address the problem of sparse signal reconstruction from a few noisy samples. Recently, a Covariance-Assisted Matching Pursuit (CAMP) algorithm has been proposed, improving the sparse coefficient update step of the classic Orthogonal Matching Pursuit (OMP) algorithm. CAMP allows the a-priori mean and covariance of the non-zero coefficients to be considered in the coefficient update step. In this paper, we analyze CAMP, which leads to a new interpretation of the update step as a maximum-a-posteriori (MAP) estimation of the non-zero coefficients at each step. We then propose to leverage this idea, by finding a MAP estimate of the sparse reconstruction problem, in a greedy OMP-like way. Our approach allows the statistical dependencies between sparse coefficients to be modelled, while keeping the practicality of OMP. Experiments show improved performance when reconstructing the signal from a few noisy samples.

  • Zermini A , Yu Y, Xu Y , Plumbley MD, Wang W . (2016) 'Deep neural network based audio source separation'. Institute of Mathematics & its Applications (IMA) Proceedings of the 11th IMA International Conference on Mathematics in Signal Processing, IET Austin Court, Birmingham, UK: 11th IMA International Conference on Mathematics in Signal Processing

    Abstract

    Audio source separation aims to extract individual sources from mixtures of multiple sound sources. Many techniques have been developed such as independent compo- nent analysis, computational auditory scene analysis, and non-negative matrix factorisa- tion. A method based on Deep Neural Networks (DNNs) and time-frequency (T-F) mask- ing has been recently developed for binaural audio source separation. In this method, the DNNs are used to predict the Direction Of Arrival (DOA) of the audio sources with respect to the listener which is then used to generate soft T-F masks for the recovery/estimation of the individual audio sources.

  • Liu Q , Yang T, Jackson PJB, Wang W . (2016) 'Predicting binaural speech intelligibility from signals estimated by a blind source separation algorithm'. INTERSPEECH 2016 Proceedings, San Francisco, US: INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association
    [ Status: Accepted ]

    Abstract

    State-of-the-art binaural objective intelligibility measures (OIMs) require individual source signals for making intelligibility predictions, limiting their usability in real-time online operations. This limitation may be addressed by a blind source separation (BSS) process, which is able to extract the underlying sources from a mixture. In this study, a speech source is presented with either a stationary noise masker or a fluctuating noise masker whose azimuth varies in a horizontal plane, at two speech-to-noise ratios (SNRs). Three binaural OIMs are used to predict speech intelligibility from the signals separated by a BSS algorithm. The model predictions are compared with listeners' word identification rate in a perceptual listening experiment. The results suggest that with SNR compensation to the BSS-separated speech signal, the OIMs can maintain their predictive power for individual maskers compared to their performance measured from the direct signals. It also reveals that the errors in SNR between the estimated signals are not the only factors that decrease the predictive accuracy of the OIMs with the separated signals. Artefacts or distortions on the estimated signals caused by the BSS algorithm may also be concerns.

  • Xu Y, Huang Q, Wang W, Plumbley MD. (2016) 'Hierarchical Learning for DNN-Based Acoustic Scene Classification'. Tampere University of Technology Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary: DCASE2016 Workshop (Workshop on Detection and Classification of Acoustic Scenes and Events), pp. 105-109.

    Abstract

    In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework. Two hierarchical learning methods are proposed to improve the DNN baseline performance by incorporating the hierarchical taxonomy information of environmental sounds. Firstly, the parameters of the DNN are initialized by the proposed hierarchical pre-training. Multi-level objective function is then adopted to add more constraint on the cross-entropy based loss function. A series of experiments were conducted on the Task1 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. The final DNN-based system achieved a 22.9% relative improvement on average scene classification error as compared with the Gaussian Mixture Model (GMM)-based benchmark system across four standard folds.

  • Xu Y, Huang Q, Wang W, Jackson PJB, Plumbley MD. (2016) 'Fully DNN-based Multi-label regression for audio tagging'. Tampere University of Technology Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary: DCASE2016 Workshop (Workshop on Detection and Classification of Acoustic Scenes and Events), pp. 110-114.

    Abstract

    Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.

  • Font F, Brookes TS, Fazekas G, Guerber M, La Burthe A, Plans D, Plumbley M, Shaashua M, Wang W, Serra X. (2016) 'Audio Commons: bringing Creative Commons audio content to the creative industries'. London, UK: 61st International Conference: Audio for Games

    Abstract

    Significant amounts of user-generated audio content, such as sound effects, musical samples and music pieces, are uploaded to online repositories and made available under open licenses. Moreover, a constantly increasing amount of multimedia content, originally released with traditional licenses, is becoming public domain as its license expires. Nevertheless, the creative industries are not yet using much of all this content in their media productions. There is still a lack of familiarity and understanding of the legal context of all this open content, but there are also problems related with its accessibility. A big percentage of this content remains unreachable either because it is not published online or because it is not well organised and annotated. In this paper we present the Audio Commons Initiative, which is aimed at promoting the use of open audio content and at developing technologies with which to support the ecosystem composed by content repositories, production tools and users. These technologies should enable the reuse of this audio material, facilitating its integration in the production workflows used by the creative industries. This is a position paper in which we describe the core ideas behind this initiative and outline the ways in which we plan to address the challenges it poses.

  • Kong Q, Sobieraj I, Wang W , Plumbley MD. (2016) 'Deep Neural Network Baseline for DCASE Challenge 2016'. Proceedings of DCASE 2016, Budapest, Hungary: Detection and Classification of Acoustic Scenes and Events 2016
    [ Status: Accepted ]

    Abstract

    The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. In Task 1 we obtained accuracy of 81.0% using Mel + DNN against 77.2% by using Mel Frequency Cepstral Coefficients (MFCCs) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 12.6% using Mel + DNN against 37.0% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 36.3% using Mel + DNN against 23.7% by using MFCCs + GMM. In Task 4 we obtained Equal Error Rate (ERR) of 18.9% using Mel + DNN against 20.9% by using MFCCs + GMM. Therefore the DNN improves the baseline in Task 1, 3, and 4, although it is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.

  • Remaggi L, Jackson PJB, Coleman P, Wang W. (2014) 'Room boundary estimation from acoustic room impulse responses'. Edinburgh, UK : IEEE Proc. Sensor Signal Processing for Defence (SSPD 2014), Edinburgh: Sensor Signal Processing for Defence (SSPD 2014), pp. 1-5.
  • Dong J, Wang W, Dai W. (2014) 'Analysis SimCO: A new algorithm for analysis dictionary learning'. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, , pp. 7193-7197.
  • Dong J, Wang W. (2014) 'Analysis dictionary learning based on Nesterov's gradient with application to SAR image despeckling'. ISCCSP 2014 - 2014 6th International Symposium on Communications, Control and Signal Processing, Proceedings, , pp. 501-504.
  • Zubair S, Wang W, Chambers JA. (2014) 'Discriminativetensor dictionaries and sparsity for speaker identification'. 2014 4th Joint Workshop on Hands-Free Speech Communication and Microphone Arrays, HSCMA 2014, , pp. 37-41.
  • Popa V, Wang W, Alinaghi A. (2013) 'Underdetermined model-based blind source separation of reverberant speech mixtures using spatial cues in a variational bayesian framework'. IET IET Conference Publications, London: IET Intelligent Signal Processing Conference 2013 (619 CP), pp. 1-6.

    Abstract

    In this paper, we propose a new method for underdetermined blind source separation of reverberant speech mixtures by classifying each time-frequency (T-F) point of the mixtures according to a combined variational Bayesian model of spatial cues, under sparse signal representation assumption. We model the T-F observations by a variational mixture of circularly-symmetric complex-Gaussians. The spatial cues, e.g. interaural level difference (ILD), interaural phase difference (IPD) and mixing vector cues, are modelled by a variational mixture of Gaussians. We then establish appropriate conjugate prior distributions for the parameters of all the mixtures to create a variational Bayesian framework. Using the Bayesian approach we then iteratively estimate the hyper-parameters for the prior distributions by optimizing the variational posterior distribution. The main advantage of this approach is that no prior knowledge of the number of sources is needed, and it will be automatically determined by the algorithm. The proposed approach does not suffer from overfitting problem, as opposed to the Expectation-Maximization (EM) algorithm, therefore it is not sensitive to initializations.

  • Alinaghi A, Jackson PJB, Wang W. (2013) 'Comparison between the statistical cues in BSS techniques and Binaural cues in CASA approaches for reverberant speech separation'. IET Conference Publications, 2013 (619 CP)

    Abstract

    Reverberant speech source separation has been of great interest for over a decade, leading to two major approaches. One of them is based on statistical properties of the signals and mixing process known as blind source separation (BSS). The other approach named as computational auditory scene analysis (CASA) is inspired by human auditory system and exploits monaural and binaural cues. In this paper these two approaches are studied and compared in more depth.

  • Kilic V, Barnard M, Wang W, Kittler J. (2013) 'Audio constrained particle filter based visual tracking'. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, , pp. 3627-3631.
  • Barnard M, Wang W, Kittler J. (2013) 'Audio head pose estimation using the direct to reverberant speech ratio'. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, , pp. 8056-8060.
  • Alinaghi A, Wang W, Jackson PJB. (2013) 'Spatial and coherence cues based time-frequency masking for binaural reverberant speech separation'. Vancouver, Canada : IEEE Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), , pp. 4-4.

    Abstract

    Most of the binaural source separation algorithms only consider the dissimilarities between the recorded mixtures such as interaural phase and level differences (IPD, ILD) to classify and assign the time-frequency (T-F) regions of the mixture spectrograms to each source. However, in this paper we show that the coherence between the left and right recordings can provide extra information to label the T-F units from the sources. This also reduces the effect of reverberation which contains random reflections from different directions showing low correlation between the sensors. Our algorithm assigns the T-F regions into original sources based on weighted combination of IPD, ILD, the observation vectors models and the estimated interaural coherence (IC) between the left and right recordings. The binaural room impulse responses measured in four rooms with various acoustic conditions have been used to evaluate the performance of the proposed method which shows an improvement of more than 1:4 dB in signal-to-distortion ratio (SDR) in room D with T60 = 0:89 s over the state-of-the-art algorithms.

  • Zhong X, Premkumar AB, Mohammadi A, Asif A, Wang W. (2013) 'Acoustic source tracking in a reverberant environment using a pairwise synchronous microphone network'. IEEE Proceedings of the 16th International Conference on Information Fusion, FUSION 2013, Istanbul: 16th International Conference on Information Fusion, pp. 953-960.
  • Liu Q, Wang W. (2013) 'Show-through removal for scanned images using non-linear NMF with adaptive smoothing'. 2013 IEEE China Summit and International Conference on Signal and Information Processing, ChinaSIP 2013 - Proceedings, , pp. 650-654.
  • Barnard M, Wang W, Kittler J, Naqvi SM, Chambers J. (2013) 'Audio-visual face detection for tracking in a meeting room environment'. Proceedings of the 16th International Conference on Information Fusion, FUSION 2013, , pp. 1222-1227.
  • Chen X, Alinaghi A, Wang W, Zhong X. (2013) 'Acoustic vector sensor based speech source separation with mixed Gaussian-laplacian distributions'. IEEE 2013 18th International Conference on Digital Signal Processing, DSP 2013, Fira: 18th International Conference on Digital Signal Processing, pp. 1-5.

    Abstract

    Acoustic vector sensor (AVS) based convolutive blind source separation problem has been recently addressed under the framework of probabilistic time-frequency (T-F) masking, where both the DOA and the mixing vector cues are modelled by Gaussian distributions. In this paper, we show that the distributions of these cues vary with room acoustics, such as reverberation. Motivated by this observation, we propose a mixed model of Laplacian and Gaussian distributions to provide a better fit for these cues. The parameters of the mixed model are estimated and refined iteratively by an expectation-maximization (EM) algorithm. Experiments performed on the speech mixtures in simulated room environments show that the mixed model offers an average of about 0.68 dB and 1.18 dB improvements in signal-to-distotion (SDR) over the Gaussian and Laplacian model, respectively. © 2013 IEEE.

  • Zhong X, Premkumar AB, Chen X, Wang W, Alinaghi A. (2013) 'Acoustic vector sensor based reverberant speech separation with probabilistic time-frequency masking'. IEEE European Signal Processing Conference, Marrakech: 21st European Signal Processing Conference, pp. 1-5.

    Abstract

    Most existing speech source separation algorithms have been developed for separating sound mixtures acquired by using a conventional microphone array. In contrast, little attention has been paid to the problem of source separation using an acoustic vector sensor (AVS). We propose a new method for the separation of convolutive mixtures by incorporating the intensity vector of the acoustic field, obtained using spatially co-located microphones which carry the direction of arrival (DOA) information. The DOA cues from the intensity vector, together with the frequency bin-wise mixing vector cues, are then used to determine the probability of each time-frequency (T-F) point of the mixture being dominated by a specific source, based on the Gaussian mixture models (GMM), whose parameters are evaluated and refined iteratively using an expectation-maximization (EM) algorithm. Finally, the probability is used to derive the T-F masks for recovering the sources. The proposed method is evaluated in simulated reverberant environments in terms of signal-to-distortion ratio (SDR), giving an average improvement of approximately 1:5 dB as compared with a related T-F mask approach based on a conventional microphone setting. © 2013 EURASIP.

  • Zhong X, Premkumar A, Wang W . (2013) 'Direction of arrival tracking of an underwater acoustic source using particle filtering: Real data experiments'. IEEE 2013 Tencon - Spring, TENCONSpring 2013 - Conference Proceedings, Sydney, Australia: TENCON Spring Conference 2013, pp. 420-424.

    Abstract

    In the past, both theoretical work and practical implementation of particle filtering (PF) method have been extensively studied. However, its application in underwater signal processing has received much less attention. This paper intends to introduce PF approach for underwater acoustic signal processing. Particularly, we are interested in direction of arrival (DOA) estimation using PF. A detailed introduction along with this perspective is presented in this paper. Since the noise usually spreads the mainlobe of likelihood function and causes problem in subsequent particle resampling step, an exponential weighted likelihood model is developed to emphasize particles at more relevant area. Hence, the the effect due to background noise can be reduced. Real underwater acoustic data collected in SWELLEx-96 experiment are employed to demonstrate the performance of the proposed PF approaches for underwater DOA tracking. © 2013 IEEE.

  • Zhang Y, Wang H, Wang W, Sanei S. (2013) 'K-plane clustering algorithm for analysis dictionary learning'. IEEE International Workshop on Machine Learning for Signal Processing, MLSP,
  • Shapoori S, Wang W, Sanei S. (2013) 'A constrained approach for extraction of pre-ictal discharges from scalp EEG'. IEEE International Workshop on Machine Learning for Signal Processing, MLSP, Southampton: International Workshop on Machine Learning for Signal Processing (MLSP)
  • Zubair S, Wang W. (2013) 'Tensor dictionary learning with sparse tucker decomposition'. IEEE 2013 18th International Conference on Digital Signal Processing, DSP 2013, Fira: 18th International Conference on Digital Signal Processing, pp. 1-6.

    Abstract

    Dictionary learning algorithms are typically derived for dealing with one or two dimensional signals using vector-matrix operations. Little attention has been paid to the problem of dictionary learning over high dimensional tensor data. We propose a new algorithm for dictionary learning based on tensor factorization using a TUCKER model. In this algorithm, sparseness constraints are applied to the core tensor, of which the n-mode factors are learned from the input data in an alternate minimization manner using gradient descent. Simulations are provided to show the convergence and the reconstruction performance of the proposed algorithm. We also apply our algorithm to the speaker identification problem and compare the discriminative ability of the dictionaries learned with those of TUCKER and K-SVD algorithms. The results show that the classification performance of the dictionaries learned by our proposed algorithm is considerably better as compared to the two state of the art algorithms. © 2013 IEEE.

  • Kilic V, Barnard M, Wang W, Kittler J. (2013) 'Adaptive particle filtering approach to audio-visual tracking'. IEEE European Signal Processing Conference, Marrakech: 21st European Signal Processing Conference, pp. 1-5.

    Abstract

    Particle filtering has emerged as a useful tool for tracking problems. However, the efficiency and accuracy of the filter usually depend on the number of particles and noise variance used in the estimation and propagation functions for re-allocating these particles at each iteration. Both of these parameters are specified beforehand and are kept fixed in the regular implementation of the filter which makes the tracker unstable in practice. In this paper we are interested in the design of a particle filtering algorithm which is able to adapt the number of particles and noise variance. The new filter, which is based on audio-visual (AV) tracking, uses information from the tracking errors to modify the number of particles and noise variance used. Its performance is compared with a previously proposed audio-visual particle filtering algorithm with a fixed number of particles and an existing adaptive particle filtering algorithm, using the AV 16.3 dataset with single and multi-speaker sequences. Our proposed approach demonstrates good tracking performance with a significantly reduced number of particles. © 2013 EURASIP.

  • Zhao X, Zhou G, Dai W, Xu T, Wang W. (2013) 'Joint image separation and dictionary learning'. 2013 18th International Conference on Digital Signal Processing, DSP 2013,
  • Ye Z, Wang H, Yu T, Wang W. (2013) 'Subset pursuit for analysis dictionary learning'. European Signal Processing Conference,
  • Liu Q, Wang W, Jackson PJB, Barnard M. (2012) 'Reverberant Speech Separation Based on Audio-visual Dictionary Learning and Binaural Cues'. IEEE Proc. of IEEE Statistical Signal Processing Workshop (SSP), Ann Abor, USA: IEEE Statistical Signal Processing Workshop (SSP), pp. 664-667.

    Abstract

    Probabilistic models of binaural cues, such as the interaural phase difference (IPD) and the interaural level difference (ILD), can be used to obtain the audio mask in the time-frequency (TF) domain, for source separation of binaural mixtures. Those models are, however, often degraded by acoustic noise. In contrast, the video stream contains relevant information about the synchronous audio stream that is not affected by acoustic noise. In this paper, we present a novel method for modeling the audio-visual (AV) coherence based on dictionary learning. A visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio-visual mask, which is then applied to the binaural signal for source separation. We tested our algorithm on the XM2VTS database, and observed considerable performance improvement for noise corrupted signals.

  • Zhao X, Zhou G, Dai W, Wang W. (2012) 'Weighted SimCO: A novel algorithm for dictionary update'. IET Seminar Digest, London: Sensor Signal Processing for Defence (SSPD 2012) 2012 (3), pp. 1-5.
  • Dai W, Xu T, Wang W . (2012) 'Dictionary learning and update based on simultaneous codeword optimization (SimCO)'. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, , pp. 2037-2040.

    Abstract

    Dictionary learning aims to adapt elementary codewords directly from training data so that each training signal can be best approximated by a linear combination of only a few codewords. Following the two-stage iterative processes: sparse coding and dictionary update, that are commonly used, for example, in the algorithms of MOD and K-SVD, we propose a novel framework that allows one to update an arbitrary set of codewords and the corresponding sparse coefficients simultaneously, hence termed simultaneous codeword optimization (SimCO). Under this framework, we have developed two algorithms, namely the primitive and the regularized SimCO. Simulations are provided to show the advantages of our approach over the K-SVD algorithm in terms of both learning performance and running speed. © 2012 IEEE.

  • Jan T, Wang W . (2012) 'Joint blind dereverberation and separation of speech mixtures'. 2012 EUSIPCO European Signal Processing Conference Proceedings, 20th European Signal Processing Conference, pp. 2343-2347.

    Abstract

    This paper proposes a method for jointly performing blind source separation (BSS) and blind dereverberation (BD) for speech mixtures. In most of the previous studies, BSS and BD have been explored separately. It is common that the performance of the speech separation algorithms deteriorates with the increase of room reverberations. Also most of the dereverberation algorithms rely on the availability of room impulse responses (RIRs) which are not readily accessible in practice. Therefore in this work the dereverberation and separation method are combined to mitigate the effects of room reverberations on the speech mixtures and hence to improve the separation performance. As required by the dereverberation algorithm, a step for blind estimation of reverberation time (RT) is used to estimate the decay rate of reverberations directly from the reverberant speech signal (i.e., speech mixtures) by modeling the decay as a Laplacian random process modulated by a deterministic envelope. Hence the developed algorithm works in a blind manner, i.e., directly dealing with the reverberant speech signals without explicit information from the RIRs. Evaluation results in terms of signal to distortion ratio (SDR) and segmental signal to reverberation ratio (SegSRR) reveal that using this method the performance of the separation algorithm that we have developed previously can be further enhanced. © 2012 EURASIP.

  • Jan T, Wang W . (2012) 'Frequency dependent statistical model for the suppression of late reverberations'. IET Seminar Digest, London, UK: Sensor Signal Processing for Defence (SSPD 2012) 2012 (3)

    Abstract

    Suppression of late reverberations is a challenging problem in reverberant speech enhancement. A promising recent approach to this problem is to apply a spectral subtraction mask to the spectrum of the reverberant speech, where the spectral variance of the late reverberations was estimated based on a frequency independent statistical model of the decay rate of the late reverberations. In this paper, we develop a dereverberation algorithm by following a similar process. Instead of using the frequency independent model, however, we estimate the frequency dependent reverberation time and decay rate, and use them for the estimation of the spectral subtraction mask. In order to remove the processing artifacts, the mask is further filtered by a smoothing function, and then applied to reduce the late reverberations from the reverberant speech. The performance of the proposed algorithm, measured by the segmental signal to reverberation ratio (SegSRR) and the signal to distortion ratio (SDR), is evaluated for both simulated and real data. As compared with the related frequency indepenent algorithm, the proposed algorithm offers considerable performance improvement.

  • Jan T, Wang W. (2012) 'Blind reverberation time estimation based on Laplace distribution'. European Signal Processing Conference, Bucharest: 20th European Signalling Processing Conference, pp. 2050-2054.
  • Xu T, Wang W. (2011) 'Methods for learning adaptive dictionary in underdetermined speech separation'. IEEE Proceedings of MLSP2011, Beijing, China: 2011 IEEE International Workshop on Machine Learning for Signal Processing, pp. 1-6.
  • Jan T, Wang W. (2011) 'Empirical mode decomposition for joint denoising and dereverberation'. European Signal Processing Conference, , pp. 206-210.

    Abstract

    We propose a novel algorithm for the enhancement of noisy reverberant speech using empirical-mode-decomposition (EMD) based subband processing. The proposed algorithm is a one-microphone multistage algorithm. In the first step, noisy reverberant speech is decomposed adaptively into oscillatory components called intrinsic mode functions (IMFs) via an EMD algorithm. Denoising is then applied to selected high frequency IMFs using EMD-based minimum mean-squared error (MMSE) filter, followed by spectral subtraction of the resulting denoised high-frequency IMFs and low-frequency IMFs. Finally, the enhanced speech signal is reconstructed from the processed IMFs. The method was motivated by our observation that the noise and reverberations are disproportionally distributed across the IMF components. Therefore, different levels of suppression can be applied to the additive noise and reverberation in each IMF. This leads to an improved enhancement performance as shown in comparison to a related recent approach, based on the measurements by the signal-to-noise ratio (SNR). © 2011 EURASIP.

  • Liu Q, Wang W. (2011) 'Blind source separation and visual voice activity detection for target speech extraction'. IEEE Proceedings of 2011 3rd International Conference on Awareness Science and Technology, Dalian, China: iCAST 2011, pp. 457-460.

    Abstract

    Despite being studied extensively, the performance of blind source separation (BSS) is still limited especially for the sensor data collected in adverse environments. Recent studies show that such an issue can be mitigated by incorporating multimodal information into the BSS process. In this paper, we propose a method for the enhancement of the target speech separated by a BSS algorithm from sound mixtures, using visual voice activity detection (VAD) and spectral subtraction. First, a classifier for visual VAD is formed in the off-line training stage, using labelled features extracted from the visual stimuli. Then we use this visual VAD classifier to detect the voice activity of the target speech. Finally we apply a multi-band spectral subtraction algorithm to enhance the BSS-separated speech signal based on the detected voice activity. We have tested our algorithm on the mixtures generated artificially by the mixing filters with different reverberation times, and the results show that our algorithm improves the quality of the separated target signal. © 2011 IEEE.

  • Zubair S, Wang W. (2011) 'Audio classification based on sparse coefficients'. IET Seminar Digest, London, UK: Sensor Signal Processing for Defence (SSPD 2011) 2011 (4)
  • Naqvi S, Khan M, Chambers J, Liu Q , Wang W . (2011) 'Multimodal blind source separation with a circular microphone array and robust beamforming'. Proceedings of the 19th European Signal Processing Conference (EUSIPCO-2011), Barcelona, Spain: 19th European Signal Processing Conference (EUSIPCO-2011), pp. 1050-1054.

    Abstract

    A novel multimodal (audio-visual) approach to the problem of blind source separation (BSS) is evaluated in room environments. The main challenges of BSS in realistic environments are: 1) sources are moving in complex motions and 2) the room impulse responses are long. For moving sources the unmixing filters to separate the audio signals are difficult to calculate from only statistical information available from a limited number of audio samples. For physically stationary sources measured in rooms with long impulse responses, the performance of audio only BSS methods is limited. Therefore, visual modality is utilized to facilitate the separation. The movement of the sources is detected with a 3-D tracker based on a Markov Chain Monte Carlo particle filter (MCMC-PF), and the direction of arrival information of the sources to the microphone array is estimated. A robust least squares frequency invariant data independent (RLSFIDI) beamformer is implemented to perform real time speech enhancement. The uncertainties in source localization and direction of arrival information are also controlled by using a convex optimization approach in the beamformer design. A 16 element circular array configuration is used. Simulation studies based on objective and subjective measures confirm the advantage of beamforming based processing over conventional BSS methods. © 2011 EURASIP.

  • Alinaghi A, Wang W, Jackson PJB. (2011) 'Integrating binaural cues and blind source separation method for separating reverberant speech mixtures'. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, , pp. 209-212.

    Abstract

    This paper presents a new method for reverberant speech separation, based on the combination of binaural cues and blind source separation (BSS) for the automatic classification of the time-frequency (T-F) units of the speech mixture spectrogram. The main idea is to model interaural phase difference, interaural level difference and frequency bin-wise mixing vectors by Gaussian mixture models for each source and then evaluate that model at each T-F point and assign the units with high probability to that source. The model parameters and the assigned regions are refined iteratively using the Expectation-Maximization (EM) algorithm. The proposed method also addresses the permutation problem of the frequency domain BSS by initializing the mixing vectors for each frequency channel. The EM algorithm starts with binaural cues and after a few iterations the estimated probabilistic mask is used to initialize and re-estimate the mixing vector model parameters. We performed experiments on speech mixtures, and showed an average of about 0.8 dB improvement in signal-to-distortion (SDR) over the binaural-only baseline. © 2011 IEEE.

  • Liu Q, Naqvi SM, Wang W, Jackson PJB, Chambers J. (2011) 'Robust feature selection for scaling ambiguity reduction in audio-visual convolutive BSS'. European Signal Processing Conference, Barcelona, Spain: 19th European Signal Processing Conference 2011 (EUSIPCO 2011), pp. 1060-1064.

    Abstract

    Information from video has been used recently to address the issue of scaling ambiguity in convolutive blind source separation (BSS) in the frequency domain, based on statistical modeling of the audio-visual coherence with Gaussian mixture models (GMMs) in the feature space. However, outliers in the feature space may greatly degrade the system performance in both training and separation stages. In this paper, a new feature selection scheme is proposed to discard non-stationary features, which improves the robustness of the coherence model and reduces its computational complexity. The scaling parameters obtained by coherence maximization and non-linear interpolation from the selected features are applied to the separated frequency components to mitigate the scaling ambiguity. A multimodal database composed of different combinations of vowels and consonants was used to test our algorithm. Experimental results show the performance improvement with our proposed algorithm.

  • Wang W, Mustafa H. (2011) 'Single channel music sound separation based on spectrogram decomposition and note classification'. Springer Lecture Notes in Computer Science: Exploring Music Contents, Malaga, Spain: CMMR 2010: 7th International Symposium 6684, pp. 84-101.

    Abstract

    Separating multiple music sources from a single channel mixture is a challenging problem. We present a new approach to this problem based on non-negative matrix factorization (NMF) and note classification, assuming that the instruments used to play the sound signals are known a priori. The spectrogram of the mixture signal is first decomposed into building components (musical notes) using an NMF algorithm. The Mel frequency cepstrum coefficients (MFCCs) of both the decomposed components and the signals in the training dataset are extracted. The mean squared errors (MSEs) between the MFCC feature space of the decomposed music component and those of the training signals are used as the similarity measures for the decomposed music notes. The notes are then labelled to the corresponding type of instruments by the K nearest neighbors (K-NN) classification algorithm based on the MSEs. Finally, the source signals are reconstructed from the classified notes and the weighting matrices obtained from the NMF algorithm. Simulations are provided to show the performance of the proposed system. © 2011 Springer-Verlag Berlin Heidelberg.

  • Liu Q, Wang W, Jackson PJB. (2011) 'A visual voice activity detection method with adaboosting'. IET IET Seminar Digest, London, UK: Sensor Signal Processing for Defence (SSPD 2011) 2011 (4)

    Abstract

    Spontaneous speech in videos capturing the speaker's mouth provides bimodal information. Exploiting the relationship between the audio and visual streams, we propose a new visual voice activity detection (VAD) algorithm, to overcome the vulnerability of conventional audio VAD techniques in the presence of background interference. First, a novel lip extraction algorithm combining rotational templates and prior shape constraints with active contours is introduced. The visual features are then obtained from the extracted lip region. Second, with the audio voice activity vector used in training, adaboosting is applied to the visual features, to generate a strong final voice activity classifier by boosting a set of weak classifiers. We have tested our lip extraction algorithm on the XM2VTS database (with higher resolution) and some video clips from YouTube (with lower resolution). The visual VAD was shown to offer low error rates.

  • Liu Q, Wang W, Jackson P. (2010) 'Audio-visual Convolutive Blind Source Separation'. London : IEEE Proc. Sensor Signal Processing for Defence (SSPD 2010), London, UK: Sensor Signal Processing for Defence

    Abstract

    We present a novel method for speech separation from their audio mixtures using the audio-visual coherence. It consists of two stages: in the off-line training process, we use the Gaussian mixture model to characterise statistically the audio-visual coherence with features obtained from the training set; at the separation stage, likelihood maximization is performed on the independent component analysis (ICA)-separated spectral components. To address the permutation and scaling indeterminacies of the frequency-domain blind source separation (BSS), a new sorting and rescaling scheme using the bimodal coherence is proposed.We tested our algorithm on the XM2VTS database, and the results show that our algorithm can address the permutation problem with high accuracy, and mitigate the scaling problem effectively.

  • Liu Q, Wang W, Jackson PJB. (2010) 'Use of Bimodal Coherence to Resolve Spectral Indeterminacy in Convolutive BSS'. Springer Lecture Notes in Computer Science (LNCS 6365), St. Malo, France: 9th International Conference on Latent Variable Analysis and Signal Separation (formerly the International Conference on Independent Component Analysis and Signal Separation) 6365/2010, pp. 131-139.

    Abstract

    Recent studies show that visual information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterisation of the coherence between the audio and visual speech using, e.g. a Gaussian mixture model (GMM). In this paper, we present two new contributions. An adapted expectation maximization (AEM) algorithm is proposed in the training process to model the audio-visual coherence upon the extracted features. The coherence is exploited to solve the permutation problem in the frequency domain using a new sorting scheme. We test our algorithm on the XM2VTS multimodal database. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS.

  • Liu Q, Wang W, Jackson PJB. (2010) 'Bimodal Coherence based Scale Ambiguity Cancellation for Target Speech Extraction and Enhancement'. ISCA-International Speech Communication Association Proceedings of 11th Annual Conference of the International Speech Communication Association 2010, Makuhari, Japan: 11th Annual Conference of the International Speech Communication Association 2010, pp. 438-441.

    Abstract

    We present a novel method for extracting target speech from auditory mixtures using bimodal coherence, which is statistically characterised by a Gaussian mixture modal (GMM) in the offline training process, using the robust features obtained from the audio-visual speech. We then adjust the ICA-separated spectral components using the bimodal coherence in the time-frequency domain, to mitigate the scale ambiguities in different frequency bins. We tested our algorithm on the XM2VTS database, and the results show the performance improvement with our proposed algorithm in terms of SIR measurements.

  • Xu T, Wang W. (2010) 'Learning Dictionary for Underdetermined Blind Speech Separation Based on Compressed Sensing Method'. Proc. INSPIRE Conference on Information Representation and Estimation, London, UK: INSPIRE 2010
  • Xu T, Wang W. (2010) 'A block-based compressed sensing method for underdetermined blind speech separation incorporating binary mask'. IEEE Proceedings of 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, USA: ICASSP 2010, pp. 2022-2025.
  • Mustafa H, Wang W. (2010) 'Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification'. Proc. 7th International Symposium on Computer Music Modeling and Retrieval, Malaga, Spain: CMMR 2010
  • Xu T, Wang W. (2009) 'A compressed sensing approach for underdetermined blind audio source separation with sparse representation'. IEEE IEEE Workshop on Statistical Signal Processing Proceedings, Cardiff, UK: SSP '09, pp. 493-496.
  • Jan T, Wang W, Wang D. (2009) 'A multistage approach for blind separation of convolutive speech mixtures'. IEEE IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Taipei, Taiwan: ICASSP'09, pp. 1713-1716.
  • Soltuz SM, Wang W, Jackson PJB. (2009) 'A HYBRID ITERATIVE ALGORITHM FOR NONNEGATIVE MATRIX FACTORIZATION'. IEEE 2009 IEEE/SP 15TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING, VOLS 1 AND 2, Cardiff, WALES: 15th IEEE/SP Workshop on Statistical Signal Processing, pp. 409-412.
  • Liang Y, Wang W, Chambers J. (2009) 'Adaptive signal processing techniques for clutter removal in radar-based navigation systems'. IEEE Conference Record of the 43rd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA: Asilomar 2009, pp. 1855-1858.
  • Wang W. (2008) 'One Microphone Audio Source Separation Using Convolutive Non-negative Matrix Factorization with Sparseness Constraints'. Proc. 8th IMA International Conference on Mathematics in Signal Processing, Cirencester, UK: IMA 2008
  • Jan T, Wang W, Wang D. (2008) 'Binaural Speech Separation Based on Convolutive ICA and Ideal Binary Mask Coupled with Cepstral Smoothing'. Proc. 8th IMA International Conference on Mathematics in Signal Processing, Cirencester, UK: IMA 2008
  • Zou X, Wang W, Kittler J. (2008) 'Non-negative Matrix Factorization for Face Illumination Analysis'. Proc. ICA Research Network International Workshop, Liverpool, UK: ICARN 2008, pp. 52-55.
  • Wang W, Zou X. (2008) 'Non-Negative Matrix Factorization based on Projected Nonlinear Conjugate Gradient Algorithm'. Proc. ICA Research Network International Workshop, Liverpool, UK: ICARN 2008, pp. 5-8.
  • Wang W. (2008) 'Convolutive non-negative sparse coding'. IEEE Proceedings of the International Joint Conference on Neural Networks, Hong Kong: IJCNN 2008, pp. 3681-3684.
  • Zhang Y, Chambers JA, Wang W, Kendrick P, Cox TJ. (2007) 'A new variable step-size LMS algorithm with robustness to nonstationary noise'. IEEE IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Hawaii, USA: ICASSP'07 3, pp. III-1349-III-1352.
  • Wenwu W, Yuhui L, Chambers JA, Saeid S. (2007) 'Non-negative matrix factorization for note onset detection of audio signals'. IEEE Proceedings of the 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, Arlington, USA: MSLP 2006, pp. 447-452.
  • Wang W. (2007) 'Squared Euclidean distance based convolutive non-negative matrix factorization with multiplicative learning rules for audio pattern separation'. IEEE IEEE International Symposium on Signal Processing and Information Technology, Giza, Egypt: ISSPIT 2007, pp. 347-352.
  • Wang W, Hicks Y, Sanei S, Chambers J, Cosker D. (2005) 'Video assisted speech source separation'. IEEE IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Philadelphia, USA: ICASSP'05 V, pp. 425-428.
  • Yuan L, Sang E, Wang W, Chambers JA. (2005) 'An effective method to improve convergence for sequential blind source separation'. SPRINGER-VERLAG BERLIN ADVANCES IN NATURAL COMPUTATION, PT 1, PROCEEDINGS, Changsha, PEOPLES R CHINA: 1st International Conference on Natural Computation (ICNC 2005) 3610, pp. 199-208.
  • Wang W, Chambers J, Sanei S. (2004) 'Subband Decomposition for Blind Speech Separation Using a Cochlear Filterbank'. Proc. IMA 6th International Conference on Mathematics in Signal Processing, Cirencester, UK: IMA 2004, pp. 207-210.
  • Wang W, Chambers J, Sanei S. (2004) 'Penalty Function Based Joint Diagonalization Approach for Convolutive Constrained BSS of Nonstationary Signals'. Technische Universität Wien Proc. 12th European Signal Processing Conference, Vienna, Austria: EUSIPCO 2004
  • Sanei S, Wang W, Chambers J. (2004) 'A Coupled HMM for Solving the Permutation Problem in Frequency Domain BSS'. IEEE Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Montreal, Canada: ICASSP 2004, pp. 565-568.
  • Wang W, Chambers JA, Sanei S. (2004) 'Penalty function approach for constrained convolutive blind source separation'. Springer Lecture Notes in Computer Science: Independent Component Analysis and Blind Signal Separation, Granada, Spain: ICA 2004: 5th International Conference 3195, pp. 661-668.
  • Chambers J, Wang W. (2004) 'Frequency domain blind source separation'. IET Seminar Digest, 2004 (10774)
  • Wang W, Sanei S, Chambers JA. (2004) 'A novel hybrid approach to the permutation problem of frequency domain blind source separation'. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3195, pp. 532-539.
  • Sanei S, Spyrou L, Wang W, Chambers JA. (2004) 'Localization of P300 sources in schizophrenia patients using constrained BSS'. Springer Lecture Notes in Computer Science: Independent Component Analysis and Blind Signal Separation, Malaga, Spain: ICA 2004: 5th International Conference 3195, pp. 177-184.
  • Wang W, Sanei S, Chambers J. (2003) 'Hybrid Scheme of Convolutive BSS and Beamforming for Speech Signal Separation Using Psychoacousitcs Filtering'. Proc. International Conference on Control Science and Engineering, Harbin, China: ICCSE 2003
  • Wang W, Jafari M, Sanei S, Chambers J. (2003) 'Blind Separation of Convolutive Mixtures of Cyclostationary Sources Using an Extended Natural Gradient Method'. IEEE Proc. IEEE 7th International Symposium on Signal Processing and its Applications, Paris, France: ISSPA 2003 2, pp. 93-96.
  • Wang W, Sanei S, Chambers J. (2003) 'A Joint Diagonalization Method for Convolutive Blind Separation of Nonstationary Sources in the Frequency Domain'. Proc. 4th International Symposium on Independent Component Analysis and Blind Signal Separation, Nara, Japan: ICA 2003, pp. 939-944.

Books

  • Wang W. (2010) Machine Audition: Principles, Algorithms and Systems. New York, USA : Information Science Reference
  • Zhou S, Wang W. (2009) IEEE/WRI Global Congress on Intelligent Systems Proceedings. USA : IEEE Computer Society

Book chapters

  • Wang W. (2011) 'Preface of Machine Audition: Principles, Algorithms and Systems'. in Wang W (ed.) Machine Audition: Principles, Algorithms and Systems Information Science Reference , pp. xv-xxi.

    Abstract

    "This book covers advances in algorithmic developments, theoretical frameworks, andexperimental research findings to assist professionals who want an improved ...

  • Jan T, Wang W. (2010) 'Cocktail Party Problem: Source Separation Issues and Computational Methods'. in Wang W (ed.) Machine Audition: Principles, Algorithms and Systems New York, USA : Information Science Reference Article number 3 , pp. 61-79.
  • Wang W. (2010) 'Instantaneous versus Convolutive Non-negative Matrix Factorization: Models, Algorithms and Applications to Audio Pattern Separation'. in Wang W (ed.) Machine Audition: Principles, Algorithms and Systems Information Science Reference Article number 15 , pp. 353-370.

Theses and dissertations

  • Dong J. (2016) Sparse analysis model based dictionary learning and signal reconstruction..
    [ Status: Approved ]

    Abstract

    Sparse representation has been studied extensively in the past decade in a variety of applications, such as denoising, source separation and classification. Earlier effort has been focused on the well-known synthesis model, where a signal is decomposed as a linear combination of a few atoms of a dictionary. However, the analysis model, a counterpart of the synthesis model, has not received much attention until recent years. The analysis model takes a different viewpoint to sparse representation, and it assumes that the product of an analysis dictionary and a signal is sparse. Compared with the synthesis model, this model tends to be more expressive to represent signals, as a much richer union of subspaces can be described. This thesis focuses on the analysis model and aims to address the two main challenges: analysis dictionary learning (ADL) and signal reconstruction. In the ADL problem, the dictionary is learned from a set of training samples so that the signals can be represented sparsely based on the analysis model, thus offering the potential to fit the signals better than pre-defined dictionaries. Among the existing ADL algorithms, such as the well-known Analysis K-SVD, the dictionary atoms are updated sequentially. The first part of this thesis presents two novel analysis dictionary learning algorithms to update the atoms simultaneously. Specifically, the Analysis Simultaneous Codeword Optimization (Analysis SimCO) algorithm is proposed, by adapting the SimCO algorithm which is proposed originally for the synthesis model. In Analysis SimCO, the dictionary is updated using optimization on manifolds, under the $\ell_2$-norm constraints on the dictionary atoms. This framework allows multiple dictionary atoms to be updated simultaneously in each iteration. However, similar to the existing ADL algorithms, the dictionary learned by Analysis SimCO may contain similar atoms. To address this issue, Incoherent Analysis SimCO is proposed by employing a coherence constraint and introducing a decorrelation step to enforce this constraint. The competitive performance of the proposed algorithms is demonstrated in the experiments for recovering synthetic dictionaries and removing additional noise in images, as compared with existing ADL methods. The second part of this thesis studies how to reconstruct signals with learned dictionaries under the analysis model. This is demonstrated by a challenging application problem: multiplicative noise removal (MNR) of images

Page Owner: ww0003
Page Created: Thursday 16 September 2010 14:47:13 by lb0014
Last Modified: Monday 21 March 2016 13:55:59 by jg0036
Expiry Date: Friday 16 December 2011 14:42:55
Assembly date: Mon Aug 21 13:00:02 BST 2017
Content ID: 37289
Revision: 45
Community: 1379

Rhythmyx folder: //Sites/surrey.ac.uk/CVSSP/people
Content type: rx:StaffProfile