Speech and audio processing

The speech and audio processing work spreads across a number of fundamental and direct application research areas| including, for example, signal processing for separation, recognition, transcription, enhancement, coding, synthesis as well as applications to advanced fixed and wireless communication systems. The various research activities have received funding support from EPSRC, DSTL, RAENG, BBC, EU, BAE systems, etc.

Speech and audio separation

(Wang and Jackson)

It is well known that humans are generally skilful in isolating speech source of interest from sound mixtures observed in a cocktail party environment where background noise and interfering sounds are present simultaneously. It is however difficult for machines to replicate such capabilities. Solutions to this problem are likely to have impact on hearing aids and cochlear implants, automatic speech recognition in uncontrolled natural environment, advanced human computer interactions, and security and defence related applications. Our efforts in this area have been centred on the development of algorithmic solutions for separation or extraction of speech or music sources from their mixtures, using primarily the techniques of blind source separation, independent component analysis, time-frequency masking, non-negative matrix/tensor factorization, and computational auditory scene analysis.  Our interests in this direction are summarised as follows:

  • Cocktail party problem
  • Underdetermined speech separation
  • Convolutive speech separation
  • Audio-visual speech separation
  • Extraction/separation of music  sources from a mixture (or mixtures)
  • Separation of singing voices from accompanying music
  • Computational auditory scene analysis
  • Psychoacoustics motivated speech and audio separation
  • Model based speech and audio separation
  • Sparse representations for speech separation
  • Microphone array techniques for speech separation
  • Sound localisation
  • Applications to hearing aids, cochlear implants
  • Applications to advanced human computer interactions

Speech and audio recognition

(Jackson and Wang)

Automatic recognition of speech, which aims to recognise from speech recordings what have been said by the speaker, has been studied for many years. Although the state-of-the-art techniques are performing well for clean speech, they are still very limited when presented with noisy speech that contains background noise and interfering sounds. Recent developments in audio recognition shift towards recognition and detection sound events from general sounds, such as environmental sound recognition, and audio event detection from sound mixtures, anomaly event detection from sound recordings (e.g. cough of patients). Applications of this research include human computer interactions, healthcare, etc. Our interests in this area include:

  • Emotion recognition from speech
  • Environmental sound recognition
  • Audio event classification and recognition
  • Anomaly audio event detection
  • Voice activity detection

Audio transcription and retrieval


Audio transcription aims to analyse and detect automatically from an audio signal (e.g. music) the pitch (i.e. frequency), onset/offset of the notes being played (i.e. duration), and other parameters. Audio transcription is a challenging scientific problem that relates to multiple disciplines including, for example, signal processing, pattern recognition, machine learning, statistics and cognitive science. This topic has a wide range of applications in multimedia search industry, entertainment industry, and broadcast industry. Our interests in this area include:

  • Music transcription
  • Note onset/offset detection
  • Pitch detection and transcription
  • Polyphonic pitch tracking
  • Sparse coding of music
  • Non-negative matrix factorization for music analysis
  • Music retrieval from metadata
  • Singer identification
  • Instrument identification

Speech enhancement


Our activities in this area have been centred on the analysis of room acoustical effects on speech recordings. In particular, we have been developing algorithms for mitigating the effects of room reverberations and ambient noise from distorted speech, using, for example, empirical mode decomposition based subband processing technique, frequency dependant statistical modelling approach, and binaural hearing based techniques.

  • Speech denosing
  • Speech dereverberation
  • Blind dereverberation
  • Room acoustics modelling
  • Reverberation time estimation

Speech and audio synthesis


  • Speech driven visual animation
  • Spatial audio synthesis
  • Modelling of speech articulation
  • Spatial audio
  • 3D sound

Speech and audio processing activities in the I-LAB

I-Lab traditionally has been known as one of the world leading speech coding groups where it contributed to many national and international research and development activities as well as standardisations programmes.

These standards are ETSI GSM Full and Half rate speech and channel coding systems, ETSI, 3GPP AMR speech and channel coding system, INMARSAT-M and mini-M, NATO STANAG etc.

In addition to speech coding I-LABs research activities now covers Blind Source Separation, Advanced Spatial Audio Coding based on ABS, Speaker recognition systems, Speech and data tunnelling for secure GSM /3G communication systems, simplified WFS audio rendering, audio video synchronisation for P2P and DVB networks which are all leading edge research activities with internationally leading quality and world standing.

Contact us

Find us

Centre for Vision Speech and Signal Processing
Alan Turing Building (BB)
University of Surrey