I-Lab: Audio and speech processing

Speech and audio analysis research theme in I-Lab covers a number of activities which include fundamental research and application based specialisations.

Speech and audio signals contain many characteristics which can be used to devise a number of techniques for their extraction and use in a given application. I-Lab traditionally has been known as one of the world-leading speech coding groups where it has been contributing to many national and international research and development activities as well as standardisations programmes.

These standards include ETSI GSM full and half rate speech and channel coding systems, ETSI, 3GPP AMR speech and channel coding system, INMARSAT-M and mini-M, NATO STANAG, etc.

In addition to speech coding, I-LAB’s research activities now cover blind source separation, advanced spatial audio coding based on analysis-by-synthesis, speaker recognition systems, speech and data tunnelling for secure GSM/3G communication systems, simplified wave field synthesis audio rendering, and audio-video synchronisation for P2P and DVB networks, which are all world standing and leading edge research activities with internationally renowned quality.

The following lists the topics of the current and recent research activities related to speech and audio in I-Lab.

Speech processing

This research aims to provide high quality speech at low bit rates, typically from 1.2 to 4 kb/s for narrowband speech, and 5 to 10 kb/s for wideband speech. The main applications are communications over wireless networks (e.g. GSM, 3G and 4G), satellites, or IP networks. It can also be used for storage purposes.

A baseline speech coder called Split-Band LPC (SB-LPC) has been developed in I-Lab over several years. It is based on a sinusoidal model, and can provide a variety of bit rates by varying parameters’ update rates and quantisation schemes. It is able to operate on narrowband speech, sampled at 8 kHz, and on wideband speech, sampled at 16 kHz, for higher quality applications.

This coder has been used for several applications, such as:

  • Development of a candidate for the GSM AMR competition, for the half-rate channel (11.4 kb/s total bit rate, including channel coding). The SB-LPC was used at 3 different rates, complete with channel coding and a rate adaptation scheme. The proposed system was the best performing system in the half-rate GSM channel.
  • Development of a 1.2/2.4 kb/s coder for the NATO STANAG competition. The system was completed by a front-end noise canceller, and was one of three candidates entered.
  • Development of a 4 kb/s version, aiming at providing toll-quality speech for communications applications.
  • Development of an end-to-end secure voice communication system over public GSM/3G mobile networks with 1.2kb/s codec.

The SB-LPC coder is also very promising for VoIP applications, as its basic principles make it far less sensitive to packet losses than the CELP-based coders currently used in such systems.

In modern daily life, people are interacting with machines more than ever. Most of these human-machine interactions require different kinds of user recognition methods to ensure that only the correct user is allowed to gain access. Passwords, Personal Identification Numbers (PINs) and swipe cards are the most common methods of identification between humans and machines for applications such as computer access, transaction security, data protection etc. Biometric recognition methods use people's physiological or behavioural characteristics to recognise them. Iris recognition, fingerprints, signature, face and voice are some biometric methods.

Voice recognition is a process of recognising who is speaking by using characteristics of the speaker's voice. Application areas for speaker recognition include access control, transaction authentication, information retrieval, forensic applications, and prisoner monitoring. In this work, a text-independent speaker recognition system has been developed based on Gaussian Mixture Models (GMMs); the feature vectors used are Mel-frequency Cepstral coefficients. The speaker identification performance of the system is more than 99 pe cent (measured on the TIMIT database).

A speaker verification system has also been implemented, using Universal Background Model (UBM). Speech transcoding modifies the speech signal, which introduces distortion and influences the speaker recognition performance. Our research focuses on the effects of the distortion in the speech signal in terms of speaker recognition performance, and ways of reducing these unwanted effects introduced by speech coders.

We investigated the effects of four standard speech-coding algorithms on speaker recognition systems. Speech codecs used were, AMR 12.2 (12.2kbit/s), G.729 (8kbit/s), G.723.1 (5.3kbit/s), and MELP (2.4kbit/s). When there is a mismatch between enrolment data and test data (i.e. training and testing speech are coming from different speech codecs), speaker recognition performance reduces. A coder recogniser is a system that detects the specific speech codec used for training data and then uses this knowledge during testing; this eliminates the data mismatch between training and testing sessions.

Speaker recognition systems offer good performance in ideal conditions, i.e. absence of background noise, etc. Error rates as low as 1 per cent have been achieved by the existing systems. However, the existing SID systems make no assumption about an individual disguising his/her voice. With the advancements in the field of voice morphing and speech synthesis, SID systems are vulnerable to attacks by impostors who use these synthesis and transformation techniques to alter their voices. In our analysis, speaker recognition systems have been found to falter under such deliberate attacks.

In this work, a voice morphing system has been developed, which is used to intrude the speaker recognition system, also developed at I-Lab. The performance of the speaker recognition system against morphed voices deteriorates and in certain cases successful impersonation attacks with success rates of up to 83 per cent have been found. The aim of our research in the area of speaker recognition against morphed voices is to investigate how the speaker recognition systems can be improved to resist such deliberate impersonation attacks.

In this regard a system based on multiple classifiers has been developed, using different spectral representations of the speech signal. The research also aims to determine the presence of speaker specific information in the LP-residual. We use the sub-band information in the spectrum of the LP-residual for representing the speaker specific characteristics. The use of different feature sets and the LP-residual based information has been used in a multiple classifier based speaker recognition system. The proposed system is able to outperform the state-of-the-art in speaker recognition when presented with the morphed voices.

Voice morphing refers to a set of techniques which aim to alter the characteristics of a speaker’s voice in such a way that they appear to have been spoken by a different speaker. Voice morphing, or voice conversion as it is generally known, is used in the field of text-to-speech synthesis where it can be used to adapt or personalize the synthesized voices in corporate dialogue systems. Dubbing tasks in the entertainment industry can also be accomplished by the use of voice conversion systems. It can further be used for cross-language voice conversion for the film and music industry. Researchers have also made use of voice conversion system to enhance the intelligibility and naturalness of the otherwise impaired-speech. These are some of the applications which have emphasized the significance of voice morphing in the spoken language systems.

The system developed at I-Lab requires a parallel speech corpus from the impostor and the target speakers. During the training stage spectral features representing the characteristics of the impostor and the target speakers’ are extracted from the input speech. In our system, line spectral frequencies (LSF) is used for representing the spectral properties of the two speakers. The procedure is carried out pitch synchronously.

The unvoiced regions of the speech are filled with pseudo-pitch marks at a constant rate of 125Hz. The spoken text is aligned with respect to each other by means of dynamic time warping (DTW).

The output of the alignment process is used to train the joint-probability estimates of the impostor and the target speakers’ distributions. This task is accomplished by the use of Gaussian mixture model (GMM).

The impostor-target pair specific probabilistic model is used in generating a mapping function between the impostor feature vector space and the target feature vector space. During the conversion phase, the test feature vector stream from the impostor speech is transformed using the trained mapping function and the converted speech is synthesized pitch-synchronously.

The performance of a voice conversion system heavily depends on the amount of speech data used in the training of the conversion function. Limited amount of training data can result in audible discontinuities in the converted speech. A method has been introduced to reduce the artefacts in the output speech by smoothing the posterior probabilities of the components mixtures during the conversion stage.

The performance of the system has been investigated for different feature vector dimensions, different number of mixture components of the GMM and different amounts of training speech material. The system has been tested on the VOICES speech corpus and the performance of the system has been evaluated both subjectively and objectively using the spectral distortion (SD) measure and the ABX tests, respectively.

The evaluations have been carried out for male-male, male-female, female-male and female-female impostor-target pairs and successful conversion rates of up to 85 per cent have been achieved. Subjective evaluations of the GMM-PS system indicate its superior performance over the traditional GMM only systems. The direction of future work in Voice Morphing system will focus on converting the speaker specific information found at sub-segmental, segmental and supra-segmental level besides the conversion of the spectral envelope.

This work aims to provide a delay-guaranteed data service for real-time applications over public mobile networks such as GSM/3G systems. Although the GSM/3G data channel can be used for real-time data transmission, it suffers from a number of disadvantages such as interoperability issues, long transmission delays (1~2 seconds) and jitters, etc. Moreover, mobile data service may not be available in everywhere and at all the time.

On the other hand, data tunnelling over voice channels does not have such problems at all. Voice service is the top prioritised service for mobile networks and it is covered in most of the world. Interoperability has never been a problem. As the data tunnelling uses voice channels, its end-to-end delay and jitter are always constant and guaranteed. These features make the technique of data tunnelling over voice channel highly desirable as it is possible to transmit anything over the widely available voice channels of mobile networks anytime and anywhere.

However, transmission of data over voice channels is not straightforward. Firstly, the GSM terminal’s high-efficient speech encoder and decoder (codec) are optimised for perceptual quality, not a bit-by-bit accurate match of the input and output signal.

Randomised data will not be able to be recovered accurately after the transmission if it is directly applied to the codec. Secondly, the GSM systems employ voice activity detection (VAD) technique to cut-off non-speech signals such as white noise and silence. A series of randomised data is more like white noise to the GSM speech codec and it will fail to go through the GSM voice channel.

To overcome these difficulties, a new modem which will operate over the low-bit rate GSM/3G voice channel is designed in the I-Lab. This modem can be used to transmit any form of general digital data, e.g. encrypted speech data.

Modulation

The basic requirement for a data modulator used over GSM/3G voice channel is that the modulated signal is able to pass through the multiple speech compression/decompression tandems in GSM/3G system and the original data can be recovered with maximal accuracy. For this purpose, the modulator outputs is a speech-like waveform which possesses speech characteristics being able to pass through the coders and be detectable at the receiver.

Typical speech characteristics preserved by speech encoders are LSF parameters (representing the vocal tract), pitch and voiced/unvoiced classification and signal energy level. These characteristics should be incorporated in the output of the modulator in such a way that the modulated signal sounds like speech. The modulator is a simplified speech synthesizer. Data bit stream from the input to the modulator are mapped onto values of the parameters using codebooks and the output of modulator is speech-like waveform.

Demodulation

Demodulation is the reverse procedure of the modulation. After the GSM/3G network’s voice channel and the speech decoder in the mobile terminals, the speech-like waveform is recovered and input to the demodulator. The demodulator works like a simplified speech encoder where the speech characteristics are preserved and extracted. Using these characteristic parameters as input to the codebooks, and they are mapped onto the corresponding data bits. This data bits stream is the output of the demodulator and is the received data stream.

Applications

One typical application of data tunnelling is the end-to-end secure voice communication, where the speech is transmitted as encrypted data via the voice channel. Using this technique, I-Lab has developed the world’s first end-to-end secure voice communication system over public GSM/3G network voice channels. It is also possible to use this technique to transmit anything, such as images and telemetry data in encrypted or clear data format, from where data service is unavailable and reliable to anywhere in the world as long as the voice service is available.

Performance

The complete system operates on a Texas 5510 fixed point DSP which executes speech encoding, encryption, modulation for the transmit side and demodulation, decryption and speech decoding for the receive side.

With 1.2kb/s speech operating into 3kb/s throughput the bit error rate performance is close to zero.

The overall end to end delay is: maximum SVC delay + GSM delay = 200ms maximum.

Advantages of the system

The following lists the advantages of this system.

  • Low call set-up time and delay (8 seconds to set up and 200ms overall delay including 120ms for GSM-GSM system delay)
  • Good speech quality
  • End-to-end secure, for both GSM-to-GSM and GSM-to-PSTN communications
  • No interoperability issues
  • Does not requires dedicated handset or subscription: can be implemented as a plug-in module for any GSM mobile phone
  • Works with the 3G system

A real-time software is available which can be compiled for various DSPs and is being marketed by our spin-off company MulSys Limited.

Audio signal processing

The source separation problem refers to the situation where individual sources are to be extracted from mixtures of these sources, such as extracting the conversation between two individuals in a marketplace with all sorts of different sounds or in a cocktail party setting. The term, blind source separation (BSS) refers to the condition that no knowledge of the mixing system (such as the boundaries of a street canyon) or the sources to be extracted is available. The cocktail party effect refers to the human ability to carry out source separation in a party environment where a number of sources are present. For example, having the ability to pick out what someone is saying a few metres away by extracting this signal from a mix of people talking and music playing. At I-lab research is being carried out in the cocktail party effect and the BSS problem.

Applications

Blind source separation has proven beneficial in a number of applications. Most prominent of these include:

  • Separate people's voices in a room with the ability to select a voice of interest (teleconferencing)
  • CCTV systems that automatically point to sounds of interest and listen (security/surveillance)
  • Noise and interference suppression on mobile devices
  • Separate an individual instrument from a group (sound recording)
  • Recording nerve impulses from the brain and separating them from background noise such as eye movement (medical imaging)
  • To improve the quality of sound produced by a hearing aid by separating the sources electronically (hearing aids)

Previous work

Work was initially performed in this field by Herault and Jutten in mid-1980s. Certain assumptions had to be made about the audio sources for separation to be achieved. The independence of sources in the statistical sense and non-Gaussian nature of the source distributions were the main assumptions. In the beginning solutions to BSS dealt with non-moving sources. Separation was achieved by exploiting the statistical independence of acoustic sources and it performed well in cases where dispersive effects of the acoustical boundaries (such as walls and the ground) could be ignored.

The next stage was to develop source separation systems that could operate in reverberant conditions. Statistical methods suffer from an effect known as the permutation problem when operating in reverberant conditions. Separation is performed independently in each frequency bin and as these statistical techniques are ambiguous as to the order of the separated sources, frequency bins from the various sources become mixed up. Various techniques exist to solve the permutation problem. With recent developments in microphone array processing techniques, the permutation problem has been solved more easily by utilising geographical data extrapolated from these techniques. Recently methods have also utilised both statistical and microphone array techniques to further improve the quality and speed of BSS algorithms.

Current work

B-format recordings of the sound sources are obtained using a commercially available tetrahedron microphone. We get the pressure and pressure gradient measurements along the horizontal and vertical direction, coming as a mixture from various sound sources present in the scenario.

The sound is considered in blocks of time. Modified discrete cosine transform is calculated due to its overlapping and energy compaction properties to decrease the edge effects across blocks.

Next the intensity vector directions are obtained and rounded to the nearest degrees, it is referred to as localisation. There after the directions are made use of to "beamform" in a particular source direction depending on the user requirement, as a result spatial filtering to obtain the desired source signal while suppressing all others. It is then converted back to time domain as a separated sound source signal.

Performance

Performance evaluation is carried out by measuring the signal to interference ratio of the desired separated sound with respect to other unnecessary sounds presents. The test factors during performance evaluation include number of sound sources, the angular interval between them and reverberation time. Two and three sound sources are considered in different experiments among male, female, guitar and cello sound.

Currently research here at I-lab is being carried out to improve the signal to interference ratio for separated signals in order to optimize system performance. This is being carried out by way of coupling to the BSS described above a single microphone noise suppression technique.

Practical considerations

Size: Small, lightweight microphone array; ping-pong ball size or less

Speed: Real-time separation < 0.1 sec delay

3D symmetry: Localisation and separation in 360 degrees (horizontal and vertical axes)

Performance: Average SIR improvement = 14.4 dB (may go up to 26 dB)

Interface: Small number of input channels - advantageous in interfacing, storage and processing

Operation flexibility: User can select a single sound source to listen to from a mixture of sources or suppress one or more sources

The design, patent and the software of this system is being licensed by the University – Prof Kondoz.

One of the primary objectives of modern audio-visual media creation and reproduction techniques is realistic perception of the delivered contents by the consumer.

Spatial audio-related techniques in general attempt to deliver the impression of an auditory scene where the listener can perceive the spatial distribution of the sound sources as if he/she were in the actual scene. Many of the commercially available spatial audio reproduction systems have specifications of transducers such as loudspeakers or headphones, without taking into consideration the possibility of the listener’s interaction during the reproduction.

For example, during the playback of a DVD movie, the reproduced spatial sound is bound with the viewpoint in the corresponding recorded scene. On the other hand, interactive reproduction of spatial audio allows for the listener’s movement during the playback and for the delivery of the corresponding change of the auditory scene. This feature is often found in applications such as video games and virtual reality. This study presents a new methodology to efficiently extend the interactivity of spatial audio delivery over a wider area of media creation and reproduction.

The audio capturing equipment in I-Lab allows for recording and processing of audio signals for various purposes. Binaural, ambisonic and 5.1-channel configurations are initially supported by means of a dummy head and coincident microphone arrays.

Additionally, for maximum flexibility in terms of mixing or processing, the signals from individual sound sources should be obtained separately. In case of games or virtual reality applications where artificial sources can be used, the audio signals can be prepared separately. However, when actual recordings need to be made, microphones should be used in various ways depending on the characteristics of the recorded scene.

One of them is to place individual microphones close to the sources, which enables direct capturing of individual sound sources. In some situations where placing individual microphones is irrelevant and the source directions are discrete, the source separation technique (BSS) developed in I-Lab enables extraction of individual audio objects from the recording made with the microphone array.

In relation to the network and communications research within the group, various spatial audio coding technologies are under investigation, which will enable bandwidth-efficient transmission of the created contents. MPEG-standardised codecs are utilised depending on the use cases, including MPEG Surround and AAC (Advanced Audio Coding).

The audio coding modules can be implemented to suit the application needs considering the fidelity in reproduction and the allowable audio bandwidth. A new concept, named Spatial Audio Object Coding (SAOC) is under investigation as well, which has recently been introduced and standardised. Based on MPEG Surround technology, it enables mixing and transmission of audio objects, instead of audio channels corresponding to the transducers in conventional audio coding techniques.

In order to ensure interactivity at the reproduction side, the mixer/renderer needs to be able to accept and process additional input from the end user.

This input contains the information regarding the output configuration (e.g. binaural, stereo, or 5.1-channel) and the user’s movement resulting in the change of viewpoint.

The audio decoding and rendering modules need to be able to reflect this change. One of the areas where interactivity in recorded spatial audio delivery becomes useful is production of multi-view video contents.

If changes in viewpoints are allowed, through synchronous scene recordings with multiple cameras, the captured audio contents should also be able to reflect the viewpoint changes. Object-oriented audio coding can provide this flexibility in spatial audio reproduction by applying the corresponding relative position changes to the separated sound sources.

Efficient coding techniques play an essential role in delivering high quality multichannel audio such as used in home entertainment, digital audio broadcasting, computer games, music downloading and streaming services as well as other internet applications such as teleconferencing.

The traditional approach for encoding multiple audio signals is to employ a discrete multichannel audio coder while the state-of-the-art approach such as MPEG surround (MPS) is known as spatial audio coding (SAC).

This approach consists of extracting the spatial cues and downmixing multiple audio channels into a mono or stereo audio signal. The downmixed signal is subsequently compressed by an existing audio encoder and then transmitted along with the spatial parameters as side information. Analysis-by-Synthesis (AbS) technique, a general method implemented in the area of estimation and identification, is proposed as a framework for the SAC.

A spatial synthesis block, similar to that performed at the decoder side, is embedded within the AbS-SAC encoder as a model for reconstructing multichannel audio signals.

Assuming that there is no channel error, the audio signals synthesized by the model at the encoder side will be exactly the same as the real reconstructed audio signals at the decoder side.

An error minimization block is used to compare the input signals with the reconstructed signals. Moreover, a closed-loop procedure is performed in order to find the optimum signals and parameters which are defined as the minimum error measured in the error minimization block. The optimum downmix and residual signals as well as the optimum spatial parameters are then transmitted to the decoder.

The AbS framework is implemented in the MPEG surround (MPS) architecture where a tree scheme of one-to-two (OTT) boxes is utilized. Due to an extremely complex task of finding the optimum parameters, a simplified structure is introduced consisting of a modified reverse (R-OTT) module and a sub-optimum trial and error procedure. A significant performance improvement is achieved compared to the open-loop MPS system particularly for bit-rates ranging from 200 to 800 kb/s.

Wave field synthesis (WFS) is a practical application of the Huygens' principle. The main feature of this methodology is its ability to automatically reproduce spatial sound equivalent to the original one generated at the source position within a domain by using secondary sound sources. The secondary sources establish an active boundary, rendering a virtual lifelike soundscape in the given region.

WFS system gives us some clear advantages over conventional approaches for spatial audio. For instance, when ambisonics or 5.1 surround audio system is used for rendering spatial surrounding sound in a space, significant sound effects can generally be achieved only in close proximity of a single reference listening point. The solution based on WFS theory, on the other hand, is naturally valid through the entire problem domain.

This capability of WFS is potentially very useful, for instance, for applications related to the listener moving away from the reference listening position or multiple listeners, as it enables them to have a rich experience of realistic spatial sound effects everywhere through the entire space. Therefore, WFS can be considers as a rather general application for spatial audio over the others. It can be most useful in the context of applications related to sound performance on a large scale, in particular, for delivering the 3D spatial sound to number of listeners in a concert hall or cinemas etc.

Theoretically Rayleigh I integral equation allows us to implement the WFS system with a series of acoustic monopole sources. Some simulation models are developed at the I-Lab for effective evaluation of the WFS algorithm in various given environments. A simple example of them is where a virtual acoustic source is supposed to be at x=0 and y=-2 outside a listening room, and a listener is situated in the middle of the room.

Implementation

In practical realization, the secondary source can be implemented with arrays of number of discrete loudspeakers. Each loudspeaker signal is defined by in-house WFS software which calculates driving functions for the virtual sources. The software is developed in I-Lab, based on C++ under MS Windows 7 environment. It is supposed that the measurement system obtains beforehand the audio signals and additional objective information for the original sources.

All the loudspeaker panels are controlled individually by a PC and hardware control modules situated outside the studio. The control PC is equipped with HDSP cards. Digital signals are then fed into D/A converters.

Each of them converts 64 channels of the digital inputs into analogue outputs. Loudspeaker cables connecting the control modules with drivers are fed through the wall to each loudspeaker panel. The control software includes a series of audio APIs as libraries to implement complicated audio DSP techniques and WFS algorithms.

The WFS system takes place in a studio lab (4.2 x 5.2 m2) at the University of Surrey. A plan view of the new WFS system in the lab is illustrated in figure 3. Ten loudspeaker panels as secondary sources stand close to the walls of the studio inside.

Application for home users

In many practical cases, e.g. at ordinary private houses, bulky loudspeaker arrays for typical WFS systems are not available to be mounted permanently in a small room. From the practical point of view, hardware of conventional WFS systems, especially loudspeaker arrays, should be scaled down in size and also in number.

Unlike many other existing WFS methods, our approach aims to implement an optimised solution by using practically reduced number of drivers in a limited listening angle corresponding to the view angle to a video screen.

By requiring only some limited hardware resources, the unique characteristics of the proposed approach can provide a practical and cost-effective WFS system for general home users. In addition, the system may be equipped with simple and easy user interfaces as well. The research is supported and funded by the EU ICT framework programme 7: Remote collaborative real-time multimedia experience over the future internet (ROMEO).

Contact us

Find us

Address
Centre for Vision Speech and Signal Processing
Alan Turing Building (BB)
University of Surrey
Guildford
Surrey
GU2 7XH