Dr Philip Jackson

Senior Lecturer in Machine Audition

Qualifications: MA, PhD

Email:
Phone: Work: 01483 68 6044
Room no: 30 AB 05

Further information

Biography

Further details can be found on my personal web page.

Publications

Highlights

  • George S, Zielinski S, Rumsey F, Jackson PJB, Conetta R, Dewhirst M, Meares D, Bech S. (2010) 'Development and Validation of an Unintrusive Model for Predicting the Sensation of Envelopment Arising from Surround Sound Recordings'. Audio Engineering Society Journal of the Audio Engineering Society, 58 (12), pp. 1013-1031.
  • Edge JD, Hilton A, Jackson PJB. (2009) 'Model-based synthesis of visual speech movements from 3D video'. Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing, 2009 Article number 597267 , pp. 12-12.

    Abstract

    In this paper we describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system, and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g. HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.

  • Jackson PJB, Singampalli VD. (2009) 'Statistical identification of articulation constraints in the production of speech'. ELSEVIER SCIENCE BV SPEECH COMMUNICATION, 51 (8), pp. 695-710.
  • Shiga Y, Jackson PJB. (2008) 'Start- and end-node segmental-HMM pruning'. INST ENGINEERING TECHNOLOGY-IET ELECTRON LETT, 44 (1), pp. 60-U77.

Journal articles

  • Liu Q, Wang W, Jackson P. (2012) 'Use of bimodal coherence to resolve the permutation problem in convolutive BSS'. Elsevier Signal Processing, 92 (8), pp. 1916-1927.
  • Coleman P, Mo̸ller M, Olsen M, Olik M, Jackson P, Pedersen JA. (2012) 'Performance of optimized sound field control techniques in simulated and real acoustic environments.'. Acoustical Society of America J Acoust Soc Am, 131 (4: Acoustics 2012 Hong Kong), pp. 3465-3465.

    Abstract

    It is of interest to create regions of increased and reduced sound pressure ('sound zones') in an enclosure such that different audio programs can be simultaneously delivered over loudspeakers, thus allowing listeners sharing a space to receive independent audio without physical barriers or headphones. Where previous comparisons of sound zoning techniques exist, they have been conducted under favorable acoustic conditions, utilizing simulations based on theoretical transfer functions or anechoic measurements. Outside of these highly specified and controlled environments, real-world factors including reflections, measurement errors, matrix conditioning and practical filter design degrade the realizable performance. This study compares the performance of sound zoning techniques when applied to create two sound zones in simulated and real acoustic environments. In order to compare multiple methods in a common framework without unduly hindering performance, an optimization procedure for each method is first used to select the best loudspeaker positions in terms of robustness, efficiency and the acoustic contrast deliverable to both zones. The characteristics of each control technique are then studied, noting the contrast and the impact of acoustic conditions on performance.

  • Litwic Ł, Jackson PJB. (2011) 'Source localization and separation using random sample consensus with phase cues'. IEEE IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, , pp. 337-340.

    Abstract

    In this paper we present a system for localization and separation of multiple speech sources using phase cues. The novelty of this method is the use of Random Sample Consensus (RANSAC) approach to find consistency of interaural phase differences (IPDs) across the whole frequency range. This approach is inherently free from phase ambiguity problems and enables all phase data to contribute to localization. Another property of RANSAC is its robustness against outliers which enables multiple source localization with phase data contaminated by reverberation noise. Results of RANSAC based localization are fed into a mixture model to generate time-frequency binary masks for separation. System performance is compared against other well known methods and shows similar or improved performance in reverberant conditions.

  • George S, Zielinski S, Rumsey F, Jackson PJB, Conetta R, Dewhirst M, Meares D, Bech S. (2010) 'Development and Validation of an Unintrusive Model for Predicting the Sensation of Envelopment Arising from Surround Sound Recordings'. Audio Engineering Society Journal of the Audio Engineering Society, 58 (12), pp. 1013-1031.
  • Edge JD, Hilton A, Jackson PJB. (2009) 'Model-based synthesis of visual speech movements from 3D video'. Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing, 2009 Article number 597267 , pp. 12-12.

    Abstract

    In this paper we describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system, and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g. HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.

  • Jackson PJB, Singampalli VD. (2009) 'Statistical identification of articulation constraints in the production of speech'. ELSEVIER SCIENCE BV SPEECH COMMUNICATION, 51 (8), pp. 695-710.
  • Shiga Y, Jackson PJB. (2008) 'Start- and end-node segmental-HMM pruning'. INST ENGINEERING TECHNOLOGY-IET ELECTRON LETT, 44 (1), pp. 60-U77.
  • Russell MJ, Zheng X, Jackson PJB. (2007) 'Modelling speech signals using formant frequencies as an intermediate representation'. INST ENGINEERING TECHNOLOGY-IET IET SIGNAL PROCESSING, 1 (1), pp. 43-50.
  • Pincas J, Jackson PJB. (2006) 'Amplitude modulation of turbulence noise by voicing in fricatives'. ACOUSTICAL SOC AMER AMER INST PHYSICS JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 120 (6), pp. 3966-3977.
  • Russell MJ, Jackson PJB. (2005) 'A multiple-level linear/linear segmental HMM with a formant-based intermediate layer'. ACADEMIC PRESS LTD ELSEVIER SCIENCE LTD COMPUTER SPEECH AND LANGUAGE, 19 (2), pp. 205-225.
  • Pincas J, Jackson PJB. (2005) 'Amplitude modulation of frication noise by voicing saturates'. 9th European Conference on Speech Communication and Technology, , pp. 349-352.

    Abstract

    The two distinct sound sources comprising voiced frication, voicing and frication, interact. One effect is that the periodic source at the glottis modulates the amplitude of the frication source originating in the vocal tract above the constriction. Voicing strength and modulation depth for frication noise were measured for sustained English voiced fricatives using high-pass filtering, spectral analysis in the modulation (envelope) domain, and a variable pitch compensation procedure. Results show a positive relationship between strength of the glottal source and modulation depth at voicing strengths below 66 dB SPL, at which point the modulation index was approximately 0.5 and saturation occurred. The alveolar [z] was found to be more modulated than other fricatives.

  • Jackson PJB, Lo BH, Russell MJ. (2002) 'Data-driven, nonlinear, formant-to-acoustic mapping for ASR'. IEE-INST ELEC ENG ELECTRONICS LETTERS, 38 (13), pp. 667-669.
  • Jackson PJB, Shadle CH. (2001) 'Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech'. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 9 (7), pp. 713-726.
  • Jackson PJB, Shadle CH. (2000) 'Frication noise modulated by voicing, as revealed by pitch-scaled decomposition'. American Institute of Physics Journal of the Acoustical Society of America, USA: 108 (4), pp. 1421-1434.

    Abstract

    A decomposition algorithm that uses a pitch-scaled harmonic filter was evaluated using synthetic signals and applied to mixed-source speech, spoken by three subjects, to separate the voiced and unvoiced parts. Pulsing of the noise component was observed in voiced frication, which was analyzed by complex demodulation of the signal envelope. The timing of the pulsation, represented by the phase of the anharmonic modulation coefficient, showed a step change during a vowel-fricative transition corresponding to the change in location of the sound source within the vocal tract. Analysis of fricatives //[phonetic beta], v, [edh], z, [yog], [vee with swirl], [backward glottal stop]// demonstrated a relationship between steady-state phase and place, and f0 glides confirmed that the main cause was a place-dependent delay.

Conference papers

  • Liu Q, Wang W, Jackson P, Barnard M. (2012) 'Reverberant Speech Separation Based on Audio-visual Dictionary Learning and Binaural Cues'. Proc. of IEEE Statistical Signal Processing Workshop (SSP), Ann Abor, USA: IEEE Statistical Signal Processing Workshop (SSP)
  • Liu Q, Wang W, Jackson PJB. (2011) 'A visual voice activity detection method with adaboosting'. IET IET Seminar Digest, London, UK: Sensor Signal Processing for Defence (SSPD 2011) 2011 (4)

    Abstract

    Spontaneous speech in videos capturing the speaker's mouth provides bimodal information. Exploiting the relationship between the audio and visual streams, we propose a new visual voice activity detection (VAD) algorithm, to overcome the vulnerability of conventional audio VAD techniques in the presence of background interference. First, a novel lip extraction algorithm combining rotational templates and prior shape constraints with active contours is introduced. The visual features are then obtained from the extracted lip region. Second, with the audio voice activity vector used in training, adaboosting is applied to the visual features, to generate a strong final voice activity classifier by boosting a set of weak classifiers. We have tested our lip extraction algorithm on the XM2VTS database (with higher resolution) and some video clips from YouTube (with lower resolution). The visual VAD was shown to offer low error rates.

  • Alinaghi A, Wang W, Jackson PJB. (2011) 'Integrating binaural cues and blind source separation method for separating reverberant speech mixtures'. IEEE IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, , pp. 209-212.

    Abstract

    This paper presents a new method for reverberant speech separation, based on the combination of binaural cues and blind source separation (BSS) for the automatic classification of the time-frequency (T-F) units of the speech mixture spectrogram. The main idea is to model interaural phase difference, interaural level difference and frequency bin-wise mixing vectors by Gaussian mixture models for each source and then evaluate that model at each T-F point and assign the units with high probability to that source. The model parameters and the assigned regions are refined iteratively using the Expectation-Maximization (EM) algorithm. The proposed method also addresses the permutation problem of the frequency domain BSS by initializing the mixing vectors for each frequency channel. The EM algorithm starts with binaural cues and after a few iterations the estimated probabilistic mask is used to initialize and re-estimate the mix- ing vector model parameters. We performed experiments on speech mixtures, and showed an average of about 0.8 dB improvement in signal-to-distortion (SDR) over the binaural-only baseline

  • Liu Q, Naqvi SM, Wang W, Jackson PJB, Chambers J. (2011) 'Robust feature selection for scaling ambiguity reduction in audio-visual convolutive BSS'. European Signal Processing Conference, Barcelona, Spain: 19th European Signal Processing Conference 2011 (EUSIPCO 2011), pp. 1060-1064.
  • Liu Q, Wang W, Jackson PJB. (2010) 'Audio-visual Convolutive Blind Source Separation'. London : Institution of Engineering and Technology 2010 Sensor Signal Processing for Defence Conference Proceedings (SSPD 2010), , pp. 5-5.

    Abstract

    We present a novel method for speech separation from their audio mixtures using the audio-visual coherence. It consists of two stages: in the off-line training process, we use the Gaussian mixture model to characterise statistically the audiovisual coherence with features obtained from the training set; at the separation stage, likelihood maximization is performed on the independent component analysis (ICA)-separated spectral components. To address the permutation and scaling indeterminacies of the frequency-domain blind source separation (BSS), a new sorting and rescaling scheme using the bimodal coherence is proposed.We tested our algorithm on the XM2VTS database, and the results show that our algorithm can address the permutation problem with high accuracy, and mitigate the scaling problem effectively.

  • Liu Q, Wang W, Jackson PJB. (2010) 'Use of Bimodal Coherence to Resolve Spectral Indeterminacy in Convolutive BSS'. Springer Lecture Notes in Computer Science (LNCS 6365), St. Malo, France: 9th International Conference on Latent Variable Analysis and Signal Separation (formerly the International Conference on Independent Component Analysis and Signal Separation) 6365/2010, pp. 131-139.

    Abstract

    Recent studies show that visual information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterisation of the coherence between the audio and visual speech using, e.g. a Gaussian mixture model (GMM). In this paper, we present two new contributions. An adapted expectation maximization (AEM) algorithm is proposed in the training process to model the audio-visual coherence upon the extracted features. The coherence is exploited to solve the permutation problem in the frequency domain using a new sorting scheme. We test our algorithm on the XM2VTS multimodal database. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS.

  • Liu Q, Wang W, Jackson PJB. (2010) 'Bimodal Coherence based Scale Ambiguity Cancellation for Target Speech Extraction and Enhancement'. ISCA-International Speech Communication Association Proceedings of 11th Annual Conference of the International Speech Communication Association 2010, Makuhari, Japan: 11th Annual Conference of the International Speech Communication Association 2010, pp. 438-441.

    Abstract

    We present a novel method for extracting target speech from auditory mixtures using bimodal coherence, which is statistically characterised by a Gaussian mixture modal (GMM) in the offline training process, using the robust features obtained from the audio-visual speech. We then adjust the ICA-separated spectral components using the bimodal coherence in the time-frequency domain, to mitigate the scale ambiguities in different frequency bins. We tested our algorithm on the XM2VTS database, and the results show the performance improvement with our proposed algorithm in terms of SIR measurements.

  • Jackson PJB, Dewhirst M, Conetta R, Zielinski S. (2010) 'Estimates of perceived spatial quality across the listening area'. Audio Engineering Society Proceedings of AES 38th International Conference, Piteå, Sweden: AES 38th International Conference: Sound Quality Evaluation, pp. 233-242.

    Abstract

    This paper describes a computational model for the prediction of perceived spatial quality for reproduced sound at arbitrary locations in the listening area. The model is specifically designed to evaluate distortions in the spatial domain such as changes in source location, width and envelopment. Maps of perceived spatial quality across the listening area are presented from our initial results.

  • Jackson PJB, Dewhirst M, Conetta R, Zielinski S. (2010) 'Estimates of perceived spatial quality across the listening area'. Audio Engineering Society Proceedings of the AES 38th International Conference: Sound Quality Evaluation, Pitea, Sweden: AES 38th International Conference, pp. 233-242.

    Abstract

    This paper describes a computational model for the prediction of perceived spatial quality for reproduced sound at arbitrary locations in the listening area. The model is specifically designed to evaluate distortions in the spatial domain such as changes in source location, width and envelopment. Maps of perceived spatial quality across the listening area are presented from our initial results.

  • Haq S, Jackson PJB. (2009) 'Speaker-dependent audio-visual emotion recognition'. Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), Norwich, UK,
  • Barney A, Jackson PJB. (2009) 'A model of jet modulation in voiced fricatives'. Proc. Int. Conf. on Acoust. NAG-DAGA2009, Rotterdam, Netherlands, , pp. 1733-1736-1733-1736.
  • Edge JD, Hilton A, Jackson PJB. (2009) 'Model-based synthesis of visual speech movements from 3D video'. Proceedings of ACM SIGGRAPH 2009: Posters, Louisiana, USA: SIGGRAPH '09
  • Soltuz SM, Wang W, Jackson PJB. (2009) 'A HYBRID ITERATIVE ALGORITHM FOR NONNEGATIVE MATRIX FACTORIZATION'. IEEE 2009 IEEE/SP 15TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING, VOLS 1 AND 2, Cardiff, WALES: 15th IEEE/SP Workshop on Statistical Signal Processing, pp. 409-412.
  • Jackson PJB, Singampalli VD. (2008) 'Coarticulatory constraints determined by automatic identification from articulograph data'. Strasbourg, France : Proc. 8th Int. Sem. on Spch. Prod. (ISSP’08), , pp. 377-380-377-380.
  • Jackson PJB, Dewhirst M, Conetta R, Zielinski S, Rumsey F, Meares D, Bech S, George S. (2008) 'QESTRAL (Part 3): system and metrics for spatial quality prediction'. Audio Engineering Society San Francisco CA: 125th Audio Engineering Society Convention

    Abstract

    The QESTRAL project aims to develop an artificial listener for comparing the perceived quality of a spatial audio reproduction against a reference reproduction. This paper presents implementation details for simulating the acoustics of the listening environment and the listener’s auditory processing. Acoustical modeling is used to calculate binaural signals and simulated microphone signals at the listening position, from which a number of metrics corresponding to different perceived spatial aspects of the reproduced sound field are calculated. These metrics are designed to describe attributes associated with location, width and envelopment attributes of a spatial sound scene. Each provides a measure of the perceived spatial quality of the impaired reproduction compared to the reference reproduction. As validation, individual metrics from listening test signals are shown to match closely subjective results obtained, and can be used to predict spatial quality for arbitrary signals.

  • Dewhirst M, Conetta R, Rumsey F, Jackson PJB, Zielinski S, George S, Bech S, Meares D. (2008) 'QESTRAL (Part 4): Test signals, combining metrics and the prediction of overall spatial quality'. Audio Engineering Society San Francisco CA: 125th Audio Engineering Society Convention

    Abstract

    The QESTRAL project has developed an artificial listener that compares the perceived quality of a spatial audio reproduction to a reference reproduction. Test signals designed to identify distortions in both the foreground and background audio streams are created for both the reference and the impaired reproduction systems. Metrics are calculated from these test signals and are then combined using a regression model to give a measure of the overall perceived spatial quality of the impaired reproduction compared to the reference reproduction. The results of the model are shown to match closely the results obtained in listening tests. Consequently, the model can be used as an alternative to listening tests when evaluating the perceived spatial quality of a given reproduction system, thus saving time and expense.

  • George S, Zielinski S, Rumsey F, Conetta R, Dewhirst M, Jackson PJB, Meares D, Bech S. (2008) 'An Unintrusive Objective Model for Predicting the Sensation of Envelopment Arising from Surround Sound Recordings'. Proc. 125th AES Conv., San Francisco CA,

    Abstract

    This paper describes the development of an unintrusive objective model, developed independently as a part of the QESTRAL project, for predicting the sensation of envelopment arising from commercially available 5-channel surround sound recordings. The model was calibrated using subjective scores obtained from listening tests that used a grading scale defined by audible anchors. For predicting subjective scores, a number of features based on Interaural Cross Correlation (IACC), Karhunen-Loeve Transform (KLT) and signal energy levels were extracted from recordings. The ridge regression technique was used to build the objective model and a calibrated model was validated using a listening test scores database obtained from a different group of listeners, stimuli and location. The initial results showed a high correlation between predicted and actual scores obtained from the listening tests.

  • Conetta R, Rumsey F, Zielinski S, Jackson PJB, Dewhirst M, Bech S, Meares D, George S. (2008) 'QESTRAL (Part 2): Calibrating the QESTRAL model using listening test data'. Audio Engineering Society Proc. 125th AES Conv., San Francisco CA, San Francisco USA: 125th AES Convention

    Abstract

    The QESTRAL model is a perceptual model that aims to predict changes to spatial quality of service between a reference system and an impaired version of the reference system. To achieve this, the model required calibration using perceptual data from human listeners. This paper describes the development, implementation and outcomes of a series of listening experiments designed to investigate the spatial quality impairment of 40 processes. Assessments were made using a multi-stimulus test paradigm with a label-free scale, where only the scale polarity is indicated. The tests were performed at two listening positions, using experienced listeners. Results from these calibration experiments are presented. A preliminary study on the process of selecting of stimuli is also discussed.

  • Rumsey F, Zielinski S, Jackson PJB, Dewhirst M, Conetta R, George S, Bech S, Meares D. (2008) 'QESTRAL (Part 1): Quality Evaluation of Spatial Transmission and Reproduction using an Artificial Listener'. Proc. 125th AES Conv., San Francisco CA,

    Abstract

    Most current perceptual models for audio quality have so far tended to concentrate on the audibility of distortions and noises that mainly affect the timbre of reproduced sound. The QESTRAL model, however, is specifically designed to take account of distortions in the spatial domain such as changes in source location, width and envelopment. It is not aimed only at codec quality evaluation but at a wider range of spatial distortions that can arise in audio processing and reproduction systems. The model has been calibrated against a large database of listening tests designed to evaluate typical audio processes, comparing spatially degraded multichannel audio material against a reference. Using a range of relevant metrics and a sophisticated multivariate regression model, results are obtained that closely match those obtained in listening tests.

  • Jesus LMT, Jackson PJB. (2008) 'Frication and voicing classification'. Lecture Notes in Computer Science: Computational Processing of the Portuguese Language, Aveiro, Portugal: 8th International Conference, PROPOR 2008 5190, pp. 11-20.

    Abstract

    Phonetic detail of voiced and unvoiced fricatives was examined using speech analysis tools. Outputs of eight f0 trackers were combined to give reliable voicing and f0 values. Log - energy and Mel frequency cepstral features were used to train a Gaussian classifier that objectively labeled speech frames for frication. Duration statistics were derived from the voicing and frication labels for distinguishing between unvoiced and voiced fricatives in British English and European Portuguese.

  • Edge J, Hilton A, Jackson P. (2008) 'Parameterisation of Speech Lip Movements'. Proceedings of International Conference on Auditory-visual Speech Processing, Tangalooma, Australia: AVSP
  • Haq S, Jackson PJB, Edge J. (2008) 'Audio-visual feature selection and reduction for emotion classification'. Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), Tangalooma, Australia,

    Abstract

    Recognition of expressed emotion from speech and facial gestures was investigated in experiments on an audio-visual emotional database. A total of 106 audio and 240 visual features were extracted and then features were selected with Plus l-Take Away r algorithm based on Bhattacharyya distance criterion. In the second step, linear transformation methods, principal component analysis (PCA) and linear discriminant analysis (LDA), were applied to the selected features and Gaussian classifiers were used for classification of emotions. The performance was higher for LDA features compared to PCA features. The visual features performed better than audio features, for both PCA and LDA. Across a range of fusion schemes, the audio-visual feature results were close to that of visual features. A highest recognition rate of 53% was achieved with audio features, 98% with visual features, and 98% with audio-visual features selected by Bhattacharyya distance and transformed by LDA.

  • Longton JH, Jackson PJB. (2008) 'Parallel model combination and word recognition in soccer audio'. IEEE 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, Hannover, Germany: IEEE International Conference on Multimedia and Expo (ICME 2008), pp. 1465-1468.

    Abstract

    The audio scene from broadcast soccer can be used for identifying highlights from the game. Audio cues derived from these sources provide valuable information about game events, as can the detection of key words used by the commentators. In this paper we interpret the feasibility of incorporating both commentator word recognition and information about the additive background noise in an HMM structure. A limited set of audio cues, which have been extracted from data collected from the 2006 FIFA World Cup, are used to create an extension to the Aurora-2 database. The new database is then tested with various PMC models and compared to the standard baseline, clean and multi-condition training methods. It is found that incorporating SNR and noise type information into the PMC process is beneficial to recognition performance.

  • Conetta R, Dewhirst M, Rumsey F, Zielinski S, George S, Jackson PJB, Bech S, Meares D. (2008) 'Calibration of the qestral model for the prediction of spatial quality'. Institute of Acoustics - 24th Reproduced Sound Conference 2008, Reproduced Sound 2008: Immersive Audio, Proceedings of the Institute of Acoustics, 30 (PART 6), pp. 280-289.
  • Singampalli VD, Jackson PJB. (2007) 'Statistical identification of critical, dependent and redundant articulators'. ISCA-INST SPEECH COMMUNICATION ASSOC INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, Antwerp, BELGIUM: Interspeech Conference 2007, pp. 2736-2739.

    Abstract

    A compact, data-driven statistical model for identifying roles played by articulators in production of English phones using 1D and 2D articulatory data is presented. Articulators critical in production of each phone were identified and were used to predict the pdfs of dependent articulators based on the strength of articulatory correlations. The performance of the model is evaluated on MOCHA database using proposed and exhaustive search techniques and the results of synthesised trajectories presented.

  • Jackson PJB. (2007) 'Time-frequency-modulation representation of stochastic signals'. IEEE 2007 15th International Conference on Digital Signal Processing, DSP 2007, Cardiff: 15th International Conference on Digital Signal Processing, pp. 639-642.

    Abstract

    When a noise process is modulated by a deterministic signal, it is often useful to determine the signal's parameters. A method of estimating the modulation index m is presented for noise whose amplitude is modulated by a periodic signal, using the magnitude modulation spectrum (MMS). The method is developed for application to real discrete signals with time- varying parameters, and extended to a 3D time-frequency- modulation representation. In contrast to squared-signal approaches, MMS behaves linearly with the modulating function allowing separate estimation of m for each harmonic. Simulations evaluate performance on synthetic signals, compared with theory, favouring a first-order MMS estimator.

  • Turkmani A, Hilton A, Jackson PJB, Edge J. (2007) 'Visual Analysis of Lip Coarticulation in VCV Utterances'. ISCA-INST SPEECH COMMUNICATION ASSOC INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, Antwerp, BELGIUM: Interspeech Conference 2007, pp. 1281-1284.

    Abstract

    This paper presents an investigation of the visual variation on the bilabial plosive consonant /p/ in three coarticulation contexts. The aim is to provide detailed ensemble analysis to assist coarticulation modelling in visual speech synthesis. The underlying dynamics of labeled visual speech units, represented as lip shape, from symmetric VCV utterances, is investigated. Variation in lip dynamics is quantitively and qualitatively analyzed. This analysis shows that there are statistically significant differences in both the lip shape and trajectory during coarticulation.

  • Every MR, Jackson PJB. (2006) 'Enhancement of harmonic content of speech based on a dynamic programming pitch tracking algorithm'. ISCA-INST SPEECH COMMUNICATION ASSOC INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, Pittsburgh, PA: 9th International Conference on Spoken Language Processing/INTERSPEECH 2006, pp. 81-84.
  • Nadtoka N, Hilton A, Tena J, Edge J, Jackson PJB. (2006) 'Representing Dynamics of Facial Expression'. IET European Conference on Visual Media Production, IET 3rd European Conference on Visual Media Production, pp. 183-183.

    Abstract

    Motion capture (mocap) is widely used in a large number of industrial applications. Our work offers a new way of representing the mocap facial dynamics in a high resolution 3D morphable model expression space. A data-driven approach to modelling of facial dynamics is presented. We propose a way to combine high quality static face scans with dynamic 3D mocap data which has lower spatial resolution in order to study the dynamics of facial expressions.

  • Pincas J, Jackson PJB. (2005) 'Amplitude modulation of frication noise by voicing saturates'. Lisbon : Proc. Interspeech ’05, Lisbon, Portugal: , pp. 4-4.

    Abstract

    The two distinct sound sources comprising voiced frication, voicing and frication, interact. One effect is that the periodic source at the glottis modulates the amplitude of the frication source originating in the vocal tract above the constriction. Voicing strength and modulation depth for frication noise were measured for sustained English voiced fricatives using high-pass filtering, spectral analysis in the modulation (envelope) domain, and a variable pitch compensation procedure. Results show a positive relationship between strength of the glottal source and modulation depth at voicing strengths below 66 dB SPL, at which point the modulation index was approximately 0.5 and saturation occurred. The alveolar [z] was found to be more modulated than other fricatives.

  • Dewhirst M, Zielinski SK, Jackson PJB, Rumsey F. (2005) 'Objective assessment of spatial localisation attributes of surround-sound reproduction systems'. Barcelona : Audio Engineering Society Sound: 108th AES Convention
  • Ypsilos IA, Hilton A, Turkmani A, Jackson PJB. (2004) 'Speech Driven Face Synthesis from 3D Video'. IEEE IEEE Symposium on 3D Data Processing, Visualisation and Transmission, Thessaloniki, Greece: 2nd International Symposium on 3D Data Processing, Visualization and Transmission, pp. 58-65.

    Abstract

    We present a framework for speech-driven synthesis of real faces from a corpus of 3D video of a person speaking. Video-rate capture of dynamic 3D face shape and colour appearance provides the basis for a visual speech synthesis model. A displacement map representation combines face shape and colour into a 3D video. This representation is used to efficiently register and integrate shape and colour information captured from multiple views. To allow visual speech synthesis viseme primitives are identified from the corpus using automatic speech recognition. A novel nonrigid alignment algorithm is introduced to estimate dense correspondence between 3D face shape and appearance for different visemes. The registered displacement map representation together with a novel optical flow optimisation using both shape and colour, enables accurate and efficient nonrigid alignment. Face synthesis from speech is performed by concatenation of the corresponding viseme sequence using the nonrigid correspondence to reproduce both 3D face shape and colour appearance. Concatenative synthesis reproduces both viseme timing and co-articulation. Face capture and synthesis has been performed for a database of 51 people. Results demonstrate synthesis of 3D visual speech animation with a quality comparable to the captured video of a person.

  • Russell MJ, Jackson PJB, Wong MLP. (2003) 'Development of articulatory-based multi-level segmental HMMs for phonetic classification in ASR'. Faculty of Electrical Engineering and Computing, Zagreb, Croatia PROCEEDINGS EC-VIP-MC 2003, VOL 2, Zagreb, Croatia: 4th EURASIP Conference on Video, Image Processing and Multimedia Communications, pp. 655-660.

    Abstract

    A simple multiple-level HMM is presented in which speech dynamics are modelled as linear trajectories in an intermediate, formant-based representation and the mapping between the intermediate and acoustic data is achieved using one or more linear transformations. An upper-bound on the performance of such a system is established. Experimental results on the TIMIT corpus demonstrate that, if the dimension of the intermediate space is suficiently high or the number of articulatory-to-acoustic mappings is sufjciently large, then this upper-bound can be achieved.

  • Jackson PJB, Shadle CH. (2000) 'Frication noise modulated by voicing, as revealed by pitch-scaled decomposition'. AMER INST PHYSICS JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, BERLIN, GERMANY: 2nd International Conference on Voice Physiology and Biomechanics (ICVPB) 4 (108), pp. 1421-1434.
  • Jackson PJB, Shadle CH. (2000) 'Performance of the pitch-scaled harmonic filter and applications in speech analysis'. IEEE 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, ISTANBUL, TURKEY: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1311-1314.

    Abstract

    The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech signals into their voiced and unvoiced constituents. In this paper, we evaluate its ability to reconstruct the time series of the two components accurately using a variety of synthetic, speech-like signals, and discuss its performance. These results determine the degree of confidence that can be expected for real speech signals: typically, 5 dB improvement in the signal-to-noise ratio (HNR) in the anharmonic component. A selection of the analysis oportunities that the decomposition offers is demonstrated on speech recording, including dynamic HNR estimation and separate linear prediction analyses of the two components. These new capabilities provided by the PSHF can facilitate discovering previously hidden features and investigating interactions of unvoiced sources, such as friction, with voicing.

Book chapters

  • Haq S, Jackson PJB. (2010) 'Multimodal Emotion Recognition'. in Wang W (ed.) Machine Audition: Principles, Algorithms and Systems IGI Global Article number 17 , pp. 398-423.

    Abstract

    Recent advances in human-computer interaction technology go beyond the successful transfer of data between human and machine by seeking to improve the naturalness and friendliness of user interactions. An important augmentation, and potential source of feedback, comes from recognizing the user‘s expressed emotion or affect. This chapter presents an overview of research efforts to classify emotion using different modalities: audio, visual and audio-visual combined. Theories of emotion provide a framework for defining emotional categories or classes. The first step, then, in the study of human affect recognition involves the construction of suitable databases. The authorsdescribe fifteen audio, visual and audio-visual data sets, and the types of feature that researchers have used to represent the emotional content. They discuss data-driven methods of feature selection and reduction, which discard noise and irrelevant information to maximize the concentration of useful information. They focus on the popular types of classifier that are used to decide to which emotion class a given example belongs, and methods of fusing information from multiple modalities. Finally, the authors point to some interesting areas for future investigation in this field, and conclude.

  • Jackson PJB. (2005) 'Mama and papa: the ancestors of modern-day speech science'. in Smith CUM, Arnott R (eds.) The Genius of Erasmus Darwin Aldershot, UK : Ashgate , pp. 217-236.

Theses and dissertations

  • Jackson PJB. (2000) Characterisation of plosive, fricative and aspiration components in speech production.

    Abstract

    This thesis is a study of the production of human speech sounds by acoustic modelling and signal analysis. It concentrates on sounds that are not produced by voicing (although that may be present), namely plosives, fricatives and aspiration, which all contain noise generated by flow turbulence. It combines the application of advanced speech analysis techniques with acoustic flow-duct modelling of the vocal tract, and draws on dynamic magnetic resonance image (dMRI) data of the pharyngeal and oral cavities, to relate the sounds to physical shapes. Having superimposed vocal-tract outlines on three sagittal dMRI slices of an adult male subject, a simple description of the vocal tract suitable for acoustic modelling was derived through a sequence of transformations. The vocal-tract acoustics program VOAC, which relaxes many of the assumptions of conventional plane-wave models, incorporates the effects of net flow into a one-dimensional model (viz., flow separation, increase of entropy, and changes to resonances), as well as wall vibration and cylindrical wavefronts. It was used for synthesis by computing transfer functions from sound sources specified within the tract to the far field. Being generated by a variety of aero-acoustic mechanisms, unvoiced sounds are somewhat varied in nature. Through analysis that was informed by acoustic modelling, resonance and anti-resonance frequencies of ensemble-averaged plosive spectra were examined for the same subject, and their trajectories observed during release. The anti-resonance frequencies were used to compute the place of occlusion. In vowels and voiced fricatives, voicing obscures the aspiration and frication components. So, a method was devised to separate the voiced and unvoiced parts of a speech signal, the pitch-scaled harmonic filter (PSHF), which was tested extensively on synthetic signals. Based on a harmonic model of voicing, it outputs harmonic and anharmonic signals appropriate for subsequent analysis as time series or as power spectra. By applying the PSHF to sustained voiced fricatives, we found that, not only does voicing modulate the production of frication noise, but that the timing of pulsation cannot be explained by acoustic propagation alone. In addition to classical investigation of voiceless speech sounds, VOAC and the PSHF demonstrated their practical value in helping further to characterise plosion, frication and aspiration noise. For the future, we discuss developing VOAC within an arti

Page Owner: ees1pj
Page Created: Wednesday 18 January 2012 17:04:13 by lb0014
Last Modified: Wednesday 18 January 2012 17:22:00 by lb0014
Expiry Date: Thursday 18 April 2013 17:00:59
Assembly date: Tue Mar 26 22:43:10 GMT 2013
Content ID: 71866
Revision: 2
Community: 1379