Introduction
In February 2009 the National Research Council (NRC) Report to Congress on Strengthening Forensic Science in the United States found that:
- “[S]ome forensic disciplines are supported by little rigorous systematic research to validate the discipline’s basic premises and techniques. There is no evident reason why such research cannot be conducted” (p. 22).
- “The development of scientific research, training, technology, and databases associated with DNA analysis have resulted from substantial and steady federal support for both academic research and programs employing techniques for DNA analysis. Similar support must be given to all credible forensic science disciplines if they are to achieve the degrees of reliability needed to serve the goals of justice.” (p. 13)
Over the last decade, a small number of researchers (principally in Australia, Spain, and Switzerland) have been working on developing demonstrably valid and reliable forensic voice comparison with evidence evaluated using the same framework as is applied to the evaluation of DNA evidence.
Meanwhile in the Americas there has been little interest in this field of research.
The NRC report gives a new impetus for conducting forensic voice comparison research and holds out the hope for new funding opportunities in this area.
The 2nd Pan-American/Iberian Meeting on Acoustics provides an excellent opportunity to bring together researchers from Iberia and other parts of the world with researchers from the Americas to help foster research in this area in the Americas.
It also provides a venue for an exchange of ideas between researchers working on acoustic-phonetic and signal-processing approaches to forensic voice comparison.
Tutorial
Monday 15 November 2010 at 7:00 pm
The tutorial will present an introduction to the forensic evaluation of acoustic evidence using the same framework as is applied to the evaluation of DNA evidence.
Both acoustic-phonetic and signal-processing approaches to forensic voice comparison will be described.
The focus will be on evidence in the from of voice recordings, but the evaluative framework can also be applied to other forms of evidence, including audio recordings of other types of acoustic events, and the tutorial should therefore be of value to anyone interested in forensic acoustics in general, not just forensic voice comparison.
Presenters:
- Geoffrey Stewart Morrison
- School of Language Studies, Australian National University
- School of Electrical Engineering & Telecommunications, University of New South Wales
- Daniel Ramos
- Biometric Recognition Group, Autonomous University of Madrid – Universidad Autónoma de Madrid
Both presenters are invited lecturers in the Judicial Phonetics Specialization in the Masters in Phonetics and Phonology Program of the Consejo Superior de Investigaciones Científicas [Spanish National Research Council] / Universidad Internacional Menéndez Pelayo.
They previously presented a similar tutorial at the International Speech Communication Association’s Intespeech 2008 conference.
The tutorial will be presented in English, but both presenters can also field questions in Spanish.
Lecture notes:
A pdf of the slides for the tutorial presentation will be made available on this website.
In conjunction with this tutorial, the publisher of Morrison’s new introduction to forensic voice comparison
- Morrison, G.S. (2010). Forensic voice comparison. In I. Freckelton, & H. Selby (Eds.), Expert Evidence (Ch. 99). Sydney, Australia: Thomson Reuters.
will make pdf downloads available at half-price. For more information see http://expert-evidence.forensic-voice-comparison.net/
Cost:
To partially defray the cost of the tutorial, a registration fee is charged. The fee is US$15 for registration received by 5 October and US$25 at the meeting. US$7 / US$12 for students with current ID.
ASA Forensic Acoustics Group
On Tuesday 16 November 2010 at 7:30 pm there will be a meeting to propose the establishment of and organize a Forensic Acoustics Group within the Acoustical Society of America.
Special Session
Wednesday 17 November 2010
Sponsored by the Speech Communication Technical Committee.
Papers and posters on acoustic-phonetic and signal-processing approaches to demonstrably valid and reliable forensic evaluation of audio recordings of human voices and other acoustic events.
Invited presentations:
Wednesday 17 November 2010, 8:00 am – 11:40 am
Lecture presentations listed in order of presentation.
- Andrzej Drygajlo
- Speech Processing and Biometrics Group, Swiss Federal Institute of Technology at Lausanne – École Polytechnique Fédérale de Lausanne
- Value and interpretation of biometric evidence in forensic automatic speaker recognition
- Forensic speaker recognition (FSR) is the process of determining if a specific individual (suspected speaker) is the source of a questioned voice recording (trace). Forensic automatic speaker recognition (FASR) has proven an effective tool in the fight against crime, yet there is a constant need for more research due to the difficulties in adapting automatic methods of voice comparison to the forensic methodology that provides a coherent way of assessing and presenting recorded speech as scientific evidence. The ongoing paradigm shift in the forensic speaker recognition needs biometric methods for the calculation of the evidence value, its strength and the evaluation of this strength under operating conditions of the casework In such methods, the biometric evidence consists of the quantified degree of similarity between speaker-dependent features extracted from the trace and speaker-dependent features extracted from recorded speech of a suspect, represented by his/her model. This presentation aims at introducing deterministic and statistical automatic speaker recognition (ASR) methods that provide several ways of quantifying and presenting recorded voice as biometric evidence, as well as the assessment of its strength (likelihood ratio) in the Bayesian interpretation framework, including scoring and direct methods, compatible with interpretations in other forensic disciplines.
- Didier Meuwly
- Netherlands Forensic Institute – Nederlands Forensisch Instituut
- Forensic speaker recognition: Comparison and validation of automatic systems over 3 generations
- The first aim of this paper is to demonstrate the improvement of automatic speaker recognition systems used for forensic evaluation over a period of 12 years. The second aim consists of exploring how the results of different systems can be compared and their improvement measured. The same set of experiments is replicated on 3 different systems: the original LR-based ASR system from 1998 (ASPIC I), the EPFL ASR system of 2004 (ASPIC II) and the Agnitio ASR system (BATVOX) from 2010. The reference database, Polyphone, consisting of 2000 male and 2000 female speakers from the French part of Switzerland and the forensic database, Polyphone-IPSC, consisting of 16 male and 16 female speakers from the same region, were used to test the following forensic conditions: spontaneous speech, disguised speech, PSTN, GSM, Signal to noise ratio from +30 dB to –3dB, digital and analogue recording, close set of family related speakers. The results are visualized using Tippett plots. The refinement and calibration of the systems are measured and compared using the log-likelihood-ratio cost (Cllr). Finally, the value, in terms of forensic validation, of the methodology used and the results produced are discussed.
- Philip Rose
- School of Language Studies, Australian National University
- Combining linguistic and non-linguistic information in likelihood-ratio-based forensic voice comparison: A hybrid automatic-traditional system
- In the last decade, forensic voice comparison has experienced a remarkable paradigm shift [Morrison, Sci. Justice 49, 298–308 (2009)]. Both automatic and traditional phonetic approaches have been developed within the new paradigm. The main difference is that traditional approaches are typically local in both time and frequency domains, with features like formant frequencies extracted from linguistically comparable items (e.g., words or phonemes), whereas automatic approaches are typically global, with long-term spectral properties used and linguistic information treated as noise. Since neither makes use of all the information present, combining them could improved performance. A fully-automatic and a partially-traditional system were compared. Data were pairs of non-contemporaneous landline-telephone recordings of 60 speakers from the Japanese National Research Institute of Police Science database (net 35–40 s speech per recording). In the fully-automatic system the whole speech-active portion of the recording was analyzed using 12th order LPCCs, mean cepstral subtraction, GMM-UBM, and logistic-regression calibration. In the partially-traditional system the same procedures were applied only to tokens of [oː], [N], and [ɕ] extracted from the recordings, with logistic-regression fusion of the results. The performance of each system and the fusion of the two were compared using the log-likelihood-ratio cost (Cllr).
- Michael Jessen & Timo Becker
- Department of Speaker Identification and Audio Analysis, German Federal Police Office – Bundeskriminalamt
- Long-Term Formant Distribution as a forensic-phonetic feature
- With the Long-Term Formant Distribution (LTF) method [F. Nolan and C. Grigoras, Int. J. Speech, Lang. Law 12, 142–173 (2005)], manually-corrected LPC-based formant tracks are extracted over all vocalic portions of the recording of a speaker in which the formants F2 and F3 are sufficient well-structured. LTF analysis has been successfully added to the inventory of phonetic features that are used in voice comparison casework. Current research in our lab has highlighted a number of advantages of the LTF-method, including high inter-expert reliability (different phoneticians when using the method arrive at highly consistent results), anatomical motivation (Long-Term F2 and F3 are negatively correlated with speaker height), and language independence (LTF patterns in different languages – so far German, Russian, and Albanian – do not differ significantly). Presently, quantitative measures of inter-individual variation in case data are investigated, including Equal Error rates (EER) and calibrated Likelihood Ratios (LR) based on Gaussian Mixture Modeling (GMM) of the formant tracking raw data [T. Becker, M. Jessen and C. Grigoras, Proc. Interspeech 2008, 1505–1508]. The final inter-individual variation results will be presented, along with results on how the LTF method compares to automatic speaker recognition, which is applied to the same data.
- Daniel Ramos, Javier González-Domínguez, & Joaquín González-Rodríguez
- Biometric Recognition Group, Autonomous University of Madrid – Universidad Autónoma de Madrid
- High-Performance session variability compensation in forensic automatic speaker recognition
- Recently, the main performance improvement in automatic speaker recognition technology has been due to session variability compensation techniques, mainly based on factor analysis (FA), which have reduced the Equal Error Rate (EER) of state-of-the-art systems by a factor of ten in less than five years (e.g., EER<2% for NIST SRE 2008 telephone speech). Moreover, such systems are able to compute millions of comparisons in thousand times faster than real time, after speech features are extracted. However, some challenges remain, because if there is a mismatch between the conditions of the FA training database and the speech used for comparison, the effectiveness of the compensation significantly decreases. This problem is especially relevant in forensic voice comparison, where the availability of speech matching operational conditions is usually sparse. In this presentation we show the impact of this effect in realistic simulated case studies. We use the Baeza - Ahumada IV database, which contains speech acquired with the Spanish Guardia Civil facilities, used in their daily work. We also present algorithms to handle sparsity in the data used for training FA models. Finally, we outline future research plans in order to improve session variability compensation performance in forensically realistic conditions.
- Geoffrey Stewart Morrison1, Julien Epps1, Philip Rose2, Tharmarajah Thiruvaran1, & Cuiling Zhang3
- 1School of Electrical Engineering & Telecommunications, University of New South Wales
- 2School of Language Studies, Australian National University
- 3Department of Forensic Science & Technology, China Criminal Police University
- Measuring reliability in forensic voice comparison
- Recently there has been a great deal of concern in forensic science about validity and reliability (accuracy and precision). The log-likelihood-ratio cost (Cllr), developed for automatic speaker recognition, is increasingly applied as a standard measure of accuracy in forensic voice comparison, but so far there has been little work on developing a metric of precision within this field. Because voice data can have a large amount of intrinsic variation at the source, and likelihood ratios are typically calculated using a single suspect recording and a single offender recording, assessing the precision of a forensic-voice-comparison system is extremely important. This presentation discusses the importance of measuring precision and describes two procedures, one parametric and one non-parametric, for calculating 95% credible intervals for the likelihood ratios resulting from running tests of forensic-voice-comparison systems (in which some comparisons are know to be same-speaker comparisons and others are known to be different-speaker comparisons). Examples are drawn from both acoustic-phonetic and automatic forensic-voice-comparison systems.
- Jeff Boyczuk
- Audio and Video Analysis Section, Royal Canadian Mounted Police – Gendarmerie royale du Canada
- Factors affecting the intelligibility of recorded speech: Considerations for forensic audio “best evidence”
- Derived from a traditional common law rule, the “best evidence” standard as applied to recorded audio prescribes that an original recording, and not a duplicated or altered copy, will be presented in legal proceedings. The intent of this standard is to ensure the integrity of the original evidence is preserved, such that a court is reasonably assured it is being presented with the most complete and accurate record of the recorded evidence. However, when considering forensic audio recordings of speech, that are frequently made in adverse acoustic environments, presentation of such recordings in their original form may not afford a court with the opportunity for a complete and accurate assessment of the evidence in question – namely, what words are being spoken on the recording? The current paper summarizes the technological and listener-based factors that should be considered when speech intelligibility is of prime importance in meeting the best evidence standard for presentation of forensic audio in court proceedings. Illustrative examples from recent court cases will be provided.
- Ray Bull
- Forensic Section, School of Psychology, University of Leicester
- Witnesses’/Victims’ recognition of a once-heard voice
- This presentation summarises the results of three decades of research testing the validity of lay persons’ (e.g. witnesses’/victims’) ability to recognize the voice of a once-before heard stranger (e.g. a crime perpetrator). Studies around the world have consistently found that people are usually very poor at this task, even with short delays and adequate lengths of speech. Some courts have taken notice of this research and caution witnesses accordingly; however, in many cases voice line-ups have been poorly constructed and are therefore invalid. The final part of the presentation provides an account of some court cases in which I have participated as an expert.
Contributed presentations:
Wednesday 17 November 2010, 1:00 pm – 3:00 pm
Poster presentations listed in order of submission.
(Authors: If you wish to have your full list of authors, URLs, affiliations, and abstract listed here, please e-mail them to the
.)
- Cuiling Zhang1, Geoffrey Stewart Morrison2
- 1Department of Forensic Science & Technology, China Criminal Police University
- 2School of Electrical Engineering & Telecommunications, University of New South Wales
- Accuracy and precision of forensic voice comparison using the Chinese /iao/ triphthong
- Some studies on forensic voice comparison have fitted parametric curves to the formant trajectories of diphthongs spoken in controlled phonetic environments, and have obtained results with a high degree of validity. The present study fits parametric curves to the formant trajectories of tokens of the Standard-Chinese /iao/ triphthong extracted from telephone conversations in which there was no control over the phonetic context. Two non-contemporaneous recordings from each of 60 female speakers were analysed. Likelihood ratios were calculated for a test set in which some comparisons are known to be same-speaker comparisons and others known to be different-speaker comparisons. The accuracy and precision (validity and reliability) of the test results were calculated using the log-likelihood-ratio cost (Cllr) and an estimate of their 95% credible interval respectively.
- Harry Hollien
- Speaker identification: The case for speech vector analysis
- Eugenia San Segundo Fernández
- Laboratorio de Fonética, Consejo Superior de Investigaciones Científicas – Phonetics Laboratory, Spanish National Research Council / Universidad Internacional Menéndez Pelayo
- Parametric representations of the formant trajectories of Spanish vocalic sequences for likelihood-ratio-based forensic voice comparison
- Non-contemporaneous speech samples from 30 Spanish male speakers were compared within the forensic-likelihood-ratio framework. The acoustic parameters studied were the formant trajectories of a series of vocalic sequences, /ue/ /ie/ /ia/ /ai/ (pronounced as diphthongs and in hiatus), in order to analyze their suitability for forensic voice comparison. Following Morrison [J. Acoust. Soc. Am. 119, 2387–2397 (2009)], parametric curves (polynomials and discrete cosine transforms) were fitted to these formant trajectories. The estimated coefficient values from the parametric curves were used as input to a multivariate-kernel-density formula for calculating likelihood ratios expressing the probability of obtaining the observed differences between two speech samples under two opposing hypotheses: that the samples were produced by the same speaker and that the samples were produced by different speakers. Cross-validated likelihood-ratio results from systems based on different parametric curves were calibrated and evaluated using the log-likelihood-ratio-cost function (Cllr). The cross-validated likelihood ratios from the best-performing system for each vocalic sequence were fused using logistic regression.
- Christin Kirchhübel
- Department of Electronics, University of York
- The effects of Lombard speech on vowel formant measurements
- This study analyses the effects of Lombard speech on vowel formant frequencies. Ten male native German speakers were selected from the ‘Pool 2010’ corpus which was recorded at the Bundeskriminalamt (BKA), Germany. Spontaneous speech produced in a neutral setting and Lombard setting, where 80dB of noise was played through headphones, was analysed. Measurements of F1, F2 and F3 were collected from 10 vowel categories for every speaker in both conditions. The results agree with previous findings in that F1 is consistently higher in the Lombard condition. The effect on F2 is very variable and complex. F3 was less affected than F1 and F2, but changes were present, especially for speakers with low F3s in modal speech. Differences could be observed among vowel categories. Inter-speaker variability was found to be large with respect to the size of increase in F1 and the direction and size of change in F2. The findings are discussed in light of the articulatory changes that have been associated with Lombard speech and the implications for forensic speaker comparison are spelled out.
- Ewald Enzinger
- Acoustics Research Institute, Austrian Academy of Science
- Measuring the effects of adaptive multi-rate (AMR) codecs on formant tracker performance
- Several approaches to forensic speaker comparison rely on formant centre-frequency measurements as features due to their rather straightforward interpretation as resonance frequencies of the cavities of the human vocal tract. Formant tracking algorithms, mostly based on linear predictive coding (LPC), are commonly used for automatic extraction. Telephone conversations constitute a substantial amount of forensic material, which increasingly involves wireless communication channels instead of landline transmission. The effects and limitations introduced by the Adaptive Multirate (AMR) set of codecs that is used for speech transmission in GSM and UMTS networks are therefore of special interest in forensic settings. To evaluate the extent of the effects that are caused solely by the codecs, speech recordings were en- and decoded with the different bitrate levels provided by the AMR (narrowband) codec. The formant frequencies of vowel segments were extracted using different trackers and settings. The preliminary results suggest partial shifts in frequency depending on codec level and individual speakers, but no consistent trend emerges.
- Alejandro Wang
- Forensic voice comparison based on nasal formants
- Cassie Dallasarra
- Speaker Identification: Effects of noise, telephone bandwidth and word count on accuracy
- Daniel García-Romero
- Automatic speaker recognition: Advances towards informative systems
- Jeff Boyczuk1, David Luknowsky1, Bradford Gover2, Heping Ding3
- 1Audio and Video Analysis Section, Royal Canadian Mounted Police – Gendarmerie royale du Canada
- 2Institute for Research in Construction, National Research Council Canada – Couseil national de recherches Canada
- 3Institute for Microstuctural Sciences, National Research Council Canada – Couseil national de recherches Canada
- Improving the speech intelligibility of forensic audio recordings through adaptive filtering with non-synchronous interference signals
- Forensic audio recordings are frequently made in uncontrollable acoustic environments where background sound emanating from television, radio, and video or music playback may interfere with the intelligibility of the intended “target” speech on a recording. In such cases, adaptive filtering techniques have proven highly effective in eliminating the interfering sound sources and improving intelligibility, provided that the interfering reference signal was acquired simultaneously with the target speech. However, in cases where interfering signals are acquired through a post hoc retrieval of broadcast, music or video recordings, non-linear time base differences between the original and the secondarily-acquired reference may significantly lessen the effectiveness of conventional adaptive filtering techniques in improving speech intelligibility. The current paper describes the results in applying a commercially-available adaptive filtering tool, as well as a newly developed tool, Drift-Compensated Adaptive Filtering (DCAF), for improving the intelligibility of recorded speech when utilizing a non-synchronously acquired reference signal. Listening tests show an overall improvement in speech intelligibility through the application of adaptive filtering with non-synchronous reference signals, with greater intelligibility for DCAF-processed audio recordings as compared to recordings processed with conventional adaptive filtering techniques.
- Sandra Ferrari Disner
- The fine structure of phonation as a biometric
Submissions:
The call for papers is now closed.
Registration
Registration information
Links
Acoustical Society of America
Iberoamerican Federation of Acoustics – Federación/Federação Iberoamericana de Acústica
Mexican Institute of Acoustics – Instituto Mexicano de Acústica
Forensic Voice Comparison parent website