Itre.cis.upenn.edu

"Voiceprints" again

2014-10-14

"Millions of voiceprints quietly being harvested as latest identification tool", The Guardian (AP), 10/13/2014:

Over the telephone, in jail and online, a new digital bounty is being harvested: the human voice.

Businesses and governments around the world increasingly are turning to voice biometrics, or voiceprints, to pay pensions, collect taxes, track criminals and replace passwords.

The article lists some successful applications:

Barclays plc recently experimented with voiceprinting as an identification for its wealthiest clients. It was so successful that Barclays is rolling it out to the rest of its 12 million retail banking customers.

“The general feeling is that voice biometrics will be the de facto standard in the next two or three years,” said Iain Hanlon, a Barclays executive.

It's interesting — and a little disturbing — that the antique term "voiceprint" is being revived. This term was promoted in the early 1960s by a company aiming to show that appropriately-trained forensic experts could use sound spectrograms (in those days on paper) to identify speakers, just as they used fingerprints to identify the individuals who left them. That theory was profoundly misleading, and was effectively discredited at least from the point of view of the acceptability of such evidence in American courts — see "Authors vs. Speakers: A Tale of Two Subfields", 7/27/2011, for some details.

The Guardian/AP article goes further in promoting this bogus analogy:

Vendors say the timbre of a person’s voice is unique in a way similar to the loops and whorls at the tips of someone’s fingers.

Their technology measures the characteristics of a person’s speech as air is expelled from the lungs, across the vocal folds of the larynx, up the pharynx, over the tongue, and out through the lips, nose and teeth. Typical speaker recognition software compares those characteristics with data held on a server. If two voiceprints are similar enough, the system declares them a match.

But this is not to say that speech biometrics are invalid. As the cited post explains, in 1996 the National Institute of Standards and Technologies (NIST) began holding regular formal evaluations of speaker-recognition technology, with the result that we have credible evidence of major improvements in the underlying technology, and also credible methods for evaluating the actual performance of specific algorithms as applied to specific problems.

The most recent completed NIST "Speaker Recognition Evaluation" was in 2012, and the results are available on line. A good review of the past two decades has just be published – Joaquin Gonzalez-Rodriguez, "Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014)", Loquens 1.1 2014:

Automatic Speaker Recognition systems show interesting properties, such as speed of processing or repeatability of results, in contrast to speaker recognition by humans. But they will be usable just if they are reliable. Testability, or the ability to extensively evaluate the goodness of the speaker detector decisions, becomes then critical. In the last 20 years, the US National Institute of Standards and Technology (NIST) has organized, providing the proper speech data and evaluation protocols, a series of text-independent Speaker Recognition Evaluations (SRE). Those evaluations have become not just a periodical benchmark test, but also a meeting point of a collaborative community of scientists that have been deeply involved in the cycle of evaluations, allowing tremendous progress in a specially complex task where the speaker information is spread across different information levels (acoustic, prosodic, linguistic…) and is strongly affected by speaker intrinsic and extrinsic variability factors. In this paper, we outline how the evaluations progressively challenged the technology including new speaking conditions and sources of variability, and how the scientific community gave answers to those demands. Finally, NIST SREs will be shown to be not free of inconveniences, and future challenges to speaker recognition assessment will also be discussed.

Most of the NIST speaker-recognition evaluations have involved so-called "text independent" detection, where the task is to determine whether two samples of uncontrolled content come from the same speaker or not. For the biometric applications discussed in the Guardian/AP article, "text dependent" tasks seem to be more relevant, where the reference and the test utterances are supposed to be repetitions of a known phrase. An excellent recent discussion of this technology has also just been published — Anthony Larcher et al., "Text-dependent speaker verification: Classifiers, databases and RSR2015", Speech Communication 2014:

The RSR2015 database, designed to evaluate text-dependent speaker verification systems under different durations and lexical constraints has been collected and released by the Human Language Technology (HLT) department at Institute for Infocomm Research (I2R) in Singapore. English speakers were recorded with a balanced diversity of accents commonly found in Singapore. More than 151 h of speech data were recorded using mobile devices. The pool of speakers consists of 300 participants (143 female and 157 male speakers) between 17 and 42 years old making the RSR2015 database one of the largest publicly available database targeted for textdependent speaker verification. We provide evaluation protocol for each of the three parts of the database, together with the results of two speaker verification system: the HiLAM system, based on a three layer acoustic architecture, and an i-vector/PLDA system. We thus provide a reference evaluation scheme and a reference performance on RSR2015 database to the research community. The HiLAM outperforms the state-of-the-art i-vector system in most of the scenarios.

Here's a DET ("Detection Error Tradeoff") curve from that paper:

This plot shows what happens when you adjust the algorithm to trade off misses (where the two speakers are the same but the algorithm says they're different) against false alarms (where the two speakers are different but the algorithm says they're the same). The "equal error rate" — the point on the DET curve where the two kinds of error are equally likely — is 2.47% for their best algorithm, which they claim to be better than the state-of-the-art algorithms that are likely to be used in the commercial systems discussed in the article.

Of course, in a real-world speaker verification application, you would probably chose a different operating point, say increasing the miss probability a bit in order to let fewer imposters through. But in any case, a biometric method with this level of performance is clearly useful, though not exactly perfect.

AGNITiO is one of the vendors cited in the Guardian/AP article. There's a page on their site entitled "How accurate is voice biometric technology?", which explains the standard evaluation method clearly, but unfortunately the sentence "The charts below show the probability and accuracy of the AGNITiO voice identification engine, using both a pass phrase (text-dependent) and free-form speech" is followed, alas, by no charts… However, there's another AGNITiO page "Is speaker verification reliable?" which says that

In practice, the conditions surrounding the application, the setting of thresholds and the method of decision employed can impact the error rates in a significant way. It is not the same that voices are captured in an office environment using a land line, than to capture the same voices from a mobile (cellular) phone in situations where there is a lot of background noise such as in a car or in a public place. Therefore, it is risky to give statistics and error rates without knowledge of the concrete application. We can say, however, that for certain real applications the threshold can be set in a way that we have an false acceptance rate below 0.5% with false rejection rates of less than 5%.

Thus they are claiming to do a little bit better than the DET curve from Larcher et al. would allow, though the AGNITiO number seems to be a sort of guess about how things are likely to come out in a typical case, rather than the actual value from an independently-verifiable test.

ValidSoft, another vendor cited in the article, claims that "ValidSoft’s voice biometric engine has been independently tested and endorsed as one of the leading commercial engines available in the market today", but they don't link to the independent tests or endorsements, and I haven't been able to find any documentation of their system's performance. There's a press release "ValidSoft Voice Biometric Technology Comes of Age" which says that

ValidSoft [...] today announced that it has achieved its primary objective for its voice biometric technology of successfully being measured and compared with the world's leading commercial voice biometric providers and non-commercial voice biometrics research groups. As a result, ValidSoft clients are assured that they have access to a fully integrated, multi-layer, multi-factor authentication and transaction verification platform that can seamlessly incorporate world-leading voice biometric technology.

ValidSoft achieved its objective by successfully taking part in the prestigious National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation. This evaluation is the most important and main driver for innovation in the field of voice biometrics, particularly security.

But they don't show us their DET curves, or any other quantitative outcomes, while (as you can see in the DET curve from one of the SRE 2012 conditions) there's a fair amount of spread in system performance:

Anyhow, this is a completely different scenario from the "Deception Detection" domain, Here we know what the algorithms are, and there are independent benchmark tests that evaluate the overall performance of the state-of-the-art technology, and compare different systems in that framework. In the case of commercial speech based deception-detection products, the algorithms are secret, and the performance claims are effectively meaningless (see "Speech-based 'lie detection'? I don't think so", 11/10/2011).

Note that Ulery et al., "Accuracy and reliability of forensic latent fingerprint decisions", PNAS 2011, found a 0.1% false positive rate with a 7.5% miss rate, so that in fact voice biometrics are now roughly as accurate as (latent) fingerprint identification, at least in some version of the speaker-recognition task. Obviously the errors rates for comparison of carefully-collected full fingerprint sets should be a great deal lower.

[h/t Simon Musgrave]