Meet the Voices Behind Text to Speech Programs

Have you ever wondered how the stunningly lifelike voices found in modern Text to Speech programs like Siri are created? How does Siri manage to say people’s names with a fair degree of accuracy? Is it as simple (or complicated, depending on how you want to look at it) as having a voice actor record every word ever?

In this fascinating video and the awesome accompanying article on The Verge, you’ll get a peek behind the scenes of the creation of Text to Speech programs. It’s 10 minutes long, but worth every minute.

Here are a few highlights from the article, though I strongly advise checking out the whole thing if you’ve got time.

On the process of creating a database of sounds by having someone read a ton of contextually nonsensical sentences:

After the script is recorded with a live voice actor, a tedious process that can take months, the really hard work begins. Words and sentences are analyzed, catalogued, and tagged in a big database, a complicated job involving a team of dedicated linguists, as well as proprietary linguistic software.

When that’s complete, Nuance’s text-to-speech engine can look for just the right bits of recorded sound, and combine those with other bits of recorded sound on the fly, creating words and phrases that the actor may have never actually uttered, but that sound a lot like the actor talking, because technically it is the actor’s voice.

The official name for this type of voice building is “unit selection” or “concatenative speech synthesis.” Ward describes it as “a little like a ransom note,” but saying it’s like a ransom note, where letters are chopped up and pasted back together to form new sentences, is a radical oversimplification of how we make language.

On the intended effect of this method of creating Text to Speech programs:

You shouldn’t think, “I’m talking to a computer.” You shouldn’t think anything at all.

