Friday, May 24, 2013

The Great Text to Speech Rabbit Hole


This isn't related to the Minecraft city generation project some of you have been following, but I thought this would be a good place to talk about something I've been working on for a while. If you own an apple device, you might have played with some of the text to speech(TTS) they include. Apple computers have had TTS for, literally, decades, but the most recent versions of voices are pretty good. The absolute best voice, quality wise, is one called Alex, introduced in OS X 10.5.
[1]From a practical point of view, geology and soil science are more complicated than astrophysics.
The reason I say Alex is the best is because the voice simulates taking a breath. It seems trivial, but that is a great auditory cue to listeners about how the speaker is organizing what they are going to say. If you are face to face with someone, there are even subtle facial cues during that breath that clues the listener in on what is going to be said next.

So, Alex breathes. But he, and the other voices are not exactly warm feeling. I've been putting together some weekly hour long 'story hours' and am modeling the narration roughly off of Zero Hour and the "Tokeo Rose" esthetic. Here's a clip



The voice I am using is "Moria," which was added to the possible voice choices with OS X 10.7. In the clip above, I'm adding effects to distort the voice and try and make it sound like it's coming over sketchy, old equipment, but it's still clearly an inhuman voice. A friend commented on how annoying he found it, so I started thinking about ways around that.

The easy solution would be for me to just do the voice, but I hear my voice enough. If computer voices were not good enough, maybe there was an intermediate. TTS voices prior to those introduced in OS X 10.7 allow one to change the pitch, duration, volume and other attributes for each phoneme (or at least the OS X version of phonemes).

Now, for some reason, when I think of testing audio, I think of a scene from the movie "Ten Things I Hate About You" in which Heath Ledger sings "You're just to good to be true" by Frankie Valli. Seriously, I test mics at the start of conference calls by singing a few lines.


Like me, the Alex voice is not much of a singer. We both mean well, but sort of suck. Take a listen:



In theory, it's possible to get Alex to sing, but to do so by hand would not be fun, and I was looking for an automated way to get the TTS voices to sound a bit more human. Burried in the OS X Developer tools is an application called Repeat After Me (RAM).


The way it is supposed to work is, you enter the text string you want the computer to say in the "Text" box. Click on "To Phonemes" to convert the string to OS X phonemes (Anything more than 100 phonemes will make later steps fail). Then, clicking on "Build Graph" generates a frequency graph of what that voice's speech rules indicate are reasonable guesses about how the phrase will be spoken.

RAM is not the most transparent application, and the documentation is not great, but each of the little dots in the "Tune" window are draggable. You can also add more drag points by holding Shift while clicking on the pitch line. Option+dragging a point will let you move that indvididual point. The real magic happens when you import a sound file. In this case, it's Heath Ledger singing.

I grabbed the audio of Health Ledger singing the song and pumped it through the Sound eXchange program (SoX). First, to increase the overall volume of the track, and second to automatically split the track into separate ones, based on pauses



That file was imported into RAM. Clicking the "Impose Durations, "Extract Pitch," and "Impose Pitch" buttons Makes RAM tray and make your typed text match the durraction of the audio file, and uses the pitch profile of the audio file as the basis for the pitch of the typed text. Lastly, you click on "Tune", and the end product is a series of text that the pre OS X 10.7 voices can use as the basis of how to speak.

According to RAM, to get the voice Alex to sing the phrase "You're just too good to be true" like Heath Ledger, you need the following:

[[inpt TUNE]]
~
y {D 50; P 265.0:0 162.0:41}
2AO {D 170; P 183.0:0 200.0:44}
r {D 20; P 239.0:0 245.0:55}
_
J {D 110; P 245.0:0 220.0:6 220.0:33 220.0:67}
1UX {D 120; P 227.0:0 234.0:35 237.0:85}
s {D 120; P 218.0:0 218.0:29}
t {D 20; P 218.0:0 218.0:13 218.0:63}
_
t {D 70; P 218.0:0 218.0:6 218.0:19 218.0:25 218.0:38 218.0:69}
1UW {D 220; P 232.0:0 229.0:4 227.0:37 227.0:44 229.0:74 229.0:81}
_
g {D 100; P 225.0:0 225.0:8 225.0:42}
1UH {D 150; P 237.0:0 237.0:4 237.0:11 242.0:19 245.0:26 259.0:70 259.0:85}
d {D 140; P 259.0:0 259.0:20 259.0:40}
~
t {D 120; P 259.0:0 259.0:67}
AX {D 160; P 227.0:0 227.0:5 229.0:30 227.0:75}
_
b {D 60; P 199.0:0 199.0:8 199.0:38 190.0:69}
1IY {D 170; P 195.0:0 195.0:7 198.0:37 198.0:44 200.0:52 300.0:56 400.0:85}
_
t {D 210; P 399.0:0 398.0:5 396.0:21 394.0:53}
r {D 30; P 222.0:0 222.0:5 222.0:20 222.0:35 222.0:45 222.0:75}
1UW {D 550; P 222.0:0 222.0:2 222.0:29 234.0:100}
. {D 460}
[[inpt TEXT]]

Which sounds like:



This is far from a quick process, and there's no similar command line tools that I am aware of. Enter AppleScript, System Events, GUI Scripting, and Python.

RAM is on a scriptable application in that it is not made to allow AppleScript to easily control it. You can control applications using System Events and GUI scripting though. The downside of doing it this way is that the computer is following a script of actions in a very braindead way, and it's impossible to use the computer for anything else, since the script is mimicking user input going to the application. So, it is doable, but feels like a Rube Goldberg machine.

I wrote a Python backend that takes arguments on the name of a file with dialog to be spoken (in this case, one line of text for each audio file made by SoX), an output file name, and the computer voice to use. The Python script does most of the housekeeping and then passes info to a seperate applescript that automates all of the steps I outlined above (plus saving the resulting RAM file in case you want to edit it later). For each short audio file, the Python script get back the tunes phoneme text, and joins them together making a single string of text representing the song. It sends the string to the system and saves the resulting audio to a file, effectively providing an automated workflow for taking recorded audio and a transcript and producing a computer generated audio recording that captures some of the human reader's nuances.

Now, RAM doesn't always do a good job in getting the durations or imposing pitch. Take for example this:
It's clear that RAM failed to get the durations correct, and the tuning of the frequencies could be better at the start. A quick manual change results in:

When you just let the scripts run without any interference, a wholly automatically generated file sounds like:



But if you reopen the saved RAM files and manually adust the durations a bit, you get something closer to the original. Below are a few samples of the whole process sung by the Victoria (introduced in OS X 10.0), Vicki (introduced in OS X 10.4), and Alex (introduced in OS X 10.6). It is worth noting that even older voices sound passable. Minor hand tweeking of durations and pitch results in a robotic Heath Ledger, presented with his original version for comparision.
Victoria (OS X 10.0):

Vicki (OS X 10.4):

Alex (OS X 10.6):

Heath Ledger: