2015-02-02



In Star Trek, characters routinely talk to the ship’s computer and it talks back. This feels very natural to the viewer, because the computer is just another character in the story. However, when given the opportunity to talk to my own computer today, I still feel awkward and prefer not to… especially if other people are within earshot!

The exception to this is when I’m driving. My Lumia Windows Phone integrates nicely with my car’s Bluetooth, so as text messages arrive, I can have them read aloud and even reply to them simply by having a Star Trek-like conversation with my own ship’s computer.  With Cortana, I’m not limited to just text messages – I can ask her “What is the forecast for Sunday?” and get a voice reply stating that “The forecast for Sunday shows Snow with a high of 34 and a low of 15″ – all while driving down the Interstate with my phone tucked away in my coat pocket.

The key is to a successful hands-free interface for your mobile app is to accept short and simple commands with short replies. Think of the hands-free interface as being like a walkie-talkie: while you have the button pressed and are speaking, you cannot hear what the other person may be saying, and the same is true when they are speaking. So, input commands should be short and simple to reduce the risk of the user and/or the voice recognizer messing up the command structure, and replies should be short to avoid listener fatigue, or a runaway narration that cannot be stopped (imagine if the phone starts to read an extremely long email and you have no opportunity to interrupt it).

While on the topic of replies, the content of the message should only include the most important information. Supporting information can still be presented on the screen, but it doesn’t need to be made available to a user accessing the app in hands-free mode. For example, a stock quote may give the current price, and perhaps the current trend from the day’s high or low price, but does not need to give any information about the stock itself (such as Market Cap or P/E Ratio) or the hourly price from earlier trading – that’s just noise to the listener.  As a general guideline, the responses should be targeted to play for no more than 15 seconds – your app could always be set up for the user to request additional supporting information after the first response.

App: Speak to Me

First, the low-hanging fruit: let’s explore how to get your app to talk to the user.

In a WinRT Universal App, it is very easy to implement Text-to-Speech (TTS) using the Windows.Media.SpeechSynthesis.SpeechSynthesizer object:

Here, the speech is created by passing in the text message to the SynthesizeTextToStreamAsync() method. The SpeechSynthesizer itself returns a Stream object, so in Xaml, we need a MediaElement to play that Stream.

A word of caution: WinRT on Windows Phone 8.1 only supports one MediaElement playing at a time. If you play a sound effect or other media using a MediaElement while the TTS stream is still being played, then it’s highly likely that your speech will be truncated. As a workaround, I have started playing sound effects using Xaudio2 (for C# projects, I use the SharpDX library of wrappers so that I don’t have to deal with C/C++ code).  This limitation does not seem to exist in Universal Apps written in HTML/JavaScript (WinJS), as they would use the HTML5 Audio element instead of the XAML MediaElement for playback, and also does not exist in Windows Store Apps (i.e., Desktop/Tablet apps) which makes more simultaneous channels available to the app.

I’m Hearing Voices

Normally, your app would just use the system’s default voice. This will match the user’s preferences, should they have gone into the device’s Speech settings and selected a different “Text to Speech voice” (gender) and “Speech language” (pronunciation rules). But, it’s possible for your app to override the Voice property of the SpeechSynthesizer prior to rendering a new speech stream (i.e., perhaps to force a particular voice font gender regardless of the system’s default, or to allow the user to select a different voice just for your app).

Here’s a demo of how to cycle through all of the voices using C# and XAML, playing the next voice in the list every 10 seconds:

Take notice of the AllVoices static property of the SpeechSynthesizer, which provides a list of all VoiceInformation objects that are installed on the device. Each VoiceInformation object has a Language property (which is a Culture Code, like “en-US”) and Gender property, which can be used to find an appropriate voice for your app to use without actually knowing the voice’s name.

Photo Credit: Star Trek IV – The Voyage Home, Paramount (1992)

The post Hands-free WinRT: Part 1 – The Talking App appeared first on Falafel Software Blog.

Show more