Apple’s Siri personal assistant was one of the main advertised features of the iPhone 4S. It is a marriage of speaker-indepent voice-recognition technology and domain-aware AI. What sets it apart is how it’s integrated with back-end and on-device services and how it maintains context across multiple requests so it feels like you’re having a conversation with an intelligent assistant instead of barking orders at a robot minion.
Once you get past the immediate thrill of having your phone understand your commands, talk back, and crack jokes, you quickly realize how much more powerful this all could be if Siri could do more than the relatively limited range of tasks it can currently perform.
That is, until you discover that Siri is a walled garden.
The entire iOS platform is predicated on the premise that third parties can add new features to the stock system provided by Apple. But here we have a seemingly closed system in which third-parties can not participate. As usual, Apple is mumm about whether a Siri SDK will ever be released.
But the question of how third-party apps could participate in Siri has been nagging at me since I first got my hands on an iPhone 4S and started playing with it. As I thought about it more I realized that for a third-party app to play in the Siri playground, there would have to be some profound changes in the way we develop iOS apps.
To first realize the scope of what Siri could do we should walk through a few use-cases of its current capabilities.
Let’s take the case of creating an appointment. We invoke Siri by holding down the home button which brings up a standard modal interface and turns on the microphone. We speak the words “make lunch appointment with John tomorrow at noon.” Siri goes into processing mode, then generates a voice-synthesized message asking which “John” you would like to schedule a meeting with. The list of possible particpants has been garnered from the addressbook. You respond by saying, say, “John Smith”. Siri then displays a mini-calendar (which closely resembles a line entry in the standard calendar), makes note of any other conflicts, then asks for you to confirm. You respond ‘yes’ and the event has been entered into your calendar (and distributed out via iCloud or Exchange).
Let’s break down what Siri just did:
- It put up a system-modal dialog.
- It captured some audio for processing.
- It processed the audio and converted it to text (often without previous training to adapt to a specific person’s voice).
- It managed to elicit enough context form the text to determine the request had to do with a calendar action.
- It extracted the name of an individual out of the text and matched it against the addressbook.
- It determined that the request was not specific enough and responded with a request for clarification. This request contained more data (list of other Johns) based on a search of the addressbook.
- It converted the request to a synthesized voice.
- The response was again recorded and processed for text recognition. It should be noted that one of the marvels of Siri is by going through a dialog, where each individual response is considered in the context of previous requests. When Siri offers a list of Johns, your response is ‘John Smith.’ Taken out of context, it would not make any sense, but Siri maintained context from a previous request and handled the response properly.
- Once confirmed, the event was scheduled with the calendar service.
- The system modal dialog can now be dismissed.
First, let’s consider whether Siri does voice-to-text recognition on the device itself or uses server resources. This is easy to verify. Put the phone into airplane mode and Siri can no longer operate so it’s likely it uses the server to do some (or all) of its recognition.
But how did Siri recognize that this was a calendar-related request (vs. a navigation one). It appears that once the voice data has been converted to text form, that the system categorizes the request into domains. By categorizing into well-defined domains the range of actions can be quickly narrowed. Having support for a sufficiently rich set of primitives within each domain creates the illusion of freedom to ask for anything. But of course, there are limits. Use an unusual word or phrase and it soon becomes obvious the vocabulary range has boundaries.
Once the action has been determined (e.g. schedule event) a particular set of slots need to be filled (who, when, where, etc.) and for each slot Siri apparently maintains a context and engages in a dialog until a minimum viable data set has been properly obtained. For example, if we ask it to ‘make an appointement with John Smith’ Siri will ask ‘OK… when’s the appointment?’ and if we say ‘tomorrow’ it will come back and ask ‘What time is your appointment?’ Once you’ve narrowed it down to ‘noon’ it can proceed to make the appointment with the calendar service.
By all appearances it maintains a set of questions that require an answer in order for it to accomplish a task and it iterates until those answers are obtained. Note that in this context a place was not required but if offered it is used to fill out the location field of a calendar entry. Some slots are required (who and when) but others are optional (where). It should also be noted that in its current state, Siri can easily be confused if given superfluous data:
Make lunch appointment with John Smith at noon tomorrow in San Francisco wearing a purple suit.
The actual event is recognized as Purple suite with John Smith with the location marked as San Francisco wearing.
Once all the required slots have been filled the actual act of creating a calendar entry is pretty straightforward since iOS offers an internal API for accessing the calendar database.
But Siri is not limited to just calendar events. As of first release the list of services available via Siri include:
- Maps and Directions
- Clock, Alarm, and Timer
- Address Book
- Find My Friends
- Web Search
- Search via WolframAlpha
- Nearby services and reviews via Yelp
Several of these services (e.g. weather, stocks, Find My Friends, notes, and WolframAlpha) do not advertise a publicly accessible API. This means that for Siri to work with them, the service had to be modified so it could have an integration point with Siri.
It is the nature of this integration point which would be of interest to software developers and in the next post we’ll explore it a little more to see how Apple might integrate with third-parties if it chose to open up Siri to other apps.