Voice assistants are the next big thing. Some say they’re the next mobile, though I don’t even know if that’s accurate or an understatement. All the major platform companies have one, and startups building them appear ever faster, making it hard to even keep track of everything. The point is, they are going to be everywhere and are going to dominate the way we interact with our computers. Yet I hear many questioning if these assistants are even viable from a business perspective. The argument goes that by moving people away from screens, assistants may be diminishing traditional screen based revenue streams. How is Google going to sell ads along their search results if the user gets taken directly to the information they desire without ever looking at a list of results?
Content providers may indeed have a harder time turning their work into paychecks. If you’re running a blog or publication, your main business is placing ads next to your reporting. When more people move away from screens and have their news read to them by an AI instead, less people will see your ads. Though if that is something people are actually going to do in significant quantities remains to be seen. For the companies operating the voice assistants, however, they will become a gold mine. Even better, their value proposition for the customer is precisely what makes them valuable for the operating businesses.
All it takes to cash in is a simple two step plan:
THE PATH OF GREATEST CONVENIENCE
The first part of the story is about getting to market dominance, or at least gaining a significant share of the market, and growing the overall market volume at the same time.
Essentially, companies are trying to get as many people as possible to use their system. And get these people to use their system as much as possible. That’s why at the moment, all weight is behind making these voice assistants useful and their interaction feel natural. We’re supposed to get used to talking to robots. The useful part of this puzzle is about handling relevant key use cases on the one hand, while supporting action in an extremely broad area of tasks on the other. As with most innovations, a few key functions are what people really want and what keeps them coming back. Still, it has to be universally useful and support the user with whatever she wants to accomplish. This is especially important for voice. Without any visual cues for the available functionality, all the user is left with is trial and error. Every missed command the system can’t understand or act upon is disappointing for the user. Get a command wrong one to many times and you will be frustrated enough to not use it anymore.
So as long as no one entity creates an assistant that handles everything, there is a demand, and possible place for coexistence, for many T-shaped ones. The crucial part here, however, is acknowledging coexistence, knowing strengths and weaknesses and facilitating actions and responses between these systems. Imagine an assistant that is great at smart-home coordination and controlling, but has no exceptional skills in most other departments. By it self, it seems not too convenient and you might instead turn to one that does pretty well generally, but is just a bit weaker in smart-home stuff. However, if this first one is now also able to facilitate between different assistants, that could become a whole different story. It could call Google Assistant for knowledge questions, Amazon’s Alexa for shopping tasks, and so on and so forth. Now that would be pretty convenient! As soon as an assistant handles core functions well enough, and delights instead of disappoints in most other, more general requests, people might actually use it to an extent and volume that makes it interesting for companies.
The part about conversations feeling natural is just as important. It needs two things for conversations with machines to feel natural: speech synthesis and conversation flow.
Speech synthesis, nowadays, describes the action of a computer producing actual “spoken” sounds from written words and data. This begins with arranging pre recorded syllables one by one and becomes ever more complicated when incorporating important traits of our languages, such as intonation and flow. Technology has gotten really good at this as you can see in currently available voice assistants. While in most cases you can still easily tell that you’re talking to a robot, speech synthesis has reached a state of being good enough to have a conversation. You can clearly understand what the machine is trying to say without the sound of it coming off as a distraction of any sort.
The next big challenges in the field are about making sounds even more human. Robots are good at communicating facts, but conversation is about so much more than plain facts. We use speech to direct attention, convey emotions and carry more meaning than the individual words. Getting our robots to follow conversational conventions by producing and using all these stylistic measures correctly and effectively is the current area of focus in the field. And one where our robot friends still have a lot to learn.
Conversation flow describes, in simple terms, how well the conversation is going overall. For proper conversation flow it takes both parties to be benevolently engaged, actively listening and understanding. Let’s break that down:
- benevolently engaged: wanting the best outcome for the other party and taking action towards this goal
- listening: being focused and hearing what the other party is saying
- understanding: recognizing and comprehending both literal and contextual/tonal information
Listening here translates to microphone technology and speech to text transcription. While there is still a lot of room for improvement, at the basic level it’s solved. The other two is where it gets complicated. When it comes to voice assistants, that means that even if they can’t do everything you want them to do, they at the very least have to understand what it is that you mean and try their best to help you reach your goal some other way. This is where assistants go wrong at the moment. Just stray a tad too far from the predefined functional path and you could just as well be talking gibberish. But even when they understand what you’re saying, keeping a healthy exchange alive, offering assistance and information where you didn’t expect it and asking relevant questions to grasp context is still a huge problem that needs to be solved.
Once a voice interface reaches a certain threshold of both, being useful and leading conversation that feels natural, it has the potential to reach an incredible amount of users. And reach them on a more personal level than it is possible now. In conversations with robots, even simple ones, humans tend to assign human traits to the machine. We put meaning, feelings and intentions to words where there are none. This is called the ELIZA effect and scientifically well established, this article by Chatbots Magazine explains it in simple terms. If our assistant acts nicely and reacts benevolently to our requests, we can’t help but put trust in that it’s caring for our best interest. Knowing that it’s a machine, and about the contradiction of these statements, surprisingly doesn’t even diminish our trust. And here’s the fun part: Combining usefulness with trust leads to heavy use, and more importantly, people opening up and giving more information about themselves and their wants.