For all of modern gaming’s advances, conversation is still a fairly unsophisticated affair. Starship Commander, an upcoming virtual reality game on Oculus and SteamVR, illustrates both the promise and challenge of a new paradigm seeking to remedy that: using your voice.
In an early demo, I control a starship delivering classified goods across treacherous space. Everything is controlled by my voice: flying the ship is as simple as saying “computer, use the autopilot,” while my sergeant pops up in live action video to answer questions.
At one point, my ship is intercepted and disabled by a villain, who pops onto my screen and starts grilling me. After a little back and forth, it turns out he wants a deal: “Tell you what, you take me to the Delta outpost and I’ll let you live.”
I try to shift into character. “What if I attack you?” I say. No response, just an impassive yet expectant stare. “What if I say no?” I add. I try half a dozen responses, but — perhaps because I’m playing an early build of the game or maybe it just can’t decipher my voice– I can’t seem to find the right phrase to unlock the next stage of play.
It’s awkward. My immersion in the game all but breaks down when my conversational partner does not reciprocate. It’s a two-way street: If I’m going to dissect the game’s dialogue closely to craft an interesting point, it has to keep up with mine too.
The situation deteriorates. The villain eventually gets fed up with my inability to carry the conversation. He blows up my ship, ending the game.
Yet there is potential for a natural back and forth conversation with characters. There are over 50 possible responses to one simple question from the sergeant — “Is there anything you’d like to know before we start the mission?” — says Alexander Mejia, the founder and creative director at Human Interact, which is designing the game. The system is powered by Microsoft’s Custom Speech Service (similar technology to Cortana), which sends players’ voice input to the cloud, parses it for true intent, and gets a response in milliseconds. Smooth voice control coupled with virtual reality means a completely hands-free, lifelike interface with almost no learning curve for someone who’s never picked up a gamepad.
Speaking certainly feels more natural than selecting one of four dialogue options from a menu, as a traditional roleplaying game might provide. It makes me more attentive in conversation — I have to pay attention to characters’ monologues, picking up on details and inconsistencies while coming up with insightful questions that might take me down a serendipitous narrative route (much like real life). No, I don’t get to precisely steer a ship to uncharted planets since voice control, after all, is not ideal for navigating physical space. But, what this game offers instead is conversational exploration.
Video games have always been concerned with blurring the lines between art and real life.
Photorealistic 4K graphics, the disintegration of levels into vast open worlds, virtual reality placing players inside the skull of another person: The implicit end goal of every gaming advance seems to be to create an artificial reality indistinguishable from our own. Yet we communicate with these increasingly intelligent games using blunt tools. The joystick/buttons and keyboard/mouse combinations we use to speak to games do little to resemble the actions they represent. Even games that use lifelike controls from the blocky plastic Time Crisis guns to Nintendo Switch Joy-Cons still involve scrolling through menus and clicking on dialogue options. The next step is for us to talk to games.
While games that use the voice have cropped up over the years — Seaman on Sega’s Dreamcast, Lifeline on the PlayStation 2, Mass Effect 3 on the Xbox 360’s Kinect — their commands were often frustratingly clunky and audio input never seemed more than a novelty.
That may be coming to an end. Well-rated audio games have appeared on the iPhone such as Papa Sangre and Zombies, Run! At E3 this month, Dominic Mallinson, a Sony senior vice president for research and development, referred to natural language understanding among “some of the technologies that really excite us in the lab right now.”
More than anything, the rush by Microsoft, Google, Amazon and Apple to dominate digital assistants is pushing the entire voice computing field forward. In March, The Information reported that Amazon CEO Jeff Bezos wants gaming to be a “killer app” for Alexa, and the company has paid developers that produce the best performing skills. Games are now the top category for Alexa, and the number of customers playing games on Echo devices has increased tenfold in the last year, according to an Amazon spokeswoman. “If I think back on the history of the world, there’s always been games,” says Paul Cutsinger, Amazon’s head of Alexa voice design education. “And it seems like the invention of every new technology comes along with games.”
Simply: If voice assistants become the next major computing platform, it’s logical that they will have their own games. “On most new platforms, games are one of the first things that people try,” says Aaron Batalion, a partner focused on consumer technology at venture capital firm Lightspeed Venture Partners. “It’s fun, engaging and, depending on the game mechanics, it’s often viral.” According to eMarketer, 35.6 million Americans will use a voice assistant device like Echo at least once a month this year, while 60.5 million Americans will use some kind of virtual voice assistant like Siri. The question is, what form will these new games take?
Gaming skills on Alexa today predominantly trace their lineage to radio drama — the serialized voice acted fiction of the early 20th century — including RuneScape whodunnit One Piercing Note, Batman mystery game The Wayne Investigation and Sherlock Holmes adventure Baker Street Experience.
Earplay, meanwhile, has emerged as a leading publisher of audio games, receiving over $10,000 from Amazon since May, according to Jon Myers, who co-founded the company in 2013. Myers describes their work as “stories you play with your voice,” and the company crafts both their own games and the tools that enable others to do the same.
For instance, in Codename Cygnus, you play a James Bond-esque spy navigating foreign locales and villains with contrived European accents, receiving instructions via an earpiece. Meanwhile, in Half, you navigate a surreal Groundhog Day scenario, picking up clues on each playthrough to escape the infinitely repeating sequence of events.
Like a choose-your-own-adventure novel, these experiences intersperse chunks of narrative with pivotal moments where the player gets to make a decision, replying with verbal prompts. Plot the right course through an elaborate dialogue tree and you reach the end. The audio storytelling activates your imagination, yet there is little agency as a player: The story chugs along at its own pace until you reach each waypoint. You are not so much inhabiting a character or world as co-authoring a story with a narrator.
“What you see with the current offerings from Earplay springs a lot out of what we did at Telltale Games over the last decade,” says Dave Grossman, Earplay’s chief creative officer. “I almost don’t even want to call them games. They’re sort of interactive narrative experiences, or narrative games.”
Grossman has had a long career considering storytelling in games. He is widely credited with creating the first game with voice acting all the way through — 1993’s Day of the Tentacle — and also worked on the Monkey Island series. Before arriving at Earplay, he spent a decade with Telltale Games, makers of The Wolf Among Us and The Walking Dead.
Earplay continues this genre’s bloodline: The goal is not immersion but storytelling. “I think [immersion] is an excellent thing for getting the audience involved in what you want, in making them care about it, but I don’t think it’s the be-all-end-all goal of all gaming,” says Grossman. “My primary goal is to entertain the audience. That’s what I care most about, and there are lots of ways to do that that don’t involve immersing them in anything.”
In Earplay’s games, the “possibility space”– the degree to which the user can control the world — is kept deliberately narrow. This reflects Earplay’s philosophy. But it also reflects the current limitations of audio games. It’s hard to explore physical environments in detail because you can’t see them. Because Alexa cannot talk and listen at the same time, there can be no exchange of witticisms between player and computer, only each side talking at pre-approved moments. Voice seems like a natural interface, but it’s still essentially making selections from a multiple choice menu. Radio drama may be an obvious inspiration for this new form; its overacted tropes and narrative conventions are also well-established for audiences. But right now, like radio narratives, the experience of these games seem to still be more about listening than speaking.
Untethered, too, is inspired by radio drama. Created by Numinous Games, which previously made That Dragon Cancer, it runs on Google’s Daydream virtual reality platform, combining visuals with voice and a hand controller.
Virtual reality and voice control seem to be an ideal fit. On a practical level, speech obviates the need for novice gamers to figure out complicated button placements on a handheld controller they can’t see. On an experiential level, the combination of being able to look around a 360 degree environment and speaking to it naturally brings games one step closer to dissolving the fourth wall.
In the first two episodes, Untethered drops you first into a radio station in the Pacific Northwest and then into a driver’s seat, where you encounter characters whose faces you never see. Their stories slowly intertwine, but you only get to know them through their voice. Physically, you’re mostly rooted to one spot, though you can use the Daydream controller to put on records and answer calls. When given the cue, you speak: your producer gets you to record a radio commercial, and you have to mediate an argument between husband and wife in your back seat. “It’s somewhere maybe between a book and a movie because you’re not imagining every detail,” says head writer Amy Green.
The game runs off Google’s Cloud Speech platform which recognizes voice input, and may return 15 or 20 lines responding to whatever you might say, says Green. While those lines may meander the story in different directions, the outcome of the game is always the same. “If you never speak a word, you’re still gonna have a really good experience,” she says.
This is a similar design to Starship Commander: anticipating anything the player might say, so as to record a pre-written, voice-acted response.
“It sounds like a daunting task, but you’d be surprised at how limited the types of questions that people ask are,” says Mejia of Human Interact. “What we found out is that 99% of people, when they get in VR, and you put them in the commander’s chair and you say, “You have a spaceship. Why don’t you go out and do something with it?” People don’t try to go to the fast food joint or ask what the weather’s like outside. They get into the character.”
“The script is more like a funnel, where people all want to end up in about the same place,” he adds.
Yet for voice games to be fully responsive to anything a user might say, traditional scripts may not even be useful. The ideal system would use “full stack AI, not just the AI determining what you’re saying and then playing back voice lines, but the AI that you can actually have a conversation with,” says Mejia. “It passes the Turing test with flying colors; you have no idea if it’s a person.”
In this world, there are no script trees, only a soup of knowledge and events that an artificial intelligence picks and prunes from, reacting spontaneously to what the player says. Instead of a tightly scripted route with little room for expression, an ideal conversation could be fluid, veering off subject and back. Right now, instead of voice games being a freeing experience, it’s easy to feel hemmed in, trapped in the worst kind of conversation — overly structured with everyone just waiting their turn to talk.
An example of procedurally generated conversation can be found in Spirit AI’s Character Engine. The system creates characters with their own motivations and changing emotional states. The dialogue is not fully pre-written, but draws on a database of information — people, places, event timeline — to string whole sentences together itself.
“I would describe this as characters being able to improvise based on the thing they know about their knowledge of the world and the types of things they’ve been taught how to say,” says Mitu Khandaker, chief creative officer at Spirit AI and an assistant professor at New York University’s Game Center. Projects using the technology are already going into production, and should appear within two years, she says. If games like Codename Cygnus and Baker Street Experience represent a more structured side of voice gaming, Spirit AI’s engine reflects its freeform opposite.
Every game creator deals with a set of classic storytelling questions: Do they prefer to give their users liberty or control? Immersion or a well-told narrative? An experience led by the player or developer? Free will or meaning?
With the rise of vocal technology that allows us to communicate more and more seamlessly with games, these questions will become even more relevant.
“It’s nice to have this idea that there is an author, or a God, or someone who is giving meaning to things, and that the things over which I have no control are happening for a reason,” says Grossman. “There’s something sort of comforting about that: ‘You’re in good hands now. We’re telling a story, and I’m going to handle all this stuff, and you’re going to enjoy it. Just relax and enjoy that.'”
In Untethered, there were moments when I had no idea if my spoken commands meaningfully impacted the story at all. Part of me appreciated that this mimics how life actually works. “You just live your life and whatever happened that day was what was always going to happen that day,” Green says. But another part of me missed the clearly telegraphed forks in the road that indicated I was about to make a major decision. They are a kind of fantasy of perfect knowledge, of cause and effect, which don’t always appear in real life. Part of the appeal of games is that they simplify and structure the complexity of daily living.
As developers wrestle with this balance, they will create a whole new form of game: one that’s centered on complex characters over physical environments; conversation and negotiation over action and traditional gameplay. The idea of what makes a game a game will expand even further. And the voice can reduce gaming’s barrier to entry for a general audience, not to mention the visually and physically impaired (the Able Gamers Foundation estimates 33 million gamers in the US have a disability of some kind). “Making games which are more about characters means that more people can engage with them,” says Khandaker. “Not everybody is necessarily into games which are about violence or shooting but everyone understands what it is to talk to people. Everybody knows what it is to have a human engagement of some kind.”
Still, voice gaming’s ability to bring a naturalistic interface to games matters little if it doesn’t work seamlessly, and that remains the industry’s biggest point to prove. A responsive if abstract gamepad is always preferable to unreliable voice control. An elaborate dialogue tree that obfuscates a lack of true intelligence beats a fledgling AI which can’t understand basic commands.
I’m reminded of this the second time I play the Starship Commander demo. Anticipating the villain’s surprise attack and ultimatum, I’m already resigned to the only option I know will advance the story: agree to his request.
“Take me to the Delta outpost and I’ll let you live,” he says.
“Sure, I’ll take you,” I say.
This time he doesn’t stare blankly at me. “Fire on the ship,” he replies, to my surprise.
A volley of missiles and my game is over, again. I take off my headset and David Kuelz, a writer on the game who set up the demo, has been laughing. He watched the computer convert my speech to text.
“It mistook ‘I’ll take you’ for ‘fuck you,'” he says. “That’s a really common response, actually.”