voice interaction design

2022 tagged with: workshop filed under: teaching

i ran a 3-day workshop on voice interaction design for aman xaxa’s students of communication design (graduate studies) at the school of planning and architecture, bhopal. we functioned as a fast-moving studio: iterating rapidly while exploring multi-modality in voice-led human–machine ~~conversations~~ interactions.

this is a record of what everything we made (and discussed) in the workshop:

project 1: “ice breakers”

instead of asking students to introduce themselves, i decided to ask my questions (“what is your name” and “where are you from”) to their computers. following a discussion around turn-taking, students built simple interactions (sketches) with pre-programmed answers to those questions. they were introduced to voiceflow, and used the speak and intent blocks in it. they also learned about running prototypes, and about how utterances only need to match partially to trigger an intent.

instead of typing questions (into a chat-box), testing with our voices made it harder for the machine to detect what we were saying (and helped build more robust interactions.)
there were several ways of framing the same question. people spoke to the machines in different languages. people also asked completely unrelated questions, or asked more questions than were programmed-in. (for example: i’d ask the machine to describe the student’s education or hobbies, or spell out their names.)
as the students tested each-other’s sketches, they recognised a need to plan how interactions ended, looped-back, and/or reset.

project 2: “slow snake game”

in this project, students worked in groups and made an interaction lasting several turns. since the intents for a snake game were easy (”turn left”, “go right”, “keep going straight”, “go up”, etc), this project focused on multi-modality (specifically: offering visual feedback besides voice/text), and opened-up discussions about personas. students were encouraged to imagine what kind of a person(ality) their game’s “snake” was, and let that be reflected in how the snake responded to intents.

since they didn’t know how to write code (and’d only been introduced to some basic voiceflow-blocks until then), this group explicitly defined every step of their snake-game, and made the visuals for it too. it didn’t take long, though, and was good enough to test ‘a voice-controlled game’.

persona: a confused snake.

some groups used earcons (further expanding their multi-modal explorations); some used filled pauses (like “hmm”) to negotiate open-ended responses from people; and there was also a discussion on using recorded voice instead of a machine-generated voices (when feasible, of-course ; for example: when there are only a few standard machine-responses, and voice samples can be quickly recorded and placed into a sketch/prototype).

a sloth persona, with sounds played at the end of each machine-turn. (nb. green blocks.)

while prototyping, testing and iterating, we discussed several things: the importance of affordance in context; setting and managing expectations; tapering; equipping a person to issue the right commands; giving relevant feedback (to let a person know what state the machine is in); and even some basic error handling, to enable the interaction to recover from an error elegantly.

once they’d suffered enough, i showed them how handy entities and variables can be.

a prompt instructing the snake to change direction (“please go {up}”) was answered with a mirrored confirmation (“okay, i'm going {up}”). so, this group made separate blocks for “go up” and "go north" in order to offer "going up" and "going north" commands respectively. (nb. as shown in the green blocks.)

my example to the class at the end of the day, showing how entities allow you to define ‘listen for “go {direction}”’ and respond with “okay, i’m going {direction}”. i also introduced them to variables and utterance-variants through this example.

we also noticed how, in a multi-modal experience (using voice, earcons, and visuals), we can trigger modes individually, together or in a sequence. for example, when offering feedback (to denote success or error), it may be more affective for the machine to play an earcon before displaying visual feedback.

an earcon plays before a (machine's) verbal-response.

this led us to a discussion about using memory to enrich interactions, instead of just building reactive or prescriptive conversational-experiences.

project 3: “car find toilet”

“let’s imagine a woman driving a car, with her child on the back-seat. they’re on a highway, and the child suddenly expresses her need to go to a toilet stat. instead of the mother checking a map on her phone (which is dangerous to do while driving), can her vehicle help by directing her to a facility nearby?”

the students wrote a brief, in 5–6 sentences, establishing context (world-building, vehicle type, persons involved, where they’re going) and defining the task the mother would want to accomplish by talking to her vehicle.

first, they performed the interaction: after writing down a “happy-flow” script, they acted it out in front of each other (with one person in a team role-playing the car, and another pretending to be a passenger).

then, they built (at-least) the happy-flow in software. students were encouraged to use sounds and images, manage some error-handling, and include at-least a few variations in the machine’s utterances.

using animated visuals (to denote the machine’s state), maps, earcons, playful responses, etc in a context-specific manner. (nb. since these were rapid sketches, students were encouraged to use placeholder text/images/sounds within the sketch ; if it was good enough to convey the idea, it was good enough.)

even within the simple happy-flows made by students, interactions broke down often (even after several tests, fixes and iterations within class). so i sketched a typical error-handling flow, and shared it with the class.

additional resources:

in “always in” (2019), drew austin wrote—optimistically—about a future where everyone wears headphones all the time. students were asked to share their thoughts on the essay (or any specific part in it) while introducing themselves on the first day. later on, students were encouraged to read some “theory” about how machines understand what we speak to them: through an introductory article to ‘natural language understanding’ (on vux).

whatever we make is, in a way, magic (because all the complexity gets hidden away behind a seamless experience). with this in mind, i recommend genevieve bell’s talk on magic and fear and wonder and technology to any student of interaction design. also: in birth of living code: tamagotchis and teddybears (2019), anne skoug obel went through different ideas about when code/machines are perceived to be living.