Sometimes, Siri can’t even…
What formal semantics tells us
There’s an unwavering discourse phenomenon in applied linguistics that always zips right over my head: “donkey sentences.” Essentially, these are expressions containing pronouns that exist outside the antecedent of a conditional statement, despite the fact that they covary with other quantifying constituents inside.
Doesn’t ring a bell? Here’s a basic example:
(1) Every artist who paints a picture frames it.
The statement above features the personal pronoun “it” whose meaning and antecedent (“paints a picture”) are evident to you and me, but to those who study sentence structure…not so much, as visualized (incorrectly) below using first-order logic and free-variables:
(2) ∀x(ARTIST(x) ∧ ∃y(PICTURE(y) ∧ PAINT(x,y)) → FRAME(x,y))
Although this case of anaphoric reference may be somewhat convoluted, the preceding example has me pondering the related notion of deixis and its (future) role in interaction design.
Deitic expressions tell us everything about context in relation to the speaker, including tidbits about location, people, and time. When we talk about position, what words do we use? “Here” and “there” convey place. Perhaps “this street” or “that city” — more precise noun phrases that, regardless of their idiosyncrasy, provide us with spatial insight through and through. However, these points of locality are dependent on the active party, not the passive listener or reader.
Key point: “Here” is to you in Los Angeles, but “there” is to your friend in Oslo. To me, I am “I,” but to you, I am “you.” January 7, 2019, is “today” today, but on January 8, 2019, it’s “yesterday” today.
Applying deixis to voice assistants
We ask Siri to call our moms, play the new Ariana song, tell us a joke. Alexa can switch on your foyer lights with ease, ignite the fireplace on a bone-chilling night, and power up the vacuum cleaner. Okay, great. But what if I point? Snap? Whistle? Dart my eyes at my TV? Nothing but automated apologies.
Why is that? The former are voice-activated requests. This is what Siri and Alexa do for a living: react and respond to your coherent, logical, syntactically accurate commands, questions, and statements. By gesticulating at the television, you’re adding an additional element while erasing vital linguistic context clues and replacing definite nouns with pronouns. This is why deixis matters.
Here’s a real-life scenario involving motion. You’re walking past 7-Eleven and ask your device, “How much does a 12-ounce Slurpee® cost there?” Upon posing the question, you simultaneously glance over at 7-Eleven a mere 10 feet from you. In this instance, 7-Eleven has been reduced to the determiner “there” and assigned an eye movement by the user. Fast-forward years from now and your gizmo might catch that brief instance of eye contact with 7-Eleven, spitting out “$0.80” on the cent. Unspooling it today, however, the deitic expression “there” among others are beyond cryptic to your voice assistant. Instead, your device will likely accommodate your inquiry by pulling up thousands of web results — its workaround, what it was programmed to do when interpreting muddled directives.
Understandably, gesture is tough due to the utter complexity of kinetics and motion detection. There are certainly aspects of deixis that voice recognition software can parse here and now, though principally in a unilateral fashion and with little intricacy. Try asking Siri where Portland is located. It will map Portland for you. Easy! Follow up with a question about the weather: “What’s the weather there this week?” You’re presented with the weekly forecast. Good! Now, revert to your first question using only a pronoun: “Where is it?” Here, “it” refers to the city of Portland, yet “it” escapes Siri: “I don’t have any places to show.” Siri cannot alter course and discern that “it” is Portland, even though Portland acted as the antecedent in round two.
Proximity, distality, and spaces far more confusing
Determiners can be incredibly bewildering, too, even outside the field of voice recognition. Whilst intended to modify a noun, the English determiners “this” and “that” are often used interchangeably by native speakers…and not always accurately. Remember that prescriptivists assert “this” should be used to indicate proximity, whereas “that” ought to signify distance. Semantics complicates this quandary even further in English, where determiners may represent concrete nouns (“that turnip”) or abstract nouns (“those nightmares”).
And then we throw bilingualism into the mix. Although English uses a two-way referential distinction (“proximal” and “distal”), numerous languages encompass three or more layers for spatial deixis. Korean, for example, has a locution for an object near the speaker, a term for that near the listener, and yet another to denote something distant from both the speaker and the listener. (Fun fact: West Greenlandic has a whopping 12 layers, even a mode for objects that are invisible to the eye!) Ultimately, how L2 speakers relay these expressions into their non-native English dialects will vary from language to language and person to person — a tumultuous reality for UXers and designers alike.
These shortcomings — pertaining to gesture, ambiguity, or the many other factors that I failed to mention above — demonstrate how far we are from the pinnacle of voice interaction when it comes to deitic expressions and spatial orientation. Computational linguists and NLP professionals are essential in streamlining the interface for speech recognition software, such as by defining referents, as are designers by means of user research, complex usability testing, and GIS application development. Whether we examine this stumbling block from a linguistic standpoint or from that of design, ensuring deixis is wholly addressed for technology will remain a knotty endeavor no doubt.
For those who work in HCI or UX, where do you believe studies will take us in 2020?
Zac Gipson is a senior management consulting analyst at Accenture Federal Services in Washington, D.C. His primary research interests center on machine translation, natural language processing, and voice user interface. He is also an OCB natural bodybuilder who is teaching others how to build lean muscle on a plant-based diet.