Author: Michelin
What is the standard configuration for an intelligent cockpit? Touch screen? Head-up display? Or voice assistant?
Perhaps in the future of fully autonomous driving, cars will no longer require touch screens or voice assistants, and people will not be needed to drive. However, in the current stage of “over-intelligent and under-automated”, we have to admit that voice has become the optimal solution for in-cabin interaction, and it has become a focal point for automakers in the cockpit.
From the initial invisible and mechanical voice-assistant to NOMI, the visual and roundish anthropomorphic character; then to the warm and friendly names of voice assistants, they have evolved on the path to personification.
By 2021, the personification of voice assistants has shifted from appearance to the internal: at the beginning of the year, BMW’s new generation of iDrive 8.0 used a gentle and knowledgeable voice to promote active emotional interaction; recently, XPeng also launched a new voice system, Little P, which uses the gentle voice of a little sister to promote a more personal emotional experience.
Is personification and activation of voice interaction a selling point, or an inevitable trend of intelligence development?
Users and cars both need personified “it”
The increasing personification of voice assistants is a dual choice made by automakers and users, as well as a choice between emotion and function.
Why do we need voice assistants in the cabin? In addition to the direct need for communication and interaction with users, as performance indicators improve, the role of voice interaction in the cabin is increasingly like a platform.
Through this platform, users can access software functions in the cabin, open map navigation, play music, and search for various information. They can also control hardware in the car, such as opening or closing windows, controlling air conditioning, and adjusting seats. As physical buttons gradually disappear in the cabin, car functionality needs a personified scheduling carrier.
On the XPeng P7, where the voice system has just been OTA upgraded, if we say “I’m a little hot” to Little P in the front passenger seat, the voice assistant will decrease the temperature of the passenger-side air conditioning and increase the fan speed; if we say “navigate to xxx, map scale one kilometer” in the driver’s seat, the voice assistant will automatically navigate and zoom the map to one kilometer… At this point, the voice assistant is like a personified butler, activating software and hardware functionality.
Apart from being a platform function, personified speech can also bring a more cordial emotional interaction experience.
According to the measurement standard of Microsoft MOS voice evaluation, the closer a voice is to human voice, the easier it is for people to feel comfortable.
In a semi-enclosed space like the cabin, drivers may experience boredom and fatigue during long-distance or high-speed driving, as well as negative emotions caused by traffic congestion. Comfortable speech can not only accomplish functional commands, but also play a role in soothing emotions and relieving fatigue. Currently, voice systems on the market use cute cartoon voices or gentle female voices, also for this consideration.
XPeng’s Xiao P voice system is designed to avoid mechanical and monotonous replies, and instead sell cuteness with replies like “好哒”(“ok”), “好滴”(“got it”), and “欧克”(“alright”) after each command. This also makes in-cabin conversations less boring.
When it comes to voice assistants, it reminds people of the movie “Her,” in which the male protagonist is consoled and redeemed by the seductive and humorous AI Samantha, and then falls in love with her. Although falling in love with a voice system seems distant from us, it is not unimaginable that a voice as considerate as a human-being can make people happy and even dependent on it.
Perhaps in the future, the value of a car will not only be reflected in its brand, design, speed, and driving experience, but also in the sense of intimacy brought by the voice assistant in the cabin. “This is the voice of an old friend” may become a bonus when buying a car.
More important than sounding like a person is personifying performance
If a humanized voice can make people feel comfortable, why not personify the voice from the beginning?
If you only pursue voices that sound like humans, it is not impossible to achieve technically. In the navigation app we use frequently, the navigation voices of celebrities are generated through keyword collection and TTS voice synthesis technology, even achieving a realistic effect. However, the voice in the navigation app is only one-way output, and it is not a bidirectional interaction that requires identification of voice commands and steps for executing tasks and giving feedback.
In-cabin voice interaction is much more complex.Firstly, the voice system needs to extract command keywords quickly and accurately from complex user instructions, and correctly receive and transmit the commands to other functional areas across software or even domains through the integration of in-car hardware and software functions. Only then does the voice feedback to the user need to be done with a human-like voice. The three aspects need to coordinate with each other to ensure safe driving and smooth interactive experience.
For instance, when giving the same voice command to adjust the seat, “put down the driver’s seat” to Xiao P while the car is still, the voice system would automatically put down the driver’s seat. However, when the car starts to move, the system would reject the command, forbidding the automatic adjustment of the seat while driving, thus avoiding the danger of sudden seat adjustments.
Even if safety hazards are eliminated, if the first two tasks are not performed well, the personification of the voice alone may have a “backfire” effect on user experience. In one experiment conducted by the Media Effectiveness Lab at the University of Pennsylvania, intelligent voice customer service was classified into two types, robot customer service and human-like customer service. Volunteers were unaware of which type they were communicating with, and afterward rated their experience. During the same communication process, volunteers gave the robot customer service an average of 80 points, while the human-like customer service only got 60 points.
The reason is simple. When the voice system pretends to be human-like, volunteers will subconsciously evaluate it based on human standards. Therefore, to manage user expectations of the voice system, personification of the voice should be used in conjunction with the “personified” functions.
Whether it is the BMW iDrive 8.0 or the latest Xiao P released by XPeng, personification of the voice is being promoted based on deep neural networks to enhance the lifelike feel and natural language recognition. Additionally, it is combined with multimodal voice modes, such as continuous dialogues, multi-round dialogues, natural speech recognition, visible speech, and interactive modes with other modalities such as the camera and the touch screen. This combination allows the voice to sound personified while also fulfilling its functions.
As intelligence levels continue to rise, voice systems are slowly moving from being cold and mechanical to personified. However, to make it warm and emotional, they need to overcome several obstacles, such as the use of dialects or individual user speech habits. Voice systems must not only master the commonality and features of language instructions through deep learning, but they also need a smart “brain” that can learn individual behavior and satisfy personalized needs.In the experience of XPeng Xiaop, when playing videos, the system directly closed the central control screen when it said “cancel full screen”. It seems that the voice assistant needs to be more considerate and require more coordination with sensors in the cockpit and other interactive methods, using multimodal interaction to observe and provide more thoughtful interactive experience.
In BMW iDrive 8.0, the voice system was previewed to cooperate with in-car cameras to capture user’s expressions to further accurately interpret commands, enabling the voice system to understand users beyond words.
Of course, just as there may be potential risks associated with realistic AI face-swapping techniques in gray areas, discussions about AI ethics and safety will also become an issue that must be considered when the voice assistant is “humanized” to a certain extent.
Writing at the end
The personification of voice assistants brings full-scenario services to users, where sound is just a carrier, and interaction is essence. In the future, when voice assistants can meticulously help me adjust the cabin environment, plan trips, arrange activities, and even read out commands that I haven’t spoken yet, perhaps it can also understand my long distance high-speed driving and provide a version of the voice assistant with the performance of Guo Degang’s crosstalk.
This article is a translation by ChatGPT of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.