Author: Mr. Yu
From the user’s perspective, the in-car intelligent voice experience is almost the most direct one. And this is also one of the important standards for judging whether smart cars are “smart” in the eyes of the public.
Related data shows that as of 2021, the adoption rate of intelligent voice interaction function in Chinese passenger cars has reached 86%. It is evident that excellent voice experience has become one of the most valuable product attributes, from OEMs to suppliers. The industry has been continuously focusing on the speech field and running faster on the track of chasing each other.
Perhaps it is some kind of tacit understanding, or perhaps it has long been mutually beneficial. In 2022, we see a new trend in the industry from the top players in-car voice: showing more practicality and more focus on strengthening core capabilities to enhance experience.
On a busy workday afternoon before the Spring Festival, we talked with the AI product team of XPeng Motors to confirm the changes we observed in the trend of in-car voice and to solve some doubts in the past, including but not limited to:
- Why does the XPeng team insist on creating visible and speakable functions and continue to invest resources in deep cultivation?
- Why are practitioners no longer enthusiastic about packaging concepts and technological demonstrations as they were a few years ago?
When the ceiling of speech rises again
For today’s intelligent cabin, voice has become not only a common means of human-computer interaction, but also a platform that links cabin functions and perceptions, provides differentiation capabilities and services. Some auto manufacturers have chosen to take the road of self-research by building a voice interaction framework that meets their own positioning, grasping the initiative to define the user experience in their own hands.
And XPeng Motors has been striving for this as well. Even though it has not affected the collective mindset of all users, XPeng Motors, which has always been marked by high-level intelligence, has already established widespread recognition in the market and the industry with its high-standard intelligent voice experience and very open application ecology, and has also won an undeniable level of recognition.
Since the launch of the XPeng P7 in 2020, and under the all-scene voice interaction system 1.0, which includes continuous dialogue, semantic interruption, visible and speakable, etc., XPeng has become the goal of many friendly competitors.Do you remember when the XPeng G9 was unveiled, in addition to the media and opinion leaders exclaiming “the ceiling of speech recognition”, I also noticed in several articles by speech professionals that they expressed different degrees of concern for the new features of “Full-Scene Speech 2.0”, especially the full-time dialogue function.
The upgraded “Full-Scene Speech 2.0” brings three new key features: Rapid Dialogue, Full-time Dialogue, and Multi-person Dialogue. In the system framework, these three functions each have their own independent switch, and once confirmed to be enabled, they can be started.
Rapid Dialogue: the significantly improved wake-up and response speeds make up the Rapid Dialogue function. According to official information, the time between the user’s speech ending and the interface animation responding is less than 300 milliseconds. In a tweet by XPeng Motors, it was mentioned that some media sites experienced more than 40 consecutive commands in one minute while testing the XPeng G9’s auto-pilot feature. Subjectively speaking, based on past testing experiences, we can emphatically say that, despite different sound environments, whether for in-car voice or home smart voice, the XPeng G9 has refreshed the experience of speech speed in all aspects.
Full-time Dialogue: With the Full-time Dialogue switch on, the voice assistant “Small P” will continue to listen, and there is no need to use a wake-up word as the starting point for conversation at any time. In most cases, the user can directly issue a command and have it executed. If the expression is uncommon, or if Small P cannot confirm that the order is directed at itself and therefore does not respond, the user can add “Small P” within 5 seconds, after which Small P can recognize and execute the command that was not responded to.
Multi-person Dialogue: With Full-time Dialogue and Multi-person Dialogue switches on, Full-time Dialogue function will cover the entire vehicle. If the vehicle is fully loaded, users in each position can alternatively or simultaneously interact with Small P through voice, without causing interference among each other. For example, the driver says “turn on the seat heater”, and the passenger says “me too” and it will be implemented together, achieving a higher-level cross-sound-zone multi-round dialogue effect.# From voice interaction technology architecture to voice basic abilities, XPeng adheres to the self-developed route strategy, which also gives maximum freedom to define products. This is also considered by the outside world as the most unshakable moat of XPeng in the cockpit product level.
In a sense, voice products are almost the easiest field to “step on the pit”. The launch of functionality is not the end, not even a node, but rather a starting point—from scratch to being usable, which is not a concept. If the experience is not good, users’ limited patience will not allow them to give their silly voice assistants multiple chances.
With the introduction of features such as full-car multi-sound area recognition and multi-person conversations, the driver is no longer required to continue to play the role of the only interaction center in the car. Having good multi-sound area recognition capabilities first puts high demands on the reception and discrimination of the car’s multiple sound areas.
For a long time, the performance of silent wake-up was like a double-edged sword, discouraging many imaginative product managers: one point in, an entirely “no emotional” chatterbox will be added to the car. Every time the user speaks, they may receive annoying comments or misunderstandings. But one point out might make people completely unable to understand, not knowing where the small voice assistant behind the car screen is, whether it is absent-minded or “running away from home.”
In the widely-populated industry, different “racers” have displayed different styles of running: some release first and optimize through OTA, some carefully polish the experience, and some do subtraction and release little by little. Regardless of whether it is a preemptive strategy or a deliberate plan, the choice behind it is irrelevant to right or wrong, because the strength and decision-making methods of different business units vary.
To the needs that are still expanding, the industry’s approach is –
On XPeng G9, we see a new style of full-scenario voice transmission 2.0: more efficient, more casual, and more convenient.
Behind this, there is no small contribution from the group that is most mentioned in addition to “users” in this interview—the R&D team.
In the race on the reaction speed of the conversation, it is a representative example. XPeng’s R&D team takes the industry’s highest reaction speed as its goal, while making up for its shortcomings and optimizing online service reaction time through stream processing. With the improvement of hardware level and algorithm, ASR (Automatic Speech Recognition Technology) has been significantly improved in recognition accuracy. During the user’s speaking process, XPeng’s AI smart assistance (little P) will predict the upcoming command in real-time in milliseconds, so that when the command is completed, a fast and smooth response can be given to implement the voice command.And behind the Full-Time Conversation feature, the XPeng team set a series of standards that can almost be described as “self-torture”, such as setting the leakage and rejection recognition rates of Full-Time Conversation to a few parts per ten thousand, more than hundreds of times higher than the standard for continuous conversation…
Today, we still have no way of knowing what kind of battles the R & D team experienced before the launch of this feature. It is understood that after the Full-Time Conversation was officially launched, the landing effect detected online was much better than the originally set target.
“Bring convenience to users and leave problems to R & D.”
During the interview, a member of the XPeng team summed up their work philosophy with a simple joke.
In the process of evolution and upgrade of voice experience, the advantage of XPeng’s full self-development continues to be demonstrated. The seemingly radical product strategy does not mean rough planning and development. On the contrary, like the product logic we observed on XPeng P7, the XPeng team regards the confusion recognition rate, rejection rate, and omission recognition rate as important considerations. They set strict technical goals while pursuing these goals with near-paranoid determination.
Just like the persistence in the “visible-to-speak” feature. In the XPeng team’s view, as an advanced voice ability, the “visible-to-speak” function largely solves the balance problem between users’ attention and execution accuracy when facing the screen interface, without any physical contact. It simplifies the user’s intention and problem into a voice command, and all the interfaces in the system become voice-friendly action help menus. It not only reduces operating barriers and costs, but also makes users more confident in voice interaction.
The fact proves that this is meaningful. Research by related domestic research institutions shows that XPeng car owners generally have a complete understanding of the visible-to-speak function, and are more willing to use voice as a common interaction method in appropriate scenarios.
In fact, not only XPeng, the in-car voice industry is becoming more and more pragmatic.In these years, social networks, especially short video platforms, have become prevalent. Who hasn’t seen a few funny videos related to intelligent cars? From the comedian-style voice command to the so-called emotional interaction, we have to admit that these have played a role in popularizing and establishing basic knowledge of intelligent cars among the general public.
For a long time, the car voice industry has conveyed a keen interest in making voice assistants proficient in “reading emotions.” They want to test whether the AI’s IQ and EQ are online in every sentence spoken. For example, when a user says “it’s a bit hot today,” the voice assistant is not just expected to respond immediately by turning on the air conditioner and ventilation; it is required to say a few cyber-flavored words and show some care to make people satisfied.
Now, let us look at this issue rationally.
For statements like “it’s a bit hot today,” should we consider it a command, or simply an expression of feeling? Is the statement addressed to the voice assistant, or the other passengers in the car? Should we activate the car’s air conditioning and other comfort functions or remain still? If a misunderstanding occurs, will it cause people to feel helpless and annoyed?
Over-sensitivity can cause unnecessary pressure on others. This is not only a philosophy of interpersonal communication but also something that AI should understand deeply.
Fast forward to 2022. In both product and promotion, there seems to be a certain tacit understanding within the car voice industry. We see more and more voice products on mass-produced models focusing on strengthening basic skills.
The interaction process has become simpler and smoother, and voice assistants have become increasingly “understandable” and forgiving with higher error tolerance for commands. More consideration is given to the convenience of other users in the car, such as independent audio zones, the ability to interrupt at any time, and the enhancement of the ability to deal with cross-instructions, all emphasizing convenience.
To summarize, China’s car voice industry has come a long way. To quote the famous Olympic slogan: faster, higher, stronger. The use of showcases that emphasize skills is helpful for publicity, but car makers and suppliers have gradually become aware of the contradiction between skills and basic abilities. Skills may not produce value beyond dissemination, only the ability does.
XPeng Motors team believes that excessive showcasing of technology in voice products will raise user expectations blindly, leading to a vicious cycle. The more a brand promises its users, the higher their expectations become, and the more freely they use the product, resulting in ambiguous questions. As a result, the boundaries of in-vehicle voice assistants are expanded, leading to more problems and decreasing the probability of a stable user experience. As this scenario repeats, trust in the product functions declines, affecting the trust in the entire vehicle.
In short, excessive showcasing of technology raises user expectation without providing actual value and creates obstacles.
However, in the industry, the focus on the core capabilities and user experience is stronger than ever. This is a sign of the growing maturity of the industry.
Conclusion
In fact, not only XPeng Motors team but also all the professionals interviewed indicated that their ultimate goal for their in-vehicle voice assistants is to achieve a humanoid product capable of performing the tasks of an intelligent cabin voice assistant.
Admittedly, in the future, there will still be expressions related to in-vehicle voice that are impetuous but interesting. Meanwhile, a more practical approach will continue at the product level, and the convenience of the voice interaction user experience will become richer.
This is because more and more voice industry professionals are taking the “establishing trust in the stable use of the product by users” as the underlying logic for their work.
This article is a translation by ChatGPT of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.