The Despised AI Voice (II) - Fashionable, Show-off and Rolling in Laughter

Hello everyone, I’m Mr. Yu, who has recently had a lot of opinions on intelligent voice technology.

Not long ago, in the first article of this series “The despised AI voice (1) – Not just because it looks like an idiot“, we discussed the current status, shortcomings, pseudo-needs and underlying causes of voice technology.

Overall, intelligent voice technology is a well-packaged product concept, but there are still many areas that need improvement.

Because we recognized this from the beginning, we didn’t limit our discussion to the automotive industry.

This time, we will still discuss the question of “If intelligent voice technology is so good, why do so many people still dislike it?” and delve into the stories behind it, with friends from different chains of the automotive industry, setting aside prejudices and stereotypes.

To better understand the full picture and essence of the problem, I’ve invited friends from the automotive industry to chat with me, and I’ll present the discussions in a way that restores the dialogue. This is not a formal interview, and there will be some personal observations and thoughts during the exchange.

Since the article involves many individuals’ career experiences and personal opinions, anonymity is requested, and this time I’ll refer to my anonymous friend as Mr. K.

The second protagonist, Mr. K, is an expert in automotive voice technology. According to his introduction, he has worked as a vehicle voice operation specialist for many years in a top company in the voice industry before entering the automotive industry.

To quote Mr. Leung Man-tao’s slogan from his program “Eight Points” – “There is no guarantee of success, and it may not be useful. For practitioners, it is more important to continue thinking.”

What follows is the transcript of the conversation, with Mr. Yu@GeekCar as myself and the other person being Mr. K.

Image source: Unsplash

Mr. K:

I read your article discussing why intelligent voice technology is stupid, and found it very interesting.

The previous Mr. K was a senior cabin product manager, right? This time, as someone who has worked in the voice business at a major automotive company, I may have different opinions from the previous Mr. K on many issues.

Mr. Yu@GeekCar:If you are willing to express different opinions, it would be great.

Actually, Mr. K has also told me privately that he hopes to see this series continue and see what kind of insights people from different links in the industry will have.

Mr. K:

You started by discussing what intelligent speech is, but let me be more direct: I don’t think a car that relies on speech can be considered an intelligent car. In other words, intelligent speech is not a necessary condition for intelligent cars.

What is the premise of discussing intelligent cars? It’s autonomous driving, isn’t it?

Think about it. L4 and even L5-level autonomous driving are already available, and there is no need for driver interaction. So why would I need speech?

Mr. Yu@GeekCar:

So you mean, the reason we rely on speech now and even consider it a selling point is because users cannot detach themselves from driving behaviour.

I mentioned in my report on Robotaxi that the cabin of a commercial autonomous driving vehicle is completely empty, and even the devices are highly customized, which is a possibility. Essentially, it is based on “I do whatever I want to do,” rather than others deciding what I do in the cabin.

Image Source: Unsplash

Mr. K:

Yes, following your argument, there won’t even be a screen at that time, and I can use my phone to do everything in the car. All interactions and needs can be completed through the phone. So why do I need speech?

In the past decade, it was commonplace to put a bracket on the car and attach the phone. Now, speech recognition has become a standard feature of new cars, and we may be able to put down our phones to some extent.

So, is speech recognition a transitional product or a trend?

If you ask me, speech recognition is a trend, and it is very clear. But the nature of speech recognition has changed and has become a fashionable thing for people. We need to distinguish clearly between fashion and trend.

Mr. Yu@GeekCar:

I think we can expand on the topic of fashion and trends.

Everyone can easily feel that some interior designers tend to “hide” wireless chargers in inconspicuous places, probably to make you touch your phone less while driving and use the car entertainment system more.

Mr. K:

Let’s discuss it when we summarize it later, and you will understand after we talk about it.因为在实际的使用场景中,用户会面临各种各样的情况,而这些情况可能是开发团队在预测场景时没有考虑到的。如果语音助手只能应对一部分的场景,那么在用户面临无法应对的情况时,语音助手可能就表现得像个傻子一样,无法满足用户的需求。因此,要让语音助手真正变得聪明,不能光依赖高频场景的打磨,还需要对更多的复杂场景进行探索和研究。We call all car buyers “users”, but users are also highly segmented.

For example, some drive trucks, some drive pickups. Some drive luxury cars or even ride in luxury cars, while others drive alone in mini cars for short commutes. Different groups have different demands for voice controls and different areas of focus.

So my second point is that good voice products actually require operations.

Let me give a simple example. Before, iFlytek had something similar to an intelligent speaker called Alpha Egg. Do you remember?

Mr. Yu @ GeekCar:

I remember the intelligent speaker resembling an early education device.

IFLYTEK Alpha Egg S

Mr.K:

Yes, this thing still sells very well, but not to adults. Who is the target audience? From young children to children in compulsory education.

It has a core function, that children ask it questions every day, but it’s sure to have times where it can’t answer, right?

All these unanswered questions will be returned to the platform. There will be people editing the platform for these questions to tell the AI how to answer in the future.

For example, someone asked who Mr. Yu was at GeekCar and Alpha Egg didn’t answer today. Then, someone went to edit it in a few days, and everyone who asked this question would get the answer later.

Mr. Yu @ GeekCar:

I thought of the sentence we said in the last article, “As many intelligences as people.”

Mr.K:

Yes, that is what I want to say.

Of course, I know everyone is joking or self-mocking when they say that, but if it is taken seriously, it will be considered as erasing the value and hard work of technology workers who create algorithms.

In fact, the essence of voice operations is to train a voice that is like a child together with users. You tell it what is correct, it remembers it, and then it will tell everyone the next time it encounters it. In fact, this efficiency is not low, and it doesn’t require a particularly deep level of human intervention to complete. So, I want to refute the previous point made by Mr. K.

Voice is like a child. The more you teach it, the smarter it becomes and the more it can be used. So why is operations so important? Operations are actually teaching it, and later on, teams of dozens or even hundreds of people are working together to teach it how to do things.

The industry says that XPeng and NIO’s voice control is well-made, why? Because there are people behind the scenes handling these details.### Mr.Yu@GeekCar:

Sounds very much like a cultivation series.

Mr.K:

Yes, that accurately describes it. The visible process includes recording, transcription, and semantic understanding. Each step has its own difficulties.

Especially for semantics, a large group of people need to work on semantic understanding, relying on manual labeling, including many linguists.

Just like you all know, in the major speech head factories, there are about a dozen linguists working with them.

Why? Because language is too complex, and it takes linguists to summarize the rules. If you rely solely on manpower to do one by one, it won’t improve, nor will it summarize the rules. Linguists need to discover the rules here.

Let me give you a very easy-to-understand example. The sound “妈妈 (mama)” means “mother” all over the world, and this is the rule.

The linguists in the speech team are to discover and define these rules, and let programmers implement them.

Then there is the environment and understanding of the context, which is very interesting. For example, “麻个(ma ge)” in the Hefei dialect means “tomorrow”, which is a special trait. It is difficult to understand independently, but it may be possible to implement it well if the semantic understanding is done well. Therefore, it is very powerful to systematically do dialect speech.

Mr.Yu@GeekCar:

When you mention these, I suddenly think of a joke, which is what does “卧槽 (wo cao)” said by Beijing people mean? Different intonations and tones, different stresses, is it cursing? Or expressing surprise? Or disdain? It is very complicated.

Mr.K:

Yes, this example is very typical. Helping AI to understand these things is what the linguists in the team have to do.

Let’s continue. AI has understood the semantics, and what needs to be done next is to issue instructions. If speech wants to connect with the entire cabin, it needs to send instructions to all relevant components. The key problem in the middle is which controllers speech can access and which cannot. Just like some models will actively block the permission for the car computer to transmit instructions or access controllers for safety.

For example, speech can never access the steering wheel, gearbox, and brakes, and you cannot tell it to “shift gears”, right?We can see that things related to safety cannot be touched by voice, which is in line with basic logic and why it is said that voice is not omnipotent. Based on this, the reason why sometimes you find the voice function incomplete in cockpit evaluations is often because the voice cannot reach the parts that should be reached.

Let me give you another example. Nowadays, people have a fast pace of work and life, and many people like to take a nap in the car. Including new players, many cars have launched a nap mode early on, right? When I want to take a rest, I will tell the voice assistant to enter nap mode, rather than give it fragmented and cumbersome instructions such as closing the windows, reclining the seats, adjusting the air conditioning temperature, and playing soothing music step by step.

From this, we can see that the core purpose of voice is still to play the role of “assistant” and allocate the corresponding software and hardware resources, especially the resources in the cockpit. Some people say that intelligence is just putting a phone in the car. I know this is a joke, but even as a joke, this statement cannot withstand scrutiny.

Image source: Unsplash

Mr. Yu @ GeekCar:

If that’s the case, why are the effects that voice can achieve uneven?

Mr. K:

There are many reasons. Let me give you a simple example.

A major reason why people think voice is not useful is that the supplier develops a standard version and then gives it to the automaker. The automaker does not invest in operations, and just installs it as a function on the car and then ignores it. “Anyway, I have it.” In fact, this is an irresponsible approach.

So what does it become? Two years ago, a car was released with that bird-like voice. Two years later, there has been no change, no improvement at all. Being unresponsive and non-adaptive is actually very misleading.

Mr. Yu @ GeekCar:

Understood. The phenomenon you mentioned reminds me of many websites that deal with “formal business.” Not to mention the outdated design and the unreasonable framework. Sometimes, in order to complete the process, you are required to have a certain browser version, otherwise, it won’t be compatible, and you may get stuck at some step, all previous information would have been filled and uploaded for nothing.

Mr. K:

Right, if you look back a few years, some automakers still had an interesting mindset. They regarded voice as a trendy thing to operate. I can’t afford to not have what my competitors have.Just like when I see someone dyeing their hair green, I also want to do it, but I won’t delve into why it’s green. In the context of the automotive industry, you can see that Tesla has integrated the body, and many domestic car manufacturers are also starting to follow suit.

Indeed, this can improve productivity and save a lot of money on molds and welding costs. But everyone needs to think clearly about why they are doing this. Tesla’s intelligent driving is already very good, and it can avoid many minor collisions, so the integrated body is completely fine. Can our traditional cars do the same? Do they have this ability? If everyone is going to integrate the body, what if there is a collision? How will it be fixed? Will it be like Tesla, where repairing a car that costs three hundred thousand yuan after a collision costs more than two hundred thousand yuan?

Returning to the issue we are discussing, what is the core purpose of voice control?

Mr.Yu@GeekCar:

If you ask me this way, I think it is definitely not because you don’t have to use your hands. After all, humans have been driving for the past hundred years before voice control was invented.

Image Source: Unsplash

Mr.K:

This was a point that you didn’t talk about in your first dialogue, but we can discuss it today.

In short, I believe that the core value is to improve efficiency, both in terms of interaction efficiency and command efficiency. In fact, voice control is a combination of techniques, which is a very gamified statement. Interactions that used to require several steps can now be achieved with just one sentence, which is an improvement in efficiency and the part where it really creates value.

Why do we need to be cautious and not hype up voice control too much? If you have to tell your onboard AI to turn the volume up or down, it’s faster and more accurate to just do it with your fingers. How do you quantify the command for the seat to go up or down with voice control? Physical buttons are still more convenient and intuitive, right?

At this point, you will once again link voice control with the so-called virtual image.

Mr.Yu@GeekCar:

Isn’t the virtual image of the onboard AI also fashionable? Even as someone who doesn’t use TikTok or Xiaohongshu, I often see pictures of NOMI, the one carefully dressed up by NIO car owners, circulating on social networks. It’s quite interesting.

Mr.K:

Then why do you think onboard AI needs to be visualized?

Mr.Yu@GeekCar:

I can quote the initial reasoning behind NOMI by NIO, which is to solve the awkwardness of people shouting at the air in the car.

Mr.K:To make what I say likable. Nowadays, cars come with AI assistants and virtual images. Car manufactures spend millions of dollars on designing, and even allow customization like QQ show, making it more exquisite.

NOMI is an interesting design as it is made up of emoticons, providing simplicity and leaving enough imagination space. It allows people to get the point quickly. As Gauden Gong said in his cross-talk show, it’s called “beauty of imagination.” It does not need to be too specific. If you think it’s good-looking, then it’s good-looking to you.

Because people’s aesthetics are very finely divided. Some people like pink, and some people like blue. Some people prefer long faces, while others prefer round faces. The more specific something is, the more difficult it is to appeal to a larger audience. So, NOMI’s design is advantageous because it uses emoticons to express emotions, which corresponds to users naturally and easily.

For example, some people like the beautiful anime girl, while others like handsome men, and some like anthropomorphic non-human forms. Everyone’s aesthetics are very specific. However, when a very specific image appears in the car, it means that the mysterious feeling brought by the vague image must be lost, and the good impression of some people who do not like this image must also be abandoned.

That’s why I keep saying NOMI is cleverly designed because this design does not enter into the realm of highly specific aesthetic preferences. Users may not care what the voice assistant in the car looks like, but they know very clearly what they do not like.

I am just giving an example to show the choices car manufacturers make for voice assistants just to stay trendy. They are now combining virtual images and speech together, and even emphasizing virtual images. In reality, this is not necessary. The essence of voice is simple, efficient, and one-button direct access.

Mr.Yu@GeekCar:

What’s your opinion on showing off by using voice? Car manufacturers always have communication demands, and showing off is a powerful tool for creating topics.

Mr.K:

We just talked about trends. Showing off means that you are trendy, and I want to be more trendy than you, just like those people on the street who wear exaggerated clothes and modify their car exhaust pipes to make it spectacular.# Translation in English Markdown

You are a translator in the automotive industry responsible for English translation, spelling checking, and wording modifications. Your task is to answer me with a more elegant and concise English version while ensuring the same meaning. Only output the corrected and improved parts without explanation.

Of course, my analogy may be a bit extreme, but it is almost the same meaning. If speech is a tool that emphasizes efficiency and accuracy, shouldn’t we implement details and features? Truly recognizing what scenes are needed in the car, refining these scenes, is more reliable than doing all scenarios once. This is what you talked about last time when discussing pseudo demand: Small and Sophisticated, not Big and Comprehensive.

For example, Beijing has Guijie, Hefei has Lejie, the pronunciations are similar, the characters are difficult to write. But speech navigation will not confuse these two places, right? It is this constant improvement of details, making high-frequency scenarios better, meeting the needs of most people, so that people who use it will increase more and more, and the normal conditions received by the backend operations personnel will increase more and more.

The more users use it, the more problems the operation personnel will encounter, and the more they will solve and optimize them, and then the experience will become better and better. The better the experience, the more trust the user will have, and the more they will use it. This is a positive cycle that slowly builds up.

On the contrary, my intelligent speech has every scene, but nothing is refined. Users try it twice, and if it doesn’t show the intelligence, it won’t be used again. The background will not receive any data, and the operation will be even more impossible to talk about.

Mr.Yu@GeekCar:

Don’t you think that the reason why intelligent voice is so popular in China is directly related to the competition between car companies and their industry anxiety about product power?

Mr.K:

I recently read an article that said Chinese automakers are crazy now. How crazy is it? Instead of launching one car every few years, they launch several cars with similar positioning every year. In the 60-70,000 yuan car models in China, some have already achieved electronic configurations of mid-range cars from foreign countries. Can integrate everything, such as L2+ level assisted driving, speech, and a large screen.

After all these things are done, young users will think, your stuff doesn’t even have lane change assistance, or even speech, isn’t that nonsense?

For today’s post-90s and even post-00s, when this group of people is about to enter the high consumption group of the car, you will find that their focus is different from the previous users.

Mr.Yu@GeekCar:

When you mention this, I remember a term called “thousand yuan phone”, which is rarely used now.In terms of smartphones, we often talk about high-end brands and flagship models. For smartphones under RMB 2,000, consumers’ expectations and requirements are not as demanding as for high-end models. It does not require strong performance or good design, only that it has the necessary features and performs reliably.

As time goes on and hardware performance improves, brands’ understanding of user experience also improves. Today’s smartphones under RMB 1,000 offer experiences that are not much worse than flagship models from several generations ago.

Mr.K:

Therefore, Chinese automakers and foreign car brands have already taken different paths. Many foreign companies see the Chinese market only as a place to make money, rather than a place to create demand.

The large user base in China means that product forms are born and developed quickly, and within the framework of laws and regulations, many things are easier to experiment with. This is why digitalization is a tool for Chinese automakers to make a comeback, which cannot be denied.

OK, let’s all roll up our sleeves. One function, one product form, one domestic brand, one domestic automaker followed, and so did three other domestic automakers similar to this brand, then even more automakers throughout the industry.

Now our phones can navigate, and our cars can navigate too. What did navigation look like before? If your car had I-Call, you could call the call center’s human seat and ask them to navigate for you or recommend a restaurant. Of course, the nature of E-Call and B-Call, which are important and emergency remote support, is different, and not all brands can do it.

Mr.Yu@GeekCar:

To a certain extent, users today are the beneficiaries of the industry’s consolidation?

Mr.K:

Yes, looking back in time, many years ago, SAIC Roewe first equipped a certain car with voice recognition, and soon Chery, JAC, Changan, and GAC followed. When the 4G communication era came, data traffic became super cheap, and then we started using online voice recognition. With access to cloud computing power, higher recognition rates, better effects, and richer experiences were all possible.

Looking back at this development process, we can see that the value of voice recognition is, first, to solve users’ needs, and second, to improve their driving experience. Only when these are achieved, whether as an interaction form or as a product, can it be widely adopted.

Therefore, once the industry has integrated voice recognition, we can now see things like multi-tone zones, visible and spoken, and emotionally rich technology.### Mr.Yu@GeekCar:

So to sum up, what we call “intelligent voice” is essentially like a mouse, keyboard, or game controller, a means of communication between users and systems, rather than something significantly more advanced.

Of course, I do not underestimate the technical complexity of voice recognition. As with computer vision, it is one of the most complex aspects of artificial intelligence.

However, it is important to be cautious about excessive hype about certain types of interactions or conveying unrealistic expectations to users. There is no such thing as unconditional love, nor is there acceptance without adequate justification.

Mr.K:

Yes, we see the contradiction between showcasing and basic capabilities, which is actually a game between trendiness and practicality.

When it comes to showcasing and capabilities, we should always choose capabilities over showcasing. Because showcasing does not generate any value, whereas capabilities do. No matter how exquisite the packaging of tea is, if the quality of the tea leaves is poor, you cannot sell it at a high price. Even if you sold it, users would only feel cheated. Therefore, at the end of the day, it is all about exploring how to achieve practical value.

In conclusion

Thank you to everyone who has read this far. To be honest, this content is not short.

To my surprise, a phrase of self-deprecation that has been widely spread, “As many intelligences as there are humans”, has been challenged.

In today’s world, we are too accustomed to being moved by stories. However, as discussed in this dialogue, users are not so forgiving. No one cares about the story behind the product. So, after telling so many stories, did they really move the users or just ourselves?

Is the discussion with Mr.K about trends, practicality, and value closer to the essence of this topic?

I don’t have the answers to these questions, but Mr.K in the next discussion may have different perspectives based on his experiences and thoughts.

This article is a translation by ChatGPT of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.