Exploring the Evolutionary Path of Multi-Modal Interaction in Intelligent Cockpit: Based on the Perspective of "Third Space" Development.

Author | Zhang Desai, Huang Zhen, Gao Ziwei

Despite the shortage of chips, the impact of the pandemic, rising raw material prices, and many external geopolitical factors, the Chinese passenger car market in 2022 has rebounded perfectly in a V-shaped manner. What’s even more pleasing is that China’s domestic brands and many new car makers are keeping up with changes in Chinese consumer demand and, in the context of industrial integration, are increasingly favored by young consumers thanks to their strong technological features such as intelligent cabins and autonomous driving.

In our research visits, we found that multi-modal interaction is becoming increasingly important for many automakers in the process of planning new models. This is in line with our findings on the development of intelligent cabins based on the “Third Space” by major automakers and technology companies, which generally follow three paths, including focusing on friendly interaction and service, space and personalization, and interconnectivity.

Building on our previous research, this article analyzes the significance, content, current status, and challenges of multi-modal interaction in the development of intelligent cabins, as well as the development path of cabin interaction based on current intelligent cabins and autonomous driving.

Advantages of Multi-Modal Interaction vs. Single Modality

The so-called “modality” is a biological concept proposed by German physiologist Helmholtz, which refers to the channel through which living beings receive information through sensory organs and experience. We believe that the human-machine interaction content in the cabin is mainly realized through 4 modalities: visual, auditory, tactile, and olfactory. So, what are the advantages of using multiple modalities compared to single modality?

On the one hand, multi-modal interaction can improve interaction accuracy. For example, when using voice interaction alone, it is inevitable to encounter noise, echo, and unclear recognition, but by obtaining various sensor information such as images, eye movements, facial expressions, blood pressure, and heart rate to complement voice interaction, multiple different information sources can be fused to reduce the error rate.

On the other hand, from the perspective of usability, complementary interaction modalities can provide the necessary information more conveniently and efficiently for drivers and passengers by combining the advantages of each modality. For example, when determining the navigation destination, traditional visual and tactile interaction requires the use of touch buttons, text input, screen sliding or knob rotation, but with the integration of voice interaction, determining the navigation location using voice input and screen options can significantly reduce the time required to complete the navigation setting action.## The significance of multi-mode interaction in the cockpit: functionality and high-tech aesthetics

The significance of multi-mode interaction for end consumers in the cockpit can be summarized from two aspects: functionality and high-tech aesthetics.

Functionality refers to the fact that, in the flourishing new energy vehicle field, traditional interaction methods such as instrument panels are unable to cope with diverse and information-rich multi-mode interaction methods like power display, range, and battery status information. In addition, new demands for interaction technology have been put forward for active safety display, navigation, online entertainment, and smart services.

High-tech aesthetics refers to the subjective perception of consumers towards cockpit innovative functions, which will greatly impact the perception and purchase decisions of new-generation consumers towards the whole vehicle experience. By combining various modalities, the high-tech experience in various cockpit scenarios can be effectively improved, enhancing the travel experience of drivers and passengers.

For automobile manufacturers, auto parts manufacturers, and technology companies, expanding from the manufacturing industry to higher value-added service fields, innovating modes, and creating more added value has been a major direction that the industry has been exploring in recent years. For example, by developing new profit points through 「hardware standardization + OTA paid upgrade」. The multi-mode interaction in the cockpit, through its richness in functionality and technical depth, is expected to become an effective point of focus for creating competitiveness for auto makers, auto parts manufacturers and technology companies in this direction.

Current multi-mode interaction content in the cockpit: safety information and entertainment information

Based on the current stage of smart driving, the content of interaction in the cockpit cannot be generalized, but needs to be divided into two categories: safety driving information and entertainment information. Safety driving information includes vehicle condition information, road condition information, environmental information, etc., which are necessary for drivers to complete driving tasks. Entertainment information includes non-driving personnel such as movies and games, or entertainment interaction information conducted by drivers in a non-driving state. By summarizing the cockpit designs of the current mainstream auto makers for these two types of interaction content, we have compiled the following table to show the current situation and challenges of each modality in the current multi-mode interaction in the cockpit.

3.1 The interaction technology path for safety driving information# Challenges and Solutions of Safety Driving Information

It has been observed that there are various modalities for combining different technologies to address the challenges of safe driving information, including:

A) Visual + Voice

The running status of voice interaction is usually invisible in practical applications. Without fusion with other modalities, it is difficult to predict which state the issued command is in. In the case of Nomi by NIO, for example, Nomi uses a humanized expression image as a supplement to enhance the visual connection between the driver and the car, showing feedback such as listening, happiness, and thumbs up, thus increasing the driver’s sense of companionship and trust in the voice interaction.

B) Visual + Tactile

When the lane-keeping alarm system of XPeng P7 is turned on, the steering wheel will vibrate to remind the driver that the car is currently crossing the line, reducing the frequency of the driver looking down at the dashboard when driving.

C) Voice + Gesture

In the case of Leapmotor Shadow Gesture Control, Shadow supports three dynamic gestures of left and right swings, up and down swings, and forward and backward pushes, as well as five static gestures. This covers various commonly used functions such as confirmation, selfie, answering/rejecting calls, and playing/pausing, and realizes the interaction mode of voice + gesture by combining the four-tone area voice system.

3.2 Analysis of Interaction Modalities for Safe Driving Information: Safety and Accuracy

In the current stage of intelligent driving level, the driver’s hand-brain-eye resources need to be focused on obtaining safe driving information. The visual modality for obtaining vehicle condition information, road condition information, and environmental information will still be the main interactive modality, while other modalities serve as a supplement. Therefore, the number of modalities used for safe driving information is not the more the better. Safety and accuracy of interactive design should be the primary consideration. For example, the driver’s line of sight cannot be away from the direction of vehicle travel for too long, and the hands also need to be kept on the steering wheel. The bottleneck of multi-mode interaction in the cockpit for obtaining safe driving information is the conflict between the increasing diversification of tasks and the low hand-brain-eye resources invested. Drivers need to be able to complete complex tasks while driving safely, such as checking the remaining cruising range, finding nearby charging piles, and setting the charging pile address as the navigation destination.Based on this analysis, we believe that the visual+audio path is currently the preferred path for safe driving information exchange. With the development of visual modality HUD technology, electronic rearview mirrors, and DMS technology, drivers will be able to obtain more driving-related information without lowering their heads. Applications in voice interaction such as visible-speakable, sound source localization, and continuous dialogue that do not consume driver’s resources are also widely used in new car models released this year. According to research data released by EO Intelligence, the accuracy of in-vehicle voice recognition has increased from 60% in 2011 to 98% in 2021. Demands such as phone calls and music while driving can also be completed without excessively occupying drivers’ visual attention through the voice modality. The voice system can also achieve identity verification through voiceprint recognition and visual modality perception to improve interaction security. However, interaction modalities such as eye tracking and gestures are still in the exploratory stage and have relatively single functions.

3.3 Interaction Technology Path for Cabin Entertainment Information

Regarding the challenge of entertainment information, there are currently the following modality-combined technology paths, including A) Visual + Audio; B) Visual + Audio + Touch; and C) Visual + Audio + Touch + Smell.

A. Visual + Audio

Take AITO M7 as an example. Through collaboration with KTV music library app, combined with audio and video hardware, it can transform the cabin into a “mobile karaoke room”.

B. Visual + Audio + Touch

Take the ideal L9 as an example. The ideal L9 can realize external device projection on the co-pilot screen and rear-seat entertainment screen, which can directly connect to Switch, mobile phones, and tablets, transforming the cabin into a “mobile gaming space”.C. Visual + Audio + Tactile + Olfactory

Taking XPeng G9 as an example, creating the “5D Music Cab” with visual linking of multiple screens and ambience lights, speech assistant with four sound zones for auditory modality, music rhythm seats for tactile modality, and aroma switch system for olfactory modality.

3.4 Analysis of Rich Multisensory Experience for Entertainment Information Interaction Path

In the fast-paced mobile internet era, end-users have gradually developed the habit of fragmented entertainment, i.e., real-time and personalized interactive experiences. In the cabin entertainment scenario, this habit will also affect users’ expectations of multi-modal interaction in the cabin. Based on this background, unlike the high accuracy and efficiency requirements for safety driving information, consumers are more concerned about rich entertainment experience in the cabin. According to the survey results of IHS Markit, the level of technological configuration in the cabin has become the second key factor after safety configuration in the purchasing factors of the new generation of consumers who grew up with smartphones, and even exceeds traditional purchasing factors such as power, space, and price.

Therefore, we believe that visual + audio + tactile + olfactory is currently the preferred path for entertainment information interaction. Visual modality can add different interaction scenarios through clearer display technology (e.g. high-resolution screens, linked screens, and projection); audio modality can provide personalized interaction methods for passengers in different seats through sound source localization; tactile modality has matured waveguide ultrasonic haptic technology, making touch no longer limited to screen form; olfactory modality has personalized scent and scent algorithm. These multiple communication channels (modalities) highlight the intelligence, technology and personalization of cabin entertainment.

Prospects for Multimodal Interaction in the Cabin

In the previous section, we divided the interaction in the cabin into safety driving information and entertainment information and discussed and analyzed them separately.

Firstly, for safety driving information interaction, we believe that safety and accuracy of interaction design should be considered as the top priority. Therefore, the visual + audio pathway is currently the preferred path for safety driving information interaction.

Secondly, for the entertainment information interaction path, we believe that the focus of consumers’ attention is the rich multisensory experience. Therefore, our analysis suggests that visual + audio + tactile + olfactory is currently the preferred path for entertainment information interaction.We should be aware that the multi-modal interaction in the cabin is still in its early stages of development, and currently, the interactions of various modalities are often initiated by the driver, which typically involves waking up the system, describing their own needs, and the system responding through command input, calculation processing, and feedback, in a passive interaction process mainly consisting of Yes/No interactions.

Therefore, we anticipate that the further development and evolution of multi-modal interaction will be reflected in two aspects, namely judgment and execution of ambiguous information, and the evolution from passive to active interaction.

We can anticipate that with the development of multi-modal interaction, utilizing multi-modal information perception, such as identifying drivers and passengers by their facial expressions, eye movements, and body actions with visual modality DMS, combining speech intonation signals with auditory modality, and incorporating the distribution of back pressure curves and heart rate signals detected by seat sensor in tactile modality, can not only be used to judge and execute user ambiguous intents, such as “play some upbeat music”, “adjust the seat to be more comfortable”, but also to generate accurate feedback by combining the perception and understanding of drivers and passengers, as well as the current situation, enabling the cabin to “perceive and understand” user fatigue or negative emotions, and proactively launch functions such as music playback and seat vibration to ease driver fatigue or negative emotions, transforming the interaction mode into an active interaction.

We believe that multi-modal interaction development and evolution like the above will provide users with a more abundant experience, further improve the intelligence of the cabin, and bring it closer to the concept of “third space”. Thus, the changes in products and business models brought about by them are worthy of more attention and expectations!

(The author works for Faurecia (China) Automotive Seating Business)

This article is a translation by ChatGPT of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.