Author | Zhu Shiyun
Editor | Wang Lingfang
City Scene Automated Driving Is More Challenging Than Imagined. Currently, The Minimum Distance Between Two Human Interventions In Machines Is Generally More Than One Kilometer; The Implementation Of Fully Automated Driving Is Widely Expected To Be In 2030; Even Tesla’s Current Valuation Does Not Include The Value Of FSD (Fully Self-Driving Computer).
Despite this, Tesla overseas, XPeng Motors and WmAuto have all announced the mass production of urban automated driving assistance functions.
Is it fearless to have “assistance” limited to this?
On September 13th, Hayoit Robotics, a technology provider for WmAuto’s landing city NOH and an autonomous driving technology company under Great Wall Motors, introduced the 3.0 autonomous driving technology route based on big data and big models at its AI DAY, explaining the technology possibility and path to promote the mass production of urban scenario automated driving ability.
The Era of Big Models and Big Data 3.0
“We believe that the development of autonomous driving technology over the past decade can be divided into three stages: the hardware-driven 1.0 era; the software-driven 2.0 era; and the data-driven 3.0 era that is about to happen soon and will continue to develop.” Gu Weihao, CEO of Hayoit Robotics, said at Hayoit AI DAY that the 3.0 era is characterized by the core model of big models and big data.
The so-called “big model” refers to an artificial intelligence algorithm model with a parameter quantity of billions, trillions, or even tens of trillions, more complex functions, higher output precision and accuracy, and self-supervised learning function and strong versatility.
Not long ago, the perception field of automated driving mainly adopted a small model mode. The sensors collected data independently and provided perception recognition for small models tailored to specific tasks (such as pedestrian recognition, vehicle recognition, and lane recognition), followed by result-level fusion.Until 2020, the Transformer-based large-scale models with Attention mechanism had dominated the field of NLP and began to make significant breakthroughs in the field of CV. In 2021, Tesla showcased the BEV (Bird’s Eye View) perception space output by the Transfomer-based algorithm at its AI DAY, opening up the popularization of large-scale models in the mass production of autonomous driving.
The large model outputs perception results after unifying the original data from multiple, even different modal sensors. The processing power of large models for massive data also provides the possibility for autonomous driving systems to handle extremely complex urban road conditions.
However, the large model is not perfect.
On the one hand, a large model that outputs high precision and accuracy requires a huge amount of training data, and the diversity of the data must be sufficient.
Gu Weihao stated, “On the training data scale, the autonomous driving mileage data should reach at least 100 million kilometers; in terms of diversity, sensor data of different types, pixels, angles, and scenes are all of great value to the training of large models.“
“So, we have reason to believe that assisted driving is the way to autonomous driving. Only with a large-scale front-loading assisted driving system can we collect enough data with sufficient scale and diversity.”
On the other hand, the large model based on Attention will put a lot of “attention” on weak associations (parameters that are not highly relevant to the desired result), resulting in Transformer requiring 100 times the computing power of CNN, but only 7% of the effective (highly relevant to the desired result) computing power, causing high training costs and difficulties in implementation, especially in vehicles with limited computing power and power consumption.
“So, under the trend of large models, we believe that three problems need to be addressed: how to reduce the cost of autonomous driving through low-carbon supercomputing, how to improve the calculation efficiency of vehicle-side models, and how to improve the computing efficiency of vehicle-side chips,” said Gu Weihao.
According to statistics, the mileage of Momenta’s assisted driving has exceeded 17 million kilometers, and the learning time of its data intelligence system MANA has reached over 310,000 hours, with a virtual driving age of 40,000 years. The autonomous delivery vehicle for end-to-end logistics has also transported nearly 90,000 items for nearby users.
Momenta’s large model autonomous driving training methodEven if we collect autonomous driving data up to 100 million kilometers, how can we turn it into qualified teaching materials for neural networks, train large models to meet mass production targets, and meet the demands for time and cost?
Unlike small models that require preset learning objectives and supervised learning, large models have self-supervised learning capabilities. This can reduce data annotation and, to some extent, solve the problems of high manual annotation costs, long cycles, and low accuracy.
According to Gu Weihao, Moovita chose to unify the backbone network for all perception tasks and then used unlabeled data to train and lock it, while the remaining parts of the model were trained using labeled samples. “The experimental results show that this method can improve the efficiency of training by more than 3 times compared with using only labeled samples, and the accuracy is significantly improved.”
However, the new challenge is that when training data reaches tens of millions of kilometers, the model’s sensitivity to new scenes will decrease, and it will fall into a catastrophic forgetting situation: training the model on new datasets will forget the knowledge learned on old data, and therefore testing on old data will result in a significant drop.
To address this challenge, Moovita has constructed an incremental learning platform. During training, the output of the new model and the old model are required to be as consistent as possible, and the fitting of the new data should be as good as possible. “Compared with the conventional method (refining training on the full dataset again), we can save more than 80% of computing power to achieve the same accuracy, and the convergence time can be improved by more than 6 times.”
Once training is mature, it is time for large models to leverage the magic of the Transformer architecture. City scenes are more complicated and variable, and the high-precision map method relied upon in high-speed scenarios is no longer applicable. The autonomous driving system needs to perceive and understand the road situation in front of it like a human driver in order to make driving decisions. Therefore, the ability to build a driving space containing a certain length of time has become a necessary ability for autonomous driving systems to land in urban environments.## Development of a Human-like Decision Model
Huawei Maomao has applied a time-sequence transformer model in perception, which integrates multiple frames of information over a certain period of time to eliminate jitter and make the perception result continuously stable, then makes virtual real-time mapping under a certain spatiotemporal context on BEV space, enabling more accurate and stable perception of lane lines, as well as more accurate judgement of obstacles. With powerful real-time perception capability, Huawei Maomao has already addressed issues such as road blurriness, complex intersections, and roundabouts, all that is needed is the relatively reliable topological information in a regular navigation map, just like driving by humans.
The practical application of large-scale perception models obviously helps Huawei Maomao to accumulate AI application ideas and abilities in the field of decision-making. In the pre-3.0 era, decision-making systems for autonomous driving were mainly based on logic code written by humans, made up of rigid rules and regulations, which also meant that the vehicles lost the ability to “adapt to changes” on the road. This “rigid” decision-making method can be applied in relatively simple highways, but in urban road conditions, it will greatly affect traffic efficiency and user experience.
Huawei Maomao has drawn on the method of multimodal large-scale models to better solve cognitive problems. The specific method is to deeply understand a large number of human-driven behaviors and build a Huawei Maomao’s automatic driving scene library. Based on the actual driving behaviors of a massive number of drivers discovered via typical scenarios, the company has built task prompts, trained a large-scale pre-training driving decision model based on spatiotemporal attention, and achieved controllable and interpretable autonomous driving decisions. “In complex urban environments, Huawei Maomao NOH can not only combine with actual situations to choose the optimal route for safety but also learn human driving characteristics to provide the most reasonable behavioral sequences and parameters, creating a more realistic experience for the driver, “said Gu Weihao.
City Autonomous Driving Enhancement
With the training method for large-scale models and the working mode of large-scale models completed, Huawei Maomao has begun to supplement the series of “new abilities” needed for the production of urban autonomous driving functions. After being capable of interpreting urban road interaction systems, such as traffic lights, Huawei Maomao is upgrading the perception system in the vehicle, adding specialized recognition abilities for vehicle signal lights such as brake and turn signals, so that the vehicle can better predict the motion intentions of traffic participants.
To solve the problem of ineffective training due to deviations from the real environment in the simulation system, Huawei Maomao has cooperated with Alibaba and the Deqing government to use roadside devices to record the actual traffic situation at intersections, then imported the data into the simulation engine as a simulation environment through the log2world method, which is used to debug and verify the intersection scenes of the autonomous driving model.”Of course, in most cases, the repetition rate is relatively high. We use traffic environment entropy to calculate the value of the scene and select high-value scenes for simulation test cases, which greatly improves the pass rate of the entire product,” said Gu Wei Hao.
In addition, Momenta officially announced its first supercomputing center for Chinese autonomous driving companies, the Momenta Supercomputing Center, which aims to meet the needs of billion-parameter large-scale models, one million clip training data, and reduce overall training costs by 200 times.
At the event, Zhang Kai announced Momenta’s “Five Winning Rules for Winning the Second Half of Intelligent Driving”: always prioritize safety in the development of intelligent driving products; product experience is king; driven by user real-scenario data, achieve rapid product iteration; integrate perception intelligence and cognitive intelligence; and empower customers with an open mind to promote common progress in the industry.
Reference:
“What are big models? Superlarge models? Foundation Models?” by ZOMI on Zhihu.
This article is a translation by ChatGPT of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.