What are the challenges of intelligent driving navigation in urban areas? Written on the occasion of the release of XPeng/Huawei-Jifeng NOA.

What is Urban Navigation Intelligent Driving?

Image 1 First Show of Huawei-Inifity Navigation Intelligent Driving at the 2021 Shanghai Auto Show, with flexible and continuous movements under extremely complex road conditions, source from previous works at "Garage 42"

In this article, we discuss urban navigation intelligent driving, which refers to point-to-point intelligent assisted driving within a specific city area (generally within the scope of high-precision maps or within the geographic fence of a versioned ADS map).

Editor’s note: Recently, many companies have begun to try to get rid of their reliance on high-precision maps in urban navigation intelligent driving solutions, that is, the so-called “heavy perception, light map” approach.

Compared to the simple lane keeping and following features of basic Autopilot systems like Tesla’s Autopilot (similar to Xpeng’s LCC and Huawei-Inifity’s ICA), which do not include point-to-point navigation and cannot handle bends or traffic lights, urban navigation intelligent driving requires human intervention less frequently, providing a smoother, more flexible, and more coherent overall experience that aligns more closely with ordinary consumers’ understanding of “autonomous driving” (refer to Image 1).

High-precision maps / ADAS maps can provide massive amounts of rich layers and various information. From an angle that is perceptible to ordinary consumers, the most significant changes in urban navigation driving assistance can be summarized into the following four aspects:

For the purpose of description, “ego” is used to refer to a vehicle equipped with urban navigation intelligent driving features.

Image 2 Traffic_light lane Association Figure 3 HD-MAP with Massive Information

For this feature, the implementation technologies of various companies are similar: perception (including target detection/tracking/multi-sensor fusion), prediction using the deep learning technology stack that became popular after 2014, and localization/control/HDMAP evolved from the technology stack of robotics. The names of the mass-produced products of each company are slightly different: XPeng calls it NGP (urban area), Huawei-Jihou calls it NCA (urban area), NIO calls it NOP (urban area), and Great Wall calls it NOH.

N: Navigation, often promoted as point-to-point for urban intelligent driving scenarios

For discussion purposes, they are collectively referred to as urban navigation intelligent driving. In fact, the release of this feature has been delayed for a long time. XPeng previously promised to release the urban NGP in early this year, Huawei-Jihou promised to release the urban NCA at the end of last year, and NIO’s NOP (urban) has yet to be seen today. In fact, I personally think that NIO will not release the urban NOP before the end of this year.

Therefore, if the difficulty of implementing current autonomous driving technology is ranked, urban navigation intelligent driving technology is undoubtedly the “Mount Everest”.

Why did XPeng and Huawei-Jihou release almost simultaneously after waiting for so long?

Recently, some cities in China have opened up high-precision maps of urban areas as pilot projects. Therefore, XPeng and Huawei-Jihou started the first urban navigation intelligent driving battle: On September 17, XPeng claimed on its public account that it had started to push urban NGP features to some P5 users in Guangzhou, inviting a large number of media to test drive to build momentum. As a result, on September 23, Huawei-Jihou announced on its public account that it had pushed the full version of the Shenzhen urban navigation intelligent driving feature to some users, snatching the title of the world’s first urban high-level navigation intelligent driving release.

Figure 4 XPeng and Huawei-Jihou compete for the first release of urban navigation intelligent driving However, I suspect that the strategies of both parties should be consistent, and neither dares to release a large number of ordinary consumers at the same time. Instead, they prioritize extremely, extremely friendly users (I guess the number of ordinary consumers who have received experience tickets from each party should not exceed a few dozen, basically selected users can go home and buy lottery tickets) to experience the urban navigation intelligent driving.

And companies like Nio, Ideanomics and Momenta, which have emphasized urban intelligent driving, have yet to produce mass-market products.

Why are companies so cautious about releasing urban navigation intelligent driving?

Because urban navigation intelligent driving is really difficult.

First of all, let me draw a conclusion. If the average driving level of human drivers is used as the ground truth for comparison, the current urban navigation intelligent driving of each company needs to be further polished in terms of safety, experience, and efficiency.

Urban navigation intelligent driving is the gathering place of corner cases and the prehistoric pit of long tail challenges. In general, Robotaxi companies have the most luxurious sensor configuration due to their low cost pressure. Their appearance requirements are not as high as those of mass-produced cars, and their sensor layout is also the best (the iconic flowerpot-shaped laser shape).
In this case, the MPI level of its urban area is usually around dozens.

Figure 5 Typical sensor layout of RoboTaxi company

Here, some students may put away the statistical report of the California DMV. The MPI data in that report is not so much to compare the technical level of each company through MPI, but to compare the moral level of each company through MPI. However, the reasons for disengagement of some companies in Excel are still worth seeing, and they are written very sincerely.

That is to say, the company with the highest algorithm level and the best sensor layout in China has the possibility of emergency takeover once a day on average when used daily commuting distance (the average commuting distance in first-tier cities is about 30 km per day).

For mass-produced cars, ordinary consumers sit in the driver’s seat, not “tested and experienced” company Test-Drivers, and their predictability of system performance and vehicle control proficiency are far less than that of Test Drivers who are “trained” by their own system every day. Therefore, each company is very cautious in releasing urban navigation driving, which can be understood.>This also means that if 100 cars have activated urban intelligent driving, the potential number of emergency takeovers per day is about 100 times. If there are 1000 cars, the potential number of takeovers is about 1000 times. Under this data, it is unrealistic to expect ordinary consumers to rescue the situation as Test Drivers who are “trained” by their own systems every day in case of emergencies.
>

However, there is no need to be too pessimistic. The FSD of Tesla in the complex urban area (MPI) is in single digits. Interested students can go to YouTube to count them … Note that many Tesla test road conditions on YouTube are actually worse than those in domestic urban areas. When counting, it is necessary to choose appropriately.

What is challenging about urban navigation with intelligent driving?

HD-MAP

High-precision maps are actually not a technical challenge, but rather a cost problem. Generally, high-precision maps can be considered as an alien world. The ego is driving in this alien world, and it projects or matches other targets (people, cars, bicycles, and traffic signals) that the ego “sees” to this alien world, thus “understanding” this other world. Finally, it can achieve the goal of traveling in the real world.

In fact, this principle is not fundamentally different from that of home sweeping robots. At present, automatic driving is the culmination of artificial intelligence (especially computer vision) and robotics.

Figure 5 The ego projects/matches the perceived target information in the real world to the high-precision map, "understands," and travels in the alien world.

This will bring two problems:

The degree of correspondence between the expression of this alien world and the real physical world directly determines the level of urban navigation with intelligent driving.

That is, the quality of the HD-MAP production process, the richness of reality elements reflected, and the accuracy of matching real-world human driving habits directly determine the level of urban navigation with intelligent driving.

Taking the degree of matching between road elements in this alien world and human driving habits in the real physical world as an example: in the high-precision map, the lane that ego travels is called lanes, which can be understood as train tracks. In most cases, the ego will travel along this reference lane, and in extremely rare cases, it will perform some “human-like” operations, such as obstacle avoidance.The quality of reference lane production in maps (especially in complex topologies such as intersections and merge lanes) directly determines the rationality and safety of ego’s driving behavior. However, different companies have different levels of production quality, and the optimization level of strategy space within lanes (or even larger strategy space) varies widely. “Human-like” actions often appear to be “idiotic” or even dangerous. In fact, on road sections where autonomous driving is frequently tested, one can easily observe and distinguish whether the vehicle is in autonomous driving mode by just paying a little attention.

Figure 6 Ego driving in reference lane in HD-MAP

From this point of view, it is more accurate to say that urban navigation autonomous driving is a “smart” train rather than a “smart” car, driven along the “tracks” established by the Hdmap. Where the “tracks” are laid, the “train” runs. If the “tracks” are poorly laid, the “train” will not run well.

图 7 「智能驾驶小火车」的轨道

Similarly, there are also road scenarios such as the presence of traffic lights countdown information in hd-maps, handling of topology at road intersections without traffic lights and the establishment of topology on roundabouts in hd-maps (in the real world, driving around roundabouts is very complicated).

It is very difficult to express established driving habits that have become a part of human society in high-precision maps in a rational and human-like way. Automated algorithmic processing often falls short of expectations, and still requires a lot of manual checking and refinement. This greatly increases the production cost of high-precision maps.

Real-time synchronization of changes in real-world road conditions to the high-precision map in the other universe is extremely difficult.

Maintaining the “freshness” of a high-precision map is extremely costly, and currently, urban navigation autonomous driving technology cannot handle this situation well, which may lead to serious consequences.

This is also the key reason why Elon Musk, the leader of Tesla, frequently criticizes Waymo’s high-precision map routing. Figure 8: We briefly barked up the tree of high precision lane line [maps], but decided it wasn't a good idea. -- Elon Musk

The whole process of the tool chain for developing high-precision maps is complex and not completely mastered by players in China. In addition, as urban road infrastructure and changes happen frequently in China, the cost of maintaining the freshness of HD maps becomes almost immeasurable.

That’s why it’s called a developing country. Before I got into autonomous driving, I never thought the urban infrastructure in China was so frequent.

In extreme cases, during road repairs on segment A, after the department in charge of map information learns about it through active or passive channels, the workflow for segment A will be updated: dispatching map collection vehicles to collect data, data transmission, map fusion production, updating segment A tile, internal testing and bug fixing, and finally commercial release and DOTA push to the car end. The whole process is intricate, and in the end, it is found out that segment A has been repaired and restored to its original state.

On the one hand, even if various map vendors spare no effort, it is very difficult to update the HD maps within a certain period (such as one week or even one day). On the other hand, if the “freshness” of the high-precision maps is insufficient in urban navigation and intelligent driving, the system will struggle to deal with the situation.

For example, in the aforementioned case of repairing segment A, when a large bend is added to the original straight road, the Ego may still follow the “straight road” in the high-precision map when approaching this bend, which may cause danger.

The above video is an accident video from a certain manufacturer. It is possible that the feature used at the time was not city navigation intelligent driving. However, this accident scenario is very similar to the American Flight 965 crash scenario, and may result in serious consequences when the freshness of the high-precision map is insufficient. Currently, how to deal with the problem of insufficient freshness of high-precision maps has become a common problem faced by urban navigation intelligent driving and many robotaxi companies. At present, the solutions are similar: using real-time sensed static road information as verification. When the real-time sensed information (mainly lane lines, road structure topology, hard curbs, traffic lights, etc.) differs significantly from the information provided by the high-precision map, downstream processing is triggered, such as prompting the driver to take over.

For example, in some OEM’s SOR/RFQ, the system is required to continue driving according to the real-time perceived static road information instead of directly exiting the intelligent driving system when it detects that the HD-map freshness is insufficient.

However, in many cases, this approach is a paradox: it essentially means that the confidence of the real-time sensed information on the vehicle is higher than the confidence of the pre-designed hd—map data, and that the former can even verify or replace the latter as the input of the downstream prediction-control module. Then, what is the need for high-precision maps since we have a real-time sensing information source that is “stronger than the high-precision map”?

Another promising approach is to use real-time crowd sensing to construct maps, similar to Mobileye’s REM, Huawei’s Roadcode-RT, and Tesla’s 2021 AI day. It is believed that XPeng Motors also has a similar technology roadmap.

A common misconception is that Tesla does not use HD-map. Tesla indeed does not use “HD-map” generated by Lidar point clouds, but define a new format for maps, as described in the RoadMap: A Light-Weight Semantic Map for Visual Localization towards Autonomous Driving paper presented by Tesla AI DAY 2021 and the talented young person Qin Tong from Huawei’s Intelligent Driving Department (Content Link: https://www.zhihu.com/zvideo/1389638431253868544).After all, in an interview with the former President of Huawei’s Intelligent Driving Department last year, various problems of traditional HD-MAPs (then called Roadcode-HD by Huawei) were clearly analyzed, which can be regarded as an experience to avoid mistakes for everyone. At present, the names of HD-MAPs vary among different companies, but essentially they all use SLAM technology on single vehicles, upload the data to the cloud after a certain degree of fusion, and finally realize the fusion of map data at the fleet level on the cloud. However, due to legal restrictions in China, there is no large-scale commercial case for this method at present (at least I haven’t seen it, please remind me in the comments if there is any).

Perception

(ObjDetection,Tracking,MutliSensor Fusion)

In terms of perception, there are a large number of unsolvable corner cases (some of which are called “weak scenarios” by some people) in urban areas: all kinds of occlusions, various entities participating in traffic, a wide variety of traffic light styles in each city (even different districts of the same city), a variety of construction scenes and road obstacles that are never the same…as long as there is a missed detection, it is a forced disengagement.

Figure 9 Various strange traffic lights, even a dedicated git (see link below) https://github.com/Charmve/OpenCC

Although both XPeng and Huawei-ARCFOXxiong have Lidar in this release for urban areas, Lidar is not a perfect sensor.

In fact, in many situations (sunshine, rainwater on the road surface, high reflective obstacles, black taxis in rainy days, extremely close range blind spots, smoke, dust, and fog in rainy days), the performance of Lidar is generally poor or even “very poor”—most LIDAR typical working frequency is 10Hz, considering false alarm, almost all manufacturers’ algorithms will filter 1-2 frames, which is quite fatal in extreme urban scenes (ghost probes). Lidar also requires a lot of computing power to detect moving targets, general obstacles, etc., which is a challenge under the current limited SoC computing power.

As for multi-sensor fusion…I think those who really do fusion should be clear that no matter it is post-fusion, feature-level fusion, or pre-fusion, fundamentally it cannot eliminate missed detections or false alarms, and even the fused results may lower the detection performance of a single sensor, which has been clearly stated in the Tesla 2021 AI DAY AK report.All the severe accidents related to intelligent driving systems that you can think of, without exception, occurred in an ROI equipped with multiple sensor redundancies for sensor fusion. However, we all know the results.

Accidents within the ROI with redundant forward-facing sensors.

Figure 10: Accidents within the ROI with redundant forward-facing sensors

Assuming that ——

1. Regarding perception：we have unlimited computing power, use high-performance SOC, stack various sensors for redundancy to achieve perfect perception at all cost；

2. Regarding HD-MAP：Within a limited region, all map elements are manually annotated, providing completely human-like semantic information, and assuming real-time updates to achieve perfect HD-MAP.

Can we solve the problem of intelligent driving in urban navigation? The answer is no. Because the most difficult part of intelligent driving in urban navigation is not HD-MAP and perception, but prediction and game theory.

Prediction

Prediction uses the perceived target information (including target heading, speed, acceleration, historical observations, target categories, etc.), supplemented by high-precision map information (e.g., if the other car is on a straight lane and the traffic light is green, then it is highly probable that the car will continue straight through the intersection), based on traffic regulations or common sense (for example, VRU, the vulnerable road user, will obey traffic lights) to predict the motion trajectory (trajectory, reachable set, or occupancy grid) of target traffic participants in the next 3 to X seconds.However, the aforementioned example can be considered the simplest prediction scenario, and countless counterexamples can still be provided. For instance, the car is on a straight lane but turns left instead of going straight; VRUs (vulnerable road users) who do not follow traffic lights and suddenly appear from obstructions. There are also more difficult scenarios, such as unprotected left-turns in urban areas…

Figure 11. The car is on a straight lane but turns left instead of going straight.

Figure 12. A VRU appears suddenly from an obstruction, which is not particularly extreme.

The original video is extremely frightening… I put the link below.

https://www.bilibili.com/video/BV1Ug411a7aG/?spmidfrom=333.999.0.0&vd_source=a37d790330008054e1e6d0d77131fe1d

In highway scenarios where only motor vehicles are involved, the cars generally follow the rules, and overall intention prediction is manageable. However, in urban areas, especially low-speed VRUs, such as pedestrians and cyclists, predicting their behavior is roughly equivalent to “fortune-telling.” Predicting the behavior of vulnerable road users is the highest priority for ensuring the safety of the ego vehicle in urban scenarios.

Admittedly, we can squeeze out limited SOC computing power to do things like pedestrian skeleton, even face detection to provide more information for prediction and improve the precision and recall of the prediction. However, it is conceivable that this can only moderately enhance the prediction module’s capability. To accurately predict VRUs’ behavior through their gestures, facial expressions, etc., we might need to include Lie to Me in the training package for prediction models….

Figure 13. The TV series Lie to Me.

Interactive Prediction/Social Game

(Reaction Prediction or Social Interaction)# An Interaction Game in Urban Autonomous Driving

Let’s take Huawei-Junpeng’s navigation and intelligent driving video excerpted from the 21st Shanghai Auto Show as an example. Pedestrians who cross the road obviously stop after they discover the ego car and then start again in a short period of time to cross in front of the ego car.

The video shows a typical nest-style interactive game between the ego car and various traffic participants. This game naturally exists between human drivers and other participants as well. Similar problems include the game between the ego car and its car in ramp merge-in scenarios and the game between the ego car and its car in unprotected left turn scenarios. It can be said that in the entire driving environment in the urban area where human beings participate, extreme scenarios cannot be exhausted, and human game strategies are almost impossible to be expressed in formulas.

In fact, like VRUs and human drivers, there are many behaviors that exist outside of Legal Norms and even Social Norms.

This also means that as long as human participants exist in the intelligent driving scenario in urban areas, it is impossible to “safely” achieve intelligent navigation of urban areas, not to mention the “driver out” that robotaxis want to achieve.

While perception and prediction problems can be solved with data-driven models that have seen enough data, I personally believe that the interaction game has exceeded the scope of AI’s current ability and may require the so-called “strong AI” to solve it.

It is precisely because there are so many challenges in urban navigation and intelligent driving that many of these problems are still under research by academia. They are open-ended issues that have not yet been clearly defined. In the case of the significant delay in releasing urban navigation and intelligent driving, it’s understandable for XPeng and Huawei-Junpeng to be cautious and only open it to extremely few ordinary consumers.## How is the system strategy of urban navigation intelligent driving different from that of high-speed navigation intelligent driving that has been released by various companies?

First of all, let me state the conclusion: If the highest priority of high-speed navigation intelligent driving is “ensuring one’s own safety”, then the highest priority of urban navigation intelligent driving is “ensuring others’ (especially VRU) safety”.

For high-speed navigation intelligent driving, no matter how each company promotes it, its strategy, in a straightforward way, is that the driving experience is more important than driving safety. Each company can claim to have an L2 system and that the main responsibility lies with the driver.

In high-speed scenarios, traffic entities are relatively controllable, and there are no various VRUs. Basically, there are only various vehicles, and most of them follow certain traffic rules. At this time, it is only necessary to ensure that the ego does not have sudden or aggressive horizontal and vertical behaviors. Taking Tesla as an example, it often ignores a large number of risk scenarios (even risk scenarios that the system has already identified and displayed).

You can find that, while driving Tesla, a large number of adjacent lane vehicles have crossed the line and invaded the Ego lane, even in situations where the perception of other vehicles is inaccurate and the heading is severely swinging. The ego does not react, but maintains the original driving speed and path.

However, once entering the city, in the scenario of urban navigation intelligent driving, no matter how each company claims to be an L2 system and takes responsibility for the L2 system, it is actually very pale.

Because in the urban scenario, due to any catastrophic accident with VRUs, it is almost impossible for a car manufacturer or supplier to bear. In fact, this has already entered the requirements range of L4 level autonomous driving vehicles.

Figure 16 Elaine Herzberg, 49, was pushing her bike across the street in Tempe, Arizona when she was hit and killed by a self-driving Uber car on road tests in 2018. This incident directly led to Uber's road test license being revoked, and finally the entire autonomous driving team was disbanded.

In the scenarios that urban navigation intelligent driving needs to cover, safety is no longer just the safety of the ego, but also the safety of various VRUs within the city that interact with the Ego. This requires that the overall strategy of the system calibration is safety first and driving experience second.This is fundamentally different from the design principles of high-speed navigation and intelligent driving strategies released by various companies before – involving perception/prediction modules reporting strategies (how to trade off precision and recall?), and game/regulatory modules processing strategies (whether to play against vulnerable road users? How to quantify the aggressiveness/conservativeness of scene processing?), and how the system ensures that the driver is alert and ready to take over the vehicle at any time while ensuring continuity of the driving experience. This reflects the CTO/technical leadership’s insight into the entire autonomous driving system, technical foresight, and full-system perspective capabilities.

In recent years, the essential nature of the data feedback loop that has been popular is still based on the “L2” idea to solve the problem of “L4” requirements in urban areas – the data feedback loop attempts to solve the problem of intelligent driving in urban navigation by using weak scene data that results in accidents or potential accidents, which almost has the meaning of “to do nothing but try to mend the fold after a sheep is lost.”

As mentioned above, algorithms and methodologies used for intelligent driving at different levels (L2 vs. L4) should differ greatly. Essentially, the scenarios used by intelligent driving in urban navigation applications are typical long-tail small samples, and not close-set but open-set problems. This requires the entire algorithm design to have much more protection than simple data feedback loops in the worst case and the overall Benckmark should also tend toward the worst case. This also means that the Test case distribution of the whole system is more inclined to the worst case, which is in line with the general strategy of the system adjustment: safety > driving experience.

Of course, although the data feedback loop is not a silver bullet for urban navigation and intelligent driving, it is quite useful to do things like pre-annotation.

Under these circumstances, what should ordinary consumers pay attention to?

To put it bluntly, all manufacturers (including Robotaxi companies) saying that they cannot claim L4, but only L2, is purely an excuse for legal regulations. Regardless of how dazzling the sales of car manufacturers and external promotions are, what you need to know is:

At present, if high-speed navigation and intelligent driving can indeed relieve the burden of human driving to a certain extent, then all manufacturers of urban navigation and intelligent driving must absolutely not relieve your driving burden, and under certain scenarios, will increase your “tension” of driving.

To put it plainly, the ego will produce some “weird operations” beyond the cognitive range of ordinary consumers and directly cause accidents.

Always pay attention to the road conditions and hold the steering wheel tightly!2. Unlike mature L1 functions such as AEB and ACC, there is no production standard for urban navigation intelligent driving in our country or in the industry as a whole. The level of each manufacturer varies greatly, so please do not assume that if you have experienced the functionality of manufacturer A, manufacturer B’s product is also just as good, or if you have watched online media reviews/demos of the same manufacturer’s production vehicles, you will have a similar level.

There is still much room for improvement in the intelligent driving functions of individual manufacturers, not only referring to urban intelligent driving.

Please always pay attention to the road conditions and keep your hands on the steering wheel!

As mentioned at the beginning of this article, because the experience of urban navigation intelligent driving is more coherent and the vehicle control is more flexible, after using it for a period of time, consumers will gradually trust this system, and it is precisely when humans gradually trust machines that accidents are prone to occur.

No matter what, please read the user manual provided by the manufacturer carefully, especially the part about intelligent driving, and become familiar with the relevant operations of the system and make it muscle memory. Familiarize yourself with the manufacturer’s weak scenario descriptions for intelligent driving and remind yourself continuously when using this function.

In any case, please always pay attention to the road conditions and keep your hands on the steering wheel!

Regarding this point, I specifically checked the user manual of XPeng and Huawei-JiFox, and they have very detailed listings and descriptions of possible problems and situations that the system cannot handle, various usage restrictions, errors, and warnings. It even made me feel reluctant to drive after reading it. However, I believe this is a responsible attitude towards consumers. Looking at the user manuals of some other manufacturers, it’s hard to describe how bad their descriptions are for this part. I even suspect whether these manufacturers have done systematic and large-scale generalization scenario testing, and whether they themselves know how many scenarios their systems cannot handle.

Figure 19: The dense limitations and warnings in the XPeng P5 manual, and similar cumbersome chapter descriptions in Huawei-JiFox

Conclusion

In the next five years, there will be a big explosion of urban navigation intelligent driving in China, and all companies will open this function at a running pace. As practitioners in this industry, each of us should maintain a clear mind, be responsible to customers, consumers, and the industry, and promote intelligent driving to enter thousands of households.

This article is a translation by ChatGPT of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.