What can the “BEV model” do in autonomous driving systems?

Lane detection issues in autonomous driving

Author: Captain Jack

There is an interesting phenomenon in autonomous driving, which is that there hasn’t been much substantial progress in terms of functionality in the last 2-3 years since Tesla pushed NoA, except for Tesla publicly testing FSD Beta in North America, and a few Chinese manufacturers releasing demos of autonomous driving in some cities. Most car companies are still striving to achieve the ability of NoA, which is the core reason for this phenomenon: technology.

Since Tesla AI Days, the application of technologies such as perception algorithms and neural network models in the field of autonomous driving has been pushed to a climax, as it seems that without unique technology, it is difficult for autonomous driving to move forward.

However, both Tesla and Google have done a lot of work on basic detection technology, and only with sufficient accumulation, can we know how to arrange algorithms, chips, and neural networks reasonably. Therefore, improving the accuracy of target detection and the smoothness of vehicle control can itself improve the availability and safety of functionality.

Benchmark distortion

Currently, regarding the detection of 2D lane lines, the overall route is slightly more traditional, expecting pure vision and no map. Here are some technical problems encountered.

There are many open-source lane detection datasets on the market, and the corresponding benchmark indicators also look promising, with some datasets achieving over 90% accuracy. However, these benchmarks are actually somewhat distorted, and models that perform well may not necessarily be usable. In fact, it can be said that most of the algorithms in the benchmark have overfit to datasets from a certain era. Of course, new datasets have not yet been designed by successors to overfit new algorithms.

For example, here are some examples:

Some row-wise lane line methods cannot detect angle-tilted lateral lane lines, let alone U-turns and small roundabouts.
Some model designs fix the number of lane lines, which causes these problems: when we encounter intersections, the lower part of the lane line requires N lines, and the opposite lane line at the intersection also requires M lines; for scenes where the number of lines is few to many, or many to few, we can only fit them into these fixed lane lines.
Models that use classification in space, discretize continuous spatial relationships, causing severe jitter in the detection results at the boundary positions of category intervals. What’s more serious is that classification and business logic are bound together. If there is pixel jitter, the problem is not big. But when the business logic jitters, the vehicle can only fly around.- Many algorithms have incorporated strong priors, which limits their applicability to high-speed scenarios only. Because even slightly more complicated scenarios cannot satisfy the priors.

This raises a very basic question: Is SOTA really always suitable for facing problems that are not strictly defined?

Before blindly adopting SOTA, it is important to first look at the definition and characteristics of the problem.

From a macro-narrative perspective, the benefits of priors diminish with increasing data. From an engineering perspective, it is important to find a balance and timing.

Model

The methods currently used have already embraced a highly traditional approach. Despite some changes in the training techniques and increase in data, the overall performance indicators have not weakened compared to the SOTA methods used in the past. In fact, some indicators are even better (non-fair comparison, data has varied over time).

The whole training process can be considered without too many tricks. The idea of relying on some magical trick to create a sensation is not very realistic.

Most models have not been “fully trained” on public datasets, and the increase in data can still improve the model’s performance.

Regarding the structure of the model itself, because of the limitations of deployment platforms, it relies on traditional operators. Although some new operators or low-efficiency operators for the target platform can indeed improve certain indicators, the performance loss brought about is completely disproportionate to the benefit. The relationship between model capacity and indicators is also an S-curve. For platforms with targeted computing power, the most popular super-large models can only be used for offline auxiliary applications.

3D Ranging

The traditional method for lane line ranging is ground plane IPM, which is also a strong prior algorithm. Influencing factors:

Intrinsic parameters: calibration of intrinsic parameters and distortion issues are relatively dependent, but basically stable.
Extrinsic parameters: In actual driving, the extrinsic parameters of the camera are constantly changing, which is affected by car motion trends, cargo distribution, and road conditions. Therefore, the actual height, angle, and static calibration results of the camera all have errors, which can be adjusted dynamically with some real-world priors. However, these dynamic adjustments cannot be used in scenarios that do not satisfy the priors.
Ground parallelism: This assumption is 99% not satisfied, as the road surface itself has lateral curvature, not to mention longitudinal curvature.
In terms of extrinsic parameters, laser or IMU can be used for adjustment.

Regarding the ground, there is actually not much that can fundamentally solve the problem. The traditional filtering stability can only deal with the past, and cannot predict the future. The limitation of lane line parallelism can be used for high-speed scenarios.

If the model is used to solve the problem, it is to predict the ground height, which can be predicted after traveling a circle and then calculate the 3D relationship directly.The actual bottleneck of current vision algorithms is 3D measurement, and if it is done well, laser will lag behind. However, solving this problem relies on the entire system, including car design and downstream algorithm’s tolerance for errors.

In other words, many aspects of traditional lane detection, such as algorithm, expression, and downstream usage methods cannot be horizontally transferred to urban areas.

The error in traditional lane detection is inevitable, especially the longitudinal error.

Regarding models, although many strong prior models look good in terms of indicators, they are actually more prone to problems in practical use, especially when strong priors are placed in image space.

Experience

BEV Model

Due to Tesla, the BEV model has exploded, and some even say that it is the key technology for solving autonomous driving. However, those who have actually conducted industrial research and development know that although the BEV model has many advantages, it is definitely not a silver bullet.

The role of the BEV model is to provide a unified space to facilitate the fusion of various tasks and sensors, and this space is also in line with actual physics.

This brings several points:

It is easier to fuse. In the BEV space, traditional fusion can actually be abandoned.
BEV is more closely connected to downstream, mainly regulation, providing the possibility of end-to-end. Even if it is not directly predicted and planned on BEV, its natural physical form can still be used for feature map assistance.
In the future, if there are new expressions, BEV space may also be replaced. BEV can be used to express physical space well, but it is not good at end-to-end in the time dimension. Current methods include: adding pose information, Tesla’s Spatial RNN, and building models using Cross Attention. These introduce some obstacles that are not convenient for end-to-end training.

The BEV model is not a silver bullet, it just provides a space that is closer to the physical world than images, and can provide more possibilities for subsequent fusion and planning.

Possibility of E2E Model

One possibility brought by the previous point is the introduction of E2E models. Compared with traditional planning, original features in the network can be utilized. At least for some common scenarios, using neural networks has many advantages, including computational utilization, smoothness, faster response, and tolerance to noise.

However, from the perception perspective, exposing more original features can provide downstream with more imagination.

Regarding neural networks, many people criticize them for being black boxes, being insufficient in handling exceptions, and only targeting training sets, and having large data requirements.These are all real problems, but traditional manual parameter tuning is still unable to handle them effectively. Nesting a bunch of code and continuously adjusting for encountered scenarios also faces difficulties in understanding, controlling, and handling exceptions.

A company can continue to use traditional methods to stack strategies, slowly climbing from 50 points to 60 points. However, using an end-to-end model, you can stack data effortlessly, achieving 65 points with less time cost and faster iteration mechanism.

Moreover, it is clear that in terms of overall indicators, traditional methods are definitely not as high as neural network methods.

4D Data

Thanks to Tesla’s AI Day, many companies are starting to annotate 4D data.

The so-called 4D data mainly solves the problem of static elements. Due to the introduction of a time dimension, there are many alignment problems with dynamic targets, making traditional annotation methods more suitable.

The inherent advantages of map makers in 4D annotated data:

4D annotation is the production of static elements on visual sensors.

A large part of the toolchain overlaps with high-precision map production, especially since most companies will not abandon lidar sensors like Tesla.

Utilizing the combination of lidar and camera sensors and then using the production tools of high-precision maps is a very smooth technical path.

For map makers, they only need to supplement the technology of visual 3D reconstruction. The rest of the calibration, elimination, alignment, SLAM positioning, annotation tools, and even annotation rules can be directly reused.

Of course, this is only from a technical point of view, and whether this inherent advantage has any meaning in terms of business and safety has not been taken into consideration.

Some problems with 4D

Accuracy Issue

My basic point of view is still that pursuing so-called centimeter-level accuracy is meaningless except for PR purposes. Even at long distances, errors at the level of meters are acceptable, and what is important is the stability of the error.

Using 4D data to pursue absolute precision at the centimeter level is impossible in actual scenarios. There are many sources of error, such as sensor calibration and synchronization, positioning and SLAM optimization, annotation itself, algorithm errors in the process.

Occlusion

4D data still needs to consider static occlusion.

In some scenarios, there are elements that cannot be seen visually. They are present in the annotation due to the addition of time sequence fusion. The occluded elements have a negative impact on the model during training.

Similarly, dynamic occlusion can also bring similar problems. For example, in a crowded traffic jam scenario, what the vision sees is the surrounding vehicles, and the ground cannot be seen, which does not give the model any clues.

Sensors

Sensor Advantages and Disadvantages

Here, we will only briefly talk about Lidar and Camera, as millimeter-wave technology is not accurate enough at this stage.

LidarMany people like to talk about Lidar’s tolerance to bad weather. However, in my own understanding, Lidar’s ability to tolerate bad weather is not high.

Disadvantages of Lidar:

Rain, snow, and fog will basically cause problems, and dust in industrial scenes will also have an impact.
There will be problems when faced with objects that can reflect or refract light, such as water surfaces and mirrors.
Poor performance in detecting small targets and detecting distances with low density. Compared with the farthest detection distance, it will be beaten to death by a camera. Whether a small child on the other side of a larger intersection can be detected is also a problem.

Advantages are also very clear:

Active optical sensor, can also work at night. (The negative problem is that if the algorithm is not optimized, when there are more cars in the future, they may interfere with each other.)
Absolute ranging accuracy advantage.

Camera

For a long time, cameras have played the role of backup sensors, but now there are also technical solutions that rely on “model + computing power” to support cameras as the main sensors.

Disadvantages:

Light has a great impact
Large ranging error

Advantages:

Information density
Cost advantage

Without considering the negative impact of light, the differences between cameras and lasers are not too great.

In terms of ranging error, although the ranging of the camera is definitely not as good as that of the Lidar, it does not need to be as precise as the Lidar in practical use. As long as the motion trend can be stable, the ranging accuracy within 30m is at the level of decimeter, and within 100m, it is at the level of meter, which is available.

In terms of accuracy alone, the camera’s model can easily reach this standard.

Sensor synchronization

The first sensor synchronization solution used the nuscenes solution, but the nuscenes solution did not consider the camera exposure time. Even if laser scanning reaches the center of FOV before exposure, there will still be a difference of tens of milliseconds.

Therefore, to better align the point cloud and the left and right cameras, it should be triggered before scanning reaches the center of FOV. Of course, even with such a scheme, there are still some problems, especially when the main sensor is a camera.

Based on experience, even if there is some time difference in sensor synchronization, the impact is not particularly significant. At least in our own private data, the calibration error caused by sensor synchronization is absolutely greater than the synchronization error. Even so, there is a certain degree of tolerance at the model level. If conditions are limited, a soft synchronization can also be used temporarily.

Possibility of pure camera

From a practical standpoint, the technical solution of pure camera is actually very competitive.

The discussion on whether a pure camera solution is possible is not based on whether a pure camera can support the L4 level: No manufacturer dares to say that they can add a bunch of expensive sensors and achieve a strictly defined L4 level within a few years.Anyway, if the L4 plan is not feasible, using a camera solution can solve 90% of driving scenarios. Compared with a car that costs tens of thousands more and is equipped with a bunch of sensors, which can only solve 95% of the driving scenarios (or maybe just 91%), the former is more practical.

The cost of adding a dozen or more cameras, requiring the driver to take over at any time, and spending a lot of money to run, makes the latter still look like a fool, even if the probability of taking over is lower.

This article is a translation by ChatGPT of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.