Is it meaningful to have an unmanned driving of 3000 km under the "confidence model" of autonomous driving?

Automatic Driving And Confidence

Author: Shumi.King

Inspired by Professor Wu Jun’s lecture “Information Theory 40 Lectures – 16|Confidence: What Mathematical Mistakes Did Musk Make?” This article organizes related content and briefly discusses the concept of confidence in automatic driving.

After reading this article, you will understand:

  • What does “convincing people” mean in the confidence model if we want to achieve the claimed 3000 kilometers of unmanned driving in some automatic driving demonstration operation projects?

  • In work, someone reported that they tested the Tesla-AIparking function against the standard and had 5 parking attempts, 4 of which were successful. The conclusion was that Tesla’s parking success rate was only 80%, and Tesla’s parking level was mediocre. How unreliable is this conclusion?

What Mathematical Mistakes Did Musk Make?

The following content is a selection from Professor Wu Jun’s lecture “Information Theory 40 Lectures – 16|Confidence: What Mathematical Mistakes Did Musk Make?”

In 2016, Tesla Motors had the first accident that resulted in a fatal collision caused by the use of its assisted driving function. At that time, the media began to question the company’s technology, and most public opinion thought that Tesla was responsible for it.

In order to get rid of the PR crisis, Tesla, as a provider of assistive driving technology, first explained that the driver’s long-term failure to hold the steering wheel was mainly responsible for the traffic death accident.

However, the media said that since you provide the assistive driving function, if the driver has an accident while using it, it shows that your technology is not up to par.

The issue of whether the technology is up to par or not is not clear. Tesla’s CEO, Musk, then said:

Tesla’s fatal car accident occurred after the automatic driving function had been used for 130 million miles, while the average number of miles traveled in the United States is 93 million, resulting in one death every time on average.

Therefore, Tesla’s accident probability is lower than the average level.

This statement caused some scientists to ridicule Musk, saying that he did not learn the concept of “confidence” in statistics at all.

Confidence is the knowledge point of this lecture, which can help you measure whether information is reliable or not. We often say that we need to summarize experiences and lessons, but most people do not correctly summarize their experiences and learn from them, and often make the same mistakes as Musk.

So where did Musk go wrong?

To understand this, it is important to realize that a major car accident is a random event, and you do not know when it will happen next. Only when the amount of statistical data is large enough, it makes sense to determine which car is safer based on the results.

Otherwise, according to Musk’s statement, if Tesla has another accident soon, won’t the accident rate double again? Should we say that its technology is not good enough, or is it just that it was unlucky?

To help you better understand this point, let’s take a look at the following example:If you count the number of people entering and exiting the gate of Tsinghua University in a day at the school gate, you’ll find that 4,543 male students and 2,386 female students have passed through the gate. This can roughly lead to the conclusion that “the ratio of male and female students in this school is 2:1”. Of course, you can’t say that the ratio of male and female students is 4,543:2,386 because each person’s going out of the school today is completely random, determined by many accidental factors. Moreover, the degree of preference and demand for going out of school by male and female students may also be slightly biased. Therefore, if you give a rough ratio after counting nearly 7,000 samples, no one will challenge you.

However, in another scenario, things are different. If you get up early on May 3 and squat at the West Gate of Tsinghua University for two minutes, and see that there are 3 female students and 1 male student entering and exiting, you’ll conclude that 3/4 of the students in the school are female. Obviously, this conclusion will not be accepted by everyone, because they will feel that this may be a completely random coincidence. Perhaps if you go to take a look for two minutes again on May 4th, you’ll find that all four people entering and exiting the school gate are male, and you definitely can’t conclude that “this school has only male students”.

Regarding the safety of Tesla’s assisted driving, Morgan Stanley later made an estimate. At the current rate of fatal accidents in the United States, it would take 100 billion miles of driving to prove that its assisted driving is safer with sufficient statistical significance.

So what is statistical significance?

Let’s take another simple example. If you flip a steel coin 14 times and it lands heads up 8 times and tails up 6 times, how confident are you that the coin is not even and more likely to land heads up? This confidence is statistical significance.

There are many ways to measure statistical significance, including a method called “T test” (also known as T-test). It can tell us how likely it is that a seemingly biased phenomenon is caused by randomness rather than a real bias.

Referring back to the example of flipping a coin, if there are 8 heads out of 14 flips, can we say that the coin has a bias in producing heads?

There are two possibilities: there is indeed a bias, or it is caused by chance. What is the probability of these two situations?

Mathematically, the former has a possibility of 57%, and the latter has a possibility of 43%.

In other words, the fact that the coin casting has bias is possible, but we are not sure. We can also quantify how sure we are about this matter. This is statistical significance.

Referring back to this problem, the statistical significance is 57%, and the statistical significance of the opposite conclusion “this steel coin has no casting problem” is 43%.In statistics, we generally believe that conclusions with a confidence level below 95% cannot be trusted. In engineering, including drug trials, it is usually required to achieve a confidence level of 95% or above.

So how do we increase confidence? The usual method is to increase the number of samples collected for statistics.

If we maintain the ratio of 8:6 for heads and tails, the more times we toss the coin, the more confident we can be that the coin is not balanced.

According to the T-test calculation formula, it can be determined that tossing the coin about 140 times can achieve a 95% confidence level. Of course, if we toss the coin thousands of times, our confidence level can reach 99%.

That is to say, after tossing 140 times, we are 95% sure that the coin is unbalanced, which causes an 80:60 deviation. The factor of luck accounts for only the remaining 5%.

Returning to the example of Tesla, according to Morgan Stanley, to prove that the autopilot function is safer than human driving, it would take nearly 100 fatal accidents to accumulate, and this deduce that it would require a total of 10 billion miles of driving distance.

Of course, they use another tool called Z-test to measure the confidence level, which is similar to T-test. For autonomous vehicles, it roughly needs to accumulate this much for testing mileage.

As of now, even the earliest Google autonomous vehicle (Waymo) that conducted road tests is still far from 10 billion miles.

As for the data given by Musk through a sample point, there is no confidence level at all, which means that the reliability of that information can be ignored.

One common mistake people make when handling information is ignoring its confidence level, so that we treat completely random things as certain things.

Summary:

We discussed the concept of confidence level. The common mistake that people make when reading news is ignoring its confidence level. For things that can be repeated, their confidence level needs to be tested enough times to be high enough.

Related problems encountered in autonomous driving

Question 1: What does it mean to achieve 3,000 km of unmanned driving under the confidence model?

Actually, this problem is the same as evaluating the safety level of Tesla’s autopilot.

In a small-scale autonomous driving fleet, the team set a phased goal of 3,000 km of unmanned driving.

After a series of optimization measures, a certain test engineer reported excitedly: We have only taken over once in 3,200 km of road testing since the beginning, and we have exceeded the 3,000 km unmanned driving task, reaching 3,200.

With a little thought, we can find that this test engineer made the exact same mistake as Musk.

Musk declared the safety level of Tesla’s autopilot based on one accident, and this engineer only used one takeover to claim to have completed the task.If we continue testing and experience another takeover after only driving for 10 kilometers, what does this mean for our autonomous driving system? Following the engineer’s thinking, does it mean our autonomous driving level only reaches 1600 kilometers (or 1605 kilometers, which is not much different) without intervention, which is far from our team’s target level? Or are we just unlucky? If Musk intentionally did it for commercial purposes, what motives are driving this engineer’s behavior? Is he also being driven by “interest” or “the desire for achievement”?

Using the Tesla example of Morgan Stanley’s use of the Z-test for confidence levels: In order to prove that an assistant driver is safer than a human driver, they need to accumulate nearly 100 fatal accidents to obtain a solution, and they derive that 10 billion miles of total driving distance is needed. Of course, these analyses are done using the common 95 percent confidence level used in engineering.

In the case of a 3000 kilometer intervention-free drive: To prove that we have reached a level of 3000 kilometers without intervention, we need to accumulate nearly 100 interventions to prove that the average intervention mileage is no less than 3000 kilometers. This leads to the conclusion that at least 300,000 kilometers of total test mileage is needed.

Given a testing scope of 20 vehicles, with each vehicle testing 500 kilometers per day, it would take 30 days to complete the 300,000 kilometer test. Additionally, to achieve a level of intervention-free driving for 3000 kilometers, incremental optimization of the intelligent driving system is required, with staged tests and verifications for these optimization measures.

After X1 optimization, testing and verification can reach a level of 1000 kilometers without intervention; after X2 optimization, testing verification can reach a level of 2000 kilometers without intervention. Theoretically, these staged tests and verifications also require testing as described above for intervention-free driving of 3000 kilometers, which requires the accumulation of nearly 100 interventions for evaluation. Correspondingly, it would take about 10 and 20 more days of testing verification to form an evaluation cycle. If the optimization measures are not effective and do not meet the staged goals after testing verification, then another round of optimization and testing evaluation is required.

Even if we do not consider repeated optimization, testing evaluations, it would take 2 months (10 days + 20 days + 30 days) of testing and validation to draw the conclusion that we have reached the level of 3000 kilometers of autonomous driving without intervention.

Question 2: Regarding “Someone once said that in benchmark testing Tesla-AIparking, after 5 parking attempts, only 4 were successful, which means Tesla’s parking success rate is only 80%. How reliable is this conclusion?”

Assessing the parking level of competitors in this way is like the scenario that Professor Wu Jun mentioned in his course, “Standing outside Tsinghua University’s west gate for two minutes, seeing 3 female students and 1 male student enter and exit, and concluding that 3/4 of the students at the university are female.”Similarly, with the confidence model, we can reasonably evaluate and guide certain development processes, such as requiring a false detection rate of no more than 5% for traffic lights. If there have been 10 false detections in the samples tested, how many times do we need to test without any false detections to prove the conclusion that the false detection rate is no more than 5% with 95% confidence.

This article is a translation by ChatGPT of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.