In just two months on the market, XiaoMi launched end-to-end parking. Seven months post-launch, XiaoMi’s NOA achieved nationwide usability. This year’s hot trend of end-to-end + VLM technology, enabling space-to-space assisted driving, was recently introduced in the XiaoMi SU7. We have already pre-tested it for you and believe it will soon be officially delivered.
The rapid deployment of smart driving features is backed by XiaoMi’s early R&D layout, which skipped the rule-based era to embrace end-to-end + VLM, thus avoiding many pitfalls. On November 14, the day before the Guangzhou Auto Show’s media day, XiaoMi Auto first showcased its space-to-space smart driving capabilities, with Lei Jun live streaming directly on the road. During this time, Ye Hangjun, General Manager of the Autonomous Driving Division, spoke about two goals for the coming year: mass production of space-to-space and data accumulation.
Why these two goals? With this question, combined with our various queries following our first experience of XiaoMi’s space-to-space smart driving, we engaged in profound dialogue with the XiaoMi Auto smart driving team at a communication meeting.
In discussions with the smart driving team, they specifically mentioned the implementation of end-to-end + VLM technology. Already a year ago, XiaoMi’s smart driving team had attempted to lay out end-to-end. This year, the first place it took root was in parking scenarios, followed by integrating urban smart driving and parking, that is, space-to-space. In the future, end-to-end will also be updated for high-speed smart driving.
By this time next year, perhaps the next generation VLA of VLM will have taken shape. Visual models will not only see but also provide feedback through action after seeing.
Discussing the potential of the next-generation Transformer technology, the smart driving team believes that the industry has yet to explore technological leaps as significant as from CNN to Transformer. The focus for the coming period remains on end-to-end.
During nearly two hours of conversation, the smart driving team shared numerous details about XiaoMi smart driving’s technology, team, and future developments. We have organized the dialogue into written form.
Q: XiaoMi’s space-to-space smart driving is already end-to-end + VLM. When did XiaoMi initiate end-to-end?
A: Actually, around this time last year, attempts were already being made. End-to-end parking and mechanical parking were implemented fairly early on.
Q: Given the plan for end-to-end, why was a version without maps that could be used nationwide released first?
A: Mapless and end-to-end are not sequentially related; they are two dimensions of the matter. Perhaps colleagues in the market or product sectors thought this stage could offer something to users and gather some feedback. The development of both is not separate.
Q: Alongside the launch of end-to-end capability is VLM. What capabilities does VLM currently have?
A: The VLM’s alert function is in a stage where it can be productized. Its greatest role is in recognizing this vast world.“`markdown
The novelty of voice broadcasts today might seem intriguing, and tomorrow it might still be fine, but with daily broadcasts, users may start wondering what more you can do for them. Therefore, applications based on VLM will inevitably evolve from voice broadcasts to vehicle “actions” in the future. Indeed, this is the next generation of VLM: the VLA (Vision-Language-Action Model).
From VLM to VLA, there can be roughly three stages in terms of functionality:
- Currently, the capabilities of VLM are in the first stage, where sensors perceive the environment and remind the driver through voice and text.
- The next second stage might involve VLM performing protective or rerouting actions for specific scenarios.
- The third stage evolves into VLA, where a model can directly produce trajectories (Actions).
Q: What are the R&D plans for Xiaomi’s intelligent driving team next year?
A: Next year, Xiaomi Intelligent Driving will focus on two things: one is end-to-end all-scenario parking-to-parking intelligent driving, aiming to release a test version by the end of this year, introducing it to a scale of a thousand people to form a test group. By early next year, we aim to achieve mass production and delivery of parking-to-parking in the fastest possible time.
The second goal is to accumulate effective data and make substantial breakthroughs in data within a year, striving to harness end-to-end performance as much as possible. Ultimately, the aim is to transition intelligent driving from “usable” to “practical.”
Q: One of the two goals for next year for the Xiaomi intelligent driving team is data accumulation. How is high-quality data defined, and what is its proportion in all intelligent driving data?
A: This is actually quite similar to the learning process of humans. For example, when learning to drive, you first go in a straight line and then learn to turn. To move from not being able to drive to being able to, you need quite a few such “positive” samples.
To move from being able to drive to driving proficiently, more “negative” samples are required. These could be dangerous situations encountered while driving or adverse weather conditions.
Therefore, high-quality data must include both “positive” samples from the driving process as well as a large number of “negative” samples.
For training positive capabilities, about 1% – 5% of the data is valuable.
For training negative capabilities, the proportion is far lower, and some data may be rare and hard to obtain. Therefore, to address these challenges, it’s not only necessary to mine data but also to perform data mining, such as increasing the danger level in hazardous scenarios. Currently, the Xiaomi Intelligent Driving team is conducting preliminary research which has found that training with this data is indeed very useful.
Q: Since intelligent driving has shifted from a rule-based era to end-to-end, does this mean fewer people are needed for intelligent driving R&D, and how many people are needed?
A: To use an imperfect analogy, previously everyone wrote rules on the vehicle side, whereas now everyone writes “rules” and finds data on the cloud side. It’s actually a change in the method of knowledge input. This has the advantage of being more suitable for large-scale deployment.
In the previous rule-based era, 20 people wrote rules, but as more code was written, it became unusable because the rules would “conflict” with each other. Now, there is no problem with 200 people working on data simultaneously.
“`Therefore, current intelligent driving development does not eliminate the need for experts or humans; rather, the number of people might not decrease, as everyone has become an expert in the cloud.
Q: Are there any technologies on the horizon that could surpass the Transformer model?
A: As of now, we haven’t seen any technological leap as significant as from CNNs to Transformers. The next 1-2 years are likely to continue along this path, pushing end-to-end production, akin to how BEV + Transformer took the industry 1-2 years to reach mass production. Contemplating the distant future is somewhat meaningless at this stage.
The entire industry is still actively exploring and experimenting, though nothing particularly groundbreaking has emerged recently. The recent OpenAI’s o1 model has sparked some thought.
Q: Given Xiaomi’s relatively late start in intelligent driving, having avoided the pitfalls of the early rule-based era, does focusing directly on end-to-end offer an advantage?
A: The first version of Xiaomi’s intelligent driving uses BEV + Transformer, granting Xiaomi significant latecomer advantages. Furthermore, Xiaomi Auto leverages the wider group, not starting from scratch. Essentially, each company has similar personnel; none are inherently more intelligent, and all demonstrate diligence in this industry.
Q: How should we understand the “world model” in intelligent driving?
A: The human brain performs parallel deductions when performing tasks. For example, when encountering an obstacle while driving, a person simultaneously evaluates options like “bypass directly,” “wait in place,” or “peer out,” predicting the results in several “parallel scenarios.” Intelligent driving similarly requires an engine to predict how the vehicle’s potential actions will impact the surrounding environment in the next 3-5 seconds; this engine is the world model.
This concept stems from reinforcement learning, with the biggest challenge being the creation of an accurate world model. For effective reinforcement learning, the world model must be sufficiently realistic. This dilemma reflects the “chicken and egg” problem. Presently, creating a highly accurate world model, akin to scenes from “The Matrix,” remains challenging. If not realistic enough, the imagined scenarios become mere hallucinations, delivering incorrect outcomes.
Q: The functional gap among different intelligent driving technologies seems to be decreasing. How will differentiation manifest in the future?
A: Differentiation lies more in the extent to which companies satisfy real user needs, rather than showcasing technical prowess without fulfilling essential user requirements.
Q: In your opinion, with end-to-end technology and nationwide operability, have companies exhausted this area?
A: The journey is far from over. Those developing algorithms might issue conservative statements, but this is just the beginning. Achieving a truly satisfying user experience will require another 1-2 years. Current experiences are akin to when BEV first emerged.
Q: Xiaomi Intelligent Driving aims to enter the industry’s top echelon. How does the company internally assess being in the “top echelon”?
A: Evaluation of intelligent driving is multidimensional, with significant emphasis on real-world usage. Metrics such as the number of user interventions and user engagement levels are key considerations.Q: We feel the starting speed at traffic lights is relatively slow when trying out the parking-to-parking autonomous driving currently. What is your take on this issue?
A: XiaoMi’s autonomous driving is quite fast, and while there is still room for optimization in the whole system before mass production and delivery, this scenario may have a bit more delay. The corresponding engineering optimizations are ongoing.
Q: Which city does XiaoMi Autopilot find the most challenging for autonomous driving?
A: On one hand, cities with difficult geographical environments, like Chongqing. On the other hand, those with noticeably different traffic facilities, such as varying lane and traffic light positions in some cities.
XiaoMi does not optimize for different cities or driving habits and may ultimately become a “super driver.”
This article is a translation by AI of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.