VLA is one of the hottest assisted driving technology routes in 2025, with the LI i8 already in mass production and the Xpeng P7 closely following, while Huawei claims it doesn’t use this technology.
Just yesterday, a new player joined the VLA camp. DeepRoute released a new generation of its assisted driving platform—DeepRoute IO 2.0, equipped with its self-developed VLA (Vision-Language-Action) model.
DeepRoute’s CEO, Zhou Guang, revealed that the starting point for developing the VLA model is to teach AI to learn fear. Just because sensors can’t see something doesn’t mean there is no danger, which was a limitation of assisted driving in the previous end-to-end era.
Drivers who frequently use assisted driving might have experienced scenarios where the system confidently navigates situations deemed dangerous by the driver, like obstructions, turns, and lane merges. More concerning is that this perceived efficiency often leads drivers to overtrust the system. As more models come equipped with assisted driving technology, any issues can be magnified.
The fundamental cause is AI hasn’t learned to fear.
Not only humans but almost all living beings have learned to fear over evolutionary periods. During driving, a human driver will slow down when encountering a blind spot because of the fear of a pedestrian suddenly appearing; drivers would avoid trucks carrying oddly-shaped goods from a distance; they would understand textual instructions on tidal lanes and bus lanes… Could AI do the same?
DeepRoute’s VLA model features four major functions: spatial semantic understanding, odd-shaped obstacle recognition, interpretation of textual signboards, and memory-voice vehicle control.
Among these, spatial semantic understanding becomes the core feature. In the assisted driving process, the vehicle performs semantic interpretation (Vision-Language) of the front-view camera images. When the vehicle approaches visual blind spots like blockages, complex intersections, and tunnels, the system can make preventive judgments, ultimately deciding to reduce speed (Action).
Odd-shaped obstacle recognition allows the system to identify and flexibly respond to unstructured obstacles like traffic cones and overloaded small trucks; interpretation of textual signboards enables the system to understand road signs, decipher text information on tidal lanes and bus-only lanes; memory-voice vehicle control supports natural language command interaction, gradually learning user preferences to achieve a personalized and humanized driving experience.周 Guang disclosed that the DeepRoute IO 2.0 platform is compatible with “multi-modal + multi-chip + multi-models,” supporting both LiDAR and pure vision versions. Currently, based on the DeepRoute IO 2.0 platform, DeepRoute has achieved designated cooperation projects for 5 vehicle models, with the first batch of mass-produced cars about to enter the market.
Conversation with Zhou Guang: The Biggest Challenge of VLA Lies in the Chain of Thought and Long Sequential Reasoning
After a brief launch, DeepRoute CEO Zhou Guang was jointly interviewed by several media, including Garage 42.
Focusing on the technical details of the DeepRoute VLA model, Zhou stated that based on the NVIDIA Thor chip, VLA can operate at a few Hertz per second, achieving real-time response.
During the development of VLA, the biggest challenge was the Chain of Thought (CoT) and long sequential reasoning. Zhou believes, “This is the real core capability of VLA. Chain of Thought is a fundamental requirement for this type of architecture. Without it, it can’t be considered VLA.”
Recently, there has been significant discussion in the industry regarding whether LiDAR is necessary for assisted driving and whether the VLA route is a better solution for assisted driving. The initiators of these discussions are Musk and Huawei.
Zhou believes LiDAR currently plays an important role in the recognition of common obstacles. However, with the advancement of large model technology, vision will play an increasingly crucial role in perception, with large models expected to gradually solve tasks that currently rely on LiDAR.
So does assisted driving really need VLA? Zhou believes that to truly achieve the Chain of Thought (CoT), one needs to pursue the VLA direction; unless there is insufficient computational power, in which case, other paths might be chosen.
In a one-hour in-depth exchange, Zhou discussed the technical details and training of the DeepRoute VLA model, responding individually to industry hot topics. We’ve organized the entire conversation, with slight edits for conciseness, for your reference.
Additional Technical Details on Mass-Produced VLA
Q: What is the target frame rate for the mass-produced VLA model?
A: Currently, it’s at a level of a few Hertz per second. Specific figures are not convenient to disclose, but it can certainly achieve real-time response, avoiding several seconds per frame.
Q: What optimizations have been made in terms of algorithms and training for different chip platforms in the VLA model? Is there any forward-looking layout in the technical architecture?
A: The development and training of the VLA model itself are chip-independent. Deployment adaptation occurs after training is complete. Different chip platforms mainly affect the workload of engineering deployment and do not alter the training method or model architecture.Q: Does Yuanrong Qihang support multiple chip platforms, and if so, what is the specific scope? With the development of domestic chips (such as Horizon) and automaker’s self-developed chips, can these all be adapted? Can automakers specify chips?
A: Chip adaptation has certain requirements, such as basic computing power, bandwidth, etc. After model training is completed, distillation and quantization will be performed, and adaptation must meet basic conditions. Automakers can specify chip needs during cooperation, and the costs of adaptation (time, funds, data) are negotiable. We currently start with a certain chip, and will support more chips in the future, not limited to just one company.
Q: In the industry, it seems that only Yuanrong Qihang and LI clearly follow the VLA route. Some also believe that while large language models are strong in text reasoning, they are not proficient in spatial perception. What is your view on this?
A: More accurately, VLA is essentially an “end-to-end model based on GPT.” Companies currently insisting on investing in massive computing power, including Xpeng, are actually moving in this direction. For example, Tesla’s latest chip has a computing power of up to 2,500 TOPS, which would not be necessary for CNN models; only the GPT architecture requires such high parameters and computing power. CNN model parameters are limited, whereas the GPT architecture is naturally suited for expansion, which is the direction of the future.
Q: Regarding voice-controlled cars, you mentioned it is a basic function. So what is truly difficult in the VLA model?
A: The most difficult aspects are the Chain of Thought (CoT) and long-term sequence reasoning, which are the core capabilities of VLA.
Q: Can VLA model quality be evaluated through the performance of the Chain of Thought?
A: The Chain of Thought is a basic requirement of this architecture. Without it, it cannot be considered a VLA. The industry does not yet have a unified evaluation benchmark like NLP, but a dedicated physical scene-based benchmark may be established in the future.
Q: Can the advantages and disadvantages of a VLA model be judged intuitively from the in-car interface?
A: Currently, we are still focused on solving the 0 to 1 problem. Tesla’s interaction is already very mature, but we need to ensure the core capabilities are implemented before optimizing user experience.
Q: What size of model can actually run on the car end?
A: The parameter count is not convenient to disclose at this time. However, due to limitations of automotive-grade computing power and power consumption, even GPT models installed in cars still belong to the “small model” category.
Q: Do VLA models also experience hallucinations, and how can such risks be reduced?
A: Hallucinations can occur during the pre-training phase, but post-training alignment techniques greatly mitigate this phenomenon. Mainstream large models (e.g., Doubao, Qianwen) now experience very few hallucinations, and there are already good solutions in place.
Q: As technologies like VLA and VLM progress, baseline capabilities for assisted driving improve universally. Is there a risk of convergence in different companies’ solutions? How does Yuanrong Qihang maintain its unique features?
A: End-to-end technology does indeed have a convergence risk, with differences mainly in the rate of progress. Yuanrong Qihang laid out plans for defensive driving early on, emphasizing this direction half a year ago. Accurate technological judgment is crucial, especially in such a broad field like VLA.Q: Is the VLA model’s current frame rate—lower than some end-to-end solutions (10 – 20 frames)—a limitation at this stage? Are there ways to compensate?
A: The impact on frame rate is essentially a latency issue. Reducing from 100 milliseconds to 50 milliseconds shows obvious benefits. It’s normal for the initial VLA frame rate to be slightly lower. A higher frame rate isn’t always better; enhanced predictive capabilities can also compensate for frame rate limitations.
Q: What breakthroughs might continuous improvement of VLA’s Reasoning capability bring in the future?
A: VLA hasn’t fully achieved the chain-of-thought (COT) yet, which is a key gap. In the long run, language and reasoning capabilities are essential for full unmanned autonomous driving. For example, when encountering temporary signage like “left turn without light control,” relying on map updates isn’t enough; it needs to be understood in real-time upon the first encounter. VLA has a long way to go on this path, requiring more technical accumulation. Tesla invests ten times the computing power and parameters because the GPT architecture is a clear direction, whereas CNN can’t support this kind of expansion.
Q: What is the lowest price range of vehicles that the VLA model system can adapt to? Which models can apply it?
A: Currently, vehicles above 150,000 yuan can be adapted, while models in the 100,000 yuan range also have a chance with optimization. End-to-end solutions cost less, whereas the VLA model currently relies more on computing power. In terms of sensors, 11 cameras are becoming a standard configuration, and Tesla insists on a pure vision approach. The industry overall is improving computing power, and the next generation of chips will reach 5,000 TOPS, with even 10,000 TOPS not far off.
Q: How much more expensive is the VLA model compared to end-to-end solutions? Is there a significant cost difference?
A: The main cost difference lies in the chip; other parts are basically consistent. Chip cost depends on process technology. We have entered the era of tera-scale chip computing power, for example, Tesla’s 2,500 TOPS chip, where dual chips can achieve 5,000 TOPS.
Q: At the last motor show, you mentioned that the VLA model is not only for vehicles but will also extend to robots. Could you share more? Are they humanoid robots or unmanned? Are there any partnerships? Are the VLA models for cars and robots the same system?
A: Yes, the VLA model itself is a universal architecture, no longer tailored to specific scenarios. As our RoadAGI strategy released earlier this year states, this technology could be generalized to various mobility scenarios in the future—including neighborhoods, elevators, offices, and other indoor and outdoor environments. Many current robots still rely on remote control or line following technology, but we aim to achieve true autonomous, universal mobility.
Q: What score would you give the current version (out of 10)? What is the biggest challenge?
A: Personally, I’d give it a 6, just passing. The VLA model is still in its early stage, equivalent to a “youth phase,” but its potential ceiling is much higher than end-to-end solutions. A new generation of architecture requires new generation chip support, which is incomparable to the CNN era.
Q: Can defensive driving be achieved without the VLA architecture, is VLA necessary?
A: Statistical methods can partially achieve defensive strategies, but complex scenarios require true reasoning capabilities. VLA, with its CoT and language reasoning, can more thoroughly solve these issues. BEV has inherent limitations in spatial understanding.
How is VLA trained?
Q: Is Yuanrong Qixing’s foundational model for VLA based on Qianwen?
A: We utilize various models for distillation; Qianwen is among the more outstanding open-source models. We’ve also experimented with solutions based on Qianwen and our self-developed distillation approaches. Therefore, we don’t solely rely on one specific model, though there are elements from Qianwen, it is not completely identical.
Q: You didn’t mention cloud-based world models and simulation data. The industry is generally using simulation pathways, how does Yuanrong address the resource constraints of inference cards?
A: The fundamental difference between VLA and the first generation of end-to-end methods lies in the model architecture shift—from CNN to GPT. Training methods, such as whether to incorporate RL, are merely strategic considerations. The CNN architecture itself cannot achieve reasoning and generalization capabilities akin to humans.
Q: Where do the training data come from? Are they from proprietary test fleets and Great Wall?
A: The data source is multifaceted: it includes proprietary test fleets, mass-produced vehicle data, and generated data. To achieve pre-training in the GPT architecture, reliance on large-scale, diverse datasets is necessary, which CNN models cannot fulfill.
Q: Regarding the resource demand of VLA models for training, some vendors indicate the need for tens of thousands of cards. How does Yuanrong Qixing view this substantial resource consumption? Will it cause cost pressure? Moreover, why is the industry currently emphasizing reinforcement learning and AI training?
A: Reinforcement learning is merely a means of model training, part of the “post-training” phase. The industry has entered the post-training era, but this is not something to overly emphasize—just as GPT or Waymo would not solely stress reinforcement learning. Yuanrong has always been precise in technical choices; VLA is a novel field with many directions, and with clear technical judgment, resource consumption can be made far more efficient. In fact, the scale of GPT models for assisted driving scenarios is relatively manageable, and a 7B model, for example, does not require extremely large computational power.
Q: In terms of simulation testing, some companies have significantly reduced real-vehicle testing and increased simulation mileage. Is this the industry trend?
A: We are more focused on our technical roadmap. Simulation is one source of data; the key issue is not whether it is real or simulated, but the quality of the data. High-quality datasets are the core of model optimization.
Q: In the long term, what proportion of simulation data will be used in training? Will the ability to generate simulation data become a barrier?
A: Simulation needs to be based on real data; otherwise, it cannot effectively mimic. Real-world data remains predominant, with simulation serving as a supplement. From the pre-training to the post-training phase, the proportion of simulation will gradually increase. The industry should focus on the overall development of large models and avoid being limited to the autonomous driving field. The essence of technology is interconnected, much like the structure of neurons in the human brain shows little difference.### Views on Industry Trends
Q: Recently, Musk mentioned that “lidar will make autonomous driving increasingly unsafe.” What is your take on this?
A: Lidar currently plays an important role in general obstacle recognition. As mentioned earlier, the capabilities of large model knowledge bases can identify many unknown obstacles. I believe that with the development of large model technology, vision will increasingly play a pivotal role in perception. In the short term, lidar, being limited by technology development and dataset maturity, still has its value. In the long term, large models are expected to gradually resolve tasks that currently rely on lidar.
Q: How do you view other car manufacturers launching VLA models, such as Xpeng? What are YuanRong’s differentiating advantages?
A: Xpeng’s VLA progress is not bad; they have produced tangible results based on the Qianwen model. The coverage of VLA is extensive, requiring precise technical judgment and continuous accumulation, unlike end-to-end which is more straightforward.
Q: From rule-based algorithms to end-to-end 1.0 to VLA models, if car enterprises or suppliers want to develop their own assisted driving systems now, can they directly move into VLA? Is it necessary to go through previous R&D stages completely? Did you foresee the limitations of end-to-end when you were developing it?
A: Each stage cannot be skipped. From those with images, without images, end-to-end, to VLA models, the entire development process is indispensable; at most some phase durations can be shortened, but they cannot be entirely bypassed. As for the limitations of the VLA models, its current baseline has already surpassed the ceiling of end-to-end solutions.
Q: In recent years, the support for Transformer models by smart driving chips launched by domestic and international manufacturers has not been very good. Since VLA is a GPT-based E2E architecture, does this mean that when developing advanced intelligent assisted driving chips in the future, manufacturers must not only provide computing power of thousands of TOPS but also support Transformer models as a core design criterion?
A: Indeed. Early chips were primarily designed for CNNs, and there will certainly be enhanced support for Transformers in the future, especially in the optimization of FP4, FP6 precisions.
Q: What do you think about Huawei not following the VLA route?
A: If computing power is insufficient, other paths might be chosen. However, to truly achieve chain of thought (CoT), the VLA direction is still necessary.
Q: As an industry participant, how can we collectively enlarge the smart driving pie? What support is required beyond technology?
A: Promotion needs to be rational, avoiding overpromising, especially concerning safety. Technological development takes time, and it is important to correctly manage user expectations. Regulation and industry self-discipline are also crucial.
Q: Will YuanRong participate in the L4 competition? How is the current progress?
A: The traditional classification of autonomous driving levels is outdated. True driverless requires reasoning capabilities; pure rule-based systems cannot solve issues like “left turns allowed on red lights.”请提供需要翻译的 Markdown 中文文本。
This article is a translation by AI of a Chinese report from 42HOW. If you have any questions about it, please email bd@42how.com.