|
시장보고서
상품코드
1613816
중국의 E2E(End-to-End) 자율주행 산업(2024-2025년)End-to-end Autonomous Driving Industry Report, 2024-2025 |
||||||
E2E 자율주행에는 글로벌(1 단계)과 세그멘티드(2 단계) 2유형이 있습니다. 전자는 개념이 명확하고 연구개발비용이 후자보다 훨씬 저렴합니다. 왜냐하면 수작업으로 주석이 달린 데이터세트를 필요로 하지 않고 구글, META, 알리바바, OpenAI가 개발한 멀티모달 기반 모델에 의존하기 때문입니다. 이들 대기업의 세계 E2E 자율주행의 성능은 세분화된 E2E 자율주행보다 훨씬 뛰어나지만, 탑재 비용이 매우 높습니다.
부문형 E2E 자율주행은 여전히 전통적인 CNN 백본 네트워크를 사용하여 인식에 사용되는 특징을 추출하고 E2E 경로 계획을 채택하고 있습니다. 성능은 세계 E2E 자율주행에 비해 떨어지지만 탑재 비용은 낮습니다. 그러나 세분화된 E2E 자율주행의 탑재 비용은 현재 주류인 기존 'BEV OCC 결정 트리' 솔루션과 비교하면 여전히 매우 높습니다.
세계 E2E 자율주행의 대표로서 Waymo EMMA는 백본 네트워크를 사용하지 않고 멀티모달 기반 모델을 핵심으로 동영상을 직접 입력하며, UniAD는 세분화된 E2E 자율주행의 대표입니다.
E2E 자율주행 연구자들은 주로 CARLA와 같은 시뮬레이터에서 연구를 진행하여 계획한 명령을 실행할 수 있도록 하는 것과 UniAD를 참고하여 모방 학습을 중심으로 수집한 실제 데이터에 기반한 연구로 나뉩니다. 현재 E2E 자율주행은 개방형 루프가 특징이며, 스스로 예측한 명령의 실행 효과를 실제로 확인할 수 없습니다. 피드백이 없다면 오픈루프 자율주행의 평가는 매우 제한적일 수밖에 없습니다. 자료에서 자주 사용되는 지표로는 L2 거리와 충돌률 두 가지가 있습니다.
L2 거리: 예측된 궤적과 실제 궤적 사이의 L2 거리를 계산하여 예측된 궤적의 품질을 판단합니다.
충돌률: 예측된 궤적이 다른 물체와 충돌할 확률을 계산하여 예측된 궤적의 안전성을 평가합니다.
E2E 자율주행의 가장 큰 매력은 성능 향상 가능성입니다. 가장 빠른 E2E 솔루션은 UniAD로, 2022년 말 논문에 따르면 L2 거리가 1.03m에 달했고, 2023년 말에는 0.55m, 2024년 말에는 0.22m로 크게 단축되었습니다. Horizon Robotics는 E2E 분야에서 가장 활발한 기업 중 하나이며, 그 기술 개발도 E2E 경로의 전반적인 진화를 보여주고 있으며, UniAD가 등장한 후 Horizon Robotics는 즉시 UniAD와 유사한 개념으로 더 나은 성능을 가진 VAD를 제안했습니다. 이후 호라이즌 로보틱스는 세계 E2E 자율주행으로 눈을 돌렸습니다. 첫 번째 성과는 HE-Driver로, 비교적 많은 파라미터를 가지고 있습니다. 다음 Senna는 매개변수 수가 적고 E2E 솔루션 중 가장 높은 성능을 가지고 있습니다.
일부 E2E 시스템의 핵심은 여전히 BEVFormer이며, 기본적으로 차량의 CAN 버스 정보를 사용합니다. 이 정보에는 차량의 속도, 가속도 및 조향 각도에 대한 명시적인 정보도 포함되어 경로 계획에 큰 영향을 미칩니다. 이러한 E2E 시스템은 여전히 교습을 통한 학습이 필요하므로 방대한 양의 수동 주석이 필수적이며, 데이터 비용이 매우 높습니다. 또한 GPT의 개념을 차용한 것이므로 LLM을 직접 사용하는 것은 어떨까? 그래서 리오토는 DriveVLM을 제안했습니다.
DriveVLM의 시나리오 설명 모듈은 환경 설명과 주요 객체 인식으로 구성됩니다. 환경 설명은 날씨, 도로 상황 등 일반적인 운전 환경에 초점을 맞춥니다. 핵심 객체 인식은 현재 운전 판단에 큰 영향을 미치는 핵심 객체를 찾는 것입니다. 환경 설명에는 날씨, 시간, 도로 유형, 차선 등 네 가지 부분이 포함됩니다.
중국의 E2E 자율주행 산업에 대해 조사분석했으며, 자율주행 기술의 개요나 개발 동향과 함께 국내외 공급업체 정보를 제공하고 있습니다.
End-to-end intelligent driving research: How Li Auto becomes a leader from an intelligent driving follower
There are two types of end-to-end autonomous driving: global (one-stage) and segmented (two-stage) types. The former has a clear concept, and much lower R&D cost than the latter, because it does not require any manually annotated data sets but relies on multimodal foundation models developed by Google, META, Alibaba and OpenAI. Standing on the shoulders of these technology giants, the performance of global end-to-end autonomous driving is much better than segmented end-to-end autonomous driving, but at extremely high deployment cost.
Segmented end-to-end autonomous driving still uses the traditional CNN backbone network to extract features for perception, and adopts end-to-end path planning. Although its performance is not as good as global end-to-end autonomous driving, it has lower deployment cost. However, the deployment cost of segmented end-to-end autonomous driving is still very high compared with the current mainstream traditional "BEV+OCC+decision tree" solution.
As a representative of global end-to-end autonomous driving, Waymo EMMA directly inputs videos without a backbone network but with a multimodal foundation model as the core. UniAD is a representative of segmented end-to-end autonomous driving.
Based on whether feedback can be obtained, end-to-end autonomous driving researches are mainly divided into two categories: the research is conducted in simulators such as CARLA, and the next planned instructions can be actually performed; the research based on collected real data, mainly imitation learning, referring to UniAD. End-to-end autonomous driving currently features an open loop, so it is impossible to truly see the effects of the execution of one's own predicted instructions. Without feedback, the evaluation of open-loop autonomous driving is very limited. The two indicators commonly used in documents include L2 distance and collision rate.
L2 distance: The L2 distance between the predicted trajectory and the true trajectory is calculated to judge the quality of the predicted trajectory.
Collision rate: The probability of collision between the predicted trajectory and other objects is calculated to evaluate the safety of the predicted trajectory.
The most attractive thing about end-to-end autonomous driving is the potential for performance improvement. The earliest end-to-end solution is UniAD. A paper at the end of 2022 revealed that the L2 distance was as long as 1.03 meters. It was greatly reduced to 0.55 meters at the end of 2023 and further to 0.22 meters in late 2024. Horizon Robotics is one of the most active companies in the end-to-end field, and its technology development also shows the overall evolution of the end-to-end route. After UniAD came out, Horizon Robotics immediately proposed VAD whose concept is similar to that of UniAD with much better performance. Then, Horizon Robotics turned to global end-to-end autonomous driving. Its first result was HE-Driver, which had a relatively large number of parameters. The following Senna has a smaller number of parameters and is also one of the best-performing end-to-end solutions.
The core of some end-to-end systems is still BEVFormer which uses vehicle CAN bus information by default, including explicit information related to the vehicle's speed, acceleration and steering angle, exerting a significant impact on path planning. These end-to-end systems still require supervised training, so massive manual annotations are indispensable, which makes the data cost very high. Furthermore, since the concept of GPT is borrowed, why not use LLM directly? In this case, Li Auto proposed DriveVLM.
The scenario description module of DriveVLM is composed of environment description and key object recognition. Environment description focuses on common driving environments such as weather and road conditions. Key object recognition is to find key objects that have a greater impact on current driving decision. Environment description includes the following four parts: weather, time, road type, and lane line.
Differing from the traditional autonomous driving perception module that detects all objects, DriveVLM focuses on recognizing key objects in the current driving scenario that are most likely to affect autonomous driving decision, because detecting all objects will consume enormous computing power. Thanks to the pre-training of the massive autonomous driving data accumulated by Li Auto and the open source foundation model, VLM can better detect key long-tail objects, such as road debris or unusual animals, than traditional 3D object detectors.
For each key object, DriveVLM will output its semantic category (c) and the corresponding 2D object box (b) respectively. Pre-training comes from the field of NLP foundation models, because NLP uses very little annotated data and is very expensive. Pre-training first uses massive unannotated data for training to find language structure features, and then takes prompts as labels to solve specific downstream tasks by fine-tuning.
DriveVLM completely abandons the traditional algorithm BEVFormer as the core but adopts large multimodal models. Li Auto's DriveVLM leverages Alibaba's foundation model Qwen-VL with up to 9.7 billion parameters, 448*448 input resolution, and NVIDIA Orin for inference operations.
How does Li Auto transform from a high-level intelligent driving follower into a leader?
At the beginning of 2023, Li Auto was still a laggard in the NOA arena. It began to devote itself to R&D of high-level autonomous driving in 2023, accomplished multiple NOA version upgrades in 2024, and launched all-scenario autonomous driving from parking space to parking space in late November 2024, thus becoming a leader in mass production of high-level intelligent driving (NOA).
Reviewing the development history of Li Auto's end-to-end intelligent driving, in addition to the data from its own hundreds of thousands of users, it also partnered with a number of partners on R&D of end-to-end models. DriveVLM is the result of the cooperation between Li Auto and Tsinghua University.
In addition to DriveVLM, Li Auto also launched STR2 with Shanghai Qi Zhi Institute, Fudan University, etc., proposed DriveDreamer4D with GigaStudio, the Institute of Automation of Chinese Academy of Sciences, and unveiled MoE with Tsinghua University.
Mixture of Experts (MoE) Architecture
In order to solve the problem of too many parameters and too much calculation in foundation models, Li Auto has cooperated with Tsinghua University to adopt MoE Architecture. Mixture of Experts (MoE) is an integrated learning method that combines multiple specialized sub-models (i.e. "experts") to form a complete model. Each "expert" makes contributions in the field in which it is good at. The mechanism that determines which "expert" participates in answering a specific question is called a "gated network". Each expert model can focus on solving a specific sub-problem, and the overall model can achieve better performance in complex tasks. MoE is suitable for processing considerable datasets and can effectively cope with the challenges of massive data and complex features. That's because it can handle different sub-tasks in parallel, make full use of computing resources, and improve the training and reasoning efficiency of models.
STR2 Path Planner
STR2 is a motion planning solution based on Vision Transformer (ViT) and MoE. It was developed by Li Auto and researchers from Shanghai Qi Zhi Research Institute, Fudan University and other universities and institutions.
STR2 is designed specifically for the autonomous driving field to improve generalization capabilities in complex and rare traffic conditions.
STR2 is an advanced motion planner that enables deep learning and effective planning of complex traffic environments by combining a Vision Transformer (ViT) encoder and MoE causal transformer architecture.
The core idea of STR2 is to wield MoE to handle modality collapse and reward balance through expert routing during training, thereby improving the model's generalization capabilities in unknown or rare situations.
DriveDreamer4D World Model
In late October 2024, GigaStudio teamed up with the Institute of Automation of Chinese Academy of Sciences, Li Auto, Peking University, Technical University of Munich and other units to propose DriveDreamer4D.
DriveDreamer4D uses a world model as a data engine to synthesize new trajectory videos (e.g., lane change) based on real-world driving data.
DriveDreamer4D can also provide rich and diverse perspective data (lane change, acceleration and deceleration, etc.) for driving scenarios to increase closed-loop simulation capabilities in dynamic driving scenarios.
The overall structure diagram is shown in the figure. The novel trajectory generation module (NTGM) adjusts the original trajectory actions, such as steering angle and speed, to generate new trajectories. These new trajectories provide a new perspective for extracting structured information (e.g., vehicle 3D boxes and background lane line details).
Subsequently, based on the video generation capabilities of the world model and the structured information obtained by updating the trajectories, videos of new trajectories can be synthesized. Finally, the original trajectory videos are combined with the new trajectory videos to optimize the 4DGS model.