Our the open-access data set, ApolloScape, is a part of the Apollo project, the only open source autonomous driving platform which was initiated by Baidu Inc in 2017. In order to capture the static 3D world with high granularity, we used a mobile LIDAR scanner from Riegl to collect point clouds, which yields point cloud densities much higher than those from Velodyne (which was used by KITTI). In addition, two high-resolution cameras at the head of the collection car were synchronized and calibrated, recording at a frame rate of 30 fps. Each camera has high precision GPS and IMU, so that the accurate camera pose is recorded on-the-fly. All our videos were recorded from cities, e.g. Beijing, Shanghai and Shenzhen, in China.

In Tab. 1, we show the properties of our dataset compared to existing ones. In particular, we have high-quality 3D labels from realistic scenes for both the static background and moving objects. Currently, we already have 50K images labeled covering around 10 km from three sites in three cities. Moreover, each area was scanned repeatedly under various weather and lighting conditions. Finally, ApolloScape will be an evolving dataset and labeled data from new cities will be added monthly. We plan to have at least 200K images, consisting of 20 km road covering 5 sites from three cities, for holding the challenge. In the following, we will introduce the details of each challenge specifically targeting autonomous driving.


For all the challenges, in addition to testing the accuracy (which we will use to rank the algorithms), we also require the participants to specify the speed of the algorithm, the platform they use, and implementation details. We will encourage algorithms which run in real-time, i.e. 30 fps, and will highlight them in the leader-board, since speed is a crucial property for practical applications.

Task 1: Vision-based fine-grained lane markings segmentation
An accurate High Definition (HD) Maps with lane markings usually serves as the back-end for all commercial auto-drive vehicles for navigation. Currently, most HD maps are constructed manually by human labelers. In this challenge, we require participants to develop algorithms to extract all basic road elements from RGB image frames. The segmentation results can be directly used for HD Maps construction or updating process. This task is challenging for non-clear lane markings and busy traffic environment. More specifically, all the ground truth is generated by projecting the annotated 3D point cloud into corresponding 2D camera views. In addition, the occlusion problem has been carefully handled by removing all the movable obstacles before annotation. Participants can also access the corresponding videos, calibrated camera parameters and relative camera pose to benefit the segmentation task. We illustrate labeled 3D lane marks and generated 2D ground truth in Fig. 1 and Fig. 2 respectively. We already have our train and val data for lane-mark segmentation ApolloScape at two road scenes.
Our evaluation will be per-pixel Intersection Over Union (IOU) performed in the whole image. Please check API for detailed information.
Participate Sample Dataset Download LeaderBoard
Figure 3: Illustration of 3D semantic maps and ground truth camera poses (yellow tetrahedron).
Task 2: Self-localization on-the-fly
Vision-based self-localization, i.e. estimating 6 DoF camera pose by images or videos, has great potential for reducing the cost (compared to using LIDAR), but this is challenging and still under research. The accuracy of a published state-of-the-art vision based algorithm we tested, i.e. VidLoc 4, is still far from the requirements of the industrial application, i.e. which require a range of less than 15cm. In addition, self-localization should run in real-time and on-the-fly.
In this challenge, the participants will be given training videos of a large scene with ground truth camera 6DoF poses. Test videos will be recorded in the same scene. But they will be captured at different times, traffics, weathers and lighting conditions. The goal is to estimate the camera pose at each time frame. The participants will also have access to the 3D semantically labeled point cloud to assist their localization. Our metric will be the same as PoseNet [1] and DeLS-3D [8].
In Fig. 3, we highlight the ground truth camera pose. We already have all the ground truth pose jointly with the released data at ApolloScape. You may train and validate your algorithms with our provided API.
Participate Sample Dataset Download LeaderBoard
Task 3: 3D car instance understanding
For self-driving cars, it is important to detect the other vehicles, pedestrians, riders, etc. The system must understand the 3D relationship of each object in each image frame, especially those surrounding or near the self-driving vehicle. In this challenge, the algorithm is required to detect, reconstruct and estimate the 3D shape of the cars in a given video in a single image. Two labeled images are illustrated in Fig. 4. A sampled dataset with 1000 images are release.
We will evaluate based on average precision (AP) with respect to recovered vehicles' 3D bounding boxes, 3D shape and pose. This is similar to 2D instance detection and segmentation, but in 3D space. Please check the provided API for detailed informations.
Participate Sample Dataset Download LeaderBoard

[1] Kendall Alex, Grimes Matthew, Cipolla Roberto, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization 2015, ICCV

[2] Brostow Gabriel J, Fauqueur Julien, Cipolla Roberto, Semantic object classes in video: A high-definition ground truth database 2009, PR

[3] Geiger Andreas, Lenz Philip, Urtasun Raquel, Are we ready for autonomous driving? the kitti vision benchmark suite 2012, CVPR

[4] Cordts Marius, Omran Mohamed, Ramos Sebastian, Rehfeld Timo, Enzweiler Markus, Benenson Rodrigo, Franke Uwe, Roth Stefan, Schiele Bernt, The Cityscapes Dataset for Semantic Urban Scene Understanding 2016, CVPR

[5] CWang Shenlong, Bai Min, Mattyus Gellert, Chu Hang, Luo Wenjie, Yang Bin, Liang Justin, Cheverie Joel, Fidler Sanja, Urtasun Raquel, TorontoCity: Seeing the World with a Million Eyes 2017, ICCV

[6] German Ros, Laura Sellart, Joanna Materzynska,David Vazquez, Antonio Lopezt, The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes 2016, CVPR

[7] Richter Stephan R, Hayder Zeeshan, Koltun Vladlen, Playing for benchmarks 2017, ICCV

[8] Peng Wang, Ruigang Yang, Binbin Cao, Wei Xu, Yuanqing Lin, DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map 2018, CVPR