Data Challenges in Autonomous Driving and the Need for Data Tags

Autonomous driving has taken the world by storm in recent years. While self-driving cars are still in their infancy, they hold the promise of a future where commuting is safer, faster, and more convenient. However, the development and deployment of autonomous driving features have been hampered by several challenges, chief among them being the need for vast amounts of data to train, develop, test, and retrain such features. In this blog, we will examine the data challenges in developing and validating autonomous driving technology and the need for Drive Data Tags. 

Before we dive into the data challenges faced by autonomous driving, it is essential to understand the different levels of autonomy in cars. The Society of Automotive Engineers (SAE) has defined five levels of driving automation ranging from L0 to L5, with L5 being a fully autonomous vehicle. 

For example, L3 vehicles can self-drive on highways, but the driver needs to take over when exiting the highway or navigating through a city. L4 vehicles can drive without human intervention in specific conditions, such as on predetermined routes or in designated areas. L5 autonomy represents full autonomy, where the vehicle can operate in any condition without human intervention. 

Real World incidents caused by failures in AD deployment

In recent years, there have been several reported incidents where autonomous driving features have failed due to perception issues or issues where AD system features deployed could not function as expected, which have resulted in accidents or near-misses. Some of the reported incidents include: 

  • In 2018, a pedestrian was hit and killed by a self-driving car in Arizona. The car failed to detect the pedestrian, and the backup driver was distracted and failed to intervene.  
  • Another example is the incident that occurred in Chandler, Arizona, in May 2018, where a Waymo self-driving car was involved in a collision with a motorcyclist. The cause of this accident was identified as a perception issue, where the car’s sensors failed to detect the motorcyclist.  
  • In 2022, where a Zoox self-driving car rear-ended a tractor-trailer.
  • AV Disengagement report from 2022 from the state of California DMV:  The report lists all the documented AD system disengagements from the AD licensed vehicles due to various perception issues on object and object status detections like ADS incorrectly detecting the state of a traffic signal, which resulted in a motion plan requiring disengagement by the Safety Driver, stop sign overruns, crosswalk overruns, lane exit, and merge failures, examples where the perception system failed to detect an object correctly, failing to identify lane markers and the type of it from its perception systems. Most of the root causes emerge from incorrect perception detection and classification outputs. 

These incidents show the complexity and scale of challenges of deploying perception algorithms in current autonomous driving features and the need for extended development and testing before they can be deployed on the roads. 

Also, it is essential to note that the current deployed vehicle must actively register the failure cases encountered from the field to mitigate the risk and evolve to be re-developed, tested, and deployed. Here you can follow how Tesla tackles this issue with their framework and the scale of automation needed in the process. 

Need for Massive Real-World Diverse Data Collection

To develop Level 3 (L3), Level 4 (L4), and Level 5 (L5) AD functionalities, there is a need for high-quality diverse amounts of datasets to train, develop, and test autonomous driving features, which needs to be comprehensive to capture all the complexities of real-world driving scenarios. Data collection involves capturing data from various sensors and cameras mounted on the vehicle, and this data must be then curated and selected to be labeled accurately to train the algorithms. 

The cost of collecting data is significant, and it involves planning the data collection drive routes and timing, transferring collected data to cloud environments or data farms from the test collection vehicles, and setting up data organization, and storage while adhering to privacy laws. At the end of the data collection planning phase, you will not yet be sure if you have the data completeness and balance that you need to develop and test your function algorithms. 

The downstream challenges of data collection 

The challenge with massive data collection is that not all data is useful or relevant. The vast amount of data collected can be overwhelming, and it is challenging to sift through it to find the most critical information or completeness of all the needed data that fits curating your dataset for development and test purposes. The need for an intermediate step in understanding the data quality, and composition before selecting well-balanced data curated to be sent for labeling efforts for training algorithms thus essentially saves time and costs of the downstream works on data labeling, training, and deployment efforts as without this intermediate step deploying all the collected data with a vast amount of redundant data to be stored, transferred only to be later discarded or needing a new iteration cycle of collection of data increases the development costs significantly. 

It’s worth noting that the usefulness of data can vary widely depending on factors such as the conditions under which the data was collected, the type of environment the vehicle was driving in, and the accuracy of the sensors used. Therefore, it’s important to carefully evaluate and curate the data before using it for training algorithms. 

How to resolve data collection challenges 

A systematic approach to resolve these problems is to set up an intermediate step between the vast data that has been collected and the data that will be used for labeling, development, and validation scope, A solution that provides data insights to have the option for a user to filter only the needed critical data that helps progress the development and validation of autonomous driving functionalities from the vast datasets that are collected and for the data collection planning scope to be aware on the areas where data collection is further required. This is where a data tagging solution comes to help. 

What is data tagging?

Data tagging involves in its simple form adding metadata information of all diverse classes of information associated with the raw data. Information such as weather, lighting, road infrastructure, road signs, and different actors in the scene, information relevant to understanding the scene of drive and extracting all the necessary data points that are useful towards developing AD features, The user can determine which combination of data points is rich, complete, and diverse for the downstream works and have the ability to view, identify and filter specific datasets which can be curated for labeling and training perception algorithms or validating the developed functionalities for a certain ODD requirement. 

For example, a weather tag can be used to identify the type of weather condition at the time of data collection. A lighting tag can be used to identify the lighting condition, such as bright or dim. A road infrastructure tag can be used to identify the type of road, such as a highway or a residential street. A Traffic sign tag can be used to determine the associated impact on AD feature path planning and motion control such as when encountered with a speed limit signs, stop signs, or a yield sign. 

Data tags can also help to identify critical data points that are relevant to specific ODD requirements. For example, if the ODD requires the autonomous vehicle to navigate through a busy intersection, then data tags can be used to identify data points that are relevant to this scenario, such as the location of other vehicles, pedestrians, and traffic lights. By filtering those data which require the ODD elements to be associated with, developers can focus on the data that is most relevant to the specific ODD requirements, and filter out the irrelevant data points in no time. 


In conclusion, the development and testing of autonomous driving features require massive amounts of real-world diverse data. This data needs to consist of all the complexity of real-world driving scenarios, in addition, the vast amount of data collected during testing is not all useful for developing and testing perception algorithms. Therefore, the need for data tags is essential for identifying critical data points that are relevant to specific ODD requirements. By using data tags to filter and curate selective useful data, developers can focus on the data that is most useful for developing and testing autonomous driving features and avoid costs and resources on irrelevant data points. 

The use of data tags is becoming increasingly important in the development of autonomous driving features, and it is essential that developers and researchers understand how to use data tags effectively to filter critical data. In the next blog article, we will explore in more detail – the challenges in building up the driving scene information tags across different ODD attributes. 

Sign up for our newsletter and get the latest insights!

Anonymize your own images

Find out more about DriveTag AI!

A solution for automated data tagging services enabling in the development
and testing of algorithms for autonomous driving.

Anonymize your own images

Talk to our Cybersecurity experts today!

Get in touch with our experts to learn more about our Automotive Cybersecurity solution.