Introducing Scale’s Automotive Foundation Model

byon December 5, 2023

Autonomous vehicle development requires iterative improvements in perception models through a data engine. These data engines currently rely on a set of task-specific models based around a fixed taxonomy of objects and scenarios to identify. However, there are two critical limitations to existing data engines:

  1. Current models are task-specific and limited to a fixed taxonomy. As data requirements for task types and taxonomies change over time, new models must be trained from scratch

  2. Safe deployment requires detecting a ‘long tail’ of rare events and scenarios, which are not captured in any fixed taxonomy

Today, we are introducing Scale’s Automotive Foundation Model (AFM-1): a new model utilizing transformer modules trained on diverse, large-scale street scene data that is capable of handling multiple vision tasks with new taxonomies without requiring fine-tuning. This model is a major step for the autonomous vehicle  research community as it is the first zero-shot model that is generally available and reliable enough to use off the shelf for experimentation. It delivers state-of-the-art performance both in zero-shot regimes and once fine-tuned on specific datasets. Check out the demo to try AFM-1 with your own images.

AFM-1 is a single model that has been trained on millions of densely labeled images across five computer vision tasks: object detection, instance segmentation, semantic segmentation, pantopic segmentation, and classification.


AFM-1 Architecture

AFM-1 is a neural network that fuses text and image features and then decodes them through a transformer into segmentation masks, detection bounding boxes, objects classes, and a global image class. The model  is trained on a large-scale internal dataset to maximize the similarity between language embeddings and object-based visual features. The architecture connects language and visual features presented in OpenAI’s CLIP research to dense, object-based perception. Training in this open-vocabulary modality with a contrastively-pretrained text embedding model enables two key breakthroughs: 

  1. Segmentation and detection of similar concepts with reduced training data. For example,“traffic light” and “traffic signal” produce nearly identical results. 

  2. Removing the requirement of training a new model any time the taxonomy changes since concepts are based on language similarity and not training output layers hard-coded to a set of output classes.

Today, changing data taxonomy is slow and cumbersome. With AFM-1, autonomous vehicles programs can iterate on data requirements without being locked into a fixed taxonomy.

We are initially releasing detection and instance, semantic, and panoptic segmentation capabilities, with classification coming in a future release.

State of the Art Performance

AFM-1 is the most advanced vision foundation model for automotive applications in the world.  The model reaches state of the art results on Berkeley Deep Drive (BDD) and Cityscapes segmentation in both zero-shot and fine-tuned regime

AFM-1 is state of the art across zero-shot benchmarks and when fine tuned on many benchmarks even without hyperparameter tuning.

Benchmarking against historical progress, our improvement on Cityscapes for segmentation is equal to four years of progress by the entire open source community.

Source: Cityscapes Leaderboard

Currently, zero-shot AFM-1 is available in our public demo. We are also working with our customers to offer custom, fine tuned versions of the model. 

Using AFM-1 to Accelerate Autonomy Development

Autonomy programs improve perception by constructing a data engine to iteratively improve neural networks with higher quantities of high-quality data. Data engines require collecting and labeling data, training a model on that data, evaluating where the model is failing, curating the data to improve performance, and then repeating the cycle:

Today, running a data engine requires significant iteration of task-specific models. As data requirements and taxonomies change over time, new models must be trained from scratch. This iteration process not only slows down development speed, but can be costly to label as well.  

AFM-1 will accelerate the data engine through:

Accelerated Curation
Finding the specific self-driving cases that are needed for model improvement, such as specific scenarios with emergency vehicles, is a significant challenge today. With AFM-1, developers can search with open vocabulary queries for small objects independent of taxonomy - without needing to fine tune curation models to a specific taxonomy in advance.

AFM-1 detection of “emergency vehicle”

Ground Truth Data Labeling
Prior to AFM-1, obtaining high-quality training data for autonomous vehicle use cases required a high degree of human-in-the-loop intervention, as standard ML autolabels would not be of high enough quality to iterate on the perception stack. With AFM, nascent perception programs can quickly experiment on a taxonomy and generate a testable model without human intervention at a fraction of the cost, while for production programs, AFM-1 prelabeling and linters will be used to bring down labeling costs for our customers, while reducing turnaround time. 

“This is perfect timing for our roadmap. We want to incorporate auto-labeling into our solution." Director of Perception, autonomous vehicle company


AFM-1 is not built for direct consumption of inferences for safety critical applications in autonomous vehicles. Instead, AFM-1 should be used for curation, taxonomy experimentation, pretraining, evaluation and other offboard tasks. Further, there are domains of data that will not work reliably, such as aerial imagery, as of yet.

The open vocabulary nature of AFM-1 brings similar challenges to the evaluation of models as found in generative AI, such as with large language models, where the massive diversity of tasks supported must be evaluated by humans. To this end, we have released our public demo and will have a free trial period in Scale Nucleus.

In future work, we look forward to releasing more details on the evaluation of our automotive foundation models.

The Future of AFM

We believe AFM-1 is a first step in changing the paradigm of how self-driving programs approach perception. In the future, we expect our autolabels to not only become more accurate, we expect them to work across a wider range of modalities, potentially including object attributes, 3D LiDAR data, aerial imagery, and visual question answering and other applications that push generative AI and Reinforcement Learning from Human Feedback (RLHF) into automotive. We also expect to offer substantial model performance gains for our customers as we fine-tune AFM-1 with their data. We believe the possibilities of future iterations of AFM are broad, and we’re excited to gather even more user feedback to discover new use cases that will accelerate the development of automotive vehicles. 

Using AFM

To try AFM-1 for yourself, please visit our demo environment here. Further, the model will be integrated into Nucleus, our dataset management product, so that you can run the model over your data, view and query predictions in various ways, and utilize the resulting autolabels for model training and experimentation. If you have any questions about how AFM can be used in your self-driving program, or you’re interested in fine-tuning AFM-1 on your own data, you can reach out to your account manager or book a demo below.

The future of your industry starts here.