What can Autonomous Vehicle (AV) tech companies learn from Generative AI?
Despite significant efforts by some of the world’s largest tech and automotive companies, the problem of Autonomous Vehicles remains unsolved. Over $100 billion have been invested to bring AVs to market, but the technology is not yet advanced enough to commercially release Autonomous Vehicles.
I have been recently tinkering with and learning about large-scale language-guided image generators (DALLE2, Stable Diffusion & Google’s Imagen).
I also trained my own model based on Latent-diffusion. Some of the results are here:
The results are impressive and for those who have been using and building neural networks for a long time, they are really quite surprising. How could it be? Why they are so good? The model is capable of connecting language, history, culture, and art and distilling them into great visualizations.
We need this kind of technology in other industries.
I’ve been working on AV tech for many years, and I can’t help but feel a little jealous. The progress in self-driving cars’ AI has thus far been limited.
It’s quite clear that safe and large-scale deployment of AVs will require models that can make similar connections between motion plans, physics, rules, and behaviors. I’d like to propose how to do that, but before that, let’s review what makes this technology so good.
Large Language Models (LLMs)
LLMs are language models that are generally tens of gigabytes in size and are trained on internet-scale self-supervised text data, sometimes the data is at the petabyte scale.
LLMs are trained to predict the next word(s) in a sentence, so their training data is self-supervised; i.e., the inputs and outputs are the same. The data used to train LLMs varies from the entire works of Shakespeare to C++ code snippets.
Many experts believe that the reason large language models are so good is that they are able to capture long-term dependencies. In addition, they are also able to handle a lot of noise and uncertainty, which is often present in real-world data.
The mix of very large and diverse datasets with humongous neural networks containing billions of parameters makes LLMs zero-shot learners, where no domain-tailored training data is needed to adapt to new tasks.
Here is an example of How OpenAI’s GPT3 is able to capture the meaning of George Orwell’s animal farm. It shows up in the example generated Image when the model is prompted to generate ‘Warplanes Bombing George Orwell’s Animal Farm’
CLIP (Contrastive Language-Image Pre-Training)
CLIP is a neural network trained on a variety of image-text pairs. CLIP utilizes LLMs to encode text and connects it with visual data. The following illustration is from OpenAI’s CLIP research paper.
Like LLMs CLIP demonstrates zero-shot capabilities in downstream tasks with little to no domain-specific training data.
CLIP has not only democratized Image captioning but also allowed the collection of internet-scale image-caption pairs by anyone with access to the internet. See the amazing work at LAION.AI
Image synthesis is a broad class of machine learning tasks with a variety of applications. Image synthesis is typically done with deep generative models, such as GANs, VAEs (Variational Auto-Encoders), and autoregressive models. However, each of these models has its disadvantages. For example, GANs often have unstable training, while autoregressive models are usually slow at synthesis.
Diffusion models were originally proposed in 2015. They work by corrupting the training data by progressively adding Gaussian noise, and slowly wiping out details in the data until it becomes pure noise. Then, a neural network is trained to reverse this corruption process. Running this reversed corruption process synthesizes data from pure noise by gradually de-noising it until a clean sample is produced. There has been a recent resurgence in interest in diffusion models due to their training stability and the promising results they have achieved in terms of image and audio quality.
Internet-scale datasets, LLMs, Transformers, and a stable image synthesis training paradigm have fueled the progress of text-guided image synthesis. It seemed as though there was some sort of arms race of who could come up with the best-in-class generator. However, the basic ingredients needed to make those models are available to everyone (except for the massive cloud computing bill needed to train such models).
There are great blog posts out there detailing how this technology works.
The technology will continue to evolve at a rapid pace disrupting many industries from photo editing to movie making.
Now back to Self-Driving Cars
I have been working on self-driving technology for the past 8 years and the industry has progressed so much but no one has been able to produce a consumer-grade solution.
Deep Learning has driven significant innovation & progress in self-driving technology, but it has not yet disrupted the entire industry or provided a comprehensive toolset to solve it.
The current tools available to the industry cannot solve complex problems such as understanding pedestrian intents, building a perfect perception system, or executing human-level decision-making. This lack of progress is not due to a lack of manpower. The industry is stuck because of the way we are solving this problem.
The current system architecture of self-driving tech has been split into perception & planning stacks. This is problematic since we do not know how much engineering is needed in the perception stack to encode the universe and writing the rules of driving is prohibitive from both an engineering and technology POV. Back in 2015 people knew this but most investments went into HD maps & LIDARS since progress there is much more attainable than a research program that cannot produce a demo.
I don’t need to convince anyone that what matters most for making self-driving cars is having massive amounts of high-quality data. Diverse data can help your neural networks handle situations not seen at training time. This is only part of the problem. If you rely on annotation for producing a high-quality dataset then you are stuck because every little change means redoing everything from scratch.
What happens if you think of a great idea six months from now; you can’t run around and redo your data. so what do you do? it does not work because the existing annotations cannot magically change, they can’t change for each new feature and they can’t change down the road when you realize you’ve made a mistake.
Imagine changing sensors, sensors get better with time, and every time you update your sensor stack, some of the annotations need to be redone. This is cost prohibitive, especially when you add sensors at a later stage in the R&D program. Imagine the motion planning requirements change now you need to annotate your data differently.
So how do we solve this problem?
The solution is in the making
Tesla is taking a new approach to solving the problem- not relying on sensors or HD Maps, they are building a real-time perception system followed by a planning system instead. They have built fast annotation systems that can produce millions of images relatively quickly.
Tesla has hit a ‘glass ceiling’ in its system performance and it seems that shipping more features or training on larger and larger datasets does not translate to better driving performance.
Wayve has published their method for solving self-driving cars, while their method does address some of the limitations of the traditional approach it still relies on hand-crafted abstraction layers, i.e. annotation(See Figure 1).
There is a third approach that can lead to a better self-driving system. We need something that connects language, history, safety, physics, and the DMV handbook. Saying end-to-end learning is the solution is not enough as time and again it has been proven how fragile and uncontrollable end-to-end learning can be when applied to self-driving cars.
Generative AI shows the way
The following figure is a diagram of Google’s Imagen image generator, we will use it as a guideline for building an ‘end-to-end’ AV solution.
The first stage is to fine-tune a neural network to connect driving data with text captions. The text captions here are twofold:
1- Navigational Instructions (Take left, at the roundabout take the 3rd exit). these will be collected automatically (but carefully) using Mapping APIs.
2- Driving Instructions (Stop at the red traffic light, keep lane, Overtake the car …etc.). this data will be generated using a zero-shot video-CLIP in the following manner:
The second stage is to extract driving trajectories from driving episodes.
Putting it all together:
The system architecture presented here has the following benefits:
- There is no middleware for perception.
- The CLIP-V and the text encoder models can be trained to produce internet-scale embeddings.
- Very large datasets can be collected with little to no human interventions.
- The interface between the models is language and no specific format has to be adopted.