Technical Approach

Robust detection of moving vehicles is a critical task for any autonomously operating outdoor robot or self-driving vehicle. Most modern approaches for solving this task rely on training image-based detectors using large-scale vehicle detection datasets such as nuScenes or the Waymo Open Dataset. Providing manual annotations is an expensive and laborious exercise that does not scale well in practice. To tackle this problem, we propose a self-supervised approach that leverages audio-visual cues to detect moving vehicles in videos. Our approach employs contrastive learning for localizing vehicles in images from corresponding pairs of images and recorded audio. In extensive experiments carried out with a real-world dataset, we demonstrate that our approach provides accurate detections of moving vehicles and does not require manual annotations. We furthermore show that our model can be used as a teacher to supervise an audio-only detection model. This student model is invariant to illumination changes and thus effectively bridges the domain gap inherent to models leveraging exclusively vision as the predominant modality.

Overview of the System
We use a volume-based heuristic to classify the videos into positive, negative, and inconclusive image-spectrogram pairs. We subsequently train an audio-visual teacher model, denoted as AV-Det, on positive and negative pairs. We embed both the input image and the stacked spectrograms into a feature space using encoders $T_I$ and $T_A$, producing a heatmap that indicates spatial correspondence between the image features and the audio features. We post-process the heatmap to generate bounding boxes. These bounding boxes can be used to train an optional audio-detector model.




We collected a real-world video dataset of moving vehicles, the Freiburg Audio-Visual Vehicles dataset. We use an XMOS XUF216 microphone array with 7 microphones in total for audio recording, where six microphones are arranged circularly with a 60-degree angular distance and one microphone is located at the center. The array is mounted horizontally for maximum angular resolution for objects moving in the horizontal plane. To capture the images, we use a FLIR BlackFly S RGB camera and crop the images to a resolution of 400 × 1200 pixels. Images are recorded at a fixed frame rate of 5 Hz, while the audio is captured at a sampling rate of 44.1 kHz for each channel. The microphone array and the camera are mounted on top of each other with a vertical distance of ca. 10 cm.

For our dataset, we consider two distinct scenarios: static recording platform and moving recording platform. In the static platform scenario, the recording setup is placed close to a street and is mounted on a static camera mount. In the dynamic platform scenario, the recording setup is handheld and is thus moved over time while the orientation of the recording setup also changes with respect to the scene. The positional perturbations can reach 15 cm in all directions while the angular perturbations can reach 10 deg.

We collected ca. 70 minutes of audio and video footage in nine distinct scenarios with different weather conditions ranging from clear to overcast and foggy. Overall, the dataset contains more than 20k images. The recording environments entail suburban, rural, and industrial scenes. The distance of the camera to the road varies between scenes. To evaluate detection metrics with our approach, we manually annotated more than 300 randomly selected images across all scenes with bounding boxes for moving vehicles. We also manually classified each image in the dataset whether it contains a moving vehicle or not. Note that static vehicles are counted as part of the background of each scene.


Please cite our works if you use the Freiburg AV-Vehicles dataset or report results based on it.

  title={Self-Supervised Moving Vehicle Detection from Audio-Visual Cues},
  author={Z{\"u}rn, Jannik and Burgard, Wolfram},
  journal={arXiv preprint arXiv:2201.12771},