Formula-Supervised Sound Event Detection: Pre-Training Without Real Data

Keio University, AIST, Waseda University, Doshisha University
ICASSP2025
Formula-SED

We create a synthetic dataset for SED, named Formula-SED, and propose a novel formula-driven pre-training method that uses acoustic synthesis parameters as labels with correct timestamps.

Abstract

In this paper, we propose a novel framework for pre-training environmental sound analysis models by utilizing parametrically synthesized acoustic signals using formula-driven methods. Specifically, we outline detailed procedures and evaluate their effectiveness for Sound Event Detection (SED). The SED task, which involves estimating the types and timings of sound events, is particularly challenged by the difficulty of acquiring a sufficient quantity of accurately labeled training data. Moreover, it is well known that manually annotated labels often contain noises and are significantly influenced by the subjective judgment of annotators. To address these challenges, we propose a novel pre-training method that utilizes a synthetic dataset, Formula-SED, where acoustic data are generated solely based on mathematical formulas. The proposed method enables large-scale pre-training by using the synthesis parameters applied at each time step as ground truth labels, thereby eliminating label noise and bias. We demonstrate that large-scale pre-training with Formula-SED significantly enhances model accuracy and accelerates training, as evidenced by our results in the DESED dataset used for DCASE2023 Challenge Task 4.

Sample Audio Clips

Spectrogram 1

Example Audio #1

Spectrogram 2

Example Audio #2

Spectrogram 3

Example Audio #3

Spectrogram 4

Example Audio #4

Spectrogram 5

Example Audio #5

Spectrogram 6

Example Audio #6

Overview

Overview

Overview of the proposed method. We effectively pre-train SED models using acoustic data generated solely based on mathematical formulas.

Learning Curve

Downstream training curve for baseline CRNN

Downstream training curve for baseline CRNN.

We can see that the model with formula-driven pre-training converges faster and achieves better performance.

Poster

BibTeX

@INPROCEEDINGS{10888414,
        author={Shibata, Yuto and Tanaka, Keitaro and Bando, Yoshiaki and Imoto, Keisuke and Kalaoka, Hirokatsu and Aoki, Yoshimitsu},
        booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
        title={Formula-Supervised Sound Event Detection: Pre-Training Without Real Data}, 
        year={2025},
        volume={},
        number={},
        pages={1-5},
        keywords={Training;Accuracy;Event detection;Noise;Supervised learning;Training data;Acoustics;Mathematical models;Timing;Synthetic data;sound event detection;pre-training without real data;environmental sound synthesis},
        doi={10.1109/ICASSP49660.2025.10888414}}