Woosh - A Sound Effect Foundation Model

Gaetan Hadjeres1, Marc Ferras1, Khaled Koutini1, Benno Weck1, Alexandre Bittar1, Thomas Hummel1, Zineb Lahrichi1, Hakim Missoum1, Joan Serrà1, Yuki Mitsufuji1,2,
1Sony AI
2Sony Group Corporation

Abstract

The audio research community relies on open generative models to build upon and use as baselines for novel approaches. In this report, we describe Woosh, the public release of the sound effect foundation model built at Sony AI, including architecture, training process, and evaluation against other popular open models. Being optimized for sound effects, we provide a high-quality audio encoder/decoder model, a text-audio alignment model for conditioning, and text-to-audio (T2A) and video-to-audio (V2A) generative models. Distilled T2A and V2A models for fast inference are also included in the release. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like Stable Audio Open and TangoFlux. Inference code and model weights are available at https://www.github.com/Woosh.

Woosh-AE: Audio Encoder/Decoder

We compare our Woosh-AE audio encoder/decoder model against other baselines.

Sample 1 - Spectrograms
Original
 
SAO
 
Woosh-AE-Public
 
Woosh-AE-Private


Sample 2 - Spectrograms
Original
 
SAO
 
Woosh-AE-Public
 
Woosh-AE-Private


Sample 3 - Spectrograms
Original
 
SAO
 
Woosh-AE-Public
 
Woosh-AE-Private


Woosh-Flow/Woosh-DFlow: Text-to-audio Generation

We compare our private Woosh-Flow model against Woosh-Flow Public and other baselines.

Find our private model audio samples in the Woosh-Flow Private demo section.

Sample 1: Emergency vehicle driving with siren on (Spectrograms)
SAO
 
TangoFlux
 
Woosh-Flow-Public
 
Woosh-DFlow-Public


Sample 2: An engine humming and brakes squealing (Spectrograms)
SAO
 
TangoFlux
 
Woosh-Flow-Public
 
Woosh-DFlow-Public


Sample 3: Starting a motorcycle (Spectrograms)
SAO
 
TangoFlux
 
Woosh-Flow-Public
 
Woosh-DFlow-Public


Sample 4: A crowd applauds (Spectrograms)
SAO
 
TangoFlux
 
Woosh-Flow-Public
 
Woosh-DFlow-Public


Woosh-VFlow/Woosh-DVFlow: Video-to-Audio Generation

We compare our models against baselines across several benchmarks. OV means the model is conditioned only on video without the caption.

For more video samples and details, see the full video demo section.

Video Samples

Caption: A propeller plane flies over a snowy landscape, its engine roaring.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A man in a yellow shirt punches focus mitts held by a trainer in a black shirt.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A large audience in a formal hall gives a standing ovation with applause.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

BibTeX

@misc{hadjeres2026,
   title={Woosh: A Sound Effects Foundation Model},
   author={Gaetan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà and Yuki Mitsufuji},
   year={2026},
   eprint={2412.15322},
   archivePrefix={arXiv},
   primaryClass={cs.CV},
   url={https://arxiv.org/abs/2412.15322},
   }