Woosh: A Sound Effects Foundation Model

Hadjeres, Gaetan; Ferras, Marc; Koutini, Khaled; Weck, Benno; Bittar, Alexandre; Hummel, Thomas; Lahrichi, Zineb; Missoum, Hakim; Serrà, Joan; Mitsufuji, Yuki

Woosh - A Sound Effect Foundation Model

Gaetan Hadjeres¹, Marc Ferras¹, Khaled Koutini¹, Benno Weck¹, Alexandre Bittar¹, Thomas Hummel¹, Zineb Lahrichi¹, Hakim Missoum¹, Joan Serrà¹, Yuki Mitsufuji^1,2,

¹Sony AI
²Sony Group Corporation

Code arXiv

Abstract

The audio research community relies on open generative models to build upon and use as baselines for novel approaches. In this report, we describe Woosh, the public release of the sound effect foundation model built at Sony AI, including architecture, training process, and evaluation against other popular open models. Being optimized for sound effects, we provide a high-quality audio encoder/decoder model, a text-audio alignment model for conditioning, and text-to-audio (T2A) and video-to-audio (V2A) generative models. Distilled T2A and V2A models for fast inference are also included in the release. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like Stable Audio Open and TangoFlux. Inference code and model weights are available at https://www.github.com/Woosh.

Woosh-AE: Audio Encoder/Decoder

We compare our Woosh-AE audio encoder/decoder model against other baselines.

Sample 1 - Spectrograms

Original

SAO

Woosh-AE-Public

Woosh-AE-Private

Sample 2 - Spectrograms

Original

SAO

Woosh-AE-Public

Woosh-AE-Private

Sample 3 - Spectrograms

Original

SAO

Woosh-AE-Public

Woosh-AE-Private

Woosh-Flow/Woosh-DFlow: Text-to-audio Generation

We compare our private Woosh-Flow model against Woosh-Flow Public and other baselines.

Find our private model audio samples in the Woosh-Flow Private demo section.

Sample 1: Emergency vehicle driving with siren on (Spectrograms)

SAO

TangoFlux

Woosh-Flow-Public

Woosh-DFlow-Public

Sample 2: An engine humming and brakes squealing (Spectrograms)

SAO

TangoFlux

Woosh-Flow-Public

Woosh-DFlow-Public

Sample 3: Starting a motorcycle (Spectrograms)

SAO

TangoFlux

Woosh-Flow-Public

Woosh-DFlow-Public

Sample 4: A crowd applauds (Spectrograms)

SAO

TangoFlux

Woosh-Flow-Public

Woosh-DFlow-Public

Woosh-VFlow/Woosh-DVFlow: Video-to-Audio Generation

We compare our models against baselines across several benchmarks. OV means the model is conditioned only on video without the caption.

For more video samples and details, see the full video demo section.

Video Samples

Caption: A propeller plane flies over a snowy landscape, its engine roaring.

Ground Truth

MM-Audio

Woosh_VFlow

Woosh_DVFlow

MM-Audio OV

Woosh_VFlow OV

Woosh_DVFlow OV

Caption: A man in a yellow shirt punches focus mitts held by a trainer in a black shirt.

Ground Truth

MM-Audio

Woosh_VFlow

Woosh_DVFlow

MM-Audio OV

Woosh_VFlow OV

Woosh_DVFlow OV

Caption: A large audience in a formal hall gives a standing ovation with applause.

Ground Truth

MM-Audio

Woosh_VFlow

Woosh_DVFlow

MM-Audio OV

Woosh_VFlow OV

Woosh_DVFlow OV

BibTeX

@misc{hadjeres2026,
   title={Woosh: A Sound Effects Foundation Model},
   author={Gaetan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà and Yuki Mitsufuji},
   year={2026},
   eprint={2412.15322},
   archivePrefix={arXiv},
   primaryClass={cs.CV},
   url={https://arxiv.org/abs/2412.15322},
   }