Abstract
The audio research community relies on open generative models to build upon and use as baselines for novel approaches. In this report, we describe Woosh, the public release of the sound effect foundation model built at Sony AI, including architecture, training process, and evaluation against other popular open models. Being optimized for sound effects, we provide a high-quality audio encoder/decoder model, a text-audio alignment model for conditioning, and text-to-audio (T2A) and video-to-audio (V2A) generative models. Distilled T2A and V2A models for fast inference are also included in the release. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like Stable Audio Open and TangoFlux. Inference code and model weights are available at https://www.github.com/Woosh.
Woosh-AE: Audio Encoder/Decoder
We compare our Woosh-AE audio encoder/decoder model against other baselines.
Woosh-Flow/Woosh-DFlow: Text-to-audio Generation
We compare our private Woosh-Flow model against Woosh-Flow Public and other baselines.
Find our private model audio samples in the Woosh-Flow Private demo section.
Woosh-VFlow/Woosh-DVFlow: Video-to-Audio Generation
We compare our models against baselines across several benchmarks. OV means the model is conditioned only on video without the caption.
For more video samples and details, see the full video demo section.
Video Samples
Caption: A propeller plane flies over a snowy landscape, its engine roaring.
Caption: A man in a yellow shirt punches focus mitts held by a trainer in a black shirt.
Caption: A large audience in a formal hall gives a standing ovation with applause.
BibTeX
@misc{hadjeres2026,
title={Woosh: A Sound Effects Foundation Model},
author={Gaetan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà and Yuki Mitsufuji},
year={2026},
eprint={2412.15322},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.15322},
}