Woosh - A Sound Effect Foundation Model

Gaetan Hadjeres1, Marc Ferras1, Khaled Koutini1, Benno Weck1, Alexandre Bittar1, Thomas Hummel1, Zineb Lahrichi1, Hakim Missoum1, Joan Serrà1, Yuki Mitsufuji1,2,
1Sony AI
2Sony Group Corporation

Video-to-Audio Generation Demos

We compare our models against baselines across several benchmarks. OV means the model is conditioned only on video without the caption.

FoleyBench

Caption: A propeller plane flies low over the water while firing its guns.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: Gentle waves transitioning to the muffled, bubbling sound of being underwater.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: The moon reflects on gently rippling water at night.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: Two figures in costumes walk down a basement hallway, their footsteps echoing on the concrete floor.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A person uses a power impact wrench to loosen a large bolt on a vehicle's suspension assembly.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A propeller plane flies over a snowy landscape, its engine roaring.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A man in a yellow shirt punches focus mitts held by a trainer in a black shirt.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A person fires a handgun at metal targets at an outdoor range.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A marker scratches against paper as it is used to color a drawing of a hummingbird.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A large audience in a formal hall gives a standing ovation with applause.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A character stands in a dark forest and sobs.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

OGameData

Caption: A horse gallops at a steady pace, its hooves striking the ground rhythmically, while a person shouts in a high-pitched, urgent tone.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A continuous, low-frequency hum is present throughout the recording, accompanied by faint, intermittent high-pitched chirping sounds resembling distant birds.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A series of footsteps on a soft surface are followed by a low, sustained electronic tone.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A vehicle engine starts and accelerates, accompanied by the sound of tires rolling on pavement.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A high-pitched buzzing sound, characteristic of a bee or wasp, is heard intermittently alongside the soft rustling of movement through dry leaves or grass.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A vehicle engine starts and idles, followed by the sound of a siren approaching and then fading into the distance.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: A rhythmic thumping sound, like a heartbeat, is accompanied by the sound of footsteps on a hard surface, with a low-frequency hum in the background.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

VGGSound

Caption: car engine idling

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: alligators, crocodiles hissing

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: parrot talking

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: ambulance siren

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: whale calling

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: dog baying

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

Caption: playing tympani

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow
MM-Audio OV
Woosh_VFlow OV
Woosh_DVFlow OV

VGGSound (recaptioned)

Caption: A male voice speaks in Russian with a neutral tone, accompanied by the low rumble of a vehicle engine and a faint, high-pitched electronic beep.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow

Caption: A vehicle engine revs loudly as it accelerates, accompanied by the high-pitched squeal of tires losing traction on a surface, while a male voice shouts in distress.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow

Caption: A series of high-pitched, rapid squeaks and chirps, characteristic of a cartoonish or animated creature, are heard in quick succession.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow

Caption: A high-pitched, warbling electronic melody plays in a rapid, repetitive pattern, accompanied by a low, sustained electronic hum in the background.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow

Caption: A person shouts in excitement, followed by a woman speaking, all within a digital environment accompanied by a low-frequency hum.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow

Caption: A dog barks repeatedly in a series of short, sharp bursts, with faint, indistinct human speech audible in the background.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow

Caption: A dramatic orchestral score with prominent timpani drums and a soaring brass melody plays, creating a tense and epic atmosphere.

Ground Truth
MM-Audio
Woosh_VFlow
Woosh_DVFlow

BibTeX

@misc{hadjeres2026,
   title={Woosh: A Sound Effects Foundation Model},
   author={Gaetan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà and Yuki Mitsufuji},
   year={2026},
   eprint={2412.15322},
   archivePrefix={arXiv},
   primaryClass={cs.CV},
   url={https://arxiv.org/abs/2412.15322},
   }