Video-to-Audio Generation Demos
We compare our models against baselines across several benchmarks. OV means the model is conditioned only on video without the caption.
FoleyBench
Caption: A propeller plane flies low over the water while firing its guns.
Caption: Gentle waves transitioning to the muffled, bubbling sound of being underwater.
Caption: The moon reflects on gently rippling water at night.
Caption: Two figures in costumes walk down a basement hallway, their footsteps echoing on the concrete floor.
Caption: A person uses a power impact wrench to loosen a large bolt on a vehicle's suspension assembly.
Caption: A propeller plane flies over a snowy landscape, its engine roaring.
Caption: A man in a yellow shirt punches focus mitts held by a trainer in a black shirt.
Caption: A person fires a handgun at metal targets at an outdoor range.
Caption: A marker scratches against paper as it is used to color a drawing of a hummingbird.
Caption: A large audience in a formal hall gives a standing ovation with applause.
Caption: A character stands in a dark forest and sobs.
OGameData
Caption: A horse gallops at a steady pace, its hooves striking the ground rhythmically, while a person shouts in a high-pitched, urgent tone.
Caption: A continuous, low-frequency hum is present throughout the recording, accompanied by faint, intermittent high-pitched chirping sounds resembling distant birds.
Caption: A series of footsteps on a soft surface are followed by a low, sustained electronic tone.
Caption: A vehicle engine starts and accelerates, accompanied by the sound of tires rolling on pavement.
Caption: A high-pitched buzzing sound, characteristic of a bee or wasp, is heard intermittently alongside the soft rustling of movement through dry leaves or grass.
Caption: A vehicle engine starts and idles, followed by the sound of a siren approaching and then fading into the distance.
Caption: A rhythmic thumping sound, like a heartbeat, is accompanied by the sound of footsteps on a hard surface, with a low-frequency hum in the background.
VGGSound
Caption: car engine idling
Caption: alligators, crocodiles hissing
Caption: parrot talking
Caption: ambulance siren
Caption: whale calling
Caption: dog baying
Caption: playing tympani
VGGSound (recaptioned)
Caption: A male voice speaks in Russian with a neutral tone, accompanied by the low rumble of a vehicle engine and a faint, high-pitched electronic beep.
Caption: A vehicle engine revs loudly as it accelerates, accompanied by the high-pitched squeal of tires losing traction on a surface, while a male voice shouts in distress.
Caption: A series of high-pitched, rapid squeaks and chirps, characteristic of a cartoonish or animated creature, are heard in quick succession.
Caption: A high-pitched, warbling electronic melody plays in a rapid, repetitive pattern, accompanied by a low, sustained electronic hum in the background.
Caption: A person shouts in excitement, followed by a woman speaking, all within a digital environment accompanied by a low-frequency hum.
Caption: A dog barks repeatedly in a series of short, sharp bursts, with faint, indistinct human speech audible in the background.
Caption: A dramatic orchestral score with prominent timpani drums and a soaring brass melody plays, creating a tense and epic atmosphere.
BibTeX
@misc{hadjeres2026,
title={Woosh: A Sound Effects Foundation Model},
author={Gaetan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà and Yuki Mitsufuji},
year={2026},
eprint={2412.15322},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.15322},
}