Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion

 

Proposed Dual-Conditioned Latent Diffusion Model

Figure 1. Dual-conditioned Latent Diffusion Model Architecture.

 

Abstract:

Diffusion models empower the majority of text-to-audio (TTA) generation approaches. Some recent diffusion-based TTA methods use a large text encoder to encode the textual description of the generated audio, which acts as a semantic condition to guide the audio generation. In this work, we build on top of the widely-used diffusion model architecture, and integrate another acoustic condition into the diffusion process. We adopt an auto-regressive generative audio language model that generates audio tokens conditioned on text inputs, then audio tokens can be converted into acoustic latent features as the additional condition representations to improve the audio generation outcome. Consequently, comparing to baseline systems, our dual-conditioned latent diffusion approach delivers better results on most metrics and stays comparable on the rest of results on AudioCaps test set.

 

Generated audio samples by proposed dual-conditioned diffusion model.

Caption: A cat is meowing.

DC-LDM

AudioGen

TANGO

Caption: A helicopter flying followed by wind heavily blowing into a microphone.

DC-LDM

AudioGen

TANGO

Caption: A series of doors sliding open as gusts of wind blows and glass clanks.

DC-LDM

AudioGen

TANGO

Caption: A bus engine running followed by a bus horn honking.

DC-LDM

AudioGen

TANGO

Caption: Clattering of a train is ongoing, a railroad crossing bell rings, and a train horn blows.

DC-LDM

AudioGen

TANGO

Caption: A large dog barks.

DC-LDM

AudioGen

TANGO

Caption: A person snoring.

DC-LDM

AudioGen

TANGO

Caption: Music in the background as a woman speaks and food fries.

DC-LDM

AudioGen

TANGO

Caption: Footsteps and scuffing occur, after which a door grinds, squeaks and clicks, an adult male speaks, and the door grinds, squeaks and clicks shut with a thump.

DC-LDM

AudioGen

TANGO

Caption: Food and oil sizzling followed by steam hissing as a man talks and light rock music plays in the background.

DC-LDM

AudioGen

TANGO

Caption: Tick-tocking by a clock.

DC-LDM

AudioGen

TANGO

Caption: Tires squeal and an engine hums as a vehicle approaches.

DC-LDM

AudioGen

TANGO

Caption: Dishes clanking as a man is talking.

DC-LDM

AudioGen

TANGO

Caption: A toilet is flushed.

DC-LDM

AudioGen

TANGO

Caption: Singing and music with faint gunshots.

DC-LDM

AudioGen

TANGO

Caption: A person whistling.

DC-LDM

AudioGen

TANGO

Caption: A man speaking followed by a crowd of people laughing then applauding.

DC-LDM

AudioGen

TANGO

Caption: A duck quacking.

DC-LDM

AudioGen

TANGO

Caption: Wind blowing and a siren rings.

DC-LDM

AudioGen

TANGO

Caption: The wind is blowing, and a person is whistling a tune.

DC-LDM

AudioGen

TANGO