Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion

Proposed Dual-Conditioned Latent Diffusion Model

Figure 1. Dual-conditioned Latent Diffusion Model Architecture.

Abstract:

Diffusion models empower the majority of text-to-audio (TTA) generation approaches. Some recent diffusion-based TTA methods use a large text encoder to encode the textual description of the generated audio, which acts as a semantic condition to guide the audio generation. In this work, we build on top of the widely-used diffusion model architecture, and integrate another acoustic condition into the diffusion process. We adopt an auto-regressive generative audio language model that generates audio tokens conditioned on text inputs, then audio tokens can be converted into acoustic latent features as the additional condition representations to improve the audio generation outcome. Consequently, comparing to baseline systems, our dual-conditioned latent diffusion approach delivers better results on most metrics and stays comparable on the rest of results on AudioCaps test set.

Generated audio samples by proposed dual-conditioned diffusion model.

Caption: A cat is meowing.

DC-LDM

AudioGen

TANGO

Caption: A helicopter flying followed by wind heavily blowing into a microphone.

DC-LDM

AudioGen

TANGO

Caption: A series of doors sliding open as gusts of wind blows and glass clanks.

DC-LDM

AudioGen

TANGO

Caption: A bus engine running followed by a bus horn honking.

DC-LDM

AudioGen

TANGO

Caption: Clattering of a train is ongoing, a railroad crossing bell rings, and a train horn blows.

DC-LDM

AudioGen

TANGO

Caption: A large dog barks.

DC-LDM

AudioGen

TANGO

Caption: A person snoring.

DC-LDM

AudioGen

TANGO

Caption: Music in the background as a woman speaks and food fries.

DC-LDM

AudioGen

TANGO

Caption: Footsteps and scuffing occur, after which a door grinds, squeaks and clicks, an adult male speaks, and the door grinds, squeaks and clicks shut with a thump.