AudioX: Diffusion Transformer for Anything-to-Audio Generation

A unified model for high-quality audio and music generation with flexible natural language control and seamless processing of various modalities.

View on GitHub

Core Features

Multi-modal Generation

Generate audio from text, video, image, music, and audio inputs

High Quality Output

Produces high-quality audio and music with natural language control

Unified Architecture

Single model for both general audio and music generation tasks

Robust Training

Multi-modal masked training strategy for robust cross-modal representations

Use Cases

Text-to-Audio

Text-to-Audio

Generate audio from natural language descriptions

Video-to-Audio

Video-to-Audio

Create synchronized audio for video content

Music Generation

Music Generation

Generate music with specific instruments and styles

Contact Us