A unified model for high-quality audio and music generation with flexible natural language control and seamless processing of various modalities.
View on GitHubGenerate audio from text, video, image, music, and audio inputs
Produces high-quality audio and music with natural language control
Single model for both general audio and music generation tasks
Multi-modal masked training strategy for robust cross-modal representations
Generate audio from natural language descriptions
Create synchronized audio for video content
Generate music with specific instruments and styles