Audiox architecture. This figure describes the underlying architecture of the audio, highlighting its dissemination transformer framework with the new multimodal masking strategy which allows learning of unified representation through the modalities of text, video and audio. Credit: Tian et al.
In recent years, computer scientists have created various highly efficient automatic learning tools to generate texts, images, videos, songs and other content. Most of these calculation models are designed to create content according to the textual instructions provided by users.
Hong Kong University of Science and Technology Researchers have recently introduced Audiox, a model that can generate high -quality audio and musical tracks using texts, video sequences, images, music and audio recordings as entries. Their model, introduced into an article published on the arxiv Preprint Server, relies on a dissemination transformer, an advanced automatic learning algorithm which uses the so -called transformer architecture to generate content by gradually distorting the input data it receives.
“Our research arises from a fundamental question of artificial intelligence: how can intelligent systems achieve a unified trans-mudal generation and generation?” Wei Xue, the corresponding author of the newspaper, told Tech Xplore. “Human creation is a transparent integrated process, where information from different sensory channels is naturally merged by the brain. Traditional systems have often been based on specialized models, do not capture and merge these intrinsic connections between the methods.”
The main objective of the recent study led by Wei Xue, Yike Guo and their colleagues was to develop a learning framework for unified representation. This framework would allow an individual model to process information on different methods (i.e. texts, images, videos and audio tracks), instead of combining distinct models which can only process a specific type of data.
“We aim to allow AI systems to train inter-model conceptual networks similar to the human brain,” said Xue. “Audio, the model we have created, represents a paradigm shift, aimed at taking up the double challenge for conceptual and temporal alignment. In other words, it is designed to resolve both the questions” our ultimate objective of “conceptual alignment).
The new model based on a diffusion transformer developed by researchers can generate high -quality audio or musical tracks using all input data as a guidance. This ability to convert “anything” into audio opens up new possibilities for the entertainment industry and creative professions. For example, allowing users to create music that corresponds to a specific visual scene or uses a combination of inputs (for example, texts and videos) to guide the generation of desired tracks.
“Audiox is built on a diffusion transformer architecture, but what distinguishes it is the multimodal masking strategy,” said Xue. “This strategy fundamently reinvents how machines learn to understand the relationships between different types of information.
“Obscuring the elements through the entry methods during training (that is to say by selectively deleting the fixes of video frames, text tokens or audio segments) and the formation of the model to recover the missing information of other methods, we create a unified representation space.”
Presentation of audio -in -law capacities. This diagram illustrates the versatile audiox capacities on several tasks, including the audio text, video-audio, audio distress, music text, music to music and music. The model demonstrates strong performance in the generation of contextually appropriate audio for various inputs. Credit: Tian et al.
Audiox is one of the first models to combine linguistic descriptions, visual scenes and audio models, capturing the semantic meaning and the rhythmic structure of these multimodal data. Its unique design allows it to establish associations between different types of data, in a similar way to the way in which the human brain integrates the information picked up by different senses (that is to say vision, hearing, taste, smell and touch).
“Audiox is by far the most complete foundation model of all audio, with various key advantages,” said Xue. “First, this is a unified framework supporting very diverse tasks within a single model architecture. It also allows intermodal integration via our multimodal masked training strategy, creating a unified representation space. It presents multiple generation capacities, because it can manage both general audio and general music with high quality quality, including our new recruits, including our new recruits curative. ” “.
In initial tests, the new model created by Xue and his colleagues was found to produce high quality audio and musical tracks, successfully incorporating texts, videos, images and audio. Its most remarkable characteristic is that it does not combine different models, but rather uses a single diffusion transformer to treat and integrate different types of inputs.
“Audiox supports various tasks in an architecture, ranging from text / audio video to audio intervention and music completion, progressing beyond systems that generally excel in specific tasks,” said Xue. “The model could have various potential applications, extending on film production, content creation and games.”
Qualitative comparison between various tasks. Credit: arxiv (2025). DOI: 10.48550 / Arxiv.2503.10522
Audiox could soon be improved more and deployed in a wide range of parameters. For example, this could help creative professionals in the production of films, animations and content for social media.
“Imagine a filmmaker who no longer needs a Foley artist for each scene,” said Xue. “Audiox could automatically generate steps in the snow, creaky doors or rustling sheets based only on visual images. Likewise, it could be used by influencers to instantly add perfect background music to their tiktok dance videos or by YouTubers to improve their travel vlogs with authentic local sounds – all generated on demand.”
In the future, Audiox could also be used by video game developers to create immersive and adaptive games, in which the background sounds dynamically adapting to the actions of the players. For example, while a character goes from a concrete floor to the grass, the sound of their traces could change, or the soundtrack of the game could gradually become more tense when approaching a threat or an enemy.
“Our next planned steps include audio extension to long audio generation,” added Xue. “In addition, rather than just learning associations from multimodal data, we hope to integrate human aesthetic understanding in a strengthening learning framework to better align with subjective preferences.”
More information:
Zeyue Tian et al, Audiox: Diffusion transformer for the generation of all-audio, arxiv (2025). DOI: 10.48550 / Arxiv.2503.10522
arxiv
© 2025 Science X Network
Quote: The new model can generate audio and musical tracks from the various data entries (2025, April 14) recovered on April 15, 2025 from
This document is subject to copyright. In addition to any fair program for private or research purposes, no part can be reproduced without written authorization. The content is provided only for information purposes.