Skip to main content

Nvidia’s new AI model makes music from text and audio prompts

Nvidia logo.
Nvidia

Nvidia has released a new generative audio AI model that is capable of creating myriad sounds, music, and even voices, based on the user’s simple text and audio prompts.

Dubbed Fugatto (aka Foundational Generative Audio Transformer Opus 1) the model can, for example, create jingles and song snippets based solely on text prompts, add or remove instruments and vocals from existing tracks, modify both the accent and emotion of a voice, and “even let people produce sounds never heard before,” per Monday’s announcement post.

Recommended Videos

“We wanted to create a model that understands and generates sound like humans do,” said Rafael Valle, a manager of applied audio research at Nvidia. “Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale.”

The company notes that music producers could use the AI model to rapidly prototype and vet song ideas in various musical styles with varying arrangements, or add effects and additional layers to existing tracks. The model could also be leveraged to adapt and localize the music and voiceovers of an existing ad campaign, or adjust the music of a video game on the fly as the player plays through a level.

The model is even capable of generating previously unheard sounds like barking trumpets or meowing saxophones. In doing so, it uses a technique called ComposableART to combine the instructions it learned during training.

“I wanted to let users combine attributes in a subjective or artistic way, selecting how much emphasis they put on each one,” Nvidia AI researcher Rohan Badlani wrote in the announcement post. “In my tests, the results were often surprising and made me feel a little bit like an artist, even though I’m a computer scientist.”

The Fugatto model itself uses 2.5 billion parameters and was trained on 32 H100 GPUs. Audio AI’s like this are becoming increasingly common. Stability AI unveiled a similar system in April that can generate tracks up to three minutes in length while Google’s V2A model can generate “an unlimited number of soundtracks for any video input.”

YouTube recently released an AI music remixer that generates a 30-second sample based on the input song and the user’s text prompts. Even OpenAI is experimenting in this space, having released an AI tool in April that needs just 15 seconds of sample audio in order to fully clone a user’s voice and vocal patterns.

Andrew Tarantola
Andrew Tarantola is a journalist with more than a decade reporting on emerging technologies ranging from robotics and machine…
Nvidia is taking generative AI into a whole new dimension
A visualization of Nvidia's Omniverse platform.

We've seen plenty of generative AI tools that can spit out images, from Microsoft Designer to Stable Diffusion, but Nvidia is taking generative AI into a new dimension -- literally. The company announced a partnership with Shutterstock at Siggraph 2024 that will allow users to generate 3D models using generative AI.

There are handful of AI tools that can generate 3D models, but Shutterstock's take is definitely the most official tool we've seen so far. The models will live in Shutterstock's TurboSquid library, which currently hosts the company's library of 3D assets. To generate new assets, Shutterstock says users will be able to provide text and images as prompts. From there, designers can take the assets and edit them in separate apps, with exports available "in a variety of popular file formats," according to Nvidia.

Read more
GPT-4: everything you need to know about ChatGPT’s standard AI model
A laptop opened to the ChatGPT website.

People were in awe when ChatGPT came out, impressed by its natural language abilities as an AI chatbot originally powered by the GPT-3.5 large language model. But when the highly anticipated GPT-4 large language model came out, it blew the lid off what we thought was possible with AI, with some calling it the early glimpses of AGI (artificial general intelligence).
What is GPT-4?
GPT-4 is the newest language model created by OpenAI that can generate text that is similar to human speech. It advances the technology used by ChatGPT, which was previously based on GPT-3.5 but has since been updated. GPT is the acronym for Generative Pre-trained Transformer, a deep learning technology that uses artificial neural networks to write like a human.

According to OpenAI, this next-generation language model is more advanced than ChatGPT in three key areas: creativity, visual input, and longer context. In terms of creativity, OpenAI says GPT-4 is much better at both creating and collaborating with users on creative projects. Examples of these include music, screenplays, technical writing, and even "learning a user's writing style."

Read more
This new free tool lets you easily train AI models on your own
Gigabyte AI TOP utility branding

Gigabyte has announced the launch of AI TOP, its in-house software utility designed to bring advanced AI model training capabilities to home users. Making its first appearance at this year’s Computex, AI TOP allows users to locally train and fine-tune AI models with a capacity of up to 236 billion parameters when used with recommended hardware.

AI TOP is essentially a comprehensive solution for local AI model fine-tuning, enhancing privacy and security for sensitive data while providing maximum flexibility and real-time adjustments. According to Gigabyte, the utility comes with a user-friendly interface and has been designed to help beginners and experienced users easily navigate and understand the information and settings. Additionally, the utility includes AI TOP Tutor, which offers various AI TOP solutions, setup guidance, and technical support for all types of AI model operators.

Read more