AI learns how vision and sound are connected, without human intervention | Meat news

Humans learn naturally by creating attachments between sight and sound. For example, we can see someone playing cells and recognize that the movement of the cellist is producing the music we hear.

A new approach developed by MIT and elsewhere improves the ability to learn in this same fashion of the AI ​​model. This can be useful in applications such as journalism and filmmaking, where the model can help curatle multimodal content by automated video and recovery of Audio Dio.

In the long run, this task can be used to improve the robot’s ability to understand the real-world environment, where audible and visual information are often closely linked.

To improve the previous work from their group, researchers have created a method that helps organize machine-learning models adjust Audio deo and visual data from video clips without the need for human labels.

They adjusted how their original model is trained so it learns a beautiful-grain correspondence between a particular video frame and Audio Dio that happens in that moment. Researchers have also performed some architectural tweaks that help the system balance two different objectives, which improve influence.

Together, these relatively simple updates accelerate the accuracy of their approach in video redie procurement tasks and in I deio visual scenes. For example, the new method can automatically and precisely match the sound of slamming of the door with a scene in the video clip.

“We are creating AI systems that can process the world like humans, in the context of the simultaneous Audio DIO and visual information to come and be able to be able to be unified on both methods. Looking forward, if we use this Audio de-visual technol on a daily basis, such as big language models, such as this, such as new applications. Models, “in larger language models, can unite it.

She is joined on paper by Edson Erauz, a graduate student at Goth’s Gothe, Germany; Yuan Gong, former MIT Postdoc; Saurabchand Bhatti, current MIT Postdoc; Leonid Carlinsky of Samuel Thomas, Brian Kingsbury and IBM Research; Rogerio Ferris, Chief Vijay of MIT-IBM Watson AI Lab; Connect and manager; James Glass, Senior Research Head of the Spoken Language Systems Group in Connik and MIT Computer Science and Artificial Intelligence Laboratory (CSEL); And Goth University Professor of Computer Science and an affiliated professor of MIT-IBM Watson AI Lab, senior author Hilde Kuhen. Work will be presented at the conference on computer vision and pattern validity.

Co -ordination

This function is built on a machine-learning method that developed by researchers a few years ago, which provided an effective way to train the multimodal model to process Audio Dio and visual data simultaneously without the need for human labels.

Researchers feed this model, called Cave-May, labeled video clips, and they encode visuals and Audio deo data separately in representations called tokens. Using the natural Audio deo from the recording, the model automatically learns to map the Audio Dio and the corresponding pairs of visual tokens close to its internal presentation space.

They found that the use of two learning objectives balances the MODel Dell’s learning process, which enables CAV-I to understand Audio Dio and visual data corresponding while the user improves its ability to regain video clips matching queries.

But Cave-Ma treats Audio Dio and visual samples as a unit, so the sound of a 10-second video clip and door slamming is mapped together, even if it happens in just one second of the Audio Dio event video.

In their modified model, called Cave-May sink, researchers divided Audio deo into small windows before calculating data representations, so they produce different representations that correspond to each small window of Audio Dio.

During training, the model learns to connect a video frame to Audio Dio that only happens during that frame.

“By doing that, the model learns a beautiful-grain correspondence, which later helps with the influence when we collect this information,” Arazo says.

They also included architectural updates that help the model balance its two learning objectives.

Adding “Wiggle Room”

The model includes a conflicting objective, where it learns to combine similar Audio Dio and visual data, and the rehabilitation purpose is intended to regain specific Audio Dio and visual data based on user questions.

In the Cave-May sink, researchers introduced two new types of data representations or tokens to improve the ability to learn the model.

It contains a dedicated “Global tokens” that helps with the objective and dedicated “registration tokens” of conflicting education that helps focus the model on important details for the purpose of reconstruction.

“Essentially, we add a little more wiggling room to the model so that they can do each of these two tasks, contradictory and reconstructed, a little more independently. It benefited the overall influence,” Araouz added.

While researchers had a little inward .The enhancements would improve the influence of the Cave-May sink, carefully combining the model to transfer the model in their direction.

“Because we have multiple methods, we both need a good model for modulities, but we also need to fuse them together and collaborate,” says Raudichenko.

In the end, their enhancements improved the ability to receive videos based on Audio Dio Query and predict the class of Audio de-visual scene, such as dog barking or instrument playing.

The consequences were more accurate than their previous work, and they perform better than more complex, sophisticated methods that require large amounts of training data.

“Sometimes, when applicable to the top of the model you are working on, very simple ideas or small patterns are worth seeing in the data,” says Arazo.

In the future, researchers want to include new models that produce better data representations in the Cave-May sink, which can improve influence. They want to enable their system to handle the text data, which will be an important step towards producing a visual larger language model.

This work is funded by partly, German Federal Education and Research Ministry and MIT-IBM Watson AI Lab.

Scroll to Top