Imagine a radiologist examining X-rays of the chest from a new patient. She notes that the patient has swelling in the tissue but it is not an extended heart. To accelerate the diagnosis, it can use the Vision-Linguage Machine-Learning Model to find reports of the same patients.
But if the model mistakenly identifies reports with both conditions, the potential diagnosis can be quite different: if a patient has a tissue swelling and an elaborate heart, the condition is likely to be cardiac related, but there may be many underlying reasons for no extensive heart.
In the new study, MIT researchers have discovered that the models of vision-language are likely to make such a mistake in real-world situations because they are denied or “no” words that are false or absent.
“It can have a very significant impact on the words of neglect, and if we are using these models only, we can participate in catastrophic results,” says Kumail Allhamaud, a graduate student of MIT and chief of this study.
Researchers tested the ability of vision-language models to identify the negativity in image A tions. Models are often random inferred as well. In view of that findings, the team created a dataset of images with corresponding tions posters that contain negative words describing the missing Objects Burgets.
They show that rearranging the vision-lingua model with this dataset improves the operation when a model is asked to achieve images that do not contain certain objects. It also accelerates the accuracy of the question of multiple choice, answering with the negative kothans.
But researchers have warned that more work is needed to consider the root cause of this problem. They hope that their research warns potential users for non-prevalement defects, which can have serious effects in high-drug settings where these models are currently being used, determining which patients receive some treatments to identify production defects at the product plant.
“This is a technical paper, but there are a large number of issues. If we are using it, we should not use it in many ways – without intensive evaluation,”, “says Aargiyh, a senior author of the Department of Electrical Engineering and Computer. Senior writer says Argieh Magemi.
Gasmi and Alhamaud have joined the paper by Shedon Alshamri, a graduate student at MIT; Openai’s Yonglong Tion; Guo Lee, former Postdoc at Ox Xford University; Professor of Philip HS Tor, Ox Xford; And Yun Kim, Assistant Professor of EEC and Member of Computer Science and Artificial Intelligence Laboratory (CSAL) at MIT. This research will be presented at the conference on computer vision and pattern validity.
Negligence
Vision-lingua models (VLM) are trained using a wide collection and corresponding tions of images, which they learn to encode as a set of numbers, called vector representations. Models use this vector for the difference between different images.
A VLM uses two separate encoders, one for a text and one for images, and the encoders learn to output the same vector for an image and its corresponding text A tion ption.
“K tions express what is in the images – it’s a positive label. And it’s really the whole problem. No one sees an image jumping on a dog’s fence and ‘a dog jumps on the fence,” without a helicopter,’.
Because image-CAP ption datasets don’t have negative examples, VLMS never learns to identify it.
To dig more DIG in this problem, researchers created two benchmark tasks that test the ability of VLM to understand the neglect.
For the first, they used the larger language model (LLM) to repeat the CAP ptions in the existing dataset by asking the LLM to think about the related OBJECTS BUJECTS to write in a tion ption. Then they tested by asking the models Dello with words neglect, which contain certain objects, but not others but images.
For the second task, they created multiple choice questions that the VLM to select the most appropriate A Tion Pission from the list of closely related options. These C Tions are separated by adding just a reference to an OBJECT bug that does not appear in the image or reject the Object Burjet appearing in the image.
The models often fail in both functions, with the negative C Tions Passions reduced to an image recovery performance by about 25 percent. When it comes to answering multiple choice questions, the best models achieved only 39 percent accuracy, with many models performing at or below random opportunity.
One of the reasons for this failure is that researchers confirm bias – VLMS ignores the words of the Negation and instead focuses on objects bojacts in the images.
“This does not happen just for words like ‘no’ and ‘no’. Regardless of how you neglect or exclude, the models will ignore it,” Alhamaud says.
This was compatible with every VLM they tested.
“A solved problem”
VLM Usually image c tions with negative are not trained on modes, so researchers develop datasets with negative words as the first step towards solving the problem.
Using a dataset with a 10 million image-text-tion pension pair, they asked the LLM to propose the relevant tions of the tions that specify what excluded from the images, giving new tions with negative words.
They had to be especially careful that these artificial tions still read naturally naturally, or when they face more complex tions, they have to face VLM in the real world. May fail.
They discovered that it was benefited from the influence throughout the board by finalizing VLM with their dataset. It has improved the image of models by about 10 percent of the Revenue procurement capabilities, while in response to the question of multiple choice, the task has also increased the performance by about 30 percent.
“But our solution is not perfect. We are just re -aging datasets, a type of data growth. We have not even touched how these models work, but we hope this is a sign that this is a solved problem and that others can resolve and correct it.”
At the same time, he hopes that his work encourages more users to think about the problem who wants to use VLM to solve and design some examples before deploying them.
In the future, researchers can expand by teaching VLMS to process text and images separately on this task, which can improve their ability to understand their neglect. In addition, they can develop additional datasets that include an image-AP ption pair for specialized applications such as health care.