I’ve just stumbled across the new LLaVA AI model (Large Language and Vision Assistant) which is specifically designed for visual and language understanding. This has many applications from describing and interpreting images to using text to generate scientific diagrams and uses the latest Chat GPT-4 to generate multimodal language-image instruction-following data.
From a #GLAM perspective, I wanted to see how this managed the description of historical artworks.
I first gave it a landscape image with plenty of elements, like people, ships, water ect. Asked what appeared in the image, the AI returned the following;
“In the image, there are several boats, a horse-drawn carriage, and people. The boats are docked in a harbor, and the horse-drawn carriage is parked nearby. The people are walking around and interacting with the boats and the carriage.”
This seems like a good start. There are plenty of things it has missed though, such as buildings, the river, and the flock of birds. When pressed on the issue the model still didn’t find these elements;
“Based on the image description, the image does not contain any wildlife or architectural elements. It primarily focuses on the boats, the horse-drawn carriage, and the people in the harbour setting.”
Next, I wanted to see if I could identify image elements in a more machine-readable way, so I asked the model to return #Wikidata IDs for elements within the image. It was able to return the following;
Vessel (Q128466)
Ship (Q128467)
Boat (Q128468)
Horse-drawn carriage (Q128469)
Human (Q5)
Person (Q118)
At first glance this was exciting. The technology gives us the ability to tag basic elements that feature in the image. However, upon inspection, all but one of the Wikidata IDs are completely incorrect and in fact refer to completely different entities. So it’s done well at identifying things but failed to align with Wikidata. And much like Chat GPT’s habit of fabricating citations, it has opted to fake it rather than admit its limitations.
Next up I wanted to see how LLaVA handled a portrait.
I started by asking for a description of the image;
“The image features a portrait of a woman dressed in a black and red outfit with a hat, posing for the picture. She has a mustache and wears a pearl necklace. The woman is in a Victorian style and is likely a part of a portrait series or a formal portrait session.”
Where do I begin! I would describe the outfit as black and pink, rather than red. She clearly does not have a Mustache and is not dressed in Victorian style – The image dates from 1755. Oh, and do you see a pearl necklace?
Interestingly the model was able to correctly identify that this is an oil painting on canvas, however the Wikidata ID request again resulted in a long list of fake IDs. Regenerating the response also produced a range of equally incorrect interpretations of the image.
I remain excited about AI as a tool for this kind of work – to create rich structured data, to provide contextual text descriptions and even provide on the fly answers to questions our users may have about our artworks. However, it is clear from just a few minutes with this latest model, that we are not there yet.
Jason Evans