Google’s recent Gemini promotional video, which was released on Wednesday, has sparked controversy among AI experts.
The video seemingly depicts Google’s new AI model recognizing visual cues and engaging in real-time vocal interactions with a person.
However, Parmy Olson’s report for Bloomberg reveals that Google has acknowledged the video’s deception. In reality, the researchers fed still images into the model and edited together the successful responses, thereby partially misrepresenting the model’s actual capabilities.
A Google spokesperson mentioned that the demo was created by capturing footage to test the capabilities of Gemini on various challenges.
They explained that Gemini was prompted using still image frames from the footage and through text interaction. Google filmed human hands performing activities and then showed still images to Gemini Ultra individually.
Researchers at Google interacted with the model via text, selected the best interactions, and edited them with voice synthesis to produce the video.
Currently, processing still images and text using large language models requires significant computational power, making real-time video interpretation impractical. This observation was one of the initial indications that led AI experts to suspect that the video was deceptive.
Demo Video If You Haven’t Checked:
Olson tweeted that Google’s video gave the impression that Gemini Ultra could display different things in real-time and have a conversation with it. However, this is not the case.
According to a Google spokesperson, the user’s voiceover in the video consists of authentic excerpts from the prompts that were used to generate the Gemini output.
In the video titled ‘Interacting with multimodal AI: A closer look at Gemini,’ we are presented with a visual representation of the AI model’s perspective, while its responses are displayed on the right side of the screen.
During the demonstration, the researcher uses squiggly lines and ducks as visual prompts, engaging Gemini in a conversation about its perception. Throughout the video, we also hear the voice of Gemini Ultra, providing insightful responses to the queries.
Olson, in her Bloomberg article, highlights the absence of information about the use of Gemini Ultra in the recognition demo, which is not currently accessible.
“This lack of transparency suggests a larger marketing strategy: Google aims to remind us of its extensive team of AI researchers and unparalleled access to data,” Olson wrote.
On this Google blog page, Gemini’s image recognition abilities are accurately represented. They appear to be on a similar level as OpenAI’s multimodal GPT-4V (GPT-4 with vision) AI model, which is capable of recognizing the content of still images.
However, when seamlessly edited together for promotional purposes, it created a perception that Google’s Gemini model is more capable than it actually is, leading to heightened excitement among many individuals.
Chris Anderson, the organizer of TED, expressed his fascination with the demo, tweeting, “The implications of this demo are truly thought-provoking. It’s not far-fetched to speculate that in the near future, a developing Gemini 2.0 could actively participate in a board meeting.
It would be capable of analyzing briefing documents, reviewing slides, comprehending everyone’s statements, and providing intelligent contributions to the discussions. Shouldn’t this be considered as a significant step towards achieving AGI?”
Grady Booch, a pioneering software engineer, responded by saying, “The demo was heavily edited to create the impression that Gemini is much more capable than it actually is. Chris, you have been misled, and it is disappointing that they would resort to such tactics.”