Gemini’s data-analyzing abilities aren’t as good as Google claims

3:30 PM PDT • June 29, 2024

Image Credits: Lorenzo Di Cola/NurPhoto / Getty Images

One of the selling points of Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can supposedly process and analyze. In press briefings and demos, Google has repeatedly claimed that the models can accomplish previously impossible tasks thanks to their “long context,” like summarizing multiple hundred-page documents or searching across scenes in film footage.

But new research suggests that the models aren’t, in fact, very good at those things.

Two separate studies investigated how well Google’s Gemini models and others make sense out of an enormous amount of data — think “War and Peace”-length works. Both find that Gemini 1.5 Pro and 1.5 Flash struggle to answer questions about large datasets correctly; in one series of document-based tests, the models gave the right answer only 40%-50% of the time.

“While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don’t actually ‘understand’ the content,” Marzena Karpinska, a postdoc at UMass Amherst and a co-author on one of the studies, told TechCrunch.

Gemini’s context window is lacking

A model’s context, or context window, refers to input data (e.g., text) that the model considers before generating output (e.g., additional text). A simple question — “Who won the 2020 U.S. presidential election?” — can serve as context, as can a movie script, show or audio clip. And as context windows grow, so does the size of the documents being fit into them.

The newest versions of Gemini can take in upward of 2 million tokens as context. (“Tokens” are subdivided bits of raw data, like the syllables “fan,” “tas” and “tic” in the word “fantastic.”) That’s equivalent to roughly 1.4 million words, two hours of video or 22 hours of audio — the largest context of any commercially available model.

In a briefing earlier this year, Google showed several pre-recorded demos meant to illustrate the potential of Gemini’s long-context capabilities. One had Gemini 1.5 Pro search the transcript of the Apollo 11 moon landing telecast — around 402 pages — for quotes containing jokes, and then find a scene in the telecast that looked similar to a pencil sketch.

VP of research at Google DeepMind Oriol Vinyals, who led the briefing, described the model as “magical.”

“[1.5 Pro] performs these sorts of reasoning tasks across every single page, every single word,” he said.

That might have been an exaggeration.

In one of the aforementioned studies benchmarking these capabilities, Karpinska, along with researchers from the Allen Institute for AI and Princeton, asked the models to evaluate true/false statements about fiction books written in English. The researchers chose recent works so that the models couldn’t “cheat” by relying on foreknowledge, and they peppered the statements with references to specific details and plot points that’d be impossible to comprehend without reading the books in their entirety.

Given a statement like “By using her skills as an Apoth, Nusis is able to reverse engineer the type of portal opened by the reagents key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash — having ingested the relevant book — had to say whether the statement was true or false and explain their reasoning.

Tested on one book around 260,000 words (~520 pages) in length, the researchers found that 1.5 Pro answered the true/false statements correctly 46.7% of the time while Flash answered correctly only 20% of the time. Averaging all the benchmark results, neither model managed to achieve a bit higher than random chance in terms of question-answering accuracy.

“We’ve noticed that the models have more difficulty verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be solved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models struggle with verifying claims about implicit information that is clear to a human reader but not explicitly stated in the text.”

The second of the two studies, co-authored by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason over” videos — that is, search through and answer questions about the content in them.

The co-authors created a dataset of images (e.g., a photo of a birthday cake) paired with questions for the model to answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?”). To evaluate the models, they picked one of the images at random and inserted “distractor” images before and after it to create slideshow-like footage.

Flash didn’t perform all that well. In a test that had the model transcribe six handwritten digits from a “slideshow” of 25 images, Flash got around 50% of the transcriptions right. The accuracy dropped to around 30% with eight digits.

“On real question-answering tasks over images, it appears to be particularly hard for all the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study’s co-authors, told TechCrunch. “That small amount of reasoning — recognizing that a number is in a frame and reading it — might be what is breaking the model.”

Google is overpromising with Gemini

Neither of the studies have been peer-reviewed, nor do they probe the releases of Gemini 1.5 Pro and 1.5 Flash with 2-million-token contexts. (Both tested the 1-million-token context releases.) And Flash isn’t meant to be as capable as Pro in terms of performance; Google advertises it as a low-cost alternative.

Nevertheless, both add fuel to the fire that Google’s been overpromising — and under-delivering — with Gemini from the beginning. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google’s the only model provider that’s given context window top billing in its advertisements.

Google’s best Gemini demo was faked

“There’s nothing wrong with the simple claim, ‘Our model can take X number of tokens’ based on the objective technical details,” Saxon said. “But the question is, what useful thing can you do with it?”

Generative AI broadly speaking is coming under increased scrutiny as businesses (and investors) grow frustrated with the technology’s limitations.

In a pair of recent surveys from Boston Consulting Group, about half of the respondents — all C-suite executives — said that they don’t expect generative AI to bring about substantial productivity gains and that they’re worried about the potential for mistakes and data compromises arising from generative AI-powered tools. PitchBook recently reported that, for two consecutive quarters, generative AI dealmaking at the earliest stages has declined, plummeting 76% from its Q3 2023 peak.

Faced with meeting-summarizing chatbots that conjure up fictional details about people and AI search platforms that basically amount to plagiarism generators, customers are on the hunt for promising differentiators. Google — which has raced, at times clumsily, to catch up to its generative AI rivals — was desperate to make Gemini’s context one of those differentiators.

‘Embarrassing and wrong’: Google admits it lost control of image-generating AI

But the bet was premature, it seems.

“We haven’t settled on a way to really show that ‘reasoning’ or ‘understanding’ over long documents is taking place, and basically every group releasing these models is cobbling together their own ad hoc evals to make these claims,” Karpinska said. “Without the knowledge of how long context processing is implemented — and companies do not share these details — it is hard to say how realistic these claims are.”

Google didn’t respond to a request for comment.

Both Saxon and Karpinska believe the antidotes to hyped-up claims around generative AI are better benchmarks and, along the same vein, greater emphasis on third-party critique. Saxon notes that one of the more common tests for long context (liberally cited by Google in its marketing materials), “needle in the haystack,” only measures a model’s ability to retrieve particular info, like names and numbers, from datasets — not answer complex questions about that info.

“All scientists and most engineers using these models are essentially in agreement that our existing benchmark culture is broken,” Saxon said, “so it’s important that the public understands to take these giant reports containing numbers like ‘general intelligence across benchmarks’ with a massive grain of salt.”

Updated 7/3: A previous version of this article stated that Gemini 1.5 Pro and 1.5 Flash’s accuracy was below random chance on the task of reasoning over long text. In fact, their accuracy was above random chance. We’ve made the correction. Google PR also sent links to studies that suggest Gemini’s long-context performance is stronger than implied here: Extended Multi-Doc QA, Video MME, longer queries subset on LMSYS, Ruler.

More TechCrunch

Spain’s exposure to climate change helps Madrid-based VC Seaya close €300M climate tech fund

Mike Butcher

6 hours ago

According to a recent Dealroom report on the Spanish tech ecosystem, the combined enterprise value of Spanish startups surpassed €100 billion in 2023. In the latest confirmation of this upward trend, Madrid-based…

Spain’s exposure to climate change helps Madrid-based VC Seaya close €300M climate tech fund

Venture

Forestay, Europe’s newest $220M growth-stage VC fund, will focus on AI

Mike Butcher

7 hours ago

Forestay, an emerging VC based out of Geneva, Switzerland, has been busy. This week it closed its second fund, Forestay Capital II, at a hard cap of $220 million. The…

Forestay, Europe’s newest $220M growth-stage VC fund, will focus on AI

Apps

A year later, what Threads could learn from other social networks

Ivan Mehta

9 hours ago

Threads, Meta’s alternative to Twitter, just celebrated its first birthday. After launching on July 5 last year, the social network has reached 175 million monthly active users — that’s a…

A year later, what Threads could learn from other social networks

Venture

J2 Ventures, focused on military healthcare, grabs $150M for its second fund

Marina Temkin

11 hours ago

J2 Ventures, a firm led mostly by U.S. military veterans, announced on Thursday that it has raised a $150 million second fund. The Boston-based firm invests in startups whose products…

J2 Ventures, focused on military healthcare, grabs $150M for its second fund

Security

HealthEquity says data breach is an ‘isolated incident’

Lorenzo Franceschi-Bicchierai

1 day ago

HealthEquity said in an 8-K filing with the SEC that it detected “anomalous behavior by a personal use device belonging to a business partner.”

HealthEquity says data breach is an ‘isolated incident’

Security

Roll20, an online tabletop role-playing game platform, discloses data breach

Lorenzo Franceschi-Bicchierai

1 day ago

Roll20 said that on June 29 it had detected that a “bad actor” gained access to an account on the company’s administrative website for one hour.

Roll20, an online tabletop role-playing game platform, discloses data breach

Transportation

Fisker asks bankruptcy court to sell its EVs at average of $14,000 each

Sean O'Kane

1 day ago

Fisker has a willing buyer for its remaining inventory of all-electric Ocean SUVs, and has asked the Delaware Bankruptcy Court judge overseeing its Chapter 11 case to approve the sale.…

Fisker asks bankruptcy court to sell its EVs at average of $14,000 each

Social

Fizz, the anonymous Gen Z social app, adds a marketplace for college students

Amanda Silberling

1 day ago

Teddy Solomon just moved to a new house in Palo Alto, so he turned to the Stanford community on Fizz to furnish his room. “Every time I show up to…

Fizz, the anonymous Gen Z social app, adds a marketplace for college students

Venture

Why deep tech VC Driving Forces is shutting down

Christine Hall

1 day ago

With increasing competition for what is, essentially, still a small number of hard tech and deep tech deals, Sidney Scott realized it would be a challenge for smaller funds like…

Why deep tech VC Driving Forces is shutting down

Apps

How to turn off those silly video call reactions on iPhone and Mac

Ivan Mehta

1 day ago

A guide to turn off reactions on your iPhone and Mac so you don’t get surprised by effects during work video calls.

How to turn off those silly video call reactions on iPhone and Mac

Robotics

Amazon retires its Astro for Business security robot after only 7 months

Lauren Forristal

1 day ago

Amazon has decided to discontinue its Astro for Business device, a security robot for small- and medium-sized businesses, just seven months after launch. In an email sent to customers and…

Amazon retires its Astro for Business security robot after only 7 months

This Week in AI: With Chevron’s demise, AI regulation seems dead in the water

Kyle Wiggers

1 day ago

Hiya, folks, and welcome to TechCrunch’s regular AI newsletter. This week in AI, the U.S. Supreme Court struck down “Chevron deference,” a 40-year-old ruling on federal agencies’ power that required…

This Week in AI: With Chevron’s demise, AI regulation seems dead in the water

Apps

noplace, a mashup of Twitter and Myspace for Gen Z, hits No. 1 on the App Store

Sarah Perez

1 day ago

Noplace had already gone viral ahead of its public launch because of its feature that allows users to express themselves by customizing the colors of their profile.

noplace, a mashup of Twitter and Myspace for Gen Z, hits No. 1 on the App Store

Cloudflare launches a tool to combat AI bots

Kyle Wiggers

1 day ago

Cloudflare analyzed AI bot and crawler traffic to fine-tune automatic bot detection models.

Cloudflare launches a tool to combat AI bots

Security

Twilio says hackers identified cell phone numbers of two-factor app Authy users

Lorenzo Franceschi-Bicchierai

1 day ago

Twilio says “threat actors were able to identify” phone numbers of people who use the two-factor app Authy.

Twilio says hackers identified cell phone numbers of two-factor app Authy users

Hardware

Nano Dimension is buying Desktop Metal

Brian Heater

1 day ago

The news brings closure to more than two years of volleying back and forth between some of the biggest names in additive manufacturing.

TechCrunch Disrupt 2024

Groups save big at TechCrunch Disrupt 2024

TechCrunch Events

1 day ago

Planning to attend TechCrunch Disrupt 2024 with your team? Maximize your team-building time and your company’s impact across the entire conference when you bring your team. Groups of 4 to…

Groups save big at TechCrunch Disrupt 2024

Apps

Music video-sharing app Popster uses generative AI and lets artists remix videos

Lauren Forristal

1 day ago

As more music streaming apps and creation tools emerge to compete for users’ attention, social music-sharing app Popster is getting two new features to grow its user base: an AI…

Music video-sharing app Popster uses generative AI and lets artists remix videos

Apps

Threads nears its one-year anniversary with more than 175M monthly active users

Aisha Malik

1 day ago

Meta’s Threads now has more than 175 million monthly active users, Mark Zuckerberg announced on Wednesday. The announcement comes two days away from Threads’ first anniversary. Zuckerberg revealed back in…

Threads nears its one-year anniversary with more than 175M monthly active users

Transportation

From burritos to biotech: How robotics startup Cartken found its AV niche

Kirsten Korosec

1 day ago

Cartken and its diminutive sidewalk delivery robots first rolled into the world with a narrow charter: carrying everything from burritos and bento boxes to pizza and pad thai that last…

From burritos to biotech: How robotics startup Cartken found its AV niche

Biotech & Health

Granza Bio grabs $7M seed from Felicis and YC to advance delivery of cancer treatments

Marina Temkin

1 day ago

Ashwin Nandakumar and Ashwin Jainarayanan were working on their doctorates at adjacent departments in Oxford, but they didn’t know each other. Nandakumar, who was studying oncology, one day stumbled across…

Granza Bio grabs $7M seed from Felicis and YC to advance delivery of cancer treatments

Hardware

LG acquires smart home platform Athom to bring third-party connectivity to its ThinQ ecosytem

Aisha Malik

1 day ago

LG has acquired an 80% stake in Athom, a Dutch smart home company and maker of the Homey smart home hub. According to LG’s announcement, it will purchase the remaining…

LG acquires smart home platform Athom to bring third-party connectivity to its ThinQ ecosytem

Crypto

CoinDCX acquires BitOasis in international expansion push

Manish Singh

1 day ago

CoinDCX, India’s leading cryptocurrency exchange, is expanding internationally through the acquisition of BitOasis, a digital asset platform in the Middle East and North Africa, the companies said Wednesday. The Bengaluru-based…

CoinDCX acquires BitOasis in international expansion push

Security

In a major update, Proton adds privacy-safe document collaboration to Drive, its freemium E2EE cloud storage service

Natasha Lomas

1 day ago

Collaborative document features are being made available inside Proton Drive, further extending the company’s trademark pitch of robust security.

In a major update, Proton adds privacy-safe document collaboration to Drive, its freemium E2EE cloud storage service

Apps

Telegram lets creators share paid content to channels

Ivan Mehta

2 days ago

Telegram launched a digital currency called Stars for in-app use last month. Now, the company is expanding its use cases to paid content. The chat app is also allowing channels…

Telegram lets creators share paid content to channels

Altrove uses AI models and lab automation to create new materials

Romain Dillet

2 days ago

For the past couple of years, innovation has been accelerating in new materials development. And a new French startup called Altrove plans to play a role in this innovation cycle.…

Altrove uses AI models and lab automation to create new materials

Startups

Indian social network Koo is shutting down as buyout talks collapse

Manish Singh

2 days ago

The Indian social media platform Koo, which positioned itself as a competitor to Elon Musk’s X, is ceasing operations after its last-resort acquisition talks with Dailyhunt collapsed. Despite securing over…

Indian social network Koo is shutting down as buyout talks collapse

Apps

Europe is still serious about ESG, and Apiday is helping companies comply

Anna Heim

2 days ago

Apiday leverages AI to save time for its customers. But like legacy consultants, it also offers human expertise.

Europe is still serious about ESG, and Apiday is helping companies comply

Climate

Google’s environmental report pointedly avoids AI’s actual energy cost

Devin Coldewey

2 days ago

Google totally dodges the question of how much energy is AI is using — perhaps because the answer is “way more than we’d care to say.”

Google’s environmental report pointedly avoids AI’s actual energy cost

Space

SpaceX wants to launch up to 120 times a year from Florida — and competitors aren’t happy about it

Aria Alamalhodaei

2 days ago

SpaceX’s ambitious plans to launch its Starship mega-rocket up to 44 times per year from NASA’s Kennedy Space Center are causing a stir among some of its competitors. Late last…

Gemini’s data-analyzing abilities aren’t as good as Google claims

Gemini’s context window is lacking

Google is overpromising with Gemini

More TechCrunch

Get the industry’s biggest tech news

Tags