Gemini’s data-analyzing abilities aren’t as good as Google claims

3:30 PM PDT • June 29, 2024

Image Credits: Lorenzo Di Cola/NurPhoto / Getty Images

One of the selling points of Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can supposedly process and analyze. In press briefings and demos, Google has repeatedly claimed that the models can accomplish previously impossible tasks thanks to their “long context,” like summarizing multiple hundred-page documents or searching across scenes in film footage.

But new research suggests that the models aren’t, in fact, very good at those things.

Two separate studies investigated how well Google’s Gemini models and others make sense out of an enormous amount of data — think “War and Peace”-length works. Both find that Gemini 1.5 Pro and 1.5 Flash struggle to answer questions about large datasets correctly; in one series of document-based tests, the models gave the right answer only 40%-50% of the time.

“While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don’t actually ‘understand’ the content,” Marzena Karpinska, a postdoc at UMass Amherst and a co-author on one of the studies, told TechCrunch.

Gemini’s context window is lacking

A model’s context, or context window, refers to input data (e.g., text) that the model considers before generating output (e.g., additional text). A simple question — “Who won the 2020 U.S. presidential election?” — can serve as context, as can a movie script, show or audio clip. And as context windows grow, so does the size of the documents being fit into them.

The newest versions of Gemini can take in upward of 2 million tokens as context. (“Tokens” are subdivided bits of raw data, like the syllables “fan,” “tas” and “tic” in the word “fantastic.”) That’s equivalent to roughly 1.4 million words, two hours of video or 22 hours of audio — the largest context of any commercially available model.

In a briefing earlier this year, Google showed several pre-recorded demos meant to illustrate the potential of Gemini’s long-context capabilities. One had Gemini 1.5 Pro search the transcript of the Apollo 11 moon landing telecast — around 402 pages — for quotes containing jokes, and then find a scene in the telecast that looked similar to a pencil sketch.

VP of research at Google DeepMind Oriol Vinyals, who led the briefing, described the model as “magical.”

“[1.5 Pro] performs these sorts of reasoning tasks across every single page, every single word,” he said.

That might have been an exaggeration.

In one of the aforementioned studies benchmarking these capabilities, Karpinska, along with researchers from the Allen Institute for AI and Princeton, asked the models to evaluate true/false statements about fiction books written in English. The researchers chose recent works so that the models couldn’t “cheat” by relying on foreknowledge, and they peppered the statements with references to specific details and plot points that’d be impossible to comprehend without reading the books in their entirety.

Given a statement like “By using her skills as an Apoth, Nusis is able to reverse engineer the type of portal opened by the reagents key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash — having ingested the relevant book — had to say whether the statement was true or false and explain their reasoning.

Tested on one book around 260,000 words (~520 pages) in length, the researchers found that 1.5 Pro answered the true/false statements correctly 46.7% of the time while Flash answered correctly only 20% of the time. That means a coin is significantly better at answering questions about the book than Google’s latest machine learning model. Averaging all the benchmark results, neither model managed to achieve higher than random chance in terms of question-answering accuracy.

“We’ve noticed that the models have more difficulty verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be solved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models struggle with verifying claims about implicit information that is clear to a human reader but not explicitly stated in the text.”

The second of the two studies, co-authored by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason over” videos — that is, search through and answer questions about the content in them.

The co-authors created a dataset of images (e.g., a photo of a birthday cake) paired with questions for the model to answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?”). To evaluate the models, they picked one of the images at random and inserted “distractor” images before and after it to create slideshow-like footage.

Flash didn’t perform all that well. In a test that had the model transcribe six handwritten digits from a “slideshow” of 25 images, Flash got around 50% of the transcriptions right. The accuracy dropped to around 30% with eight digits.

“On real question-answering tasks over images, it appears to be particularly hard for all the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study’s co-authors, told TechCrunch. “That small amount of reasoning — recognizing that a number is in a frame and reading it — might be what is breaking the model.”

Google is overpromising with Gemini

Neither of the studies have been peer-reviewed, nor do they probe the releases of Gemini 1.5 Pro and 1.5 Flash with 2-million-token contexts. (Both tested the 1-million-token context releases.) And Flash isn’t meant to be as capable as Pro in terms of performance; Google advertises it as a low-cost alternative.

Nevertheless, both add fuel to the fire that Google’s been overpromising — and under-delivering — with Gemini from the beginning. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google’s the only model provider that’s given context window top billing in its advertisements.

Google’s best Gemini demo was faked

“There’s nothing wrong with the simple claim, ‘Our model can take X number of tokens’ based on the objective technical details,” Saxon said. “But the question is, what useful thing can you do with it?”

Generative AI broadly speaking is coming under increased scrutiny as businesses (and investors) grow frustrated with the technology’s limitations.

In a pair of recent surveys from Boston Consulting Group, about half of the respondents — all C-suite executives — said that they don’t expect generative AI to bring about substantial productivity gains and that they’re worried about the potential for mistakes and data compromises arising from generative AI-powered tools. PitchBook recently reported that, for two consecutive quarters, generative AI dealmaking at the earliest stages has declined, plummeting 76% from its Q3 2023 peak.

Faced with meeting-summarizing chatbots that conjure up fictional details about people and AI search platforms that basically amount to plagiarism generators, customers are on the hunt for promising differentiators. Google — which has raced, at times clumsily, to catch up to its generative AI rivals — was desperate to make Gemini’s context one of those differentiators.

‘Embarrassing and wrong’: Google admits it lost control of image-generating AI

But the bet was premature, it seems.

“We haven’t settled on a way to really show that ‘reasoning’ or ‘understanding’ over long documents is taking place, and basically every group releasing these models is cobbling together their own ad hoc evals to make these claims,” Karpinska said. “Without the knowledge of how long context processing is implemented — and companies do not share these details — it is hard to say how realistic these claims are.”

Google didn’t respond to a request for comment.

Both Saxon and Karpinska believe the antidotes to hyped-up claims around generative AI are better benchmarks and, along the same vein, greater emphasis on third-party critique. Saxon notes that one of the more common tests for long context (liberally cited by Google in its marketing materials), “needle in the haystack,” only measures a model’s ability to retrieve particular info, like names and numbers, from datasets — not answer complex questions about that info.

“All scientists and most engineers using these models are essentially in agreement that our existing benchmark culture is broken,” Saxon said, “so it’s important that the public understands to take these giant reports containing numbers like ‘general intelligence across benchmarks’ with a massive grain of salt.”

More TechCrunch

Twitter/X alternative Mastodon appeals to journalists with new ‘byline’ feature

Sarah Perez

53 mins ago

The new bylines go beyond the typical @username references that often accompany link posts from news publications and those pointing to other written content, like a WordPress blog or Substack

Twitter/X alternative Mastodon appeals to journalists with new ‘byline’ feature

Social

X weighs adding a downvote button to replies — but it doesn’t want to emulate Reddit

Sarah Perez

1 hour ago

code references found in the X iOS app indicate that the company could be considering adding downvotes for replies only to improve how they’re ranked.

X weighs adding a downvote button to replies — but it doesn’t want to emulate Reddit

Security

Yieldstreet says some of its customers were affected by the Evolve Bank data breach

Lorenzo Franceschi-Bicchierai

1 hour ago

Evolve, a popular financial institution for fintech startups, announced that a cyberattack affected “the data and personal information of some Evolve retail bank customers and financial technology partners’ customers.”

Yieldstreet says some of its customers were affected by the Evolve Bank data breach

Fintech

Evolve hack fallout continues, fintech M&A heats up and Plaid talks enterprise push

Mary Ann Azevedo

2 hours ago

Welcome to TechCrunch Fintech! This week, we’re looking at the Evolve Bank hack, three notable acquisitions, Plaid’s enterprise customer growth and more. To get a roundup of TechCrunch’s biggest and…

Evolve hack fallout continues, fintech M&A heats up and Plaid talks enterprise push

TechCrunch Disrupt 2024

What you need to raise a Series A today at TechCrunch Disrupt 2024

TechCrunch Events

3 hours ago

Raising a Series A round in today’s competitive market can be a daunting task. To equip seed-stage founders with the insights and strategies needed for success, TechCrunch Disrupt 2024 will…

What you need to raise a Series A today at TechCrunch Disrupt 2024

Apps

Snapchat’s latest features help users personalize their accounts

Aisha Malik

3 hours ago

Snapchat is introducing new ways for users to personalize their accounts, the company announced on Tuesday. The updates, which are mostly available for Snapchat+ subscribers, allow users to do things…

Snapchat’s latest features help users personalize their accounts

Meta plans to bring generative AI to metaverse games

Kyle Wiggers

3 hours ago

Meta plans to bring more generative AI tech into games, specifically VR, AR and mixed reality games, as the company looks to reinvigorate its flagging metaverse strategy. According to a…

Meta plans to bring generative AI to metaverse games

Featured Article

News outlets are accusing Perplexity of plagiarism and unethical web scraping

In the age of generative AI, when chatbots can provide detailed answers to questions based on content pulled from the internet, the line between fair use and plagiarism, and between routine web scraping and unethical summarization, is a thin one. Perplexity AI is a startup that combines a search engine…

Rebecca Bellan

4 hours ago

News outlets are accusing Perplexity of plagiarism and unethical web scraping

Apps

Figma disables its AI design feature that appeared to be ripping off Apple’s Weather app

Sarah Perez

4 hours ago

The Make Design feature is available within Figma’s software and will generate UI (user interface) layouts and components from text prompts.

Figma disables its AI design feature that appeared to be ripping off Apple’s Weather app

Space

Computing and shielding startups join forces to put AI-capable chips in space

Aria Alamalhodaei

5 hours ago

Sophisticated spacecraft often run on shockingly outdated computing systems: consider that the Perseverance rover runs on a PowerPC 750, the processor famous for running on iMacs in the late 1990s. …

Computing and shielding startups join forces to put AI-capable chips in space

Venture

Industry Ventures raises a $900M fund for investing in small, early-stage VCs and their breakout startups

Marina Temkin

6 hours ago

The venture fundraising trend in 2024 is fairly clear by now: Large, established VC firms are continuing to attract capital from limited partners, while smaller, newer funds are finding it…

Venture

Husband-and-wife former Olympians target $50M for new fund to invest in influencer-led consumer brands

Christine Hall

6 hours ago

Samyr Laine and Ayanna Alexander-Laine now put their grit and determination to work for founders wanting to launch and scale consumer brands.

Husband-and-wife former Olympians target $50M for new fund to invest in influencer-led consumer brands

Fundraising

As the AI boom gobbles up power, Phaidra is helping companies manage datacenter power more efficiently

Kyle Wiggers

6 hours ago

Electricity demand is booming on account of AI. In a May 2024 report, Goldman Sachs predicted that data centers will use 8% of the U.S.’s total power supply by 2030, up from…

As the AI boom gobbles up power, Phaidra is helping companies manage datacenter power more efficiently

Climate

Sensorita uses digital twins to help waste management companies streamline construction waste

Rebecca Szkutak

6 hours ago

The amount of waste produced by the construction industry adds up to more than a third of the overall waste produced each year in the European Union. And it’s no…

Sensorita uses digital twins to help waste management companies streamline construction waste

Startups

Beauty tech startup BoldHue raises capital to ship its ‘Keurig for makeup’

Lauren Forristal

6 hours ago

BoldHue’s device essentially scans your face and dispenses a customized foundation formula that matches your skin tone.

Beauty tech startup BoldHue raises capital to ship its ‘Keurig for makeup’

Startups

Indian edtech Unacademy cuts another 250 jobs

Manish Singh

7 hours ago

Unacademy is slashing another 250 jobs in latest round of cuts as Indian edtech sector continues to struggle.Q

Indian edtech Unacademy cuts another 250 jobs

Apps

Apple adds support for new languages across lock screen, keyboard and search on iOS 18

Ivan Mehta

10 hours ago

Apple unveiled iOS 18 last month at its Worldwide Developers Conference (WWDC). Since then, the company has released two developer betas in the last few weeks with extended support for…

Apple adds support for new languages across lock screen, keyboard and search on iOS 18

Anthropic looks to fund a new, more comprehensive generation of AI benchmarks

Kyle Wiggers

19 hours ago

Anthropic is launching a program to fund the development of new types of benchmarks capable of evaluating the performance and impact of AI models, including generative models like its own…

Anthropic looks to fund a new, more comprehensive generation of AI benchmarks

Fintech

Senators urge owners, partners and VC backers of fintech Synapse to restore customers’ access to their money

Mary Ann Azevedo

20 hours ago

A group of senators has banded together to urge Synapse’s owners and bank and fintech partners to “immediately restore customers’ access to their money.” As part of their demands, the…

Senators urge owners, partners and VC backers of fintech Synapse to restore customers’ access to their money

Space

TechCrunch Space: Star spangled

Aria Alamalhodaei

21 hours ago

Hello and welcome back to TechCrunch Space. I hope everyone has a fantastic July 4 this week. Go eat a hot dog. Read my story from last week on the…

Media & Entertainment

Spotify tests emergency alerts in Sweden

Sarah Perez

21 hours ago

Music, podcasts, audiobooks…emergency alerts? Spotify’s latest test has the streaming app venturing into new territory with a test of an emergency alerts system in its home market of Sweden. According…

Spotify tests emergency alerts in Sweden

YouTube now lets you request removal of AI-generated content that simulates your face or voice

Sarah Perez

1 day ago

Simply submitting the request for a takedown doesn’t necessarily mean the content will be removed, however.

YouTube now lets you request removal of AI-generated content that simulates your face or voice

Security

Fintech company Wise says some customers affected by Evolve Bank data breach

Lorenzo Franceschi-Bicchierai

1 day ago

The news highlights that the fallout from the Evolve data breach on third-party companies — and their customers and users — is still unclear.

Fintech company Wise says some customers affected by Evolve Bank data breach

Government & Policy

Supreme Court sends Texas and Florida social media regulation laws back to lower courts

Aisha Malik

1 day ago

The Supreme Court on Monday vacated two judicial decisions concerning Republican-backed laws from Florida and Texas aimed at limiting social media companies’ ability to moderate content on their platforms. The…

Supreme Court sends Texas and Florida social media regulation laws back to lower courts

Startups

Gifting on-demand startup Afloat goes nationwide

Lauren Forristal

1 day ago

Afloat, a gift delivery app that lets you shop from local stores and have gifts delivered to a loved one on the same day, is now available across the U.S. The…

Gifting on-demand startup Afloat goes nationwide

TechCrunch Disrupt 2024

Drive brand impact with a Side Event at TechCrunch Disrupt

TechCrunch Events

1 day ago

Exciting news for tech enthusiasts and innovators! TechCrunch Disrupt 2024 is just around the corner, and we have an incredible opportunity for you to elevate your brand’s visibility. How? By…

Apps

Meta changes its label from ‘Made with AI’ to ‘AI info’ to indicate use of AI in photos

Ivan Mehta

1 day ago

After Meta started tagging photos with a “Made with AI” label in May, photographers complained that the social networking company had been applying labels to real photos where they had…

Meta changes its label from ‘Made with AI’ to ‘AI info’ to indicate use of AI in photos

Robinhood snaps up Pluto to add AI tools to its investing app

Sarah Perez

1 day ago

Investment app Robinhood is adding more AI features for investors with its acquisition of AI-powered research platform Pluto Capital, Inc. Announced on Monday, the company says that Pluto will allow…