Featured Article

News outlets are accusing Perplexity of plagiarism and unethical web scraping

Ambiguity around copyright laws and AI web crawlers complicate matters

Comment

Image Credits: Getty Images

In the age of generative AI, when chatbots can provide detailed answers to questions based on content pulled from the internet, the line between fair use and plagiarism, and between routine web scraping and unethical summarization, is a thin one. 

Perplexity AI is a startup that combines a search engine with a large language model that generates answers with detailed responses, rather than just links. Unlike OpenAI’s ChatGPT and Anthropic’s Claude, Perplexity doesn’t train its own foundational AI models, instead using open or commercially available ones to take the information it gathers from the internet and translate that into answers. 

But a series of accusations in June suggests the startup’s approach borders on being unethical. Forbes called out Perplexity for allegedly plagiarizing one of its news articles in the startup’s beta Perplexity Pages feature. And Wired has accused Perplexity of illicitly scraping its website, along with other sites. 

Perplexity, which as of April was working to raise $250 million at a near-$3 billion valuation, maintains that it has done nothing wrong. The Nvidia- and Jeff Bezos-backed company says that it has honored publishers’ requests to not scrape content and that it is operating within the bounds of fair use copyright laws. 

The situation is complicated. At its heart are nuances surrounding two concepts. The first is the Robots Exclusion Protocol, a standard used by websites to indicate that they don’t want their content accessed or used by web crawlers. The second is fair use in copyright law, which sets up the legal framework for allowing the use of copyrighted material without permission or payment in certain circumstances. 

Surreptitiously scraping web content

Image Credits: Getty Images

Wired’s June 19 story claims that Perplexity has ignored the Robots Exclusion Protocol to surreptitiously scrape areas of websites that publishers do not want bots to access. Wired reported that it observed a machine tied to Perplexity doing this on its own news site, as well as across other publications under its parent company, Condé Nast. 

The report noted that developer Robb Knight conducted a similar experiment and came to the same conclusion. 

Both Wired reporters and Knight tested their suspicions by asking Perplexity to summarize a series of URLs and then watching on the server side as an IP address associated with Perplexity visited those sites. Perplexity then “summarized” the text from those URLs — though in the case of one dummy website with limited content that Wired created for this purpose, it returned text from the page verbatim. 

This is where the nuances of the Robots Exclusion Protocol come into play. 

Web scraping is technically when automated pieces of software known as crawlers scour the web to index and collect information from websites. Search engines like Google do this so that web pages can be included in search results. Other companies and researchers use crawlers to gather data from the internet for market analysis, academic research and, as we’ve come to learn, training machine learning models. 

Web scrapers in compliance with this protocol will first look for the “robots.txt” file in a site’s source code to see what is permitted and what is not — today, what is not permitted is usually scraping a publisher’s site to build massive training datasets for AI. Search engines and AI companies, including Perplexity, have stated that they comply with the protocol, but they aren’t legally obligated to do so.  

Perplexity’s head of business, Dmitry Shevelenko, told TechCrunch that summarizing a URL isn’t the same thing as crawling. “Crawling is when you’re just going around sucking up information and adding it to your index,” Shevelenko said. He noted that Perplexity’s IP might show up as a visitor to a website that is “otherwise kind of prohibited from robots.txt” only when a user puts a URL into their query, which “doesn’t meet the definition of crawling.” 

“We’re just responding to a direct and specific user request to go to that URL,” Shevelenko said.

In other words, if a user manually provides a URL to an AI, Perplexity says its AI isn’t acting as a web crawler but rather a tool to assist the user in retrieving and processing information they requested. 

But to Wired and many other publishers, that’s a distinction without a difference because visiting a URL and pulling the information from it to summarize the text sure looks a whole lot like scraping if it’s done thousands of times a day.

(Wired also reported that Amazon Web Services, one of Perplexity’s cloud service providers, is investigating the startup for ignoring robots.txt protocol to scrape web pages that users cited in their prompt. AWS told TechCrunch that Wired’s report is inaccurate and that it told the outlet it was processing their media inquiry like it does any other report alleging abuse of the service.)

Plagiarism or fair use?

screenshot of Perplexity Pages
Forbes accused Perplexity of plagiarizing its scoop about former Google CEO Eric Schmidt developing AI-powered combat drones.
Image Credits: Perplexity / Screenshot

Wired and Forbes have also accused Perplexity of plagiarism. Ironically, Wired says Perplexity plagiarized the very article that called out the startup for surreptitiously scraping its web content. 

Wired reporters said the Perplexity chatbot “produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them.” One sentence exactly reproduces a sentence from the original story; Wired says this constitutes plagiarism. The Poynter Institute’s guidelines say it might be plagiarism if the author (or AI) used seven consecutive words from the original source work.  

Forbes also accused Perplexity of plagiarism. The news site published an investigative report in early June about how Google CEO Eric Schmidt’s new venture is recruiting heavily and testing AI-powered drones with military applications. The next day, Forbes editor John Paczkowski posted on X saying that Perplexity had republished the scoop as part of its beta feature, Perplexity Pages.

Perplexity Pages, which is only available to certain Perplexity subscribers for now, is a new tool that promises to help users turn research into “visually stunning, comprehensive content,” according to Perplexity. Examples of such content on the site come from the startup’s employees, and include articles like “Beginner’s Guide to Drumming,” or “Steve Jobs: Visionary CEO.” 

“It rips off most of our reporting,” Paczkowski wrote. “It cites us, and a few that reblogged us, as sources in the most easily ignored way possible.” 

Forbes reported that many of the posts that were curated by the Perplexity team are “strikingly similar to original stories from multiple publications, including Forbes, CNBC and Bloomberg.” Forbes said the posts gathered tens of thousands of views and didn’t mention any of the publications by name in the article text. Rather, Perplexity’s articles included attributions in the form of “small, easy-to-miss logos that link out to them.”

Furthermore, Forbes said the post about Schmidt contains “nearly identical wording” to Forbes’ scoop. The aggregation also included an image created by the Forbes design team that appeared to be slightly modified by Perplexity. 

Perplexity CEO Aravind Srinivas responded to Forbes at the time by saying the startup would cite sources more prominently in the future — a solution that’s not foolproof, as citations themselves face technical difficulties. ChatGPT and other models have hallucinated links, and since Perplexity uses OpenAI models, it is likely to be susceptible to such hallucinations. In fact, Wired reported that it observed Perplexity hallucinating entire stories. 

Other than noting Perplexity’s “rough edges,” Srinivas and the company have largely doubled down on Perplexity’s right to use such content for summarizations. 

This is where the nuances of fair use come into play. Plagiarism, while frowned upon, is not technically illegal. 

According to the U.S. Copyright Office, it is legal to use limited portions of a work including quotes for purposes like commentary, criticism, news reporting and scholarly reports. AI companies like Perplexity posit that providing a summary of an article is within the bounds of fair use.

“Nobody has a monopoly on facts,” Shevelenko said. “Once facts are out in the open, they are for everyone to use.”

Shevelenko likened Perplexity’s summaries to how journalists often use information from other news sources to bolster their own reporting. 

Mark McKenna, a professor of law at the UCLA Institute for Technology, Law & Policy, told TechCrunch the situation isn’t an easy one to untangle. In a fair use case, courts would weigh whether the summary uses a lot of the expression of the original article, versus just the ideas. They might also examine whether reading the summary might be a substitute for reading the article. 

“There are no bright lines,” McKenna said. “So [Perplexity] saying factually what an article says or what it reports would be using non-copyrightable aspects of the work. That would be just facts and ideas. But the more that the summary includes actual expression and text, the more that starts to look like reproduction, rather than just a summary.”

Unfortunately for publishers, unless Perplexity is using full expressions (and apparently, in some cases, it is), its summaries might not be considered a violation of fair use. 

How Perplexity aims to protect itself

AI companies like OpenAI have signed media deals with a range of news publishers to access their current and archival content on which to train their algorithms. In return, OpenAI promises to surface news articles from those publishers in response to user queries in ChatGPT. (But even that has some kinks that need to be worked out, as Nieman Lab reported last week.)

Perplexity has held off from announcing its own slew of media deals, perhaps waiting for the accusations against it to blow over. But the company is “full speed ahead” on a series of advertising revenue-sharing deals with publishers. 

The idea is that Perplexity will start including ads alongside query responses, and publishers that have content cited in any answer will get a slice of the corresponding ad revenue. Shevelenko said Perplexity is also working to allow publishers access to its technology so they can build Q&A experiences and power things like related questions natively inside their sites and products. 

But is this just a fig leaf for systemic IP theft? Perplexity isn’t the only chatbot that threatens to summarize content so completely that readers fail to see the need to click out to the original source material. 

And if AI scrapers like this continue to take publishers’ work and repurpose it for their own businesses, publishers will have a harder time earning ad dollars. That means eventually, there will be less content to scrape. When there’s no more content left to scrape, generative AI systems will then pivot to training on synthetic data, which could lead to a hellish feedback loop of potentially biased and inaccurate content. 

More TechCrunch

According to a recent Dealroom report on the Spanish tech ecosystem, the combined enterprise value of Spanish startups surpassed €100 billion in 2023. In the latest confirmation of this upward trend, Madrid-based…

Spain’s exposure to climate change helps Madrid-based VC Seaya close €300M climate tech fund

Forestay, an emerging VC based out of Geneva, Switzerland, has been busy. This week it closed its second fund, Forestay Capital II, at a hard cap of $220 million. The…

Forestay, Europe’s newest $220M growth-stage VC fund, will focus on AI

Threads, Meta’s alternative to Twitter, just celebrated its first birthday. After launching on July 5 last year, the social network has reached 175 million monthly active users — that’s a…

A year later, what Threads could learn from other social networks

J2 Ventures, a firm led mostly by U.S. military veterans, announced on Thursday that it has raised a $150 million second fund. The Boston-based firm invests in startups whose products…

J2 Ventures, focused on military healthcare, grabs $150M for its second fund

HealthEquity said in an 8-K filing with the SEC that it detected “anomalous behavior by a personal use device belonging to a business partner.”

HealthEquity says data breach is an ‘isolated incident’

Roll20 said that on June 29 it had detected that a “bad actor” gained access to an account on the company’s administrative website for one hour.

Roll20, an online tabletop role-playing game platform, discloses data breach

Fisker has a willing buyer for its remaining inventory of all-electric Ocean SUVs, and has asked the Delaware Bankruptcy Court judge overseeing its Chapter 11 case to approve the sale.…

Fisker asks bankruptcy court to sell its EVs at average of $14,000 each

Teddy Solomon just moved to a new house in Palo Alto, so he turned to the Stanford community on Fizz to furnish his room. “Every time I show up to…

Fizz, the anonymous Gen Z social app, adds a marketplace for college students

With increasing competition for what is, essentially, still a small number of hard tech and deep tech deals, Sidney Scott realized it would be a challenge for smaller funds like…

Why deep tech VC Driving Forces is shutting down

A guide to turn off reactions on your iPhone and Mac so you don’t get surprised by effects during work video calls.

How to turn off those silly video call reactions on iPhone and Mac

Amazon has decided to discontinue its Astro for Business device, a security robot for small- and medium-sized businesses, just seven months after launch.  In an email sent to customers and…

Amazon retires its Astro for Business security robot after only 7 months

Hiya, folks, and welcome to TechCrunch’s regular AI newsletter. This week in AI, the U.S. Supreme Court struck down “Chevron deference,” a 40-year-old ruling on federal agencies’ power that required…

This Week in AI: With Chevron’s demise, AI regulation seems dead in the water

Noplace had already gone viral ahead of its public launch because of its feature that allows users to express themselves by customizing the colors of their profile.

noplace, a mashup of Twitter and Myspace for Gen Z, hits No. 1 on the App Store

Cloudflare analyzed AI bot and crawler traffic to fine-tune automatic bot detection models.

Cloudflare launches a tool to combat AI bots

Twilio says “threat actors were able to identify” phone numbers of people who use the two-factor app Authy.

Twilio says hackers identified cell phone numbers of two-factor app Authy users

The news brings closure to more than two years of volleying back and forth between some of the biggest names in additive manufacturing.

Nano Dimension is buying Desktop Metal

Planning to attend TechCrunch Disrupt 2024 with your team? Maximize your team-building time and your company’s impact across the entire conference when you bring your team. Groups of 4 to…

Groups save big at TechCrunch Disrupt 2024

As more music streaming apps and creation tools emerge to compete for users’ attention, social music-sharing app Popster is getting two new features to grow its user base: an AI…

Music video-sharing app Popster uses generative AI and lets artists remix videos

Meta’s Threads now has more than 175 million monthly active users, Mark Zuckerberg announced on Wednesday. The announcement comes two days away from Threads’ first anniversary. Zuckerberg revealed back in…

Threads nears its one-year anniversary with more than 175M monthly active users

Cartken and its diminutive sidewalk delivery robots first rolled into the world with a narrow charter: carrying everything from burritos and bento boxes to pizza and pad thai that last…

From burritos to biotech: How robotics startup Cartken found its AV niche

Ashwin Nandakumar and Ashwin Jainarayanan were working on their doctorates at adjacent departments in Oxford, but they didn’t know each other. Nandakumar, who was studying oncology, one day stumbled across…

Granza Bio grabs $7M seed from Felicis and YC to advance delivery of cancer treatments

LG has acquired an 80% stake in Athom, a Dutch smart home company and maker of the Homey smart home hub. According to LG’s announcement, it will purchase the remaining…

LG acquires smart home platform Athom to bring third-party connectivity to its ThinQ ecosytem

CoinDCX, India’s leading cryptocurrency exchange, is expanding internationally through the acquisition of BitOasis, a digital asset platform in the Middle East and North Africa, the companies said Wednesday. The Bengaluru-based…

CoinDCX acquires BitOasis in international expansion push

Collaborative document features are being made available inside Proton Drive, further extending the company’s trademark pitch of robust security.

In a major update, Proton adds privacy-safe document collaboration to Drive, its freemium E2EE cloud storage service

Telegram launched a digital currency called Stars for in-app use last month. Now, the company is expanding its use cases to paid content. The chat app is also allowing channels…

Telegram lets creators share paid content to channels

For the past couple of years, innovation has been accelerating in new materials development. And a new French startup called Altrove plans to play a role in this innovation cycle.…

Altrove uses AI models and lab automation to create new materials

The Indian social media platform Koo, which positioned itself as a competitor to Elon Musk’s X, is ceasing operations after its last-resort acquisition talks with Dailyhunt collapsed. Despite securing over…

Indian social network Koo is shutting down as buyout talks collapse

Apiday leverages AI to save time for its customers. But like legacy consultants, it also offers human expertise.

Europe is still serious about ESG, and Apiday is helping companies comply

Google totally dodges the question of how much energy is AI is using — perhaps because the answer is “way more than we’d care to say.”

Google’s environmental report pointedly avoids AI’s actual energy cost

SpaceX’s ambitious plans to launch its Starship mega-rocket up to 44 times per year from NASA’s Kennedy Space Center are causing a stir among some of its competitors. Late last…

SpaceX wants to launch up to 120 times a year from Florida — and competitors aren’t happy about it