AI Companies Need to Be Regulated: An Open Letter to the U.S. Congress and European Parliament

By John Voorhees

[CC0: No rights reserved](https://pxhere.com/en/photo/489513).

Federico: Historically, technology has usually advanced in lockstep with opening up new creative opportunities for people. From word processors allowing writers to craft their next novel to digital cameras letting photographers express themselves in new ways or capture more moments, technological progress over the past few decades has sustained creators and, perhaps more importantly, spawned industries that couldn’t exist before.

Technology has enabled millions of people like myself to realize their life’s dreams and make a living out of “creating content” in a digital age.

This is all changing with the advent of Artificial Intelligence products based on large language models. If left unchecked without regulation, we believe the change may be for the worse.

Over the past two years, we’ve witnessed the arrival of AI tools and services that often use human input without consent with the goal of faster and cheaper results. The fascination with maximization of profits above anything else isn’t a surprise in a capitalist industry, but it’s highly concerning nonetheless – especially since, this time around, the majority of these AI tools have been built on a foundation of non-consensual appropriation, also known as – quite simply – digital theft.

As we’ve documented on MacStories and as other (and larger) publications also investigated, it’s become clear that foundation models of different LLMs have been trained on content sourced from the open web without requesting publishers’ permission upfront. These models can then power AI interfaces that can regurgitate similar content or provide answers with hidden citations that seldom prioritize driving traffic to publishers. As far as MacStories is concerned, this is limited to text scraped from our website, but we’re seeing this play out in other industries too, from design assets to photos, music, and more. And top it all off, publishers and creators whose content was appropriated for training or crawled for generative responses (or both) can’t even ask AI companies to be transparent about which parts of their content was used. It’s a black box where original content goes in and derivative slop comes out.

We think this is all wrong.

The practices followed by the majority of AI companies are ethically unfair to publishers and brazenly walk a perilous line of copyright infringement that must be regulated. Most worryingly, if ignored, we fear that these tools may lead to a gradual erosion of the open web as we know it, diminishing individuals’ creativity and consolidating “knowledge” in the hands of a few tech companies that built their AI services on the back of web publishers and creators without their explicit consent.

In other words, we’re concerned that, this time, technology won’t open up new opportunities for creative people on the web. We fear that it’ll destroy them.

We want to do something about this. And we’re starting with an open letter, embedded below, that we’re sending on behalf of MacStories, Inc. to U.S. Senators who have sponsored AI legislation as well as Italian members of the E.U. Special Committee on Artificial Intelligence in a Digital Age.

In the letter, which we encourage other publishers to copy if they so choose, we outline our stance on AI companies taking advantage of the open web for training purposes, not compensating publishers for the content they appropriated and used, and not being transparent regarding the composition of their models’ data sets. We’re sending this letter in English today, with an Italian translation to follow in the near future.

I know that MacStories is merely a drop in the bucket of the open web. We can’t afford to sue anybody. But I’d rather hold my opinion strongly and defend my intellectual property than sit silently and accept something that I believe is fundamentally unfair for creators and dangerous for the open web. And I’m grateful to have a business partner who shares these ideals and principles with me.

With that being said, here’s a copy of the letter we’re sending to U.S. and E.U. representatives.

Apple Says It Won’t Ship Major New OS Features in the EU This Fall Due to DMA Uncertainty →

Linked By John Voorhees

A new round in the fight between the EU and Apple has been brewing for a while now. About a week ago, the Financial Times reported that unnamed sources said that the EU was poised to levy significant fines against the company over a probe of Apple’s compliance with the Digital Markets Act. Then, earlier this week, in an interview with CNBC, the EU’s competition chief, Margrethe Vestager telegraphed that Apple is facing enforcement measures:

[Apple] are very important because a lot of good business happens through the App Store, happens through payment mechanisms, so of course, even though you know I can say this is not what was expected of such a company, of course we will enforce exactly with the same top priority as with any other business.

Asked when enforcement might happen, Vestager told CNBC ‘hopefully soon.’

Apple made no comment to CNBC at the time, but today, that shoe has apparently dropped, with Apple telling the Financial Times that:

Due to the regulatory uncertainties brought about by the Digital Markets Act, we do not believe that we will be able to roll out three of these [new] features – iPhone Mirroring, SharePlay Screen Sharing enhancements, and Apple Intelligence – to our EU users this year.

Is it a coincidence that Apple made its statement to the same media outlet that reported that fines were about to be assessed? I doubt it. The more likely scenario is that Apple is using OS updates as a negotiating chip with EU regulators. Your guess is as good as mine whether the move will work. Personally, I think the tactic is just as likely to backfire. However, I’m quite confident that you’ll be hearing from me again about fines by the EU against Apple sooner rather than later.

Permalink

Wired Confirms Perplexity Is Bypassing Efforts by Websites to Block Its Web Crawler

By John Voorhees

Last week, Federico and I asked Robb Knight to do what he could to block web crawlers deployed by artificial intelligence companies from scraping MacStories. Robb had already updated his own site’s robots.txt file months ago, so that’s the first thing he did for MacStories.

However, robots.txt only works if a company’s web crawler is set up to respect the file. As I wrote earlier this week, a better solution is to block them on your server, which Robb did on his personal site and wrote about late last week. The setup sends a 403 error if one of the bots listed in his server code requests information from his site.

Spoiler: Robb hit the nail on the head the first time.

After reading Robb’s post, Federico and I asked him to do the same for MacStories, which he did last Saturday. Once it was set up, Federico began testing the setup. OpenAI returned an error as expected, but Perplexity’s bot was still able to reach MacStories, which shouldn’t have been the case.¹

Yes, I took a screenshot of Perplexity’s API documentation because I bet it changes based on what we discovered.

That began a deep dive to try to figure out what was going on. Robb’s code checked out, blocking the user agent specified in Perplexity’s own API documentation. What we discovered after more testing was that Perplexity was hitting MacStories’ server without using the user agent it said it used, effectively doing an end run around Robb’s server code.

Robb wrote up his findings on his website, which promptly shot to the top slot on Hacker News and caught the eye of Dhruv Mehrotra and Tim Marchman of Wired, who were in the midst of investigating how Perplexity works. As Mehrotra and Marchman describe it:

A WIRED analysis and one carried out by developer Robb Knight suggest that Perplexity is able to achieve this partly through apparently ignoring a widely accepted web standard known as the Robots Exclusion Protocol to surreptitiously scrape areas of websites that operators do not want accessed by bots, despite claiming that it won’t. WIRED observed a machine tied to Perplexity—more specifically, one on an Amazon server and almost certainly operated by Perplexity—doing this on wired.com and across other Condé Nast publications.

Until earlier this week, Perplexity published in its documentation a link to a list of the IP addresses its crawlers use—an apparent effort to be transparent. However, in some cases, as both WIRED and Knight were able to demonstrate, it appears to be accessing and scraping websites from which coders have attempted to block its crawler, called Perplexity Bot, using at least one unpublicized IP address. The company has since removed references to its public IP pool from its documentation.

That secret IP address—44.221.181.252—has hit properties at Condé Nast, the media company that owns WIRED, at least 822 times in the last three months. One senior engineer at Condé Nast, who asked not to be named because he wants to “stay out of it,” calls this a “massive undercount” because the company only retains a fraction of its network logs.

WIRED verified that the IP address in question is almost certainly linked to Perplexity by creating a new website and monitoring its server logs. Immediately after a WIRED reporter prompted the Perplexity chatbot to summarize the website’s content, the server logged that the IP address visited the site. This same IP address was first observed by Knight during a similar test.

This sort of unethical behavior is why we took the steps we did to block the use of MacStories’ websites as training data for Perplexity and other companies.² Incidents like this and the lack of transparency about how AI companies train their models have led to a lot of mistrust in the entire industry among creators who publish on the web. I’m glad we’ve been able to play a small part in revealing Perplexity’s egregious behavior, but more needs to be done to rein in this sort of behavior, including closer scrutiny by regulators around the world.

As a footnote to this, it’s worth noting that Wired also puts to rest the argument that websites should be okay with Perplexity’s behavior because they include citations in their plagiarism. According to Wired’s story:

WIRED’s own records show that Perplexity sent 1,265 referrals to wired.com in May, an insignificant amount in the context of the site’s overall traffic. The article to which the most traffic was referred got 17 views.

That’s next to nothing for a site with Wired’s traffic, which Similarweb and other sites peg at over 20 million page views that same month. That’s a mere 0.006% of Wired’s May traffic. Let that sink in, and then ask yourself whether it seems like a fair trade.

Meanwhile, I was digging through bins of old videogames and hardware at a Retro Gaming Festival doing ‘research’ for NPC. ↩︎
Mehrotra and Marchman correctly question whether Perplexity is even an AI company because they piggyback on other company’s LLMs and use them in conjunction with scraped web data to provide summaries that effectively replace the source’s content. However, that doesn’t change the fact that Perplexity is surreptitiously scraping sites while simultaneously professing to respect sites’ robot.txt file. That’s the unethical bit. ↩︎

Apple Developer Academies in Six Countries to Add AI Courses This Fall

By John Voorhees

Today, Apple announced that this fall, the company will offer a new curriculum for its Developer Academy students focused on machine learning and artificial intelligence.

According to Apple:

Beginning this fall, every Apple Developer Academy student will benefit from custom-built curriculum that teaches them how to build, train, and deploy machine learning models across Apple devices. Courses will include the fundamentals of AI technologies and frameworks; Core ML and its ability to deliver fast performance on Apple devices; and guidance on how to build and train AI models from the ground up. Students will learn from guided curriculum and project-based assignments that include assistance from hundreds of mentors and more than 12,000 academy alumni worldwide.

The new curriculum will be offered at 18 academies in Brazil, Indonesia, Italy, Saudi Arabia, South Korea, and the United States. With the company’s emphasis on Apple Intelligence at WWDC, it’s not surprising that the skills needed to implement those new features are being added to its educational efforts.

How We’re Trying to Protect MacStories from AI Bots and Web Crawlers – And How You Can, Too

By John Voorhees

Over the past several days, we’ve made some changes at MacStories to address the ingestion of our work by web crawlers operated by artificial intelligence companies. We’ve learned a lot, so we thought we’d share what we’ve done in case anyone else would like to do something similar.

If you read MacStories regularly, or listen to our podcasts, you already know that Federico and I think that crawling the Open Web to train large language models is unethical. Industry-wide, AI companies have scraped the content of websites like ours, using it as the raw material for their chatbots and other commercial products without the consent or compensation of publishers and other creators.

Now that the horse is out of the barn, some of those companies are respecting publishers’ robots.txt files, while others seemingly aren’t. That doesn’t make up for the tens of thousands of articles and images that have already been scraped from MacStories. Nor is robots.txt a complete solution, so it’s just one of four approaches we’re taking to protect our work.

Opting Out of AI Model Training →

Linked By John Voorhees

Dan Moren has an excellent guide on Six Colors that explains how to exclude your website from the web crawlers used by Apple, OpenAI, and others to train large language models for their AI products. For many sites, the process simply requires a few edits to the robots.txt file on your server:

If you’re not familiar with robots.txt, it’s a text file placed at the root of a web server that can give instructions about how automated web crawlers are allowed to interact with your site. This system enables publishers to not only entirely block their sites from crawlers, but also specify just parts of the sites to allow or disallow.

The process is a little more complicated with something like a WordPress, which MacStories uses, and Dan covers that too.

Unfortunately, as Dan explains, editing robots.txt isn’t a solution for companies that ignore the file. It’s simply a convention that doesn’t carry any legal or regulatory weight. Nor does it help with Google or Microsoft’s use of your website’s copyrighted content unless you’re also willing to remove your site from the biggest search engines.

Although I’m glad there is a way to block at least some AI web crawlers prospectively, it’s cold comfort. We and many sites have years of articles that have already been crawled to train these models, and you can’t unring that bell. That said, MacStories’ robot.txt file has been updated to ban Apple and OpenAI’s crawlers, and we’re investigating additional server-level protections.

If you listen to Ruminate or follow my writing on MacStories, you know that I think what these companies are doing is wrong both in the moral and legal sense of the word. However, nothing captures it quite as well as this Mastodon post by Federico today:

If you’ve ever read the principles that guide us at MacStories, I’m sure Federico’s post came as no surprise. We care deeply about the Open Web, but ‘open’ doesn’t give tech companies free rein to appropriate our work to build their products.

Yesterday, Federico linked to Apple’s Machine Learning Research website where it was disclosed that the company has indexed the web to train its model without the consent of publishers. I was as disappointed in Apple as Federico. I also immediately thought of this 2010 clip of Steve Jobs near the end of his life, reflecting on what ‘the intersection of Technology and the Liberal Arts’ meant to Apple:

I’ve always loved that clip. It speaks to me as someone who loves technology and creates things for the web. In hindsight, I also think that Jobs was explaining what he hoped his legacy would be. It’s ironic that he spoke about ‘technology married with Liberal Arts,’ which superficially sounds like what Apple and others have done to create their AI models but couldn’t be further from what he meant. It’s hard to watch that clip now and not wonder if Apple has lost sight of what guided it in 2010.

You can follow all of our WWDC coverage through our WWDC 2024 hub or subscribe to the dedicated WWDC 2024 RSS feed.

Permalink

iOS 18 After One Month: Without AI, It’s Mostly About Apps and Customization

macOS Sequoia: The MacStories Public Beta Preview

watchOS 11: The MacStories Public Beta Preview

Posts tagged with "AI"

AI Companies Need to Be Regulated: An Open Letter to the U.S. Congress and European Parliament

Apple Says It Won’t Ship Major New OS Features in the EU This Fall Due to DMA Uncertainty →

Wired Confirms Perplexity Is Bypassing Efforts by Websites to Block Its Web Crawler

Apple Developer Academies in Six Countries to Add AI Courses This Fall

How We’re Trying to Protect MacStories from AI Bots and Web Crawlers – And How You Can, Too

Opting Out of AI Model Training →