How We’re Trying to Protect MacStories from AI Bots and Web Crawlers – And How You Can, Too

Over the past several days, we’ve made some changes at MacStories to address the ingestion of our work by web crawlers operated by artificial intelligence companies. We’ve learned a lot, so we thought we’d share what we’ve done in case anyone else would like to do something similar.

If you read MacStories regularly, or listen to our podcasts, you already know that Federico and I think that crawling the Open Web to train large language models is unethical. Industry-wide, AI companies have scraped the content of websites like ours, using it as the raw material for their chatbots and other commercial products without the consent or compensation of publishers and other creators.

Now that the horse is out of the barn, some of those companies are respecting publishers’ robots.txt files, while others seemingly aren’t. That doesn’t make up for the tens of thousands of articles and images that have already been scraped from MacStories. Nor is robots.txt a complete solution, so it’s just one of four approaches we’re taking to protect our work.

Preventing AI Crawlers Using Robots.txt

The first step, and one of the easiest to implement, is to request that the web crawlers of AI companies not crawl your site using robots.txt. The trouble with this approach is that it’s nothing more than the Internet equivalent of an “AI Bots Keep Out” sign hung on your website. It can be ignored and only works if crawlers identify themselves, which not all seem to do. That said, it’s a good first step and the first thing we did. I highly recommend Dan Moren’s article on Six Colors that I linked to last week for more information about robots.txt and details on implementing it on your site.

Blocking AI Bots at Your Server

We don’t trust AI companies to respect our robots.txt file. After all, they already took our content without our consent. So, we went a step further and blocked known AI crawlers at the server level with the help of Robb Knight. Doing so requires that you know your way around a web server, but it’s more effective than simply editing your robots.txt file. If you want to learn more about configuring your site to block AI crawlers, Robb has written about the work he did for his personal site and MacStories here.

Update Your Terms of Service

I also recommend having a Terms of Service for your website. The New York Times, which is currently litigating OpenAI’s LLM training practices updated their terms of service late last summer, which we’ve used as a guide to carefully define how MacStories content, whether it’s an article, image, or podcast, can be used in our own Terms of Service.

Rest assured, you have a lot of latitude for personal use of MacStories content. Nor do we have an issue with commercial uses that use reasonable portions of our content as long as they are properly attributed in line with the content that is used. However, we do not consent to the use of our content for AI model training.

Support Legislation Regulating AI Training

None of the above are complete solutions, which is why we support legislation regulating how AI companies train their LLMs. Last summer, media organizations from around the world signed an open letter asking lawmakers to regulate LLM training, stating:

We, the undersigned organizations, support the responsible advancement and deployment of generative AI technology, while believing that a legal framework must be developed to protect the content that powers AI applications as well as maintain public trust in the media that promotes facts and fuels our democracies.

The letter goes to the heart of something we believe, too. We’re not against artificial intelligence as a technology. Many of the tools being built are promising. However, we don’t believe that it’s right for tech companies worth billions and even trillions of dollars to be given a pass for building those tools on the backs of others’ work, especially in an economic environment where so many online media companies are struggling to survive. It’s just not right.

The solutions above aren’t perfect or foolproof, and as a result, some people have told us that we shouldn’t bother; we should just give in. In a sign of just how strapped media companies are for cash, others have cut deals with AI companies figuring that getting something is better than nothing.

But, here’s the thing. The web is a special place. Every day, it brings people from around the world together to share their thoughts and express their creativity. That’s something nobody should take for granted, and it’s worth protecting. AI is cool and all, but it’s not worth destroying the web.

Unlock More with Club MacStories

Founded in 2015, Club MacStories has delivered exclusive content every week for over six years.

In that time, members have enjoyed nearly 400 weekly and monthly newsletters packed with more of your favorite MacStories writing as well as Club-only podcasts, eBooks, discounts on apps, icons, and services. Join today, and you’ll get everything new that we publish every week, plus access to our entire archive of back issues and downloadable perks.

The Club expanded in 2021 with Club MacStories+ and Club Premier. Club MacStories+ members enjoy even more exclusive stories, a vibrant Discord community, a rotating roster of app discounts, and more. And, with Club Premier, you get everything we offer at every Club level plus an extended, ad-free version of our podcast AppStories that is delivered early each week in high-bitrate audio.

Choose the Club plan that’s right for you:

  • Club MacStories: Weekly and monthly newsletters via email and the web that are brimming with app collections, tips, automation workflows, longform writing, a Club-only podcast, periodic giveaways, and more;
  • Club MacStories+: Everything that Club MacStories offers, plus exclusive content like Federico’s Automation Academy and John’s Macintosh Desktop Experience, a powerful web app for searching and exploring over 6 years of content and creating custom RSS feeds of Club content, an active Discord community, and a rotating collection of discounts, and more;
  • Club Premier: Everything in from our other plans and AppStories+, an extended version of our flagship podcast that’s delivered early, ad-free, and in high-bitrate audio.