Crawlers

A few months back, I wrote about how Google is breaking its social contract with the web, harvesting our content not in order to send search traffic to relevant results, but to feed a large language model that will spew auto-completed sentences instead.

I still think Chris put it best:

I just think it’s fuckin’ rude.

When it comes to the crawlers that are ingesting our words to feed large language models, Neil Clarke describes the situtation:

It should be strictly opt-in. No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. Some will declare that I am “Anti-AI” for saying such things, but that would be a misrepresentation. I am not declaring that these systems should be torn down, simply that their developers aren’t entitled to our work. They can still build those systems with purchased or donated data.

Alas, the current situation is opt-out. The onus is on us to update our robots.txt file.

Neil handily provides the current list to add to your file. Pass it on:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: FacebookBot
Disallow: /

In theory you should be able to group those user agents together, but citation needed on whether that’s honoured everywhere:

User-agent: CCBot
User-agent: ChatGPT-User
User-agent: GPTBot
User-agent: Google-Extended
User-agent: Omgilibot
User-agent: FacebookBot
Disallow: /

There’s a bigger issue with robots.txt though. It too is a social contract. And as we’ve seen, when it comes to large language models, social contracts are being ripped up by the companies looking to feed their beasts.

As Jim says:

I realized why I hadn’t yet added any rules to my robots.txt: I have zero faith in it.

That realisation was prompted in part by Manuel Moreale’s experiment with blocking crawlers:

So, what’s the takeaway here? I guess that the vast majority of crawlers don’t give a shit about your robots.txt.

Time to up the ante. Neil’s post offers an option if you’re running Apache. Either in .htaccess or in a .conf file, you can block user agents using mod_rewrite:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|Omgilibot| FacebookBot) [NC]
RewriteRule ^ – [F]

You’ll see that Google-Extended isn’t that list. It isn’t a crawler. Rather it’s the permissions model that Google have implemented for using your site’s content to train large language models: unless you opt out via robots.txt, it’s assumed that you’re totally fine with your content being used to feed their stochastic parrots.

Responses

Tracy Durnell

ADDED 4 October 2023:

Google has announced a new token you can block to exclude your website from training Bard and Vertex AI: Google-Extended. To block your site from being used to train Google’s AI products, you should include this code in your robots.txt file:

# Google AI User-agent: Google-Extended Disallow: /

As a standalone token, that means that we don’t need to block Google from indexing our websites to block them from using our content to train their AI products.

⭐ ADDED 11 December 2023:

Except!!!! Google-Extended applies to their products but not their generative search results. So if you don’t want your content to appear in generative search results, you still need to block Googlebot.

ORIGINAL ARTICLE (published 11 July 2023):

After thinking about it for a couple days, I’ve decided to de-index my website from Google. It’s reversible — I’m sure Google will happily reindex it if I let them — so I’m just going ahead and doing it for now. I’m not down with Google swallowing everything posted on the internet to train their generative AI models. I was pushed over the edge by posts from Jeremy Keith and Vasilis van Gemert, thanks y’all.

I don’t have Google Search Console set up for this website so I don’t know how much search traffic I get. My other blog, Cascadia Inspired, got about 200 hits in the past three months. I’m not going to cry over that — they’re mostly going to one 2015 article anyway (and probably not that helpful of a post, to my eye. Around New Year’s every year I usually get an influx of people to my ten-year-old guide to doing a creative annual review. Sorry folks, I’m sure someone else has written something better by now.) 😉

I’m going to start by pulling my websites out of Google search, then work on adding my sites to directories. Maybe I’ll even join a webring 💍✨

Adding a noindex meta tag to my WordPress header

Because my website has already been indexed by Google, I need to allow the Google bot to re-crawl the pages and see the new “noindex” instruction. So in the future I’ll also block the Googlebot crawler, but not just yet 😉

I added this code to the functions.php file of my child theme:

add_action( 'wp_head', function() { global $page; echo '<meta name="Googlebot" content="noindex, nofollow, noimageindex">'; });

I figured out how to adapt this from WPExplorer. This random wordpress plugin help forum suggested another version, I don’t know which is better 🤷‍♀️

I’m not 100% on whether the noimageindex is actually helpful for Googlebot since that’s their text bot, but can’t hurt right? (Tell me if it hurts lol.) Yoast says there’s a better way to block image indexing but I’m scared of touching the .htaccess file and definitely nothing with my server 😂 (I’m on shared hosting anyway, so I think the edits I can make are limited?)

Blocking bots that collect training data for AIs (and more)

In addition, I created a robots.txt file to tell “law abiding” bots what they’re not allowed to look at. I ought to have done this before but kind of assumed it came with my WordPress install 😅 (Nope.)

AI user agents to block

There’s so many now, just copy from my robots file tbh.

ADDED 4 October 23: To block training of Google’s Bard, I blocked Google-Extended.

I specifically want to deter my website being used for training LLMs, so I blocked Common Crawl.

To block OpenAI, I blocked both user agents ChatGPT-User and GPTBot. (Added GPTBot 10 August 23)

ADDED 4 October 23: Per Neil Clarke’s article, I have also blocked Omgilibot, Omgili, and FacebookBot. (Via Jeremy Keith)

ADDED 14 February 2024: I also blocked user agents used in AI training sets: anthropic-ai, Bytespider, FacebookBot, and PerplexityBot (source)

ADDED 16 April 2024: prompted by Ethan Marcotte, I blocked several more known and suspected user agents used in AI training: Claude-Web, ClaudeBot, cohere-ai, Diffbot, YouBot, ChatGPT

Added 17 June 2024: I’ve now blocked Apple’s AI training bot Applebot-Extended (thanks for the heads-up James!) Does anyone else feel like this is getting ridiculous?

I also blocked Amazonbot and applebot to block Siri and Alexa’s “smart answers.” I believe this also excludes me from Apple search.

I’ve also now blocked Googlebot and bingbot in protest of their generative AI search results — I’ve had the code up for my pages to be deindexed by Google for over six months and I’m done waiting anymore.

Dark Visitors apparently has a WordPress plugin to update your robots.txt whenever a new agent comes out, but for now I’m stickin’ with manual. I am also still wary of modifying my .htaccess file and breaking something, so it’s just my robots.txt making my stance clear — I can’t control whether companies have any sort of ethics and comply, unfortunately.

Other user agents

Searching on DuckDuckGo, I found an older article from a theme maker with specific advice for WordPress robots.txt. From there I jumped to Jeff Star’s recommendations from 2020.

I also appreciate fellow opinionated individuals on the internet so I followed some other blocks from Rohan Kumar. I would happily take more opinionated suggestions of junk bots to block if anyone else has opinions or can point me to a list somewhere 😉

Note: this article generated a lot of interest! See a Hacker News discussion.

Syndicated to IndieWeb News

Jan

Used Jeremy Keith’s Crawlers as a bit of a guideline for my updated .htaccess and robots.txt files.

Think there might be a few typos in there—the extra space before FacebookBot, the en dash in that last RewriteRule?—so I didn’t copy these verbatim, but the idea’s the same.

That also means I’m no longer asking Google to simply not index my pages.

# Posted by Jan on Tuesday, October 3rd, 2023 at 8:16am

Jan Boddez

Quick note: I’ve recently enabled native ActivityPub, you know, using the WordPress plugin. So, if you don’t mind the occassional quirk/test spam, feel free to follow @jan.

As a result, the above post can also be found at, e.g., https://indieweb.social/@jan@jan.boddez.net/111170093361292737.

If I just boost that (which I did), how does that work?

Now, I still think this is more taxing on my lil’ web server, which could have to do with it not caching REST requests, so I’m not entirely sure which approach I’m going to go for

# Posted by Jan Boddez on Tuesday, October 3rd, 2023 at 8:22am

Jan Boddez

@jan Re: “Followers from other servers are not displayed. Browse more on the original profile.” Maybe I should add a followers block (which I read now exists) to that author archive.

# Posted by Jan Boddez on Tuesday, October 3rd, 2023 at 8:25am

Nick Simson

In May 2022 I wrote the following:

It was important for me to understand what every line of code does on my website when I embarked on a personal site redesign. I switched my website from Jekyll to Eleventy a couple years ago. It was a nice upgrade and (most importantly) I didn’t have to relearn everything I knew about making a website.

Me, in an entry from 2022

I have been on WordPress since September, and it is a bit more of a complicated setup than my previous site. Despite using this software elsewhere for more than a decade, I don’t know what every line of code does here.

I added a lot of functionality to this website after switching to WordPress. Most of that functionality is through WordPress’ plugin ecosystem, which is one small reason of why this CMS is so popular.

While I may not know what every line of code does now, I do know this: An important part of maintaining a WordPress site is understanding what every plugin does on your site, updating this software regularly, and only keeping the plugins installed that you actually use. No more, no less.

Consider this post (like all my posts) a snapshot in time. I don’t think I’ll revisit this post each year, but if I’m still running WordPress in say 2028, it might be interesting to see what I’m still using, or what new functionality comes along.

Here’s what I’m currently running on this install:

ActivityPub

ActivityPub is a plugin maintained by Matthias Pfefferle & Automattic. It makes WordPress websites operable with ActivityPub supported networks like Mastodon and creates a Fediverse profile for my website. It comes with a couple of WordPress blocks including this “follow me” block:

The other block is a dynamic block to show off the accounts that follow your website profile in the Fediverse. Here’s this post for example, in elk.zone:

Right now I don’t have a great way of linking to my site’s Fediverse profile, but here’s how it looks on micro.blog.

Advanced Editor Tools

Advanced Editor Tools by Automattic adds a ton of additional rich text formatting tools in the paragraph block, and TinyMCE (“Classic” block) should I need them.

Attachment Pages Redirect

WordPress creates an additional page called an attachment page for every image, PDF or media file you upload. I use Attachment Pages Redirect by Samuel Aguilera to redirect these media attachment pages (annoying!) to my homepage instead.

Create Block Theme

Create Block Theme is a development plugin by WordPress.org that lets you easily customize an existing theme, develop a child theme, or otherwise create a block theme for the new-ish Gutenberg site editor. I could probably remove this plugin, but I’m always making tweaks to my website, and right now this is the best way to customize your typography on a WP theme, at least until the Font Library feature comes out in version 6.5.

Hum

Hum by Will Norris automatically builds short links for every post or page on my site. I point the domain name nicks.im to my Flywheel site as a secondary domain, and now I have a personal link shortener. The shortlink for this post is nicks.im/b/tZ, for example. I use this with the ActivityPub plugin to include a shortlink for each post on my Fediverse profile.

Humans TXT

I maintain my humans.txt file from the WordPress editor thanks to the Humans TXT plugin by Till Krüss.

IndieAuth

I use IndieAuth by IndieWebCamp WordPress Outreach Club to sign in to other websites (like indieweb.org and indiebookclub.biz) with my domain name.

IndieBlocks

IndieBlocks by Jan Boddez does a number of cool things to this site and was a catalyst for me switching over to WordPress. First, it adds microformats automatically to each post as well as the block theme. IndieBlocks features custom post types for likes and notes (and the option to hide titles on these), which I use on this site. It also comes with a handful of useful blocks in the post editor and site editor. I frequently use the “Bookmark,” “Like,” “Reply,” and “Repost” blocks in different posts and utilize the “Facepile” block in my templates for webmentions. Jan’s documentation for the plugin is on Github and Indieblocks.xyz.

IndieWeb

IndieWeb by IndieWebCamp WordPress Outreach Club is a plugin that recommends and helps install other great indieweb WordPress plugins and groups them together under a nice umbrella in your dashboard.

Layout Grid

Layout Grid by Automattic gives you finer control over aligning elements to a mutt-column grid in the block editor. I’m using this extensively on my Library page.

Location Weather

I’m displaying my current city’s forecast on my Now page with Location Weather by ShapedPlugin. The weather widget is powered by the OpenWeather API. The other widgets on the /now page are a LastFM recently played tracks app and a Literal currently reading widget that are not proper WordPress plugins.

Micropub

Thanks to Micropub by IndieWebCamp, I can use third-party micropub clients to draft and publish on my website with this feature.

Multi-column Tag Map

I automatically display all tags on an index page with Multi-column Tag Map by Alan Jackson (no, not that one).

Post Type Switcher

Once in a while, I’ll have to switch a new entry from a post to a note or like, or vice versa. Post Type Switcher by Triple J Software lets me do that with ease.

Random Content

Random Content by Endo Creative powers the random footer text at the bottom of each page of my website.

Redirection

Cool URIs don’t change, but sometimes a redesign leaves you needing to create redirects. I am using the Redirection plugin by John Godley to manage my 301s.

ShortPixel Image Optimizer

ShortPixel Image Optimizer is the best plugin out there for optimizing your image files and for using next-gen formats like WebP.

Simple CSS

The WordPress Customizer will probably be phased out into the newer Site Editor as more and more people switch to Block Themes. For a long time the Customizer featured a way to add custom CSS to your theme easily without having to create and maintain a custom theme. Simple CSS by Tom Usborne is a great replacement for that feature because it gives you a nicer editor to write CSS in and your customized CSS won’t disappear if/when you change themes. Similar to the old school Art Direction plugin, you can write CSS rules that only apply to a single page or post, too.

Syndication Links

I automatically syndicate new posts and notes to my Micro.blog, Bluesky and Mastodon accounts. Syndication Links by David Shanske links these copies from the canonical source.

Travelers’ Map

I’m using Travelers’ Map by Camille Verrier to add location data to select entries and add them as points to an interactive map.

Two Factor

Two Factor is a community built plugin that adds an extra layer of security to my WordPress login.

Ultimate Markdown

Ultimate Markdown allows me to write or import posts with the Markdown syntax and convert them to rich text, including blocks in the block editor.

Webmention

I use Webmention by Matthias Pfefferle to display webmentions on my posts along with WordPress comments. Likes, boosts, and replies from Mastodon and Bluesky are also brought in as webmentions via Brid.gy.

White Label CMS

White Label CMS adds my logo to the WordPress login and /wp-admin/ screens.

WP Accessibility

I’m already using an “accessibility-ready” theme, but WP Accessibility by Joe Dolson comes with a few additional features I find helpful.

WP Dark Mode

I use WP Dark Mode to add a dark mode theme based on a user’s preference. It also comes with a little light mode/dark mode toggle I placed in the lower right corner. Does the trick pretty well, but of course there’s a premium upgrade. This plugin sends a lot of unwanted marketing notifications, though, so be warned. I may be on the lookout for a basic plugin without all the bells and whistles and nag screens.

WP Robots TXT

WP Robots TXT by George Pattihis adds a robots.txt file I can customize in my /wp-admin/ settings. In the age of crawlers and content scrapers, this is sadly necessary.

WP Toolbelt

WP Toolbelt by Ben Gillbanks is a bit of an alternative to Jetpack: it adds a lot of optional features to a WordPress site. I’m using it for the following:

  • wp-admin tweaks like bigger checkboxes, highlighting table rows on hover, etc.
  • privacy-focused replacement for Gravatar
  • removing unnecessay HTML from the site header
  • lazy loading images
  • removing IP addresses from comments
  • spam blocking
Syndicated copies:

# Posted by Nick Simson on Wednesday, November 15th, 2023 at 10:00pm

5 Shares

# Shared by Fynn Becker on Monday, October 2nd, 2023 at 1:06pm

# Shared by Stuart :progress_pride: on Monday, October 2nd, 2023 at 2:38pm

# Shared by tjkendon on Monday, October 2nd, 2023 at 3:10pm

# Shared by TJ Kendon on Monday, October 2nd, 2023 at 3:10pm

# Shared by 猫の手も借りたい。 on Monday, October 2nd, 2023 at 7:03pm

12 Likes

# Liked by bkardell on Monday, October 2nd, 2023 at 1:05pm

# Liked by Fynn Becker on Monday, October 2nd, 2023 at 1:05pm

# Liked by Btrinen on Monday, October 2nd, 2023 at 2:01pm

# Liked by shaunrashid on Monday, October 2nd, 2023 at 2:38pm

# Liked by nate on Monday, October 2nd, 2023 at 3:10pm

# Liked by 猫の手も借りたい。 on Monday, October 2nd, 2023 at 9:40pm

# Liked by Professor von Explaino on Tuesday, October 3rd, 2023 at 12:07am

# Liked by Justin Myers on Tuesday, October 3rd, 2023 at 1:50am

# Liked by Nick Simson on Tuesday, October 3rd, 2023 at 5:15pm

# Tuesday, October 3rd, 2023 at 8:32pm

# Liked by Sahil 🐧 on Thursday, October 12th, 2023 at 1:53am

# Monday, November 20th, 2023 at 2:16am

2 Bookmarks

# Bookmarked by Chris Jones on Monday, October 2nd, 2023 at 12:46pm

# Bookmarked by https://desmondrivet.com/ on Monday, October 2nd, 2023 at 4:19pm

Related posts

Browser history

From a browser bug this morning, back to the birth of hypertext in 1945, with a look forward to a possible future for web browsers.

Disclosure

You’re in a desert, you see a tortoise lying on its back, and your call is very important to us.

Filters

A web by humans, for humans.

InstAI

I object.

Continuous partial ick

Voigt-Kampff.

Related links

AI Agents robots.txt Builder | Dark Visitors

A handy resource for keeping your blocklist up to date in your robots.txt file.

Though the name of the website is unfortunate with its racism-via-laziness nomenclature.

Tagged with

AI and Asbestos: the offset and trade-off models for large-scale risks are inherently harmful – Baldur Bjarnason

Every time you had an industry campaign against an asbestos ban, they used the same rhetoric. They focused on the potential benefits – cheaper spare parts for cars, cheaper water purification – and doing so implicitly assumed that deaths and destroyed lives, were a low price to pay.

This is the same strategy that’s being used by those who today talk about finding productive uses for generative models without even so much as gesturing towards mitigating or preventing the societal or environmental harms.

Tagged with

Ideas Aren’t Worth Anything - The Biblioracle Recommends

The fact that writing can be hard is one of the things that makes it meaningful. Removing this difficulty removes that meaning.

There is significant enthusiasm for this attitude inside the companies that produce an distribute media like books, movies, and music for obvious reasons. Removing the expense of humans making art is a real savings to the bottom line.

But the idea of this being an example of democratizing creativity is absurd. Outsourcing is not democratizing. Ideas are not the most important part of creation, execution is.

Tagged with

How do we build the future with AI? – Chelsea Troy

This is the transcript of a fantastic talk called “The Tools We Still Need to Build with AI.”

Absorb every word!

Tagged with

The mainstreaming of ‘AI’ scepticism – Baldur Bjarnason

  1. Tech is dominated by “true believers” and those who tag along to make money.
  2. Politicians seem to be forever gullible to the promises of tech.
  3. Management loves promises of automation and profitable layoffs.

But it seems that the sentiment might be shifting, even among those predisposed to believe in “AI”, at least in part.

Tagged with

Previously on this day

10 years ago I wrote Polyfills and products

Trying to write long-lasting code when you’re working in an agency.

12 years ago I wrote Scrollin’, scrollin’, scrollin’

Keep them updates scrollin’.

17 years ago I wrote España

Next stop: Asturias.

20 years ago I wrote Earth shattering

My brother-in-law lives and works in Seattle. That’s his workplace they’re talking about in this article in this Newsweek article about Starbucks.

21 years ago I wrote Liggin'n'giggin

I’m feeling a bit fragile today after a somewhat hedonistic night out.

23 years ago I wrote Tony comes to Brighton

Tony Blair was in Brighton today for the Labour party conference. Here’s the full text of his speech. It’s pretty stirring stuff although mentioning Europe right now smacks a little of opportunism. Overall, a good speech from a great speaker.

23 years ago I wrote The W3C Patent Policy

The World Wide Web Consortium has come under a lot of fire recently for burying a proposal that would allow its recommendations to be released under a fee-paying licence.