Opened 4 months ago
#60805 new feature request
Reading Settings: add option to discourage AI services from crawling the site
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | |
Component: | Privacy | Keywords: | |
Focuses: | privacy | Cc: |
Description
I'd like to suggest a new addition to the bottom of the Reading Settings screen in the dashboard:
This new section would help site owners indicate whether or not they would like their content to be indexed by AI services and used to train future AI models.
There have been a lot of discussions about this in the past 2 years: content creators and site owners have asked whether their work could and should be used to train AI. Opinions vary, but at the end of the day I believe most would agree that as a site owner, it would be nice if I could choose for myself, for my own site.
In practice, I would imagine the feature to work just like the Search Engines feature just above: when toggled, it would edit the site's robots.txt
file and disallow a specific list of AI services from crawling the site.
There are typically 4 main approaches to discouraging AI Services from crawling your site:
- You can add
robots.txt
entries matching the different User Agents used by AI services, and asking them not to index content via aDisallow: /
.- This seems to be the cleanest approach, and the one that AI services are the most likely to respect.
- This also has an important limitation ; it relies on a list of AI User Agents that would have to be kept up to date. It would obviously be hard for that list to ever be fully exhaustive. See an example of the user agents we would have to support below.
- You can add an
ai.txt
file to your site, as suggested by Spawning AI here.- However, we have no guarantee AI services currently recognize and respect this file.
- You could add a meta tag to your site's
head
:<meta name="robots" content="noai, noimageai" />
. This is something that was apparently first implemented by DeviantArt.- I do not know if this is actually respected by AI services. It is not an HTML standard today. In fact, discussions for a new HTML standard are still in progress, and suggest a different tag (reference).
- If a standard like that were to be accepted, and if AI Services agreed to use it, it may be the best implementation in the future since we would not have to define a list of AI services.
- You can completely block specific User Agents from accessing the site.
- I believe we may not want to implement something that drastic and potentially blocking real visitors in WordPress Core. This is something that is better left to plugins.
Some plugins already exist that implement some of the approaches above. It shows that there may be interest to include such a feature in Core.
- ChatBot Blocker
- Simple NoAI and NoImageAI
- Block AI Crawlers
- Block Chat GPT via robots.txt
- Block Common Crawl via robots.txt
- WordPress Block AI Scrapers
If we were to go with the first option, here are some examples of the User Agents we would have to support:
Amazonbot
-- https://developer.amazon.com/support/amazonbotanthropic-ai
-- https://www.anthropic.com/Bytespider
-- https://www.bytedance.com/CCBot
-- https://commoncrawl.org/ccbotClaudeBot
-- https://claude.ai/cohere-ai
-- https://cohere.com/FacebookBot
-- https://developers.facebook.com/docs/sharing/botGoogle-Extended
-- https://blog.google/technology/ai/an-update-on-web-publisher-controls/GPTBot
-- https://platform.openai.com/docs/gptbotomgili
-- https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/omgilibot
-- https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/SentiBot
-- https://sentione.com/sentibot
-- https://sentione.com/
This list could be made filterable so folks can extend or modify that list as they see fit.
Mockup of how such an option would look like in the WordPress dashboard