Tags: user
817
Sunday, July 21st, 2024
Wednesday, July 3rd, 2024
Declare your AIndependence: block AI bots, scrapers and crawlers with a single click
This is a great move from Cloudflare. I may start using their service.
Monday, June 17th, 2024
AI Pollution – David Bushell – Freelance Web Design (UK)
AI is steeped in marketing drivel, built upon theft, and intent on replacing our creative output with a depressingly shallow imitation.
Neatnik Notes · Gotta block ’em all
While we’re playing whack-a-mole, let’s poison these rodents.
Blocking bots – Manu
Blocking the bots is step one.
Sunday, June 16th, 2024
Perplexity AI Is Lying about Their User Agent • Robb Knight
See, this is exactly why we need to poison these bots.
Wednesday, June 5th, 2024
Home-Cooked Software and Barefoot Developers
A very thought-provoking presentation from Maggie on how software development might be democratised.
Thursday, May 23rd, 2024
Speculation rules and fears
After I wrote positively about the speculation rules API I got an email from David Cizek with some legitimate concerns. He said:
I think that this kind of feature is not good, because someone else (web publisher) decides that I (my connection, browser, device) have to do work that very often is not needed. All that blurred by blackbox algorithm in the browser.
That’s fair. My hope is that the user will indeed get more say, whether that’s at the level of the browser or the operating system. I’m thinking of a prefers-reduced-data setting, much like prefers-color-scheme or prefers-reduced-motion.
But this issue isn’t something new with speculation rules. We’ve already got service workers, which allow the site author to unilaterally declare that a bunch of pages should be downloaded.
I’m doing that for Resilient Web Design—when you visit the home page, a service worker downloads the whole site. I can justify that decision to myself because the entire site is still smaller in size than one article from Wired or the New York Times. But still, is it right that I get to make that call?
So I’m very much in favour of browsers acting as true user agents—doing what’s best for the user, even in situations where that conflicts with the wishes of a site owner.
Going back to speculation rules, David asked:
Do we really need this kind of (easily turned to evil) enhancement in the current state of (web) affairs?
That question could be asked of many web technologies.
There’s always going to be a tension with any powerful browser feature. The more power it provides, the more it can be abused. Animations, service workers, speculation rules—these are all things that can be used to improve websites or they can be abused to do things the user never asked for.
Or take the elephant in the room: JavaScript.
Right now, a site owner can link to a JavaScript file that’s tens of megabytes in size, and the browser has no alternative but to download it. I’d love it if users could specify a limit. I’d love it even more if browsers shipped with a default limit, especially if that limit is related to the device and network.
I don’t think speculation rules will be abused nearly as much as client-side JavaScript is already abused.
Sunday, March 17th, 2024
Friday, March 15th, 2024
Sunday, February 25th, 2024
Wednesday, February 21st, 2024
The Folly of Chasing Demographics - YouTube
I just attended this talk from Heydon at axe-con and it was great! Of course it was highly amusing, but he also makes a profound and fundamental point about how we should be going about working on the web.
Friday, November 3rd, 2023
Tuesday, October 17th, 2023
The Web Is For User Agency
I can get behind this:
I take it as my starting point that when we say that we want to build a better Web our guiding star is to improve user agency and that user agency is what the Web is for.
Robin dives into the philosphy and ethics of this position, but he also points to some very concrete implementations of it:
These shared foundations for Web technologies (which the W3C refers to as “horizontal review” but they have broader applicability in the Web community beyond standards) are all specific, concrete implementations of the Web’s goal of developing user agency — they are about capabilities. We don’t habitually think of them as ethical or political goals, but they are: they aren’t random things that someone did for fun — they serve a purpose. And they work because they implement ethics that get dirty with the tangible details.
Monday, October 2nd, 2023
Crawlers
A few months back, I wrote about how Google is breaking its social contract with the web, harvesting our content not in order to send search traffic to relevant results, but to feed a large language model that will spew auto-completed sentences instead.
I still think Chris put it best:
I just think it’s fuckin’ rude.
When it comes to the crawlers that are ingesting our words to feed large language models, Neil Clarke describes the situtation:
It should be strictly opt-in. No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. Some will declare that I am “Anti-AI” for saying such things, but that would be a misrepresentation. I am not declaring that these systems should be torn down, simply that their developers aren’t entitled to our work. They can still build those systems with purchased or donated data.
Alas, the current situation is opt-out. The onus is on us to update our robots.txt
file.
Neil handily provides the current list to add to your file. Pass it on:
User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: FacebookBot
Disallow: /
In theory you should be able to group those user agents together, but citation needed on whether that’s honoured everywhere:
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: GPTBot
User-agent: Google-Extended
User-agent: Omgilibot
User-agent: FacebookBot
Disallow: /
There’s a bigger issue with robots.txt
though. It too is a social contract. And as we’ve seen, when it comes to large language models, social contracts are being ripped up by the companies looking to feed their beasts.
As Jim says:
I realized why I hadn’t yet added any rules to my
robots.txt
: I have zero faith in it.
That realisation was prompted in part by Manuel Moreale’s experiment with blocking crawlers:
So, what’s the takeaway here? I guess that the vast majority of crawlers don’t give a shit about your
robots.txt
.
Time to up the ante. Neil’s post offers an option if you’re running Apache. Either in .htaccess
or in a .conf
file, you can block user agents using mod_rewrite
:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|Omgilibot| FacebookBot) [NC]
RewriteRule ^ – [F]
You’ll see that Google-Extended
isn’t that list. It isn’t a crawler. Rather it’s the permissions model that Google have implemented for using your site’s content to train large language models: unless you opt out via robots.txt
, it’s assumed that you’re totally fine with your content being used to feed their stochastic parrots.