Tags: scraping

9

Wednesday, July 3rd, 2024

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

This is a great move from Cloudflare. I may start using their service.

Monday, July 1st, 2024

Wallfacing

The Dark Forest idea comes from the Remembrance of Earth’s Past books by Liu Cixin. It’s an elegant but dispiriting solution to the Fermi paradox. Maggie sums it up:

Dark forest theory suggests that the universe is like a dark forest at night - a place that appears quiet and lifeless because if you make noise, the predators will come eat you.

This theory proposes that all other intelligent civilizations were either killed or learned to shut up. We don’t yet know which category we fall into.

Maggie has described The Expanding Dark Forest and Generative AI:

The dark forest theory of the web points to the increasingly life-like but life-less state of being online. Most open and publicly available spaces on the web are overrun with bots, advertisers, trolls, data scrapers, clickbait, keyword-stuffing “content creators,” and algorithmically manipulated junk.

It’s like a dark forest that seems eerily devoid of human life – all the living creatures are hidden beneath the ground or up in trees. If they reveal themselves, they risk being attacked by automated predators.

Those of us in the cozy web try to keep our heads down, attempting to block the bots plundering our work.

I advocate for taking this further. We should fight back. Let’s exploit the security hole of prompt injections. Here are some people taking action:

I’ve taken steps here on my site. I’d like to tell you exactly what I’ve done. But if I do that, I’m also telling the makers of these bots how to circumvent my attempts at prompt injection.

This feels like another concept from Liu Cixin’s books. Wallfacers:

The sophons can overhear any conversation and intercept any written or digital communication but cannot read human thoughts, so the UN devises a countermeasure by initiating the “Wallfacer” Program. Four individuals are granted vast resources and tasked with generating and fulfilling strategies that must never leave their own heads.

So while I’d normally share my code, I feel like in this case I need to exercise some discretion. But let me give you the broad brushstrokes:

  • Every page of my online journal has three pieces of text that attempt prompt injections.
  • Each of these is hidden from view and hidden from screen readers.
  • Each piece of text is constructed on-the-fly on the server and they’re all different every time the page is loaded.

You can view source to see some examples.

I plan to keep updating my pool of potential prompt injections. I’ll add to it whenever I hear of a phrase that might potentially throw a spanner in the works of a scraping bot.

By the way, I should add that I’m doing this as well as using a robots.txt file. So any bot that injests a prompt injection deserves it.

I could not disagree with Manton more when he says:

I get the distrust of AI bots but I think discussions to sabotage crawled data go too far, potentially making a mess of the open web. There has never been a system like AI before, and old assumptions about what is fair use don’t really fit.

Bollocks. This is exactly the kind of techno-determinism that boils my blood:

AI companies are not going to go away, but we need to push them in the right directions.

“It’s inevitable!” they cry as though this was a force of nature, not something created by people.

There is nothing inevitable about any technology. The actions we take today are what determine our future. So let’s take steps now to prevent our web being turned into a dark, dark forest.

Monday, June 17th, 2024

AI Pollution – David Bushell – Freelance Web Design (UK)

AI is steeped in marketing drivel, built upon theft, and intent on replacing our creative output with a depressingly shallow imitation.

Saturday, December 30th, 2023

AI Agents robots.txt Builder | Dark Visitors

A handy resource for keeping your blocklist up to date in your robots.txt file.

Though the name of the website is unfortunate with its racism-via-laziness nomenclature.

Saturday, September 30th, 2023

Robots.txt - Jim Nielsen’s Blog

I realized why I hadn’t yet added any rules to my robots.txt: I have zero faith in it.

Thursday, September 21st, 2023

Monday, January 2nd, 2017

Mercury by Postlight

Readability is back, but now it’s called Mercury.

Thursday, January 16th, 2014

kimono : Turn websites into structured APIs from your browser in seconds

This tool for building ScrAPIs is an interesting development—the current trend for not providing a simple API (or even a simple RSS feed) is being interpreted as damage and routed around.

Sunday, December 9th, 2012

I Don’t Need No Stinking API: Web Scraping For Fun and Profit | Hartley Brody

A handy step-by-step guide to scraping HTML to get data out. Useful for services (—cough—Twitter—cough—) that keep changing the rules of their API use.