Hacker News new | past | comments | ask | show | jobs | submit login
Pulling my site from Google over AI training (tracydurnell.com)
53 points by headalgorithm on July 14, 2023 | hide | past | favorite | 96 comments



I can't understand the outrage. In practice absolutely nothing has changed.

It is reading and learning. A person would read and learn.

This has no bearing on plagiarism or copyright. A work can be considered plagiarized or to breach copyright if the author hasn't even seen or come across the copyrighted/published work.

This is no different. I can write some code and use it, subconsciously referencing a work.

If I don't check my written work and put it out there, someone might have a claim against me. If I don't check the machine generated work and put it out there, someone might have a claim against me.

OpenAI,Meta,et al are providing the model, basically a regression model or tool. I'm adding the variables or secret sauce that makes it output that set of data in that specific order not them. It'd be like suing Parker for making the pen.


A human cannot learn from and re/produce work they view at the speed, volume, and scale that “AI” does, nor can that human be infinitely replicated and farmed out. When describing this gap in capabilities, or the consequences of “learning”, “orders of magnitude” would be a comical understatement.

Existing conventions around “learning” are built on assumptions of human scale, and the expected consequences thereof.

I can’t understand why one would expect people to go “oh it’s technically ‘learning’ I guess I’ll ignore all the consequences that weren’t present when it was just humans”.


That's completely subjective. What you describe is a spectrum. If that were true, thesis' would not have to be run through plagiarism scans (they are), and there would be no copyright lawsuits for familiarity such as Ed Sheeran's Marvin Gaye lawsuit.

It's also important to note these laws were always intended to strike a fair balance between the copyright owner and the good of society as a whole. Copyright is not an end in itself.


This is such an important point that I think escapes a lot of people, both on the pro-AI and anti-AI side.

Although everyone probably wishes otherwise, there isn't really a hard objective line for determining if something is infringing or not. And that's actually a good thing! But it means that, in the end, it's up to a judge to weigh a bunch of different, rather fuzzy, factors.


Law is not a static entity, it's morphing and evolving to suit our ideals today. Learning wasn't an issue until it is.

We could very well distinguish between machine learning and human learning although they're no different from each other in principle.

And although, we can't say that ML is plagiarism, at scale it breaches our other moral principles like privacy and individual identity.

Let's say in 10 years from now Google'll train ML on (close to) all the data available in the world, be it text, visual or audio. Would they be able to prompt this AI: "You are George Wilkinson from Colorado ** st. 23". How close will it be from real George on whose data it was trained on?

We can't tune our human learning to this level of precision, but it is only a matter of time for the machine.


The situation you describe is exactly that which has already existed for the last two decades with Google using that data for their search engine. If it's public it can be read.


Public book libraries have been free to read since centuries, and yet no single human has ever read even 1% of them. We can't ignore that different scales = different consequences. (and to put them together the authors were paid)


A human cannot perform arithmetic at the speed, volume, and scale that “computation” does, nor can that human be infinitely replicated and farmed out. When describing this gap in capabilities, or the consequences of “mathematics”, “orders of magnitude” would be a comical understatement. Existing conventions around “ariithmetic” are built on assumptions of human scale, and the expected consequences thereof. I can’t understand why one would expect people to go “oh it’s technically ‘math’ I guess I’ll ignore all the consequences that weren’t present when it was just humans doing the mathematical operations”.


Do you actually think that as a society we don't view maths differently after the computer?

We absolutely do, and expectations of what's possible are completely different.

It's a good thing, but at the moment language copyrights is wielded by corporations and ignored by corporations. Either we make copyrights like maths and discard them, or we hold corporation's to the same standards that they hold us to.

Do you really think that of I train my model on Mickey Mouse and then go ahead and generate "not Mickey" that Disney won't try and sue the crap out of me? Almost all these complaints boil down to different standards for them versus us.


Yet we have not banned the calculator but accept it empowers even people who did not spent time in school to learn math.


Wow. It’s like the printing press! Reproduction has hit the next jump in scale point


This is a great analogy.


So what are these "consequences" you keep referring to?


My pitch:

A search engine that exclusively indexes noindex sites (you can use other sites while spidering) and builds an LLM model with the results.


I rather suspect that this is already being done.


I seem to recall meta-search sites that only listed sites that had disappeared due to DMCA takedowns. There was also "unsafe search" that ran your google search twice, once with "SafeSearch" enabled, and returned the complement of the union, i.e. just the porn.


I had an (terrible) idea to create an search index that build on DMCA notices.


Comments are bound to be spicy on this one. I always love it when techbros say that AI learning and human learning are exactly the same, because reading one thing at a time at a biological pace and remembering takeaway ideas rather than verbatim passages is obviously exactly the same thing as processing millions of inputs at once and still being able to regurgitate sources so perfectly that verbatim copyrighted content can be spit out of an LLM that doesn't 'contain' its training material. It's even better when they get so butthurt at being called out that they have a nice little rage-cry.


They only believe it because they think their own labour isn't under threat. Think.


I'm going to savor every last iota of the Shocked Pikachu energy they emit. Especially when they realize, far too late, that building their almighty replacements gets them exactly zero kudos from the people who will inevitably control it.

I'm not going to be apologetic about it either, since the same people who think they're invulnerable also tend to espouse sadistic glee over the impending immiseration of millions due to these developments... just practicing what they preach, after all.


what do you mean, they'll get a whole $20k bonus out of it

then be competing with it for the rest of their lives as it slowly reduces their labour potential to zero


Bit of a related rant.

Just today I googled (and duck duck go'd?) alternatives for Discord (because reasons). Entire search results page was "X top alternatives to Discord." It was all blog posty kind of stuff with an "author".

And like 90% of it was written by indian and african sounding names. These were clearly "content farms" with low paid labour and bad grammar, or just authors with nothing better to do than write Yet Another Blog Post about Top Discord Alternatives. Sure, they weren't generated, but the fact that a human was involved in creating something crappy doesn't make it better or unique.

What I was actually looking for was unique content. Either an actual curated list of alternatives (NOT a blog post they update every year). Or an extract from a book where someone posted fiction about a fictional Discord user that meets aliens. Or comments in a forum, or a link to a song-lyrics website for a Weird Al parody song about discord, a website dedicated to expounding the virtues cutting the discord cord, a link to a PDF where someone saved a IRC chat server's logs about a person switching from discord to IRC, or an "IRC-MF do you speak it" crass website, or something. Anything but a damn content blog post by some third-world content creator or hipster-blog-poster from the 1st world.

What I got was garbage. Human-level garbage. Garbage that across hundreds of thousands of websites basically took a piece of content and expanded it with every known combination of words, sentences, and mini-stories and pasted it on a stupid blog post with an author.

And this garbage is what this AI is training on so we can have content farms make more copies of itself with more variations and in different languages now, all so we can pay Google et al attention-coins to magically sift through all that garbage and present us with something a little less garbage-y for us to consume.


Don't forget about also pulling your site from Bing! It would be naive if you somehow trust that Microsoft won't use your site for AI training.


And Yandex!

Side note, I have friends that crawled a massive amount of the internet over several months for their own purposes.. at this point it's probably impossible to exclude your site since tons of other people probably link to your site if it's at all of value.


Yeah, this is why robots.txt is garbage. Too many crawlers completely ignore it, so I stopped bothering with it (I have it set to block all bots, but I don't expect it to be effective.)

Instead, I'd just keep an eye on my access logs and block obvious crawlers when I saw them.


The layout on this linked site: https://vasilis.nl/nerd/how-to-disagree-with-googles-privacy...

Is incredible.


Can you pollute their data with hidden elements, or do they only scrape visible stuff?


If Google thinks a site is serving different content to googlebot vs. real users, it will stop returning that site on the SERP, because that is a malware distribution technique, among other reasons.


What if you serve the same site? Does googlebot know that some text has the same color as the background?


I don't know, that's not my game and I never worked on crawl or index, but I imagine they have that handled since invisible keyword spam at the bottom of the page was already a universal spammer technique by 1997.


> I’m going to start by pulling my websites out of Google search, then work on adding my sites to directories. Maybe I’ll even join a webring

I'm curious, this is the first time I've heard of a webring, I'd like to learn more about these alternate discovery routes. Anyone have any concrete experience or recommendations to share?


Directory sites and webrings were artifacts of the early web, when search engines were pretty bad. They disappeared once search got better.

https://en.wikipedia.org/wiki/Webring


I miss webrings. I don't think that search getting better killed them (they are useful for reasons unrelated to the state of search), but when personal and hobbyist websites started vanishing, webrings went with them.


Ironically, once search got better for the users, there was no further need to build comprehensive sites with great content in the hopes of getting added to webrings and directories. All sites need to worry about after were getting backlinks.

After that, sites also stop linking to each other.


Webrings were a great, essentially curated list of sites the authors of sites you liked thought you might also enjoy or find useful. I miss them, too.


yah, come to think of it in the curated space, this reminds me of that awesome X family of github pages. Looks like someone compiled a bunch of them here https://github.com/sindresorhus/awesome#databases. I have found those to be highly valuable treasure troves pregnant with rich and relevant information.


Add some context, expanding on what wolpoli mentioned,

Sure search engines got better, and then they got worse..

Many of the things 'web rings' promoted were 'similar sites'.

Yet it was when google become big that it destroyed web rings and blog rolls (which were similar to web rings, but often included a variety of sites not just similars) -

It became known that google penalized people for linking to other sites, linking to 'bad neighborhoods' - they would sometimes call out sites publicly for giving 'link juice' to others for profit..

Web rings disappeared and blog rolls as well, because google.

Not because search engines got better, but because google threatened to penalize you for linking out.


I'd love to see a resurgence of them, google is ever increasingly becoming useless as a search tool that when I see the first page is junk I just give up on the search and make note of what I'm looking for and try again at a later time or use alternate means of information retrieval.

It feels like at some point in the last 5 years we crossed some threshold where search engines are so optimized to the commercial space and selling junk that obscure searches yield no useful information. Any keyword which is so unfortunate to overlap in the commercial product space will just dominate your results and make them mostly useless. Even my advanced google fu to tack on certain phrases and other bits of language to narrow things down and formerly gave very focused results but now feels like google is atrophying in that domain.

It doesn't help that our culture has become so word overload friendly where instead of creating new words and phonetic combinations existing words are used instead and now require disambiguation to say not the product kind of this word.


don't forget Jimmy "Jimbo" Wales's Bomis webring thing, featuring the Bomis Babe Report!


Stop trying to make the rest of us feel old lol.


Author should update their robots.txt for Googlebot as well. It is not clear if noindex means "notrain" too. The entire webpage has to be read in and parsed for Google to extract that meta tag. However, robots.txt should stop the crawler before it proceeds to the rest of your site.


The author mentioned she was going to block Googlebot, only not yet, in order to make sure it can crawl the site again to get the 'noindex' instructions.


As a substack author, with “permission is not granted to use any portion of this to train an ai” at the bottom of most of my posts, it’s bullshit that you have to do this sort of thing, and that it will almost certainly not work

This must be illegal, but how are all the little bloggers going to oppose it?


Why do you think it would be illegal? You can state "permission is not granted to X" on anything you want, but that doesn't mean the law is on your side. Regular rules of copyright still apply.

P.S. Permission is not granted to downvote my comment!


[flagged]


That is not settled law. In the US the key is if this is derivative or transformative.

You can read a book without your brain getting owned by the author.

It's perfectly reasonable to say it should be considered copyright infringement, but such cases are in the court now.

Disclaimer: I am not a lawyer.


Vitriol aside, you need to chill for a bit and touch grass.

"Training" doesn't really have a well-defined meaning, I could use your website to train something as simple as a histogram of word counts for an AI for example. Nothing about that constitutes copyright infringement under even the loosest definition of their legal concept.

Additionally weights from training and the AI's output are two completely different matters from a legal perspective as well.


Ok, so if it's already copyright infringement then what does writing "permission is not granted" at the bottom of your post do, exactly?


I agree it is wrong and should be illegal. That being said, I do find the argument that's it's no different than a human learning from and occasionally reconstructing copyrighted things compelling.


Most normal humans do not spend their time profitably selling their "occasionally reconstructing copyrighted things" at a rate a millions of users per second, which is a pretty important difference in practice.


That said, the law is not made with super-humans being able to reproduce (slightly transformative) as good as all they read (all worlds' knowledge) in mind. A clarifying law should be created.


Is it legal to transcribe a book from memory for money? Does it matter how faithful the transcription is?


> Is it legal to transcribe a book from memory for money?

If it's an accurate transcription and you don't have permission, then it's not legal. It doesn't matter if it's for money or not (or if it's from memory or not).

> Does it matter how faithful your transcription is?

Yes, it matters. Copyright covers the specific expression of an idea, not the idea itself.


It's only illegal if a law makes it illegal.

It's not clear to me that it should be illegal.


What's right or wrong, and what's legal or illegal, are two different things. There are plenty of right things that are illegal and wrong things that are legal.


I don't see why it would be illegal, AI reading it should be no different from anyone else.


"Illegal" is too strong. But if you specifically disallow the use of your website contents from being used to train AI, then anyone doing so is violating the terms of service.

Which doesn't really mean anything.

At this point, the only defense I can think of is to not make the content publicly available. Which is what I've done.


It's a bit different because the AI is reading it with the intent of reproducing (certain aspects of) it for other people to later consume without visiting the original site. Fair use doctrine has long held that small pieces of copyrighted material can be reproduced, but the line is very blurry and generally has to be litigated if there's any ambiguity whatsoever. I'd bet many of the models we're currently using today will be pulled from serving the public over copyright lawsuits in the coming years.

I don't think training on copyrighted stuff will be ever banned, but we need to figure out how much they can be allowed to generate based on that. Eventually new models will just pop up with more carefully curated data anyway.


> I don't think training on copyrighted stuff will be ever banned, but we need to figure out how much they can be allowed to generate based on that.

From a US copyright law point of view, this is most likely correct. Copyright law doesn't prevent you from ingesting copyrighted works, it prevents you from distributing them.

There is also a great deal of existing case law about how different a work has to be before it's not infringing from another work anymore. There are existing rules of thumb judges go by when trying to determine if infringement occurred. They include things like the amount of difference in expression, the quantity, whether or not it's incidental, etc.

And that's not even getting into the question of fair use -- which is a whole other kettle of fish.

I suspect that the courts will deal with these issues the way that they've always dealt with these issues: on a case-by-case basis.


But you don't know what the intention of the reader human is either? It could be that too?


Sure, but that would be illegal too. I'm saying it doesn't matter who reads your website, but everyone knows exactly why GPT and Bard are going to do with the information they're "learning" from it, so they're trying to block it from reading in the first place.


They're not doing much they're updating probabilities on a regression model. What the user of the tool thereafter do is the question.


Many LLM's will happily recite large segments of copyrighted material word-for-word, despite the fact that it can be difficult to tell what's happening "under the hood".


Many people can do that too? It's what they do with it that's important.


> It's a bit different because the AI is reading it with the intent of reproducing (certain aspects of) it for other people to later

It could be illegal if the AI reproduces vast portions of it. If you could ask the LLM over a course of prompts to generate a significant portion of the content (as the copyright law defines it), then yes.

As long as the AI isn't reproducing it, then I am not sure if it would count.


If I recite the vague plot of a novel or a fact I learned from an encyclopedia I'm not reproducing anything, certainly not violating copyright law.

I don't see why AI developers should be expected to think otherwise and worsen their training data over this.


Scale and position matter. Google is the conduit that connects most people to most websites, so in the EU they are considered a "gatekeeper" and need to be careful about conflicts of interest with the people and websites using their "gate". I hope American competition law catches up to the point we can recognize that market makers simply should not be participating in the markets they make (and Google search is a market maker; it's connecting "buyers" [viewers or advertisers, depending on your perspective] to "sellers" [websites or viewers, respectively]), but I digress.

The point is that Google has a certain market position that makes it very different when they "recite the vague plot of a novel or a fact they learned". The point of competition law is to "distort" free market capitalism for the betterment of society. This is one of those cases where practical considerations trump information idealism. The quality of information on the internet will go down if we stop rewarding original publishers.


I keep having to say this: An ai is not a person


Yep, there’s a big difference in practice. If an AI could attribute and provide royalties then it may not be so different but that’s never going to happen. A big reason Bard exists is Google trying to ensure they stay profitable and relevant. They don’t care where the knowledge really comes from.


> If an AI could attribute and provide royalties then it may not be so different

But even that requires the permission of the copyright holder. Nobody is required to accept an infringing use of their work in exchange for royalties.


Don't forget to provide royalties for every synapse in your head.


This is the most braindead take and aibros keep pushing. Your 'AI' model is NOT A PERSON and therefore it is different.


Does that include book readers for the blind? They typically have some sort of optical character recognition and benefit a user, just like an ML training dataset benefits users.

My point being: it's exceptionally hard to create laws that deny precisely what you don't want and allow precisely what you want, without quickly getting into details that bring the entire law's assumptions into question. Here being "because an ML training is not a person, it has no right to scan the web".


The main difference here is that these AI bots are operating with an entirely different agenda. The ethics remain to be seen and the jury is out as to whether they will benefit the user they way the promise they will.

Also on a whole different scale and instead of supplementing the web content it’s devaluing it to a degree.


The "ai bots" aren't operating with an agenda- at least as far as we can tell now, training algorithms and their scrapers do not have agency.

Basically you're assuming the agenda of the operator, saying "that's bad an shouldn't be allowed". But I see the web- except for things specifically labelled with standard copyright disclaimers- as effectively a large corpus of publicly available data, "in the market square for all to see".


This whole AI scraping argument is so silly to me. If you don't want people downloading and processing your content, then don't post it on the public internet?


They literally outline their reasoning in a link[0]. There’s a significant gap between offering information for someone to read for free (where I can see the author and choose to respect their terms, if they have any), and a huge tech company aggregating that data where it is assimilated into a model and used in a product they will profit from[1]. They are exploiting gray area regarding digital rights, copyright, etc.

[0] https://tracydurnell.com/2023/07/07/the-next-big-theft/

[1] https://www.tumblr.com/nedroidcomics/41879001445/the-interne...


It's not about not wanting people to access the content, it's about not wanting AI bots to do so. But yes, I agree, the only realistic defense we have is to remove it from the publicly-accessible web.


Honestly, if you're so terrified now of what happens to your information once you post it publicly, just cut the cord. It was ALWAYS like this.


Why the preference not to have Google train their AI using your website content?


One reason is the same as authors who don’t want their books trained and actors who don’t want their likeness trained. If that content is valuable it allows google to realize that value with no return to the creator.


The naivety is assuming a small number of people making access hard will affect the value they're able to realize.


I bet the reasons vary from person to person. My reason is because I think that these AI systems pose too great of a risk to society, and I want to make sure that I'm not helping them in any way.


Ah, yes. More complaining about freely posting content publicly on the internet and then being upset when it's used in a way you don't want. I'm sure foreign companies and even governments are doing something similar, what will U.S. laws do to stop that?


jots down "CRAWL AND SCRAPE A WEB RING"

Got it.


> Blocking bots that collect training data for AIs (and more)

> In addition, I created a robots.txt file to tell “law abiding” bots what they’re not allowed to look at. I ought to have done this before but kind of assumed it came with my WordPress install (Nope.)

> I specifically want to deter my website being used for training LLMs, so I blocked Common Crawl.

Instead of blocking, it would be neater to present and alternative version to the crawlers (like many paywalled sites already to for SEO) that's full of dynamically generated LLM-generated garbage. That'll help the LLMs poison themselves.


I was thinking about jamming along these lines, but the problem is that it's a game of whack-a-mole -- you have to keep up on what bots are active (robots.txt doesn't really help here, and focusing on Common Crawl is insufficient).

My websites have been closed to the public since shortly after the release of ChatGPT, but I've been considering opening them up again, sort of. The not-logged-in experience being full of dynamically generated LLM poison as you suggest -- for everybody rather than trying to single out crawlers -- and you have to log in to get to the real contents of the site.


So if someone innocently reaches your site from google they will see a bunch of LLM generated misinformation?

You have a legal right to do this (assuming the LLM bullshit isn't libel) but I don't see how it could be considered a moral act.


> So if someone innocently reaches your site from google they will see a bunch of LLM generated misinformation?

Yes, although I've been blocking googlebot for years, so nobody will get to my sites through google anyway.

> I don't see how it could be considered a moral act.

I'm curious about this -- why do you think this is in any way an immoral act?

If a naïve human comes across the site, they'll quickly realize that it's not useful and move on. No harm done. How does morality enter into it?

Would your moral objections be eased if the first line on the page is something like "this page is full of machine-generated nonsense. Please ignore it"?


>If a naïve human comes across the site, they'll quickly realize that it's not useful and move on

I do not share your confidence.

>Would your moral objections be eased if the first line on the page is something like "this page is full of machine-generated nonsense. Please ignore it"?

That would certainly help.


> That would certainly help.

That seems reasonable enough. If I do this, I'll include such a disclaimer.


> Instead of blocking, it would be neater to present and alternative version to the crawlers

IIUC, if a site presents different (view of) content to the crawler than users, the site can get de-indexed.


If that's a concern, and you're only worried about AI crawlers, then that's not a problem. Only provide the bogus pages to the AI crawlers, not to the search engine crawlers.

Assuming there's a difference, anyway. I suspect that with Google and Bing, there isn't.


An interesting thought are government or foreign actors training a generative Ai. They won't abide by any no-index tax and will scrape any and everything.


It doesn't have to be government or foreign actors. Abiding by the contents of robots.txt isn't required of anybody at all. It's merely a social convention.


Helpful links. I will be doing the same this evening


[flagged]


"Ironically, the uniformity of the copies of Gutenberg’s Bible led many superstitious people of the time to equate printing with Satan because it seemed to be magical. Printers’ apprentices became known as the "printer’s devil." In Paris, Fust [a typographer] was charged as a witch. Although he escaped the Inquisition, other printers did not." (The Unsung Heroes, a History of Print by Dr. Jerry Waite 2001)"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact