How can SE gate access to the Dump that will allow individuals access to the data while preventing "misuse" by for-profit organizations?

Question

I read this answer written by Jody Bailey:

We are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community.

Given that SE content is CC BY-SA licensed, how can SE gate access to the Dump that will allow individuals access to the data while preventing "misuse" by organizations looking to profit from the work of our community?

One may have to sign some "I don't train LLMs" agreement to access the dump, but the content itself will still be CC BY-SA licensed so I don't see how SE can tell whether the data was accessed via the dump or a regular site scraping. Watermarking the dump is likely to be undetectable after some AI model is trained on it.

"we decided to stop the dump until we could put guardrails in place" reads to me like the data dumps have been stopped indefinitely, with no concrete criterion for resuming them. The answer to "how can SE gate access ..." is, they don't know "yet", but it's better for them to say they're looking for ways than to say there just won't be any more data dumps. — kaya3, Commented Jun 9, 2023 at 18:55
@kaya3 ok then what initiatives exist to create a publicly available dump of SE under the CC BY-SA 4.0 license? — Franck Dernoncourt, Commented Jun 9, 2023 at 18:56
This is probably a question for law SE or open source SE, but given section 2(a)(5)(B) of the CC-BY-SA 4.0 license, are SE legally able to restrict someone from either 'misusing' the dump, or redistributing the dump to people who would 'misuse' it. — user1937198, Commented Jun 9, 2023 at 19:01
@user1937198 I expect they would be, for the same reason that phone books are protected by copyright. SE can have an intellectual property interest in a collation or database of posts which is separate from the intellectual property interests of the authors of those posts. — kaya3, Commented Jun 9, 2023 at 19:10
@kaya3 Then Section 3(a)(4) comes into play, and that copyright must be licensed under the terms of CC-BY-SA 4 for SE distribute it as a combined work. — user1937198, Commented Jun 9, 2023 at 19:19
@user1937198 content is dual licensed. The second license allows SE to do whatever. — Franck Dernoncourt, Commented Jun 9, 2023 at 19:22
@user1937198 There doesn't seem to be a section 3(a)(4) in the text of the license, and I'm not sure what legal meaning "combined work" has (it doesn't appear in the license text). If you mean 2(a)(4) then this concerns format conversions, e.g. you're allowed to convert a licensed music file from FLAC to MP3, and doing so doesn't create a derivative work; this wouldn't apply to a database where the licensed work is just one entry among many. — kaya3, Commented Jun 9, 2023 at 19:29
@kaya3: In the US, phone books are not protected by copyright except for their "selection and arrangement," and then only to the extent that those things are "original." Even then, the copyright does not protect the actual data, just the way it is presented. You can freely take that data, rearrange it in a different way, and publish it yourself. The fact of the matter is, the US does not recognize database rights. — Kevin, Commented Jun 9, 2023 at 19:36
Whops, I accidentally ended up on CC-BY, Its 2(a)(5)(B) that would make any SE copyright under CC-BY-SA if they redistribute, and 2(a)(5)(C) that stops them adding extra restrictions. — user1937198, Commented Jun 9, 2023 at 19:41
@FranckDernoncourt Good point, so at that point the whole idea of licensing by CC-BY-SA becomes effectively worthless. — user1937198, Commented Jun 9, 2023 at 19:43
@user1937198 Certainly not. The purpose of the CC BY-SA license is not to restrict SE in what it can do, but to allow others to distribute and modify the content under suitable conditions. This is unaffected by the existence of an alternative license that gives SE more permissions. — Emil Jeřábek, Commented Jun 10, 2023 at 7:58
@VanitySlug-codidact.com Quora, Yahoo Answers, etc. QA websites typically don't release dumps and don't have a decent content license. SE is (was?) the exception, amongst the major QA websites. — Franck Dernoncourt, Commented Jun 15, 2023 at 17:35
@FranckDernoncourt I was commenting on the trend of turning free API into a paid one. — Vanity Slug - codidact.com, Commented Jun 15, 2023 at 18:53
The phrasing is disingenuous. " looking to profit from the work of our community" is BS - the concern is the width of your mote. Users don't give a single thought to how their forum posts are used. Make it open or charge, but don't pretend this anything more than a business concern. SE seems to have become a very two-faced organization. Just be honest, your marketing peoplke aren't fooling a bunch of engineers anyhow\. — Ed Swangren, Commented Jul 17 at 2:31

BryKKan · Accepted Answer · 2023-06-12 21:10:56Z

Personally, I don't think they can. At least not legally. To be clear, I imagine they might be able to avoid sharing the data dumps at all - at least in the future, once they've fulfilled their existing promises as to recent data. That is, they could make this dump, announce the change per existing policy, make the next dump, then safely stop.

But even that's a wildly counter-productive idea because, as others have mentioned, the critical data points are legally obtainable by a (far more bandwidth-intensive) crawling operation. So that's what everyone will start doing if the dumps go the way of the dinosaurs. If SE is looking for a way to drive up their hosting costs and further alienate the community all at once, this is a good move. Otherwise not so much.

Here's why I think they can't enact their "guardrail" plan:

From the CC-BY-SA 4.0¹ license

Section 1 is a list of definitions of other terms. It's not a bad idea to skim it before proceeding, but there's only one I'm going to mention here. Paragraph 1(e) defines "Effective Technological Measures" incredibly broadly. Pretty much anything that the US DMCA or "related laws" might otherwise prohibit us to bypass is included. That arguably includes most or all flavors of DRM and access control.

This leads to some serious problems later on, specifically with paragraph 2(a)(5)(C).

No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.

Any form of technical "guardrail" which effectively gates access to the dumps or seriously attempts to hobble their utility to 3rd parties would arguably run afoul of this. Especially if SE prerequisites data access upon agreement to new usage restrictions. By the above, they cannot require anyone who receives even partial data dumps to agree not to republish, possibly in concert with other volunteers to reconstruct the whole.

Further, because of the "ShareAlike" portion of the license applying (more on that later), they also can't prevent anyone from transforming it into more useful data formats, including collectively scraping the site and sharing it all into a monolithic data dump of our own. Most importantly, SE is not permitted to create any technical obstacles to such unrestricted reformatting and sharing.

For content viewed through this normal browsing, paragraph (A) of that same subsection automatically passes along the underlying license to do so:

Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.

We could write a simple browser plugin for users to install, which silently checks if posts on a given SE page are already in our user DB, and have the user's browser upload a copy of the page data if not. We could even do server-side validation against the SE site. Just doing validation of new/updated submissions is even more clearly fair use than direct scraping, and can't be blocked legally.

It doesn't matter either if we're talking about the Q&A content licensing directly, or SE's database(s). Section 4 makes clear that the database dumps are an "Adapted Material" under 3(b), the "ShareAlike" part of the license. And paragraph 3(b)(3) is equivalent to the restriction in Section 2:

You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply.

So any recipients of the dump cannot have any special conditions imposed upon them that are stricter than the "Adapter's License" chosen by SE inc. Practically, this is still the CC-BY-SA license, as explained in 3(b)(1):

The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.

The license to the underlying user content is effectively passed through to any recipients of a dump as well. 2(a)(5)(B):

Additional offer from the Licensor – Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply.

So we're free to extract the data from the dumps and do more or less anything with it. Including sell it² to an LLM company!

Either way, if they do erect such technical barriers, notwithstanding their obligation not to do so, they aren't allowed to stop us from trying to defeat those barriers (i.e. with the plugin I suggested). Both the original creator (us in the community), and SE, by virtue of mutually providing the work under the CC-BY-SA, explicitly authorize us to circumvent such measures under 2(a)(4):

The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures.

A Possible Solution?

Overall, this policy move by SE is both unnecessary and irrational - even if we're rooting solely for their bottom line. The DB provisions of the CC-BY-SA 4.0 license are actually one the strongest claims SE has to any form of ownership/control of the DB within the US. United States law doesn't (and possibly can't constitutionally) intrinsically provide for so-called "database rights", and the notion that SE has a stake by virtue of "choosing" the data to include in the db is belied by the fact that volunteer moderators and community members make the vast majority of those decisions - including which communities to create - and do so semi-autonomously.

The worst part is, I believe there is actually a way for SE Inc to monetize the database, to have a clearly protected copyright in their own derived value chain, and maintain the community values all at once. The product they should be selling is not the bulk database, but a specifically curated/tuned subset of the larger Q&A database which is designed to be more suitable for training models than the full raw dataset. By meaningfully discriminating on what's included in each subset db, SE will have added value to the data directly, and established a clear ownership right in each unique collection they create.

Moreover, as long as they continue to publish the full, unabridged dumps as they have in the past, they will have arguably met their obligations under the CC license for all such custom databases. That's because in this case we can prove that for any arbitrary data collection they sell, all of the licensed posts within are still freely available to all - as part of the full dump. They just happen to be mixed in with other data. This is purely coincidental, as it always has been, and thus it is neither a "Technological Measure" nor a novel condition of any kind. Only the list of which specific Q's and A's to include/exclude is needed to build a duplicate data set. However, those curation lists and the collections produced from them can be fully copyright protected for SE, provided SE produces them independently. Similarly, they can produce and package fully prepared custom dbs with only the posts which match the appropriate training filter. Each of these custom DBs can then be easily construed as a unique (and copyrightable) collection.

Provided that the public dump is entirely a superset of the custom dbs (no new data can be added by SE which is not also in the public dumps without triggering the CC SA provision), then there is no conflict with their license. SE can legally and enforceably lock down those custom DBs to the full extent otherwise allowed by law. LLM companies would of course be free to roll their own filters. But SE has a lot of advantages here as the platform host. For one, they can update and test the commercial databases in near real-time, given sufficient staff. That means by the time a new dump is released, they'd already know exactly which new questions and answers are being added to each db.

So they can release a fully viable model training product with no delay, whereas an outsider has to wait for the dump to even start work, or else scrape the site to try to keep up. If SE Inc gets to the market with a viable product quickly, and charges a fair price for their results, nobody will actually want to roll their own. Even when SE is fully cooperative, handling those dumps is a lot of work.

If only they had a larger engineering staff. Maybe about 40%³ larger? 😋

Conclusion

Permanently reenabling the dumps is probably going to be necessary anyway if the strike is to end amicably. So here's hoping they listen. 🤞

¹Content published under earlier licenses has already been pushed out with past dumps. We could examine SE's rights and obligation under those licenses as well, but since they could remove all of that content from the site and/or dump without affecting our ability to access the contributions themselves through past dumps, it seemed like a waste of time to worry about those licenses.

²Technically, the product in such a case isn't really the data, which would have to be offered under the same or compatible CC-BY-SA license, and similarly bind the recipient. But one might very well be compensated for the effort of (legally) helping bypass an "Effective Technological Measure", which amounts to the same thing from the SE side.

³After the recent 30% reduction, their staff is smaller. So rehiring the same staff represents an increase of ~40% (3/7 to be exact) from the current level.

Re "The product they should be selling is not the bulk database, but a specifically curated/tuned subset of the larger Q&A database": Yes, and they also have access to the web server logs and thus know how people ended up on a particular page (referer [sic] field with the search engine query string). And by timing of other page requests, they can probably infer whether which particular search engines hits were useful to the users and which ones weren't (in a statistical sense). Nobody else has access to those web server logs. — This_is_NOT_a_forum, Commented Jun 13, 2023 at 17:12

tripleee · Accepted Answer · 2023-06-13 04:33:07Z

Just to state the obvious, I find it extremely unlikely that restricting access to data dumps will in any way prevent SE data from being used to train AI models.

Thus, disabling the publishing of data dumps is simply yet another misdirected attempt to play the AI game by the company, with only, exclusively, negative consequences for the community they no longer seem to want to support or care about.

There is a narrow use case for AI practitioners who specifically want to train an AI model on Stack Exchange or Stack Overflow data alone; but those use cases are likely completely dominated by volunteers who try to build useful functionality for the community itself.

Any large AI model will want to receive data in a simple unified format, that is, they will almost certainly use (the output of) a generic HTML scraping bot anyway.

So in a few words: only pain, no gain?
– NoDataDumpNoContribution
Commented Jun 13, 2023 at 5:49 — NoDataDumpNoContribution, Commented Jun 13, 2023 at 5:49

Stack Exchange Network

How can SE gate access to the Dump that will allow individuals access to the data while preventing "misuse" by for-profit organizations?

2 Answers 2

From the CC-BY-SA 4.0¹ license

A Possible Solution?

Conclusion

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
discussion
data-dump
legal
licensing
.

Linked

Hot Network Questions

How can SE gate access to the Dump that will allow individuals access to the data while preventing "misuse" by for-profit organizations?

2 Answers 2

From the CC-BY-SA 4.0¹ license

A Possible Solution?

Conclusion

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged discussiondata-dumplegallicensing.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
discussion
data-dump
legal
licensing
.