Personally, I don't think they can. At least not legally. To be clear, I imagine they might be able to avoid sharing the data dumps at all - at least in the future, once they've fulfilled their existing promises as to recent data. That is, they could make this dump, announce the change per existing policy, make the next dump, then safely stop.
But even that's a wildly counter-productive idea because, as others have mentioned, the critical data points are legally obtainable by a (far more bandwidth-intensive) crawling operation. So that's what everyone will start doing if the dumps go the way of the dinosaurs. If SE is looking for a way to drive up their hosting costs and further alienate the community all at once, this is a good move. Otherwise not so much.
Here's why I think they can't enact their "guardrail" plan:
From the CC-BY-SA 4.0¹ license
Section 1 is a list of definitions of other terms. It's not a bad idea to skim it before proceeding, but there's only one I'm going to mention here. Paragraph 1(e) defines "Effective Technological Measures" incredibly broadly. Pretty much anything that the US DMCA or "related laws" might otherwise prohibit us to bypass is included. That arguably includes most or all flavors of DRM and access control.
This leads to some serious problems later on, specifically with paragraph 2(a)(5)(C).
No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
Any form of technical "guardrail" which effectively gates access to the dumps or seriously attempts to hobble their utility to 3rd parties would arguably run afoul of this. Especially if SE prerequisites data access upon agreement to new usage restrictions. By the above, they cannot require anyone who receives even partial data dumps to agree not to republish, possibly in concert with other volunteers to reconstruct the whole.
Further, because of the "ShareAlike" portion of the license applying (more on that later), they also can't prevent anyone from transforming it into more useful data formats, including collectively scraping the site and sharing it all into a monolithic data dump of our own. Most importantly, SE is not permitted to create any technical obstacles to such unrestricted reformatting and sharing.
For content viewed through this normal browsing, paragraph (A) of that same subsection automatically passes along the underlying license to do so:
Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
We could write a simple browser plugin for users to install, which silently checks if posts on a given SE page are already in our user DB, and have the user's browser upload a copy of the page data if not. We could even do server-side validation against the SE site. Just doing validation of new/updated submissions is even more clearly fair use than direct scraping, and can't be blocked legally.
It doesn't matter either if we're talking about the Q&A content licensing directly, or SE's database(s). Section 4 makes clear that the database dumps are an "Adapted Material" under 3(b), the "ShareAlike" part of the license. And paragraph 3(b)(3) is equivalent to the restriction in Section 2:
You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply.
So any recipients of the dump cannot have any special conditions imposed upon them that are stricter than the "Adapter's License" chosen by SE inc. Practically, this is still the CC-BY-SA license, as explained in 3(b)(1):
The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.
The license to the underlying user content is effectively passed through to any recipients of a dump as well. 2(a)(5)(B):
Additional offer from the Licensor – Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply.
So we're free to extract the data from the dumps and do more or less anything with it. Including sell it² to an LLM company!
Either way, if they do erect such technical barriers, notwithstanding their obligation not to do so, they aren't allowed to stop us from trying to defeat those barriers (i.e. with the plugin I suggested). Both the original creator (us in the community), and SE, by virtue of mutually providing the work under the CC-BY-SA, explicitly authorize us to circumvent such measures under 2(a)(4):
The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures.
A Possible Solution?
Overall, this policy move by SE is both unnecessary and irrational - even if we're rooting solely for their bottom line. The DB provisions of the CC-BY-SA 4.0 license are actually one the strongest claims SE has to any form of ownership/control of the DB within the US. United States law doesn't (and possibly can't constitutionally) intrinsically provide for so-called "database rights", and the notion that SE has a stake by virtue of "choosing" the data to include in the db is belied by the fact that volunteer moderators and community members make the vast majority of those decisions - including which communities to create - and do so semi-autonomously.
The worst part is, I believe there is actually a way for SE Inc to monetize the database, to have a clearly protected copyright in their own derived value chain, and maintain the community values all at once. The product they should be selling is not the bulk database, but a specifically curated/tuned subset of the larger Q&A database which is designed to be more suitable for training models than the full raw dataset. By meaningfully discriminating on what's included in each subset db, SE will have added value to the data directly, and established a clear ownership right in each unique collection they create.
Moreover, as long as they continue to publish the full, unabridged dumps as they have in the past, they will have arguably met their obligations under the CC license for all such custom databases. That's because in this case we can prove that for any arbitrary data collection they sell, all of the licensed posts within are still freely available to all - as part of the full dump. They just happen to be mixed in with other data. This is purely coincidental, as it always has been, and thus it is neither a "Technological Measure" nor a novel condition of any kind. Only the list of which specific Q's and A's to include/exclude is needed to build a duplicate data set. However, those curation lists and the collections produced from them can be fully copyright protected for SE, provided SE produces them independently. Similarly, they can produce and package fully prepared custom dbs with only the posts which match the appropriate training filter. Each of these custom DBs can then be easily construed as a unique (and copyrightable) collection.
Provided that the public dump is entirely a superset of the custom dbs (no new data can be added by SE which is not also in the public dumps without triggering the CC SA provision), then there is no conflict with their license. SE can legally and enforceably lock down those custom DBs to the full extent otherwise allowed by law. LLM companies would of course be free to roll their own filters. But SE has a lot of advantages here as the platform host. For one, they can update and test the commercial databases in near real-time, given sufficient staff. That means by the time a new dump is released, they'd already know exactly which new questions and answers are being added to each db.
So they can release a fully viable model training product with no delay, whereas an outsider has to wait for the dump to even start work, or else scrape the site to try to keep up. If SE Inc gets to the market with a viable product quickly, and charges a fair price for their results, nobody will actually want to roll their own. Even when SE is fully cooperative, handling those dumps is a lot of work.
If only they had a larger engineering staff. Maybe about 40%³ larger? 😋
Conclusion
Permanently reenabling the dumps is probably going to be necessary anyway if the strike is to end amicably. So here's hoping they listen. 🤞
¹Content published under earlier licenses has already been pushed out with past dumps. We could examine SE's rights and obligation under those licenses as well, but since they could remove all of that content from the site and/or dump without affecting our ability to access the contributions themselves through past dumps, it seemed like a waste of time to worry about those licenses.
²Technically, the product in such a case isn't really the data, which would have to be offered under the same or compatible CC-BY-SA license, and similarly bind the recipient. But one might very well be compensated for the effort of (legally) helping bypass an "Effective Technological Measure", which amounts to the same thing from the SE side.
³After the recent 30% reduction, their staff is smaller. So rehiring the same staff represents an increase of ~40% (3/7 to be exact) from the current level.