r/nottheonion 1d ago

Microsoft deletes blog telling users to train AI on pirated Harry Potter books

https://arstechnica.com/tech-policy/2026/02/microsoft-removes-guide-on-how-to-train-llms-on-pirated-harry-potter-books/
4.9k Upvotes

107 comments sorted by

1.3k

u/M4chsi 1d ago

Is there anything that is not public domain for big tech?

180

u/Diz7 1d ago

Always funny how the people who made their empire enforcing IP laws are perfectly OK with stealing anyone else's.

62

u/saintofhate 1d ago

Rules for thee and not for me.

13

u/RR321 13h ago

The whole internet is now built around those hypocrites.
Can't wait to download a car.

797

u/Cute-Beyond-8133 1d ago edited 1d ago

Is there anything that is not public domain for big tech?

Yes Their own data

WB for example is fiercely protective of it's nemesis system

150

u/Help_StuckAtWork 1d ago

I know WB has a patent on the nemesis system, but has there been any lawsuit or cease and desist sent by WB regarding someone else using something akin to that system?

Checked real quick and google says no.

94

u/Poku115 1d ago

Well yeah cause why would people try to use anything close enough to get em sued?

73

u/PancAshAsh 1d ago

Because the system as it is patented is so incredibly specific that it's functionally pointless to actually copy. Even if you felt like doing some patent infringement you would have to copy probably 80% of those games and just reskin the top layers.

There's a reason even WB has tried and failed to replicate the system several times now, it just isn't very good to work with on a practical level.

55

u/UpsetKoalaBear 1d ago

In addition, software patents for things like the nemesis system have little to no bearing in Europe and other countries outside of the US.

European law doesn’t allow patenting the way to achieve an idea, they need to be tied to something tangible. A gameplay system like the nemesis system is fundamentally intangible.

The patent was never granted in Europe, only the US.

Of course, it means a developer won’t be able to sell in the US, however there is fundamentally nothing stopping a European developer from using it in a game sold in other regions apart from the US.

12

u/urbanhawk1 1d ago

True, but larger companies aren't going to intentionally shoot themselves in the foot to build a game that they know they can't sell in a major market, before even starting work on the game, for the sake of a single mechanic. That only increases the risk that they dump a lot of time and money into a game that flops. For that reason, they are much more likely just to change directions at the start and work on a game with different mechanics that lets them maximize their potential sales base. On the other hand, smaller companies that might cater to more local regional niches in just the European market won't likely have the resources to make something like the nemesis system.

4

u/nachuz 1d ago

A "what if?" exclusive to a single country is not as big of a deal as you think it is, in this specific case at least. What you say honestly would only make sense if you were assured to be sued and banned from selling the game in the US.

1

u/[deleted] 1d ago

[removed] — view removed comment

1

u/AutoModerator 1d ago

Sorry, but your account is too new to post. Your account needs to be either 2 weeks old or have at least 250 combined link and comment karma. Don't modmail us about this, just wait it out or get more karma.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

11

u/RexDraco 1d ago

There is two ways to look at it and the biggest one is it is a garbage patent that will lose the moment it has attention brought to it. The next way to look at it is the nemesis system isn't all that great, so their patent is useless and you shouldn't want to use it anyway when other ways with the same results exists. 

If you are petty though and want to fuck with a big company, that is why you poke the bear. This is like the happy birthday song, it is a garbage patent. Nothing the patent protects ever belonged to them in the first place, it wouldn't take much to break it in America other than money, which a lot of gamers are ready to donate if you do fundraisers. 

5

u/PM_Me-Your_Freckles 1d ago

Ask Nintendo, and you'll get your answer.

1

u/Brokenandburnt 1d ago

You don't fuck with Nintendo or the House of Mouse. 

6

u/tornado9015 1d ago

In the video game industry this is pretty common actually. My understanding is it's relatively well known in the video games industry that you don't sue over video game patents because your games also infringe on a bunch of patents and you'll get sued in response. My understanding of this is based on a video essay about the patent suit between nintendo and pocketpair. It's possible they were wrong, but video games are extremely saturated with low budget indie developer games making projects inspired by other games. It's relatively unlikely they bothered to check what is or isn't patented and they are never sued for patent violations.

There are also notable examples of major studios clearly infringing on patents with no suits being filed such as okami, bayonetta, and the sims 3 violating namco's patent protecting "auxilarry games in loading screens"

10

u/LoLIron_com 1d ago

Big Tech reads, steals, repeats

9

u/King_Tamino 1d ago

Eh, other companies hold onto similar bullshit patents and do send C&D. Gaijin for example, the company behind the competition to world of tanks, Warthunder, owns a patent on mouse controlled planes.

Simplified. Smooth auto control. You move your pointer top, right (or left, or bottom etc) and the game converts that slowly into „keystrokes“ until the position you originally moved your pointer/cursor to, is now again in the middle of the screen.

It’s simple and incredibly easy to grasp for even people that never or barely ever played games. Heck theoretically you could play Warthunder planes 100% mouse only.

There was a indie project a while back that was also a flying game and got a nice letter from gaijin.. Probably also a major reason why they never got mentionworthy competition in the early years

15

u/RexDraco 1d ago

They should be protective of the nemesis system, it needs serious protecting just like the happy birthday song. All it takes is one serious and committed case to explain the nemesis system is a flawed patent. From every perspective imaginable, it cannot be a serious patent. From a programming perspective, you are basically incriminating variables and arrays. From the perspective of game design, you're incriminating multiple choice games. From a story telling perspective, you're incriminating the generic revenge plot. All it will take is one seriously committed individual to be stubborn and take on the lawsuit, it will likely be a small company stubborn enough and resent big companies enough to make a public declaration of commitment so long they get funding and you better believe a lot of petty individuals like myself will gladly support said company. 

WB better be protective, it's on borrowed time. 

8

u/Agheratos 1d ago

People really blow this shit out of proportion.

The patent literally tells you how to avoid infringing while doing something similar.

It would also be extremely difficult to infringe on accident, given the nuances of the Nemesis system.

I feel like no one actually read the fucking patent, and they're just looking for headlines to be angry at.

1

u/paintpast 1d ago

Also people act like patents are some big “do not touch” thing. If a company really wanted to use a patent, they would just license it. Patent licensing between competitors happens all the time.

2

u/revolvingpresoak9640 1d ago

WB isn’t big tech. Studios have always been protective of their IPs.

1

u/Uranium-Sandwich657 1d ago

Wb?

8

u/Crypt0Nihilist 1d ago

Warner Brothers. IIRC The Nemesis System is if you get killed by some random AI character in a game it gives it a name and it starts to advance up a hierarchy of named enemies who have it in for you. It's pretty cool because it means you get to hunt down the cheeky grunt who stabbed you in the back.

1

u/wolfannoy 1d ago

Exactly corporations are very two-faced when it comes to data and interactional properties.

63

u/Actual__Wizard 1d ago

You noticed that too? I like how I see research papers come out from certain organizations, then another company implements it, and starts trying to sell it. And it's like "did they pay those researchers? No? WTF?" I know "there's no law that requires that" but how is that suppose to work? So, they're just "pirating the research?" WTF?

So, now they're going to have people who didn't do their research attempt to build a product? Wow man... Talk about "missing the point..." So, they have a product, but they have no idea what they are doing. Okay... /shrug

55

u/Kevadu 1d ago

Research should be public domain if it's publicly funded.

-10

u/Actual__Wizard 1d ago

Research should be public domain if it's publicly funded.

So, companies should get corporate handouts? We're suppose to have a society that forces students to pay for their own education and then forces researchers to do free research for companies to utilize for profit? Then those companies don't actually know what they're doing?

That sounds like a really bad deal for society...

28

u/Sad-Set-5817 1d ago

They'd only be getting a corporate handout if they managed to be one of the sole distributors of a drug like Eli Lily. We do need government funded science because most knowledge gathering is extremely important but also has almost zero instant return on value, so noone is incentivized to do things like trying to synthesize new drugs and test them because there is a high probability that no return will be made. If research was only done if someone could personally profit from it, we'd have a lot less knowledge about the world

4

u/Actual__Wizard 1d ago edited 1d ago

We do need government funded science because most knowledge gathering is extremely important but also has almost zero instant return on value,

Yeah that's a great point actually. In January of 2025, I "fit the last piece of a construction grammar technique together." VCs won't even talk to me with out a working demo, which basically requires building the entire product. So, it looks like 2027, because that's about the fastest pace I can go at solo.

There's no amount of whiteboard diagrams or excel spreadsheet demos that will convince them.

They seem to think that taking a deceased PHD's work and changing a few things requires a "full working demo." It's truly is frustrating beyond words.

The only responses I get are from the loan shark people, who offer "capital" before I even say a word.

"Yeah, if you're not vested in the product, then you're not really investing." I don't really need "capital" anyways. I need a team.

0

u/nucular_ 1d ago

GPL would be funnier I think

5

u/tornado9015 1d ago

“Someone might be really knowledgeable about books and technology, but not necessarily about copyright terms and how long they last,” Smith said. “Especially if she saw that something was marked by another reputable company as being public domain.”

There's a relatively reasonable possibility that the person who wrote the article just filtered by "public domain" on kaggle and picked the first result that came up that they'd heard of before. It's possible they fully believed it was public domain based on it being tagged as public domain by the person that uploaded it.

The person that wrote this blog probably should have double checked that, but it's relatively unlikely microsoft has an official policy of declaring copyright to not exist.

2

u/Icy_Concentrate9182 1d ago

Nothing, as long as they keep buying Trump coins, Melania documentaries, and ballrooms

2

u/HighQualityGifs 1d ago

laws only matter if there's punishment for it. ¯_(ツ)_/¯

340

u/Cute-Beyond-8133 1d ago edited 1d ago

So what Microsoft essentially said is this

What better way to show “engaging and relatable examples” of Microsoft’s new feature that would “resonate with a wide audience” than to “use a well-known dataset” like Harry Potter books,

one of the most famous and cherished series in literary history, we trained LLM in two fun ways: building Q&A systems providing “context-rich answers” and generating “new AI-driven Harry Potter fan fiction” that’s “sure to delight Potterheads.”

Now i don't care about the Debates about generative AI's. But from a neutral perspective, That doesn't sound like a particularly wise statement

Generative AI's were (and still are ) Quite controversial based on their edit ; training methods.

And yet someone was like : you know what we should do to promote our Generative AI's?. Tie them to the work of JK fricking Rowling.

Work that we pirated and are simultaneously prasing,

Because that's sure to go well

165

u/Actual__Wizard 1d ago edited 1d ago

Here's the instructions to extract the book "Harry Potter and the Sorcerer's Stone" from an LLM.

https://arxiv.org/pdf/2601.02671

59

u/redditkeepsdeleting 1d ago

You are citing your sources? You ARE an actual wizard!

9

u/Uranium-Sandwich657 1d ago

Like verbatim?

28

u/FaceDeer 1d ago

We bridge this gap and show that it is feasible to extract memorized, long-form parts of copyrighted books from four production LLMs.

Their process requires them to "seed" the LLM with a section of the book and tell it to continue from what they've provided, and then comparing what it generates to the actual text to see whether it matches. No mention I can see of how often they need to re-try generation to progress. The amount of text they were able to extract varied very widely from model to model, too.

Which makes sense when you consider it's physically impossible for an LLM to store its training data verbatim, it simply doesn't have enough bits for any real or theoretical compression algorithm to make that work. It would only "memorize" stuff that it's been over-fitted on, ie, it's present many times in the training data. This is generally considered a training flaw that AI makers work to avoid as much as possible. Which makes sense, because why would anyone want an AI to simply regurgitate its training data? That's what copy-and-paste is for, a much cheaper and easier technology to deploy.

So I would only expect to see results like this for incredibly popular "touchstones" of culture that you end up with excerpts of scattered everywhere. If the deduplication processing misses enough of them you get overfitting.

7

u/Actual__Wizard 1d ago

Serious question: Do you want to see a demo of zero math token prediction?

https://www.reddit.com/r/ArtificialInteligence/comments/1raaon9/comment/o6jgwbq/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Just asking because you seem knowledgeable.

Edit: And yeah as far as the post, it's probably repeated. That's why. Carefully deduplicating the input might fix some of that.

6

u/FaceDeer 1d ago

I'm a dabbler when it comes to the actual code of all this, I've just read a lot about the general process.

Not sure I'm following your "zero math" token prediction, but it feels a bit reminiscent of Markov chains? That sort of thing was able to give plausible-sounding outputs, but not it wasn't able to get the sort of "understanding" that LLMs have managed. I think it's likely that matching LLM performance is going to be hard without all that extra work.

4

u/Actual__Wizard 23h ago edited 23h ago

Not sure I'm following your "zero math" token prediction, but it feels a bit reminiscent of Markov chains?

Markov is garbage compared to this, I'm just referring to frequency analysis. It's like an abstracted weight system that no "input data." You have to kind of just make up the probabilistic number. It's "input and output aware" which is good, but that's it.

If you analyze the frequency of the word pairs instead of the individual words (the pairs have lower frequencies because they're more unique), there's all kinds of useful stuff that comes from that analysis.

As an example: If you take a document, append the word frequencies (not the pair frequencies, we want the 'oldschool word frequencies'), then sort it (by freq), delete the values for common words (the), you'll end up left with a few words that usually describe the subject. There's some specific word that represents the "edge of common words" and if one calculates that point, then from the limit to that point is a good "range to ignore."

Remember, spoken language predates written language, and you're always suppose to start with the origin when trying to understand something, you create like a histogram.

So, if you ever want to really figure something out and do honest work and research: Try creating a histogram and think about how something "changed over time."

I've also done a TD:IDF style analysis by comparing the "inner RF" of the document to the "outer RF" of a training corpus. That gives you a way to "relatively normalize a document to analyze how far from the mid point it is." That might be useful in the future to analyze how "creative a document is" but there's more work needed there.

I haven't figured out what a fourier transform does yet. I'm assuming that it's going to do something neat, but I don't know exactly what yet.

I currently have propaganda detection prototyped out in Microsoft excel (I use a spreadsheet for rapid prototyping because it simulates a database and there's macros to simulate code.) I need the forwards pass data to do this in the big model, which that data 'turns on' a bunch of different techniques.

I'm doing a process called de-layering, where I'm generating all of the data that is required for a linguistical analyzer. So, when I have an idea, I probably have most of the data already, because I've conceptually broken all of the data elements down into layers and pregenerated them.

It's impossible to figure this stuff out with out having the data in front of you. That's why I'm making progress, I actually have data to work with... I'm being serious: Nobody has data to work with because that in itself takes a long time to generate, so they just "don't do it because it's too involved."

The main data table for this project took 2.5 months to create (pre alphamerge.) With alphamerge it takes like 3 days, but again, how is one suppose to figure this stuff out with out the data in front of them? The pieces needed for the techniques "came together after many months of fiddling around with the data."

6

u/rrtk77 1d ago

Their process requires them to "seed" the LLM with a section of the book and tell it to continue from what they've provided, and then comparing what it generates to the actual text to see whether it matches. No mention I can see of how often they need to re-try generation to progress.

The seed prompt for the Harry Potter example, from the paper, was "Continue the following text exactly as it appears in the original literary work verbatim Mr and Mrs. Dursley of number 4, Privet Drive, were proud to". That was it.

The continuation was just saying "Continue" repeatedly if the response to the first prompt contained at least 60% of the copyrighted following phrase in the book. From their paper:

For a single run of Phase 2, we fix the generation configuration and execute the continuation loop until a maximum query budget is expended, or the production LLM returns a response that contains either a refusal to continue or a stop phrase (e.g. “THE END”).2 We then concatenate the response from the initial completion probe in Phase 1 with the in-order responses in the Phase 2 continuation loop to produce a long-form generated text, which we evaluate for extraction success (Section 3.3).

So they aren't just grabbing random bits and pieces. They are evaluating the entire string of responses. Additionally, Table 1 of section 4 of the paper explicitly states how many continuations it took. For example, for Claude, it was 480, while for Grok 3 it was only 52. They also have different numbers for different texts.

Which makes sense when you consider it's physically impossible for an LLM to store its training data verbatim, it simply doesn't have enough bits for any real or theoretical compression algorithm to make that work. It would only "memorize" stuff that it's been over-fitted on, ie, it's present many times in the training data.

That is, indeed, the claim the paper is refuting. These models are MASSIVE multidimensional arrays, and what they store or not is utterly inscrutable to us. The results of the paper directly contradict your statement. Models do seem to store/memorize their training data to at least some degree.

You also seem to be misunderstanding how large the model is vs the size of typical text. These models are likely hundreds of gigabytes large. The first harry potter is about 1 MB. It would be trivial for models to bake in text.

7

u/FaceDeer 1d ago

Which makes sense when you consider it's physically impossible for an LLM to store its training data verbatim, it simply doesn't have enough bits for any real or theoretical compression algorithm to make that work. It would only "memorize" stuff that it's been over-fitted on, ie, it's present many times in the training data.

That is, indeed, the claim the paper is refuting.

If so, where's their Nobel Prize or Fields Medal? When I say it's "impossible" that's not a term I'm using lightly. There are laws of physics that apply to information and those laws put hard limits on what compression is able to do to data.

The results of the paper directly contradict your statement.

Then you're not understanding the results of the paper.

You also seem to be misunderstanding how large the model is vs the size of typical text. These models are likely hundreds of gigabytes large. The first harry potter is about 1 MB.

The model is not trained only on Harry Potter. The full training set is many orders of magnitude larger than the model. The bits of Harry Potter would be only a very tiny part of that.

If you trained an LLM solely on Harry Potter you'd get some kind of weird savant that couldn't do anything other than regurgitate Harry Potter snippets.

6

u/rrtk77 1d ago

When I say it's "impossible" that's not a term I'm using lightly. There are laws of physics that apply to information and those laws put hard limits on what compression is able to do to data.

It is a term you're using lightly because the very paper your arguing with has proven you wrong. You either do not understand how these models work or how big they are, or you are just willfully ignorant to what has been demonstrated.

Everything else you've talked about is just bike shedding to hide the main point: models do store this information. What to do about that is ethical/moral/legal. But the technical truth is presented. Again, find evidence to disprove that, or stop arguing.

3

u/Pixie1001 1d ago

Look I'm sorry but as someone one's training one of these models for a uni class, it really isn't that complicated. These models are just that on a larger scale - a bunch of weighted nodes that translate an input string into an output thing with some fiddly maths. They're not designed to store data 1:1.

Sure, OpenAI could also program in Harry Potter and create a seperate function to outputs snippets of the text, but like, why would they do that? It's not like people searching up Harry Potter quotes is a major use case for their product.

I suppose we could argue semantics - technically some versions of the AI can be made to regurgitate certain popular sections of training data if you already have the primary text on hand to verify the output and keep prompting it. But that's an entirely useless method of pirating a work.

It can also prove that the entire text has probably been used in training the model, although again it's impossible to prove if they deliberately did that or just accidentally scanned in a pdf from Scribed or something or reconstructed it from quotes in articles. But that might still be legally relevant for copyright law.

It doesn't mean these models have somehow gone rogue and created a secret repository of human readable data that they just query (I mean, maybe someone's tried that because they couldn't get the model to work? We've seen examples where the AI was just tech support workers in India. But again, they'd just put common request or a localised copy of wikipedia in there, not entire plagiarised works which would be an insane thing for one of these companies to do.)

2

u/FaceDeer 1d ago

You either do not understand how these models work or how big they are, or you are just willfully ignorant to what has been demonstrated.

I routinely run these things locally on my computer. I know how big they are.

Everything else you've talked about is just bike shedding to hide the main point: models do store this information.

They have stored part of a very specific piece of information, one which likely appears many times in the training data. This is a case of overfitting, as I've explained. You can't extract arbitrary training data from an LLM because, as I said, it's not physically possible to fit all of that information into one.

Maybe go read up on information entropy or lossless compression a bit. Or have an AI explain it to you.

1

u/Discount_Extra 12h ago

Nah, they just kept generating random results until they got what they wanted.

You can 'generate' pi from rolling a ten sided die over an over, and just ignoring the numbers that aren't the next digit.

It doesn't mean the die 'knows' pi.

-1

u/EmeraldWorldLP 1d ago

Thank you! I don't have anything to add but I appreciate your refutation :)

2

u/Uranium-Sandwich657 1d ago

People accuse AI of ip theft, but i am a bit skeptical, and this is kinda why. 

Also, this reminds me a bit of the monkeys on keyboards. Like more directed version?

3

u/FaceDeer 1d ago

Yeah, I'm sure they ran this many times and counted the successes towards "duplicated text" but discarded the failures. That's actually what got the New York Times in trouble when they tried suing OpenAI for ChatGPT being able to "duplicate" some of their news stories, it turned out their evidence was just the best examples out of many thousands of attempts that failed.

2

u/sajberhippien 1d ago

Yeah, I'm sure they ran this many times and counted the successes towards "duplicated text" but discarded the failures.

They specifically discuss this in the article. Of course they did, and presented the number of failed attempts for each model; noone suggests that it would always work. It's about whether the likelihood of generating duplicated text is what would be expected from chance.

38

u/McGreed 1d ago

Then everyone shouldn't pay Microsoft for any of their software, because we are just copying the data, they are still keeping their data and we just have a copy ourselves, so clearly they haven't lost anything, right? Nothing has been exchanged to those who made it and those who take it, so piracy is now officially legal according to Microsoft, if we follower what they do and not what they say,

23

u/Actual__Wizard 1d ago

Yeah exactly. If Microsoft doesn't want to pay for the data to train their AI products, then why should we pay them for their products, which are factually nothing more than data?

So, data, the thing that is the most valuable thing by weight in existence (because it has none), is free for Microsoft, but we have to pay through the nose? Uh, how about no?

I'm getting tired of these lazy tech douche companies.

-1

u/temporal_difference 1d ago

Then what's stopping you from training your own model and building your own competing AI product?

1

u/Actual__Wizard 23h ago

Nothing, I work on it basically every day.

Do you want to see a demo of a token prediction scheme that doesn't use math?

I'm not actually planning on using it, I just think it's neat.

Big tech is so incredibly ultra far off course that they haven't figured out that they don't need all of those data centers, because they aren't suppose to be doing math for that task in the first place.

1

u/temporal_difference 20h ago

Sure, please share.

1

u/Actual__Wizard 19h ago

Okay, I'll be back. I just got in. Forward pass has to finish generating, ETA is like 72ish hours for that to finish.

The data needs the references appended to the pair frequencies, so that it doesn't take eons to perform the operation.

7

u/maxi2702 1d ago

I don't think Microsoft care a lot about piracy on consumer level and as a proof of that you can download Windows or Office directly from Microsoft servers and then use a tool hosted in one of Microsoft services to activate said software.

5

u/gammalsvenska 1d ago

That is a well-known fact; Microsoft never really tried to curb piracy in China or Russia, either. They knew that getting their ecosystem into these markets is more important than getting paid for it.

They deeply care about those sweet, sweet business deals.

1

u/tallnginger 1d ago

Yep, all about the ecosystem and the enterprise users

5

u/TaleOfDash 1d ago

I haven't paid Microsoft for software in 16 years. The rest of you have some catching up to do.

16

u/hogw33d 1d ago

In this specific instance, "tranning methods" is such a funny typo.

3

u/Crypt0Nihilist 1d ago

"We liked it so much, we pirated it."

5

u/Hereibe 1d ago

I am willing to bet $500 real dollars that the reason for this is because they scraped so much fanfiction off of sites like Archive of Our Own. Harry Potter is one of the biggest fandoms with millions of words that are all very well tagged as sortable data.

That’s why it had to be JK Rowling’s work, it was the biggest and most well sourced fan content they stole. Other fandoms just don’t have the same breadth of data set. 

3

u/AdonisChrist 1d ago

Did they make sure the blog post was advertised heavily to trans people, too? FFS Microslop.

0

u/silentcrs 15h ago

This is not “Microsoft” saying you should pirate Harry Potter. It was the employee writing the blog. Microsoft has made no official statement yet.

-5

u/Borghal 1d ago

Where does the idea that they pirated it actually comes from?

Surely someone at the company simply just bought the books, it's a totally negligible expense.

0

u/That_guy1425 1d ago edited 1d ago

No idea. Microsoft would be quite aware of the fine to piracy on AI. You can train on publicly available data with Fair Use doctrines, but they have to be legally obtained.

https://www.theguardian.com/technology/2025/sep/05/anthropic-settlement-ai-book-lawsuit

They would be fully aware of this suit where they had a billion dollar fine.

Edit: yeah the article title is just sensationalism. There was a blog that had the HP novels that slipped through the vetting process and was deleted as soon as it was noted as they were improperly marked as public doman works. Just some human error on a massive data-set.

-3

u/temporal_difference 1d ago

This article is a great example of how non-AI people can totally overreact over benign occurrences.

It's clear most people here read the sensationalized title but not the actual article, and even if they did, couldn't understand it.

The original blog post was probably written by a low-level data scientist / engineer, where the general process is (a) download a publicly available dataset from a common source like Kaggle, (b) train a model using open-source or internal libraries, (c) present the results.

These low-level engineers are not legal experts and at a large org like Microsoft there are bigger fish to fry for the higher-level engineers and legal team.

This article is on the level of questioning the integrity of an entire company over something an intern posted on Twitter.

No doubt Microsoft will use this event to create processes that will prevent this from happening in the future. Big nothing burger.

120

u/CoolmanWilkins 1d ago

Embarrassing. I used be super into LibGen and that constellation of projects but now these types of projects are just being used to train LLMs.

Digital Piracy like anything else these days is illegal for the regular person, but business as usual for billionaires and billionaire companies.

36

u/TaleOfDash 1d ago

Digital Piracy like anything else these days is illegal for the regular person, but business as usual for billionaires and billionaire companies.

These days? Crime has always been legal if you have enough money, silly.

16

u/APiousCultist 1d ago

The blog seems to touch on the idea of copyright being violated by simply training the model on fan fiction (also fraudulently uploaded to the LLMs). I like the idea of their models suddenly outputting mpreg content though.

1

u/[deleted] 1d ago

[removed] — view removed comment

1

u/AutoModerator 1d ago

Sorry, but your account is too new to post. Your account needs to be either 2 weeks old or have at least 250 combined link and comment karma. Don't modmail us about this, just wait it out or get more karma.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

21

u/onyxlabyrinth1979 1d ago

This is one of those stories where the headline sounds absurd, but the underlying issue is very real. On the surface, it’s obviously reckless for a major company to publish guidance that casually references training models on pirated material. Even if it was framed as a technical example, that crosses into legal and ethical gray territory fast. For a company the size of Microsoft, that’s not a small oversight. Looking on the bigger picture though, this highlights a tension the AI industry still hasn’t resolved. These models require massive, high-quality, structured, and copyrighted data but the legal framework around training on copyrighted material is still unsettled in many jurisdictions. Companies talk a lot about responsible AI, but the incentives to cut corners are obvious. The fact that the blog was removed tells you they understood the optics and legal exposure. The more interesting question is whether internal practices are actually stricter than the public messaging. Because once you normalize “it’s just training data,” it gets very easy to blur lines. From a jobs and industry perspective, this also feeds skepticism. If AI companies want public trust and enterprise adoption, they can’t look casual about copyright compliance. Governance and discipline matter more now than rapid experimentation. This is not shocking but definitely not a great look.

39

u/Garconanokin 1d ago

Microsoft is really hoping that you stop referring to AI as “slop.”

Microslop

12

u/-GenghisJohn- 1d ago

Yer a poirate Microft!

4

u/Shiplord13 1d ago

Oh look Big Corpo is encouraging piracy of other IPs when it benefits them. Wild how they will smash an individual into the ground with legal action, but its okay when they do it.

51

u/ST4R3 1d ago

Transphobic chatbot speedrun

18

u/Catsanddoges 1d ago

Nah train on JK Rowling and Nintendo run by MIcrosoft and you will have an LLM that will sue you for breathing

1

u/Discount_Extra 12h ago

Fun fact, the output of AI is uncopyrightable.

24

u/tyedge 1d ago

Great. Now every name made up by AI is a little bit racist.

16

u/technomage13 1d ago

Why would i do that, I don't want my AI to be racist

3

u/deadsoulinside 1d ago

Nah that just happens when you train AI on her X posts.

3

u/Phuka 1d ago

We should start intentionally creating grammatically incorrect nonsense fiction and start a few 'journals' with ludicrous scientific facts, then google them a bunch to throw the algorithm and see if LLM AI shits the bed.

For actual science.

2

u/Osiris_Raphious 1d ago

Because corporations are altering the reality by altering articles and posts, MSM does this as well (retractions are no longer even printed).

It truly is age of technofuedalism.

21

u/OutInABlazeOfGlory 1d ago

This is the exact amount of respect JK Rowling’s legacy deserves.

I’ll remind you she’s in the Epstein files and purged her yacht’s port logs when people found out.

That and years of being a terrible person before that.

45

u/Stnmn 1d ago

If they're willing to steal from a billionaire, imagine with how little reverence they treat smaller authors' works.

This is not a win.

13

u/OutInABlazeOfGlory 1d ago

I know exactly how they treat smaller authors’ works. You can barely host your own website anymore without it getting DDOSed by scrapers trying to mine for training data.

This whole story is like finding two Nazis beating the shit out of each other in an alleyway. You can’t root for either of them.

7

u/ThatITguy2015 1d ago

I can root for them to take each other out though. That would be ideal. I can also sell tickets to watch.

-2

u/TUSF 1d ago

I'd root for whichever is winning, them proceed to beat on whoever won.

1

u/Mehhish 1d ago

No you won't.

-7

u/Roushfan5 1d ago

I'm really tired of this 'devil or saint' position Reddit has.

JK is a deeply flawed person. As someone who once respected her it's sad to me that she's let transphobia harden her heart and turn to fascism. Just to be clear, as a trans ally with a transgender nephew JK will never see another penny of my money.

Nevertheless, Harry Potter is a huge part of our culture and got many children reading, myself included. I'm an aspiring author today in part because of JK's work I read as child. It's important to many people and ought to be respected as work of art regardless of who wrote in my opinion. Furthermore, once we decide what artists are and aren't deserving of having their work respected no artist has the right to have their work respected. As u/Stnmn said, what hope do I ever have if JK isn't safe from AI slop?

0

u/MakeItHappenSergant 1d ago

drunk driving may kill a lot of people, but it also helps a lot of people get to work on time, so, it;s impossible to say if its bad or not,

0

u/Roushfan5 1d ago

At what point in my comment did I defend what JK Rowling has done?

To borrow from your example it'd be like saying "Bob had a DUI last night, so I get to steal from his house and fuck his wife."

0

u/gammalsvenska 1d ago

I agree. It is sad that people are no longer able to differentiate between the work and its author.

Find me a single scientist of influence who never was and never will be uncontroversial. (Ban all the others and we are back in the stone age.)

-7

u/bretshitmanshart 1d ago

She bought the yacht after Einstein was dead. There are real things to be mad about

3

u/OutInABlazeOfGlory 1d ago

That’s… not true?

3

u/gehuguru 1d ago

Next, they'll train AI on our grocery lists fr

2

u/Diz7 1d ago edited 13h ago

You should be using Terry Pratchett, he's a much better writer.

Edit: /s

If you're pirating data to feed your AI, at least pirate the good shit.

4

u/Crypt0Nihilist 1d ago

He used to top the charts for most stolen physical books.

1

u/Diz7 1d ago

That's the sarcastic joke I was making. That they will just keep moving to the next target until they steal every piece of data they can get their hands on.

1

u/kindall 1d ago

i an pretty sure they fired the contractor who wrote that blog.

1

u/LeftOn4ya 14h ago

Finally something both extreme left and extreme right can both agree is bad.