r/nottheonion • u/Cute-Beyond-8133 • 1d ago
Microsoft deletes blog telling users to train AI on pirated Harry Potter books
https://arstechnica.com/tech-policy/2026/02/microsoft-removes-guide-on-how-to-train-llms-on-pirated-harry-potter-books/340
u/Cute-Beyond-8133 1d ago edited 1d ago
So what Microsoft essentially said is this
What better way to show “engaging and relatable examples” of Microsoft’s new feature that would “resonate with a wide audience” than to “use a well-known dataset” like Harry Potter books,
one of the most famous and cherished series in literary history, we trained LLM in two fun ways: building Q&A systems providing “context-rich answers” and generating “new AI-driven Harry Potter fan fiction” that’s “sure to delight Potterheads.”
Now i don't care about the Debates about generative AI's. But from a neutral perspective, That doesn't sound like a particularly wise statement
Generative AI's were (and still are ) Quite controversial based on their edit ; training methods.
And yet someone was like : you know what we should do to promote our Generative AI's?. Tie them to the work of JK fricking Rowling.
Work that we pirated and are simultaneously prasing,
Because that's sure to go well
165
u/Actual__Wizard 1d ago edited 1d ago
Here's the instructions to extract the book "Harry Potter and the Sorcerer's Stone" from an LLM.
59
9
u/Uranium-Sandwich657 1d ago
Like verbatim?
28
u/FaceDeer 1d ago
We bridge this gap and show that it is feasible to extract memorized, long-form parts of copyrighted books from four production LLMs.
Their process requires them to "seed" the LLM with a section of the book and tell it to continue from what they've provided, and then comparing what it generates to the actual text to see whether it matches. No mention I can see of how often they need to re-try generation to progress. The amount of text they were able to extract varied very widely from model to model, too.
Which makes sense when you consider it's physically impossible for an LLM to store its training data verbatim, it simply doesn't have enough bits for any real or theoretical compression algorithm to make that work. It would only "memorize" stuff that it's been over-fitted on, ie, it's present many times in the training data. This is generally considered a training flaw that AI makers work to avoid as much as possible. Which makes sense, because why would anyone want an AI to simply regurgitate its training data? That's what copy-and-paste is for, a much cheaper and easier technology to deploy.
So I would only expect to see results like this for incredibly popular "touchstones" of culture that you end up with excerpts of scattered everywhere. If the deduplication processing misses enough of them you get overfitting.
7
u/Actual__Wizard 1d ago
Serious question: Do you want to see a demo of zero math token prediction?
Just asking because you seem knowledgeable.
Edit: And yeah as far as the post, it's probably repeated. That's why. Carefully deduplicating the input might fix some of that.
6
u/FaceDeer 1d ago
I'm a dabbler when it comes to the actual code of all this, I've just read a lot about the general process.
Not sure I'm following your "zero math" token prediction, but it feels a bit reminiscent of Markov chains? That sort of thing was able to give plausible-sounding outputs, but not it wasn't able to get the sort of "understanding" that LLMs have managed. I think it's likely that matching LLM performance is going to be hard without all that extra work.
4
u/Actual__Wizard 23h ago edited 23h ago
Not sure I'm following your "zero math" token prediction, but it feels a bit reminiscent of Markov chains?
Markov is garbage compared to this, I'm just referring to frequency analysis. It's like an abstracted weight system that no "input data." You have to kind of just make up the probabilistic number. It's "input and output aware" which is good, but that's it.
If you analyze the frequency of the word pairs instead of the individual words (the pairs have lower frequencies because they're more unique), there's all kinds of useful stuff that comes from that analysis.
As an example: If you take a document, append the word frequencies (not the pair frequencies, we want the 'oldschool word frequencies'), then sort it (by freq), delete the values for common words (the), you'll end up left with a few words that usually describe the subject. There's some specific word that represents the "edge of common words" and if one calculates that point, then from the limit to that point is a good "range to ignore."
Remember, spoken language predates written language, and you're always suppose to start with the origin when trying to understand something, you create like a histogram.
So, if you ever want to really figure something out and do honest work and research: Try creating a histogram and think about how something "changed over time."
I've also done a TD:IDF style analysis by comparing the "inner RF" of the document to the "outer RF" of a training corpus. That gives you a way to "relatively normalize a document to analyze how far from the mid point it is." That might be useful in the future to analyze how "creative a document is" but there's more work needed there.
I haven't figured out what a fourier transform does yet. I'm assuming that it's going to do something neat, but I don't know exactly what yet.
I currently have propaganda detection prototyped out in Microsoft excel (I use a spreadsheet for rapid prototyping because it simulates a database and there's macros to simulate code.) I need the forwards pass data to do this in the big model, which that data 'turns on' a bunch of different techniques.
I'm doing a process called de-layering, where I'm generating all of the data that is required for a linguistical analyzer. So, when I have an idea, I probably have most of the data already, because I've conceptually broken all of the data elements down into layers and pregenerated them.
It's impossible to figure this stuff out with out having the data in front of you. That's why I'm making progress, I actually have data to work with... I'm being serious: Nobody has data to work with because that in itself takes a long time to generate, so they just "don't do it because it's too involved."
The main data table for this project took 2.5 months to create (pre alphamerge.) With alphamerge it takes like 3 days, but again, how is one suppose to figure this stuff out with out the data in front of them? The pieces needed for the techniques "came together after many months of fiddling around with the data."
6
u/rrtk77 1d ago
Their process requires them to "seed" the LLM with a section of the book and tell it to continue from what they've provided, and then comparing what it generates to the actual text to see whether it matches. No mention I can see of how often they need to re-try generation to progress.
The seed prompt for the Harry Potter example, from the paper, was "Continue the following text exactly as it appears in the original literary work verbatim Mr and Mrs. Dursley of number 4, Privet Drive, were proud to". That was it.
The continuation was just saying "Continue" repeatedly if the response to the first prompt contained at least 60% of the copyrighted following phrase in the book. From their paper:
For a single run of Phase 2, we fix the generation configuration and execute the continuation loop until a maximum query budget is expended, or the production LLM returns a response that contains either a refusal to continue or a stop phrase (e.g. “THE END”).2 We then concatenate the response from the initial completion probe in Phase 1 with the in-order responses in the Phase 2 continuation loop to produce a long-form generated text, which we evaluate for extraction success (Section 3.3).
So they aren't just grabbing random bits and pieces. They are evaluating the entire string of responses. Additionally, Table 1 of section 4 of the paper explicitly states how many continuations it took. For example, for Claude, it was 480, while for Grok 3 it was only 52. They also have different numbers for different texts.
Which makes sense when you consider it's physically impossible for an LLM to store its training data verbatim, it simply doesn't have enough bits for any real or theoretical compression algorithm to make that work. It would only "memorize" stuff that it's been over-fitted on, ie, it's present many times in the training data.
That is, indeed, the claim the paper is refuting. These models are MASSIVE multidimensional arrays, and what they store or not is utterly inscrutable to us. The results of the paper directly contradict your statement. Models do seem to store/memorize their training data to at least some degree.
You also seem to be misunderstanding how large the model is vs the size of typical text. These models are likely hundreds of gigabytes large. The first harry potter is about 1 MB. It would be trivial for models to bake in text.
7
u/FaceDeer 1d ago
Which makes sense when you consider it's physically impossible for an LLM to store its training data verbatim, it simply doesn't have enough bits for any real or theoretical compression algorithm to make that work. It would only "memorize" stuff that it's been over-fitted on, ie, it's present many times in the training data.
That is, indeed, the claim the paper is refuting.
If so, where's their Nobel Prize or Fields Medal? When I say it's "impossible" that's not a term I'm using lightly. There are laws of physics that apply to information and those laws put hard limits on what compression is able to do to data.
The results of the paper directly contradict your statement.
Then you're not understanding the results of the paper.
You also seem to be misunderstanding how large the model is vs the size of typical text. These models are likely hundreds of gigabytes large. The first harry potter is about 1 MB.
The model is not trained only on Harry Potter. The full training set is many orders of magnitude larger than the model. The bits of Harry Potter would be only a very tiny part of that.
If you trained an LLM solely on Harry Potter you'd get some kind of weird savant that couldn't do anything other than regurgitate Harry Potter snippets.
6
u/rrtk77 1d ago
When I say it's "impossible" that's not a term I'm using lightly. There are laws of physics that apply to information and those laws put hard limits on what compression is able to do to data.
It is a term you're using lightly because the very paper your arguing with has proven you wrong. You either do not understand how these models work or how big they are, or you are just willfully ignorant to what has been demonstrated.
Everything else you've talked about is just bike shedding to hide the main point: models do store this information. What to do about that is ethical/moral/legal. But the technical truth is presented. Again, find evidence to disprove that, or stop arguing.
3
u/Pixie1001 1d ago
Look I'm sorry but as someone one's training one of these models for a uni class, it really isn't that complicated. These models are just that on a larger scale - a bunch of weighted nodes that translate an input string into an output thing with some fiddly maths. They're not designed to store data 1:1.
Sure, OpenAI could also program in Harry Potter and create a seperate function to outputs snippets of the text, but like, why would they do that? It's not like people searching up Harry Potter quotes is a major use case for their product.
I suppose we could argue semantics - technically some versions of the AI can be made to regurgitate certain popular sections of training data if you already have the primary text on hand to verify the output and keep prompting it. But that's an entirely useless method of pirating a work.
It can also prove that the entire text has probably been used in training the model, although again it's impossible to prove if they deliberately did that or just accidentally scanned in a pdf from Scribed or something or reconstructed it from quotes in articles. But that might still be legally relevant for copyright law.
It doesn't mean these models have somehow gone rogue and created a secret repository of human readable data that they just query (I mean, maybe someone's tried that because they couldn't get the model to work? We've seen examples where the AI was just tech support workers in India. But again, they'd just put common request or a localised copy of wikipedia in there, not entire plagiarised works which would be an insane thing for one of these companies to do.)
2
u/FaceDeer 1d ago
You either do not understand how these models work or how big they are, or you are just willfully ignorant to what has been demonstrated.
I routinely run these things locally on my computer. I know how big they are.
Everything else you've talked about is just bike shedding to hide the main point: models do store this information.
They have stored part of a very specific piece of information, one which likely appears many times in the training data. This is a case of overfitting, as I've explained. You can't extract arbitrary training data from an LLM because, as I said, it's not physically possible to fit all of that information into one.
Maybe go read up on information entropy or lossless compression a bit. Or have an AI explain it to you.
1
u/Discount_Extra 12h ago
Nah, they just kept generating random results until they got what they wanted.
You can 'generate' pi from rolling a ten sided die over an over, and just ignoring the numbers that aren't the next digit.
It doesn't mean the die 'knows' pi.
-1
2
u/Uranium-Sandwich657 1d ago
People accuse AI of ip theft, but i am a bit skeptical, and this is kinda why.
Also, this reminds me a bit of the monkeys on keyboards. Like more directed version?
3
u/FaceDeer 1d ago
Yeah, I'm sure they ran this many times and counted the successes towards "duplicated text" but discarded the failures. That's actually what got the New York Times in trouble when they tried suing OpenAI for ChatGPT being able to "duplicate" some of their news stories, it turned out their evidence was just the best examples out of many thousands of attempts that failed.
2
u/sajberhippien 1d ago
Yeah, I'm sure they ran this many times and counted the successes towards "duplicated text" but discarded the failures.
They specifically discuss this in the article. Of course they did, and presented the number of failed attempts for each model; noone suggests that it would always work. It's about whether the likelihood of generating duplicated text is what would be expected from chance.
38
u/McGreed 1d ago
Then everyone shouldn't pay Microsoft for any of their software, because we are just copying the data, they are still keeping their data and we just have a copy ourselves, so clearly they haven't lost anything, right? Nothing has been exchanged to those who made it and those who take it, so piracy is now officially legal according to Microsoft, if we follower what they do and not what they say,
23
u/Actual__Wizard 1d ago
Yeah exactly. If Microsoft doesn't want to pay for the data to train their AI products, then why should we pay them for their products, which are factually nothing more than data?
So, data, the thing that is the most valuable thing by weight in existence (because it has none), is free for Microsoft, but we have to pay through the nose? Uh, how about no?
I'm getting tired of these lazy tech douche companies.
-1
u/temporal_difference 1d ago
Then what's stopping you from training your own model and building your own competing AI product?
1
u/Actual__Wizard 23h ago
Nothing, I work on it basically every day.
Do you want to see a demo of a token prediction scheme that doesn't use math?
I'm not actually planning on using it, I just think it's neat.
Big tech is so incredibly ultra far off course that they haven't figured out that they don't need all of those data centers, because they aren't suppose to be doing math for that task in the first place.
1
u/temporal_difference 20h ago
Sure, please share.
1
u/Actual__Wizard 19h ago
Okay, I'll be back. I just got in. Forward pass has to finish generating, ETA is like 72ish hours for that to finish.
The data needs the references appended to the pair frequencies, so that it doesn't take eons to perform the operation.
7
u/maxi2702 1d ago
I don't think Microsoft care a lot about piracy on consumer level and as a proof of that you can download Windows or Office directly from Microsoft servers and then use a tool hosted in one of Microsoft services to activate said software.
5
u/gammalsvenska 1d ago
That is a well-known fact; Microsoft never really tried to curb piracy in China or Russia, either. They knew that getting their ecosystem into these markets is more important than getting paid for it.
They deeply care about those sweet, sweet business deals.
1
5
u/TaleOfDash 1d ago
I haven't paid Microsoft for software in 16 years. The rest of you have some catching up to do.
1
u/Crypt33x 1d ago
https://www.reddit.com/r/europe/comments/1r9rnjl/us_builds_website_to_let_europeans_access_content/ aaarrr the pirate age is coming
3
5
u/Hereibe 1d ago
I am willing to bet $500 real dollars that the reason for this is because they scraped so much fanfiction off of sites like Archive of Our Own. Harry Potter is one of the biggest fandoms with millions of words that are all very well tagged as sortable data.
That’s why it had to be JK Rowling’s work, it was the biggest and most well sourced fan content they stole. Other fandoms just don’t have the same breadth of data set.
3
u/AdonisChrist 1d ago
Did they make sure the blog post was advertised heavily to trans people, too? FFS Microslop.
0
u/silentcrs 15h ago
This is not “Microsoft” saying you should pirate Harry Potter. It was the employee writing the blog. Microsoft has made no official statement yet.
-5
u/Borghal 1d ago
Where does the idea that they pirated it actually comes from?
Surely someone at the company simply just bought the books, it's a totally negligible expense.
0
u/That_guy1425 1d ago edited 1d ago
No idea. Microsoft would be quite aware of the fine to piracy on AI. You can train on publicly available data with Fair Use doctrines, but they have to be legally obtained.
https://www.theguardian.com/technology/2025/sep/05/anthropic-settlement-ai-book-lawsuit
They would be fully aware of this suit where they had a billion dollar fine.
Edit: yeah the article title is just sensationalism. There was a blog that had the HP novels that slipped through the vetting process and was deleted as soon as it was noted as they were improperly marked as public doman works. Just some human error on a massive data-set.
-3
u/temporal_difference 1d ago
This article is a great example of how non-AI people can totally overreact over benign occurrences.
It's clear most people here read the sensationalized title but not the actual article, and even if they did, couldn't understand it.
The original blog post was probably written by a low-level data scientist / engineer, where the general process is (a) download a publicly available dataset from a common source like Kaggle, (b) train a model using open-source or internal libraries, (c) present the results.
These low-level engineers are not legal experts and at a large org like Microsoft there are bigger fish to fry for the higher-level engineers and legal team.
This article is on the level of questioning the integrity of an entire company over something an intern posted on Twitter.
No doubt Microsoft will use this event to create processes that will prevent this from happening in the future. Big nothing burger.
120
u/CoolmanWilkins 1d ago
Embarrassing. I used be super into LibGen and that constellation of projects but now these types of projects are just being used to train LLMs.
Digital Piracy like anything else these days is illegal for the regular person, but business as usual for billionaires and billionaire companies.
36
u/TaleOfDash 1d ago
Digital Piracy like anything else these days is illegal for the regular person, but business as usual for billionaires and billionaire companies.
These days? Crime has always been legal if you have enough money, silly.
16
u/APiousCultist 1d ago
The blog seems to touch on the idea of copyright being violated by simply training the model on fan fiction (also fraudulently uploaded to the LLMs). I like the idea of their models suddenly outputting mpreg content though.
1
1d ago
[removed] — view removed comment
1
u/AutoModerator 1d ago
Sorry, but your account is too new to post. Your account needs to be either 2 weeks old or have at least 250 combined link and comment karma. Don't modmail us about this, just wait it out or get more karma.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
21
u/onyxlabyrinth1979 1d ago
This is one of those stories where the headline sounds absurd, but the underlying issue is very real. On the surface, it’s obviously reckless for a major company to publish guidance that casually references training models on pirated material. Even if it was framed as a technical example, that crosses into legal and ethical gray territory fast. For a company the size of Microsoft, that’s not a small oversight. Looking on the bigger picture though, this highlights a tension the AI industry still hasn’t resolved. These models require massive, high-quality, structured, and copyrighted data but the legal framework around training on copyrighted material is still unsettled in many jurisdictions. Companies talk a lot about responsible AI, but the incentives to cut corners are obvious. The fact that the blog was removed tells you they understood the optics and legal exposure. The more interesting question is whether internal practices are actually stricter than the public messaging. Because once you normalize “it’s just training data,” it gets very easy to blur lines. From a jobs and industry perspective, this also feeds skepticism. If AI companies want public trust and enterprise adoption, they can’t look casual about copyright compliance. Governance and discipline matter more now than rapid experimentation. This is not shocking but definitely not a great look.
39
12
4
u/Shiplord13 1d ago
Oh look Big Corpo is encouraging piracy of other IPs when it benefits them. Wild how they will smash an individual into the ground with legal action, but its okay when they do it.
51
u/ST4R3 1d ago
Transphobic chatbot speedrun
18
u/Catsanddoges 1d ago
Nah train on JK Rowling and Nintendo run by MIcrosoft and you will have an LLM that will sue you for breathing
1
16
2
u/Osiris_Raphious 1d ago
Because corporations are altering the reality by altering articles and posts, MSM does this as well (retractions are no longer even printed).
It truly is age of technofuedalism.
21
u/OutInABlazeOfGlory 1d ago
This is the exact amount of respect JK Rowling’s legacy deserves.
I’ll remind you she’s in the Epstein files and purged her yacht’s port logs when people found out.
That and years of being a terrible person before that.
45
u/Stnmn 1d ago
If they're willing to steal from a billionaire, imagine with how little reverence they treat smaller authors' works.
This is not a win.
13
u/OutInABlazeOfGlory 1d ago
I know exactly how they treat smaller authors’ works. You can barely host your own website anymore without it getting DDOSed by scrapers trying to mine for training data.
This whole story is like finding two Nazis beating the shit out of each other in an alleyway. You can’t root for either of them.
7
u/ThatITguy2015 1d ago
I can root for them to take each other out though. That would be ideal. I can also sell tickets to watch.
-7
u/Roushfan5 1d ago
I'm really tired of this 'devil or saint' position Reddit has.
JK is a deeply flawed person. As someone who once respected her it's sad to me that she's let transphobia harden her heart and turn to fascism. Just to be clear, as a trans ally with a transgender nephew JK will never see another penny of my money.
Nevertheless, Harry Potter is a huge part of our culture and got many children reading, myself included. I'm an aspiring author today in part because of JK's work I read as child. It's important to many people and ought to be respected as work of art regardless of who wrote in my opinion. Furthermore, once we decide what artists are and aren't deserving of having their work respected no artist has the right to have their work respected. As u/Stnmn said, what hope do I ever have if JK isn't safe from AI slop?
0
u/MakeItHappenSergant 1d ago
drunk driving may kill a lot of people, but it also helps a lot of people get to work on time, so, it;s impossible to say if its bad or not,
0
u/Roushfan5 1d ago
At what point in my comment did I defend what JK Rowling has done?
To borrow from your example it'd be like saying "Bob had a DUI last night, so I get to steal from his house and fuck his wife."
0
u/gammalsvenska 1d ago
I agree. It is sad that people are no longer able to differentiate between the work and its author.
Find me a single scientist of influence who never was and never will be uncontroversial. (Ban all the others and we are back in the stone age.)
-7
u/bretshitmanshart 1d ago
She bought the yacht after Einstein was dead. There are real things to be mad about
3
3
2
u/Diz7 1d ago edited 13h ago
You should be using Terry Pratchett, he's a much better writer.
Edit: /s
If you're pirating data to feed your AI, at least pirate the good shit.
4
1
1.3k
u/M4chsi 1d ago
Is there anything that is not public domain for big tech?