18,000 words. Four Questions. Much Delegation. Little Guidance
The EDPB issued their much-anticipated Opinion on AI Models, and boy do I have some opinions. And bonus analysis (for paid subscribers!)
Background
On 4 September 2024, the Irish Data Protection Commission (DPC) made a request under Article 64(2) GDPR for an EDPB examination on the processing of personal data in the context of ‘creation, operation and maintenance,’ (or more simply, development and deployment) of artificial intelligence (“AI”) models, including large language models. The questions only asked for consideration of the GDPR.
Specifically, the DPC's request asked for clarity on the following issues, which I’ve summarized here:
when and how an AI model can be considered as ‘anonymous’;
the appropriateness of legitimate interest as a legal basis for processing, both in terms of development and deployment of AI models;
the consequences to subsequent processing if the development of an AI model includes unlawful processing of personal data.
On 17 December 2024, the EDPB provided its “Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models.” It’s a lengthy tome, and since it’s something I’ve been waiting on for months, I wanted to weigh in. Also, virtually everyone in the data protection world has their own hot takes, many of which are worth reading.1
I’ve broken this up into two parts: The first is a breakdown of what I like and don’t like about the EDPB’s Opinion. I’m leaving this part open for all subscribers, because I’m secretly hopeful that maybe the EDPB will read it (ha!)
The second part, which will be for paid subscribers only, will provide a full and fair breakdown of the questions asked in plain English, the EDPB’s response, and some questions and observations that, IMHO, should be considered. If you’re not a paid subscriber, consider upgrading — I’ve got a fancy 20% off deal going right now.
Before I begin with what I like, check out a picture of Rosie and Leroy snuggling in the cat perch:
What Made Sense
The EDPB said a few things I generally agree with. Firstly, the Opinion acknowledged that "the application of [the GDPR] to AI models raises systemic, abstract and novel issues." That's the understatement of the century.
Second, the Opinion noted that everyone, including the EDPB, would "greatly benefit from reaching a common position on the matters raised by this Request, such matters being central to the planned work of the EDPB in the short and medium term." (Sec. 1.2)
Third, I thought the initial questions asked by the DPC were solid. I wish they'd asked more technical and implementation questions -- a point I mentioned during the Stakeholder Discussions in November, but alas, we must play with the cards we’re given.
Fourth, I think the EDPB’s answer to the ‘what happens when an AI model is trained in a naughty way’ question was reasonably sound. I had very few questions, and no real hot takes.
But enough about what I liked -- let's get to the fun stuff, the stuff most of you probably came here to see: my cranky thoughts and observations on the rest of it.
What Baffled Me/Made Me Sad
If it wasn't apparent by the initial tone of this piece, I'm deeply confused about what the EDPB hopes this Opinion will achieve, beyond being a technical fulfillment of their consultative obligations under Article 64(2) GDPR.
The Opinion offers only 'general considerations' on the interpretation of relevant provisions of the GDPR. What's missing is any sort of clarity on how to implement any of this in practice, both for controllers, and for the poor Supervisory Authorities. As Dr. M.R Leiser noted this is "lazy delegation" to SAs, and I’m inclined to agree.
Based on my reading, a controller realistically meeting the requirements outlined in the first two questions (anonymity or legitimate interests) is about as likely as two Sundays following one another. Aside from maybe the most limited use-cases (some very narrow-task LLMs, ML models that detect spam, image-generation AIs that only generate animals and plants), the standards outlined by the EDPB in Sections 3.2 and 3.3 seem wildly unachievable. How, for example, would OpenAI's interests ever outweigh those of the 200M EU data subjects, especially given all the externalities that a controller must consider?
After reading this, I am left wondering why the EDPB needed 35 pages to state the obvious: that nearly all general purpose LLMs and generative AI systems are de facto illegal under the GDPR. We’ve got a Lucy/Charlie Brown football scenario going on.
My question then (to the EDPB and my fellow data protection punters) is why was this Opinion written at all? I'll dig into a few theories I have later, but first, here are some high-level takeaways for everyone playing along at home:
On Anonymity
Based on my reading, there is almost no conceivable situation where a general-purpose language model, generative AI, or even most ML tools will ever meet the anonymity threshold outlined in Section 3.2. If you can think of one, leave it in the comments and I'll buy you a beer.
While it is obvious that anonymity should never be presumed, I think the EDPB should been explicit that it's not just hard — it’s nigh impossible for any of the current commercial model families to anonymise their data. Short of scrapping everything and starting over with a much different, much smaller corpus, rigorous data quality, built-in entity disambiguation, etc. etc., the current LLM providers will always miss the ball.
Maybe this is a blessing and represents an opportunity for some plucky upstart privacy-preserving LLM to somehow spring into being, but I won't hold my breath.
On Legitimate Interests
Controllers must consider potential impacts (accidental or intentional, likely or theoretical, including malicious uses by third parties) for each processing activity along the AI model chain. (paras 80-81) This seems to imply that the 'means reasonably likely' language has been extended to unlawful activity, which seems broader than what the law requires under Article 32 GDPR.
Ambiguities in determining “reasonable expectations” of data subjects pose additional compliance hurdles. Reasonable expectations are subjective. Even assuming we're focused only on the expectations of average adults (and not kids or vulnerable individuals), the expectations of someone in Germany, vs Ireland, vs Poland may be different. The regulatory interpretations by SAs certainly are.
This all becomes trickier when the EDPB/SAs finally get around to applying this opinion to special categories/sensitive data. That will create at least two problems:
Firstly, which lawful basis applies for special categories data, particularly when most LLM data is sucked up off the internet, and not directly obtained from the data subject?
There is no legitimate interests basis for special categories data. Explicit consent for third-party or web-scraped data is an oxymoron. Article 9(2)(h) (public interest) is also likely a non-starter, absent a special member state laws carving out generative AI.Second, any controller asserting Article 9(2)(e) — that the personal data was manifestly made public — is going to have a time given the recent Meta v. Bundeskartellamt decision. To quote the CJEU: "it is important to ascertain whether the data subject had intended, explicitly and by a clear affirmative action" to make the data public.2 Like explicit consent, I don't see how 9(2)(e) is a possible lawful basis when the majority of training data comes from stuff grabbed off the internet, almost all of it published for decidedly different contexts, long before LLMs even existed in the public conscousness.
Notwithstanding model developer intent, the recent Meta decision all but guarantees that this question will need to be sorted out sooner rather than later.
[W]here a set of data containing both sensitive data and non-sensitive data is […] collected en bloc without it being possible to separate the data items from each other at the time of collection, the processing of that set of data *must be regarded as being prohibited, within the meaning of Article 9(1) of the GDPR, if it contains at least one sensitive data item* and none of the derogations in Article 9(2) of that regulation applies.3Meeting the necessity standard will also be challenging. While LLMs absolutely can be developed without relying on data from the internet — e.g., they could be trained on historical data, novels in the public domain, synthetic data — the question is, will those models be as robust or effective as what currently exists? How does this impact the legitimate interests argument? More importantly, if the answer is that the models are inferior, will fines actually motivate action?
If we're applying a case-by-case approach instead of a broader prohibition/obligation re: training, I don't know how this won't be regulatory whack-a-mole forever. While the "specific circumstances of each case" language comports with earlier opinions, I'm just trying to game this out in practice. As I note later, implementing the EDPB’s Opinion in practice has the potential to be a huge regulatory headache.
On Mitigations
While I appreciate the nod to data subject rights and in particular machine unlearning techniques, I have no idea how the delay between collection/training and use (para 102) would materially benefit data subjects who are not otherwise aware that their personal data is being used to train AI models in the first place.
The other mitigation measures all suffer from problems I have mentioned as part of my machine unlearning series: namely, that LLMs do not behave like databases. There often isn't a single record or file that can be deleted, anonymised, or even identified, and the process of identification of which information contain personal data in scope of the GDPR, is far more complicated than the EDPB appreciates.
Other Stuff
Organizations developing and deploying models will really need to be up on the state-of-the-art of machine learning, unlearning, and adversarial attacks. I know some people if you need help there.
Techniques like differential privacy are encouraged, but in practice, they are not widely implemented.
For all that’s holy, someone please tell me how model developers and deployers communicate all of this technical complexity to users in a transparent, accessible, understandable, and effective manner that doesn’t end up being a 40-page notice that nobody will read. Or how this comports with the recent AG opinion in Case C-203/22 — CK v. Dun & Bradstreet Austria, where AG Richard de la Tour noted that transparency requirements for automated decision-making must, amongst other things be precise, easily accessible, and presented in clear and simple language, tailored to the individual’s level of understanding. I don’t know how to square the circle on that.4
I'm asking for a me.Automated decision-making, DPIAs, and the principle of data protection by design and default under Articles 22, 35, and 25(1), respectively, are noted as important considerations and essential safeguards, but they’re mentioned in passing. While it’s good that the EDPB mentioned them, it would be nice to eventually see how controllers could combine DPIAs with Fundamental Rights Impact Assessments under the EU AI Act, and how the other pieces of the puzzle fit.
Also, the only substantive mention of the EU AI Act comes in Section 2.3, where the EDPB distinguishes AI Systems (as defined in Article 3(1) of the AI Act) and 'AI Models' (Recital 97 AI Act) from the narrower definition of 'AI models' as defined in the DPC's original request. I legitimately do not understand why they muddied the definitional waters here. It just makes the opinion more confusing, but maybe that's an administrative thing I'm missing?
Why Didn't the EDPB Just Ban LLMs Already?
After scratching my head for two days, I have a few wild-ass theories for why the Opinion was written the way it was.
The EDPB was under political or external pressure to correctly interpret the law but not explicitly state 'all existing AI Models/LLMs/GenAI are illegal', even though this is likely the correct answer, based on any sane reading of the GDPR, CJEU decisions, and past guidance.
In short, it's less politically damaging to write a 35-page legal analysis that 0.001% of the population will read and less will understand, compared to a two-paragraph statement that says, in effect that LLMs are de facto unlawful in the EU, and OpenAI et al., can go to hell.The EDPB does not want to be the bearer of bad news. They would much rather shift the unpleasant task of banning LLMs and genAI in Europe to someone else (e.g., the CJEU, SAs, the European Commission).
I think that's why there's so much emphasis on the discretion of the supervisory authorities, citations to the CJEU, and lots of hand-wavey statements about what might/could/ought to be. For what it’s worth, the EDPB includes the words 'case-by-case' 16 times, and 'discretion' or 'competence' (in relation to the Supervisory Authorities and their decision-making power) 13 times. That's an awful lot of 'fuck if we know, you guys go figure it out' language.The EDPB is hoping that the sheer unworkability of this Opinion will motivate technical advancements and privacy-preserving business models. That's why mitigations, privacy-preserving techniques, and other suggestions are sprinkled amongst otherwise unworkable standards. This is the most optimistic interpretation, and it would be amazing if it happens.
The EDPB does not fully understand how AI models/systems work, and how the law is in conflict with how these systems work at a technical level. There's evidence of this in some of the examples cited in Section 3.3 on legitimate interests, particularly in the mitigation and data subject rights sections.
In conclusion: anyone reading this Opinion in the hopes that it will provide practical guidance to save existing AI business models is deluding themselves, IMHO. At best, it's a strong kick in the ass to developers and tech companies to go back to the drawing board and start over with privacy and data protection in mind. Or ample cover for someone else (SAs, the Courts) to kick LLMs and genAI out of the EU.
At worst, it's an exercise in frustration and a reason for everyone reading to have a stiff drink in hand.
And now, the more detailed (and less snarky) analysis.
Keep reading with a 7-day free trial
Subscribe to Privacat Insights to keep reading this post and get 7 days of free access to the full post archives.