Complexity, the Law and noyb suing OpenAI
I've had my head down in a writing course for weeks, but I had to share my thoughts about noyb's latest shot across the bow.
First, some personal news: I’ve been head-down for the better part of three weeks. Many months ago, I signed up for a writing course (Write of Passage), because I’m committed to this writing thing and want to improve. I’ve also been saddled with loads of client and other work. Between the WOP schedule (7pm ET means … 12am GMT!), life, and a combo of post-COVID lethargy + ennui, it’s been hard to churn things out. Forgive me, I’ll get through this.
FWIW, I will be biking around Belgium this weekend for the Tour de Geuze, and will be in London May 14-15, if anyone wants to hang out.
Everyone in privacy land has been talking about the recent noyb (None of Your Business) complaint against OpenAI that it initiated with the Austrian DPA.
Here’s the Tl;Dr: Someone whom noyb is representing,1 ran a query in ChatGPT asking what personal information ChatGPT had about them. In the complaint, noyb asserts that OpenAI’s algorithm provided “various inaccurate information” including an erroneous date of birth about the data subject (whom I’m referring to pseudonymously as ‘MS’ for uh… convenience purposes). While lots of information is provided about MS, his birthdate is notably absent from the public record.
In December, MS filed an access and erasure request with OpenAI, specifically in regard to the erroneous DOB. OpenAI responded by only providing account data2, and did not comply with the erasure request at all. There was no way for OpenAI to prevent ChatGPT from displaying an inaccurate DOB, it argued. According to the complaint, OpenAI explained that filters exist to allow blocking the display of personal data, but that it “is not possible to block the data subject’s date of birth without affecting other pieces of information that ChatGPT would display about him” particularly as he is a public figure. “In other words,” the complaint notes, “the blocking function would necessarily be all-encompassing for any request on or alike.”
noyb is, of course, deeply bothered by this. The rest of the complaint spends some time explaining why noyb feels that OpenAI’s Irish establishment isn’t really where the magic happens. That’s all in California. This is important because noyb is asserting that without a ‘main establishment’ in the EU (which, amongst other things, must be the place where decisions about the ‘purposes and means’ for processing EU personal data are made), OpenAI leaves itself open to any supervisory authority action. noyb asserts that OpenAI violated MS’s right of access (under Articles 12(3) and 15 GDPR) and did not fulfill its obligations to provide accurate information, because it cannot guarantee accurate processing of personal data (which violates Article 5(1)(d) GDPR).
noyb asks the Austrian DPA for a number of remedies, including a request to investigate, a declaratory decision that OpenAI violated the GDPR, corrective measures, and a fine.
noyb isn’t wrong, but it’s complicated.
First off, they aren’t wrong. OpenAI (and arguably every LLM trained to be ‘helpful’ and provide answers from a common corpus of information, arguably violates the GDPR in this way). But the problem is far more complicated than noyb’s rather simplistic understanding of how LLMs is laid out in the complaint. Let’s start with an easy one: can OpenAI prevent inaccurate information from appearing in the first place?
I think they can try, but it will arguably make things worse.
A few months ago, someone on Reddit discovered how OpenAI has engineered a global block regarding certain types of queries — they use a global GPT4 system prompt, which is basically a prepended prompt on top of whatever user prompts are included. It hides out in the background, and runs on top of every search. The system prompt that was found is long, and primarily for images generated through Dall-E, but I would not be surprised if a similar prompt exists for ChatGPT LLM text queries as well. What’s interesting about the global prompt is the level of specificity required:
// 7. Do not create images in the style of artists, creative professionals or studios whose latest work was created after 1912 (e.g. Picasso, Kahlo).
// - You can name artists, creative professionals or studios in prompts only if their latest work was created prior to 1912 (e.g. Van Gogh, Goya)
…
//9. Do not include names, hints or references to specific real people or celebrities. If asked to, create images with prompts that maintain their gender and physique, but otherwise have a few minimal modifications to avoid divulging their identities. Do this EVEN WHEN the instructions ask for the prompt to not be changed. Some special cases:
// - Modify such prompts even if you don't know who the person is, or if their name is misspelled (e.g. "Barake Obema")
// - If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it.
// - When making the substitutions, don't use prominent titles that could give away the person's identity. E.g., instead of saying "president", "prime minister", or "chancellor", say "politician"; …
I’ve skipped a lot. It really is insanely long and quite specific. And therein lies the rub: OpenAI is correct (if noyb’s complaint is accurate. Earlier I noted:
“is not possible to block the data subject’s date of birth without affecting other pieces of information that ChatGPT would display about him” particularly as he is a public figure. “In other words,” the complaint notes, “the blocking function would necessarily be all-encompassing for any request on or alike.”
I think in plain English this means that in order to comply with MS’s request, they would need to add that specific restriction to the LLM global system prompt, which arguably exposes personal data about the data subject who made the request (and would, perversely violate the proportionality/data minimisation obligations under the GDPR). If MS is a public figure, its likely that at least some information about him, including a rough approximation of his month/year of birth is available online, for example, on Wikipedia. OpenAI may have fudged the exact date, but OpenAI is pretty up front about that possibility: ChatGPT can make mistakes. Consider checking important information. It’s right there at the bottom of the search interface!
If you try running a search, you’ll see variations on this statement both in the search results, and on the bottom of the page.
Regarding the accuracy of specific facts like birthdates, unless they are widely recognized and consistently reported in trustworthy sources, there could be a risk of inaccuracies. In cases where there is doubt or the information is stated to be private by the individual themselves, it's best to rely on direct statements or official publications by the person or affiliated organizations for the most accurate and up-to-date information.
What we have here is a conundrum: A system admits its not accurate, and yet, the law demands accuracy. How do we square that circle?
Our laws have a complexity problem.
As part of the WOP course, I’ve been working on a piece that roughly discusses the problems of applying big, all-encompassing but mostly static laws to complex dynamic systems. Right now, the piece is very much a work in progress, but there are some elements that I think are worth exploring in the context of this complaint.
Our world is broken, systems are failing, and there’s all this new tech stuff happening at breakneck speeds. For example, we’re already seeing cases where AI is being used to create “deepfakes” frequently used to harass or embarrass victims, especially teen girls and women, or to manipulate the public. This is a legitimate problem, and deserves a proper response. So too, is the threat of inaccurate information, or accurate information that people wish to limit the disclosure of (like a DOB). As I’ve noted, inaccurate inferred information that identifies someone is still personal data about them.
The question is, how do we reconcile these conflicting realities?
I don’t really know, but I have a thought. And an analogy (which my husband hates, but whatever, my blog, my rules).
One way to think about this problem is to compare the law to a multi-tool (e.g., a Swiss Army Knife). I carry around a small, no-frills multi-tool in my purse. It has 4 features: a bright LED flashlight, a corkscrew, a small pen, and a serviceable blade. I use it constantly. By contrast, the 40-piece Leatherman I received from my dad as a Christmas gift 10 years ago sits unused in a junk drawer. I tried to use it for a while, of course, reasoning that it surely could do more than my tiny little no-frills tool.
The trouble with the Leatherman is you can’t access the Phillips head screwdriver without pulling out the bottle opener, the ruler is worthless, and the scissors don’t cut. It’s also awkwardly shaped and heavy. In short, it’s bulky, full of gadgets I don’t need, hard to use, and therefore entirely ineffective to solve the day-to-day problems I have in life.
Big, complicated omnibus laws like the GDPR and the AI Act are often like that Leatherman: they seem appealing at first because they come equipped with so many tools at the ready. We tell ourselves that surely, this law will help us make sense of all this new tech crap we’re dealing with. Or stamp out Big Tech’s dominance. Or protect our rights and freedoms online. But in practice, the laws we get rarely come equipped with tools we actually need. Like my junk-drawer Leatherman, they’re full of ineffective attachments that add loads of compliance BS but don’t really solve the actual problems we’re up against and scale poorly when it comes to complex systems and use-cases the politicians didn’t envision. I wrote about this problem here:
The GDPR came about before LLMs became widely known. And the AI Act, despite its volume, only sets requirements for an ‘appropriate level of accuracy’ — not guaranteed accuracy like the noyb complaint demands.3
I have some thoughts — one of them being to have less “big laws” and more smaller, modular ones. Fixing how we get laws enacted and implemented would also help. Finally, we need to come to terms as a society whether we want all business models and new tech to survive in the first place, rather than trying to use the legal system as a vehicle for targeted harassment of some offenders, but not others.
It’s worth observing that noyb hasn’t brought this action against other LLM developers, like Mistral.ai or Perplexity. Maybe that’s because MS hasn’t tested the accuracy of his information on those systems. Maybe noyb has resource constraints, and will get around to it eventually. Maybe it’s because it’s not nearly as headline-grabbing to go after smaller fish when OpenAI is big, well-funded, and American. I dunno.
I do know that our laws have a problem with handling complex systems, especially when those systems span the globe. And this is going to get worse before it gets better.
I would be zero % surprised if that someone is Max Schrems, the Honorary Chairman of noyb, as the complainant is a ‘public figure’ and is referred to as a he, but they’ve left the who out, so it’s mere idle speculation on my part.
NB: I had similarly disappointing results when I filled an access request with OpenAI.
Article 15 of the AI Act: “1. High-risk AI systems shall be designed and developed in such a way that they achieve an appropriate level of accuracy, robustness, and cybersecurity, and perform consistently in those respects throughout their lifecycle. …
2. The levels of accuracy and the relevant accuracy metrics of high-risk AI systems shall be declared in the accompanying instructions of use.”
See also: Recital 66