LLM Lies About You Are Still Personal Data
What rights should you, or I have to that data? At a minimum, shouldn't we be allowed to access, correct, delete, or object to the use of hallucinated nonsense about us?
Last week was a busy AI-related week for me. I attended two different conferences that touched on different aspects of artificial intelligence and large language models, and I’ve been reading a flood of literature on the subject. The first event was hosted by DPO Organizer and the Law Society of Ireland, and was very much geared towards lawyers and policy-types. The panelists discussed the GDPR, the almost-there AI Act, and how regulators were going to act. There was also lots of lawyerly waffling, and the phrase ‘it depends’ featured repeatedly. Also, the Deputy Data Protection Commissioner Cathal Ryan was present, and was gracious enough to answer a few of my questions.
The second event was a roadshow put on by Microsoft touting all the magical (and sometimes horrifying ways) organizations could use Copilot to make everyone more efficient (and some new creepy decision-making about HR which made me gasp). The roadshow did not touch on the law much, though I will give props that one of the speakers did mention the AI Act and the GDPR and was kind enough to chat with me during the coffee break. Naturally, the Microsoft one had better swag, and way more butts in seats.
These events were both instructive, as both gave me an opportunity to explore the latest crazy thought experiment I’ve been mulling over:
Are LLM ‘hallucinations’ personal data?
Hallucinations (or confabulations) refer to responses generated by an AI or LLM which contain false or misleading information presented as fact. You’ve probably seen AI hallucinations in the wild, though sometimes, they’re wickedly hard to distinguish from truth. Take for example, this screenshot from my sister in data protection and hater of all data brokers, Heidi Saas:
Many of the points Heidi has helpfully identified in that screenshot are bullshit. But if she hadn’t highlighted them, would you know? If someone who didn’t know Heidi queried Perplexity.ai about her and believed the “botshit” it generated, how might that affect her reputation or her privacy rights? More importantly, is the botshit generated about her also personal data, and can she do anything about it?
In an attempt to get some ground truth, I posed this question to two different important constituencies:
The Deputy Data Protection Commissioner: At the DPO Organizer conference, I took the opportunity to put Mr. Ryan on the spot. He kindly deflected and noted that related questions, including whether information gobbled up in the Common Crawl was ‘manifestly made public’ is still a live question. This is fair, as it’s never nice to put a bureaucrat on the spot, and expecting anything profound is probably going to end in disappointment.
ChatGPT: ChatGPT was far more confident about the law than Mr. Ryan. After summarizing basic data protection law, ChatGPT suggested that “even fabricated information could potentially fall under this definition if it’s linked to an identifiable individual or household.” So there we go.
Here, I think ChatGPT isn’t spouting botshit — it’s dead on. That said, since this is not a settled question, I thought it would be educational to lay out the argument for why hallucinated data about an identifiable individual is in fact, personal data.
Incorrect Data That Identifies You Is Still Personal Data
I’ve droned on about personal data (or personal information) before, but as a reminder, under the GDPR and other privacy/data protection laws, personal data means any information relating to an identified or identifiable person. Identifiability can be direct (e.g., a person’s name, online identifier, or email address) or indirect (a description of the person, or a unique ID paired with other details that could be combined in order to identify them). Many US laws have adopted the GDPR definition, but the California Consumer Privacy Act likes to do its own thing. It defines “personal information” as covering any information that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household. Same but different, whee!
Going back to the GDPR, Article 5 defines the six foundational principles of data protection. Simply put, if you want to process personal data lawfully, you need to:
acquire the data lawfully, fairly & transparently;
process the data for a specific purpose;
only process data that’s adequate, relevant, and limited to what’s necessary to fulfill that specific purpose;
make sure the data is accurate and kept up-to-date;
keep the data for no longer than is necessary;
keep the data securely, which includes making ensuring its integrity and availability.
Point number 4, which is referred to as the ‘accuracy principle’ further requires that controllers not only ensure that the data is accurate, but that inaccurate data be erased or rectified ‘without delay’. Additional text in the GDPR (Recital 39) clarifies that “Every reasonable step should be taken to ensure that personal data which are inaccurate are rectified or deleted.” Which is to say, if a controller is processing inaccurate, incomplete, or fake data about a person, they are violating the law. While the GDPR lays all of this out pretty clearly, most of these principles, including the accuracy principle, aren’t unique to the GDPR. You’ll find them littered throughout other laws around the world. 1
Rights Flow From Obligations
Many data subject rights, including the right to rectify or correct inaccurate data, and the right to have irrelevant, inaccurate, or unnecessary personal data erased or ‘forgotten’, flow directly from the accuracy principle. And so do certain penalties. After all, expecting controllers to do the right thing just because some ‘principle’ exists is silly. You also need a stick.
The concept of the right to be forgotten itself arose from the pre-GDPR landmark 2014 decision by the Court of Justice in the EU, Google Spain v. Costeja González.2 In Google Spain the question was whether Mr. Costeja González could force Google to remove search results that linked to factually accurate but decades-old negative news coverage about him.3 The Court found that individuals do have a “right to be forgotten”, and that this right extends beyond merely inaccurate data to also cover “inadequate, irrelevant or no longer relevant, or excessive.”4 Importantly, the individual isn’t responsible for proving that the “information in question in the list of results causes prejudice.” (para 96)
But there are also other rights under the GDPR and US data protection laws like the CCPA, the Colorado Privacy Act, and the Virginia Consumer Data Protection Act. These include the rights of access to personal data, and the ability to restrict or object to specific processing. Frustratingly, while the US state privacy laws mention the existence of these rights, most don’t say much, beyond the fact that the right exists, an affected person/consumer can exercise their rights, and a data controller needs to honor the individual’s request. As with rights to erasure and rectification, these rights do not require an individual to demonstrate any sort of harm/injury (at least under the GDPR).
Now, data subject rights aren’t absolute, but certain rights (particularly access and rectification) are much more pro data-subject. Barring a few narrow restrictions,5 I’d argue that compliance with data subject rights should extend to hallucinated data.
What Should Heidi Do?
With all of that, let’s get back to the topic at hand. Based on the plain reading of these laws, it stands to reason that:
IF:
hallucinated information is being processed (collected, stored, disseminated, etc.); and
its processed by an entity covered by the GDPR or other data protection laws; and
the data being processed is about individuals (i.e., it makes them identifiable) and they are covered by that law;
THEN: those individuals should have the right to do something about it (access, rectification, deletion, objection) in relation to that hallucinated data.
This includes demanding access not just to account information but also to all hallucinated outputs generated, to rectify, delete, or in some cases object to processing of that hallucinated information. Service providers/controllers like Microsoft, OpenAI, Google, Meta, and Perplexity, are all bound by the data protection laws, but right now they’re all pretty bad at actually complying with them, especially data subject rights.
For example, an access request made through the ChatGPT dashboard only includes my chat history and account data. I downloaded the data and checked — it does not include hallucinated outputs, details about the sources OpenAI relies on to answer questions like ‘Who is Carey Lening’ or ‘What does Carey Lening write about’ (including information provided by other users or trainers), and critically, other queries that might mention or relate to me.
And OpenAI is considerably better than the rest. Neither Google Bard Gemini or Perplexity.ai seem to offer any options in relation to accessing my data. Perplexity.ai at least provides a contact email for their privacy team. I still haven’t been able to figure out who to contact at Google in relation to an access request. It’s a maze.
Now, its easy for those of us in Europe, but let’s examine what Heidi can do. Let’s pretend she’s based in a US state with a privacy law in place that at least recognizes access, rectification, and deletion rights. If Heidi were to send a subject access request to OpenAI, she should receive a copy of her chat history and responses. But she should also receive the result that I just queried:
… and every result (and arguably specific queries) that others have made about her on OpenAI. Unlike search results that Heidi can find by doing a search, OpenAI isn’t dumping a fixed or deterministic set of results that Heidi herself can easily find. ChatGPT (and arguably all LLMs) generate ‘creative’ results based on the probability of specific words being near other words, which can vary depending on loads of different factors. But it’s not something Heidi herself has any visibility or insight into. An access request is really the only way. Importantly, it will inform how she exercises other data subject rights, like rectification and deletion.
Additionally, Heidi should be able to demand that sites like perplexity.ai correct or rectify the fact that she did not, in fact attend Emory University or graduate in 1997, and that she probably wasn’t a Billing and Accounts Receivable Specialist at Paradigm Tax Group (which came up when I queried her name on perplexity.ai). Absent technical impossibility, she should be able to ask for this information to be deleted or corrected.
A Quick Digression about Privacy Torts
So far, I haven’t seen many cases (outside of Alex Hanff in the EU) challenging LLM providers on hallucinated data. That isn’t to say that there haven’t been any cases, just that I’m not aware of them (yet). However, the AI Bot Case Bot reports that there are at least 4 ongoing cases lodged against companies such as OpenAI/Microsoft and Alphabet in the US that touch on privacy harms caused by false or hallucinated content from LLMs.6 Most of these cases allege a range of privacy torts, which include:
Defamation;
Publicity which places a person in a false light in the public eye;
Intrusion upon seclusion or solitude, or into private affairs.
I’m admittedly less fluent on the privacy torts, seeing as I’m 20 years out of law school and haven’t seen much success in this area, outside of the occasional defamation case. This body of law is large enough to warrant many separate blog articles, and I suspect that as these cases and future cases progress, we’ll get interesting developments worthy of pouring over.
Unfortunately, I’m skeptical that these claims will be successful for the plaintiffs. For one, these suits are usually harder to prove by plaintiffs’ attorneys compared to actions grounded on statutory violations of data protection laws like the GDPR and CCPA, because tort actions usually require plaintiffs to demonstrate not only that they suffered a concrete (read: $) harm or injury, but that the injury was caused by an intentional, or negligent bad act by the defendant, which is often tough to prove.
For example, the United States has a high threshold for defamation cases involving public figures, requiring proof of “actual malice.” The actual malice test requires that a public figure plaintiff (let’s say, radio personality Mark Walters who filed a defamation action against OpenAI) prove that a false statement was made about him, with knowledge that it was false or with reckless disregard of its falsity by the defendant.7 Mr. Walters will have a time succeeding against OpenAI based on defamation, since there’s no real sense that ChatGPT ‘knows’ what it’s saying, much less that a statement is false, and ‘reckless disregard’ requires a higher standard of evidence (clear and convincing evidence) than the standard of preponderance (more likely than not) for most torts. As such, I suspect it’d be smoother sailing for him if he demanded deletion of erroneous hallucinated results directly from OpenAI (assuming he moves to a state that has a state privacy law on the books with data subject rights).
Toward an AI-Empowered Future: Ethical Boundaries and Legal Horizons
Like most things I talk about here, we’re left with more questions than answers. Fortunately, I think I’ve laid out a path to at least answer the question of whether hallucinated data/botshit is personal data, and what individuals should be able to do about it. Now we wait for courts/regulators/businesses to finally agree with my brilliance (or not).
For example, Canada’s PIPEDA (Principle 6), Ghana’s Credit Reporting Act 2007, the US Fair Credit Reporting Act & Privacy Act of 1974, and Hong Kong’s Personal Data (Privacy) Ordinance.
Google Spain SL, Google Inc. v Agencia Española de Protección de Datos, Mario Costeja González, C-131/12, 13 May 2014.Deletion / RTBF have some additional limitations and gotchas under the GDPR. For example, the right to deletion only applies if the information isn’t necessary, wasn’t lawfully processed in the first place, when processing is based on consent and no other grounds apply, where the data involves children, or when a data subject has objected under certain circumstances.
In this case, Article 6 and the related data subject rights at issue concerned the precursor of the GDPR, the Data Protection Directive, Article 95/46/EC. ↩︎
Google Spain, paras 92–93. “[S]uch incompatibility may result not only from the fact that such data are inaccurate but, in particular, also from the fact that they are inadequate, irrelevant or excessive in relation to the purposes of the processing, that they are not kept up to date, or that they are kept for longer than is necessary unless they are required to be kept for historical, statistical or scientific purposes.
It follows from those requirements, laid down in Article 6(1)(c ) to (e) of Directive 95/46, that even initially lawful processing of accurate data may, in the course of time, become incompatible with the directive where those data are no longer necessary in the light of the purposes for which they were collected or processed. That is so in particular where they appear to be inadequate, irrelevant or no longer relevant, or excessive in relation to those purposes and in the light of the time that has elapsed.” 5: Here I’m looking at the GDPR as I haven’t analyzed each of the US state law regulations. Under the GDPR exceptions include: national law, non-identifiability, intellectual property or copyright considerations, or adverse impact on others’ fundamental rights. The latter two are not catch-alls to avoid compliance. They apply narrowly, and only to the extent there’s impact on those rights. For example, if you’re processing my personal data in your database, you do not need to provide me with your full database output or source code (that’s your IP, and it potentially impacts others’ rights). You must, however, provide me with the details related to my row in the database.
Here I’m looking at the GDPR as I haven’t analyzed each of the US state law regulations. Under the GDPR exceptions include: national law, non-identifiability, intellectual property or copyright considerations, or adverse impact on others’ fundamental rights. The latter two are not catch-alls to avoid compliance. They apply narrowly, and only to the extent there’s impact on those rights. For example, if you’re processing my personal data in your database, you do not need to provide me with your full database output or source code (that’s your IP, and it potentially impacts others’ rights). You must, however, provide me with the details related to my row in the database.
The Clarkson Law Firm based in California appears to be bringing a number of these cases, including P.M. v. OpenAI LP (filed June 28, 2023, terminated Sept. 15, 2023) and J.L. v. Alphabet (filed July 11, 2023). When I was digging around I also discovered a case brought by the attorneys at Morgan & Morgan on September 5, 2023, A.T. v. OpenAI LP, which is almost a word-for-word copy of the P.M v. OpenAI case. I dunno if those guys are working together or not, or if the folks at Morgan & Morgan just flagrantly infringed on the Clarkson Firm’s work, but man it’s weird to see. And there’s the Walters case, of course.
The standard for actual malice was first identified in the New York Times Co. v. Sullivan, 376 U.S. 254 (1964) case. Mr. Walters’ case does not really allege defamation directly against OpenAI, despite naming them as plaintiffs, and it’s an odd case generally, but if you squint you’ll see he’s claiming defamation against various individuals. Walters v. OpenAI, June 5, 2023.
Really great thought provoking piece here…again 😃. So this got me thinking about libel law and defamation. First off, I’m not a lawyer and do not even play one on TV, not to mention that I’m an idiot, so lacking the mental capacity to be one even if I wanted to be. With that out of the way…
“To prove prima facie defamation, a plaintiff must show four things: 1) a false statement purporting to be fact; 2) publication or communication of that statement to a third person; 3) fault amounting to at least negligence; and 4) damages, or some harm caused to the reputation of the person or entity who is the subject of the statement.”
If a statement was made by Gemini or ChatGPT about your friend Heidi, but unlike her examples, it went a bit further to say that she’s renowned to be wholly untrustworthy and unreliable as an attorney because of several blown cases, and then it turns out that this was all a hallucination, totally fabricated, could Google or OpenAI be held liable given their clear negligence in not having taken measures (and even if they did but these still failed) to prevent users of their services find these outright false statements that their bots were claiming to be true? When these LLMs are under the covers of other applications, disclaimers on their results quality that may exist are not usually evident. To the extent Heidi could show that she lost business after people consulted with the bots to determine if she was a suitable attorney for their needs, could she go after these companies with a credible case? Let's take it a bit further and say Heidi tested this herself and found this error, reached out to Google or OpenAI to let them know of the error. These companies claim that they don't know how this happens and that not practicable to fix. Then the initial case happens where she actually loses one or two clients to these false statements, can she claim defamation then since now the companies have been made aware of the issue but have done nothing to remedy the offense? 🤔
I agree with your argument here, that hallucinated data can still be personal data. I am wondering though if there is a practical problem with this. What measures can be put in place to rectify the inaccuracies in the hallucinated data? I think this might be a bit tricky. My understanding is that the hallucination (or the inaccuracy) is not necessarily a function of inaccurate training data. It is a function of the model. It generates a probability distribution over its training data and uses this to predict what the best response to the prompt should be. So even if all the training data is 100% factually accurate (though not sure if this is possible) the model could still produce inaccurate outputs because the nature of that output is probabilistic and not deterministic (and the model is so big and complex that it is hard to anticipate its behaviour sometimes). That is not to say that the right to rectification does not exist, for if the hallucinated data is personal data then data subjects should be able to exercise their rights. It is more of a question of practical fulfilment of the right to rectification. Curious what you think of this.