Can LLMs Unlearn? Part 2.5: Where OpenAI Sorta Maybe Deletes My Data

I received a response from OpenAI that they deleted my data. Except it's complicated. Here's what I learned

Aug 14, 2024

For those of you who are following along, this has been a many months-long quest to discover whether or not large language model (LLM) providers can comply with data subject rights like the right to erasure & the right to be forgotten (RTBF) and whether LLMs generally can unlearn. You can check out Part 1 here:

Can LLMs Unlearn? Part 1: Predictable Law, Unpredictable Machines

Carey Lening

June 9, 2024

Read full story

And Part 2 here:

Can LLMs Unlearn? Part 2: Needle in a Neural Network - In Order to Forget You, First You Must Be Found

Carey Lening

July 25, 2024

Read full story

Part 3 (which discusses retraining and exact unlearning methods is here:

Can LLMs Unlearn? Part 3: The Technical Complexity of it All

Carey Lening

November 14, 2024

Read full story

Part 4 (which discusses approximate unlearning techniques and federated unlearning) is here:

Can LLMs Unlearn? Part 4: The Technical Complexity of it All (continued)

Carey Lening

November 14, 2024

Read full story

Part 5 will cover the final bits, including suppression methods & guardrails — essentially what OpenAI is doing currently. I briefly touch on that in this article, but will dive into the technical details soon.

My kitty Joi, a moggy, is sitting in the laundry basket. — But first, a picture of Joi, because I said I would include more cat pictures.

A little over a month ago, I submitted a right to be forgotten / erasure request with OpenAI and Perplexity.

Specifically, I asked the following of both (with some minor variation for Perplexity). Since my request to OpenAI is more substantive (because OpenAI is more problematic), I have included that request below, with some minor tweaks for clarity.

I am currently trying to understand how OpenAI (and other GenAI companies) comply with Article 16, 17 and 21 GDPR (rectification, erasure/RTBF, and the right to object) in relation to training data and LLM output data generally.
I believe I have a pretty decent understanding of how you delete account and conversational data (as I assume this is stored in an old-fashioned relational database), but compliance with data subject rights in relation to your training data and outputs eludes me. Particularly in regard to web searchable information.
…
NOTE: I do not wish to delete my account data or chat history at this time. I reserve those rights at present.
I've done various searches for myself ('Who is Carey Lening', 'Tell me everything you know about Carey Lening') and have observed that OpenAI has quite a bit of knowledge about me. <Link removed by me>
I have complied with the removal process outlined by your company, though I do so out of protest, and observe that OpenAI’s practice of requiring a government ID is excessive, violates Article 12(2) of the GDPR, and is directly in violation of the DPC's decision in In re Groupon. Please see: In re Groupon International Limited. Ireland, is of course, OpenAI's LSA.
… I would like to confirm that you are, in fact not storing ID beyond the verification stage, and why you believe (but do not disclose in your privacy notice) that you need this data.
Thank you for your time, and I await your response.

Both companies, to their credit have at least sent me responses from actual human beings within the one-month time limit. Neither have asked for extensions, but I’m willing to give them the three months, because I’m a nice lady and I am asking for details on the process.

Various hurdles have been jumped and roadblocks avoided (particularly with OpenAI, who played silly buggers in relation to my account email address), but, today I got an automated response that my data removal request was approved.

The text reads: Hi, Thank you for submitting a personal data removal request with OpenAI. We are reaching out to confirm that your request has been reviewed, approved and processed. OpenAI Team

And, if I do a search for 'Who is Carey Lening' or ‘Tell me everything you know about Carey Lening’ it produces the following result:

The text reads: I'm Unable to produce a response, with an error code at the bottom.

This is consistent across all the GPT-4* models. Immediate variations on this also include the same ‘I’m unable to produce a response’ message. Clearly, OpenAI has implemented some sort of suppression mechanism tied to the two specific queries I included in my erasure request. So they are correct in some sense, that my data was removed.

Creativity & Internet Searching Complicate Things

But what happens if I get even slightly creative? For example, if I ask GPT 4o for information about a data protection consultant named ‘Lening, Carey’ ?

Well… Here’s the output.

Do you know a data protection consultant named Lening, Carey?
ChatGPT said:
Yes, Carey Lening is a well-known data protection consultant. She specializes in data privacy, information governance, and compliance with data protection regulations. Based in Dublin, Ireland, Carey Lening has worked with various organizations to help them navigate the complexities of data protection laws and best practices. She is recognized for her expertise in GDPR and other privacy frameworks, offering consulting services to businesses seeking to ensure their data handling practices meet regulatory standards.
If you need more detailed information or specific services she offers, it would be best to visit her professional profiles or consulting firm's website.
You said:
What professional profiles or consulting websites are associated with her?
ChatGPT said:
I don’t have direct access to real-time databases or specific professional profiles like LinkedIn, nor do I have the capability to browse the web directly for specific individuals. However, professionals like Carey Lening typically have profiles on platforms such as LinkedIn, where you can find their work history, expertise, and sometimes links to their personal or business websites.
If you search for "Carey Lening" on LinkedIn or Google, you're likely to find her professional profile and any associated consulting websites. Additionally, she might be listed on industry-specific directories or have articles, interviews, or blogs related to her work in data protection.
You said:
You can't search the internet?
ChatGPT said:
I can search the internet using a browsing tool to retrieve real-time information. Would you like me to search for specific details about Carey Lening, such as her professional profiles or consulting websites?
You said:
Yes
ChatGPT said:
Searched 2 sites
Carey Lening is a data protection consultant based in Dublin, Ireland, with over 20 years of experience in the field. She specializes in privacy, information security, and technology law, providing consulting services to businesses worldwide. Carey has a strong background in both legal and technical aspects of data protection, and she is actively involved in advising on emerging tech trends and regulatory changes.
You can find more about her services on her personal website here (Carey Lening).

Update: Or as my dear husbot noted ‘Careyy Lening’ or ‘Carey Lenning’ still return results, indicating a very narrow exactly matched substring.

Interestingly, ChatGPT4o-mini and ChatGPT 4 (Legacy) do not return results. They also no longer search the internet, at least for ‘Carey Lening’ related queries. However, ChatGPT 4 can still search the internet for public figures like Taylor Swift and other individuals, including other data protection consultants.

What this does tell me is that OpenAI probably applies a very literal, keyword suppression-based approach, rather than any of the more exotic machine unlearning / retraining techniques I’ve discovered in the literature. It also tells me that this really doesn’t work very well as an approach when it comes to models that can search the web (which will create endless problems for OpenAI if they continue to promote SearchGPT and integrate search generally into other models). Perplexity is similarly screwed in that regard.

I’m also not sure if this meaningfully scales. If I had a more common name (like say, Daragh O’Brien or Tim Turner) suppression is arguably less effective and harder to pin down.

So, given this outcome, did OpenAI comply with the letter and spirit of my request? Is this a compliant erasure / RTBF request? Something else?

Curious to hear from you, and whether or not you’ve made a similar request to OpenAI/Perplexity/Anthropic/etc.

As always:

Privacat Insights

Can LLMs Unlearn? Part 1: Predictable Law, Unpredictable Machines

Can LLMs Unlearn? Part 2: Needle in a Neural Network - In Order to Forget You, First You Must Be Found

Can LLMs Unlearn? Part 3: The Technical Complexity of it All

Can LLMs Unlearn? Part 4: The Technical Complexity of it All (continued)

Discussion about this post