Privacy Disasters: FaceHuggers Are Eating Your Skeets
This is a post about the current mishegoss related to Hugging Face and Bluesky's firehose, and how both companies are largely fucking over their users.
So, as many folks have likely heard, Bluesky is a (newish) social network that is currently the fan-favorite Twitter replacement for non-Nazi Bar discourse.1 After Trump won the election and doomed the country to 4+ years of facism by kakistocracy, and Elon continues to immolate Twitter to make the trolls happy, millions of people finally decided to ditch the birdsite.
Needless to say, it’s been a busy few weeks at Bluesky HQ. They now have around 23 million users, which is a lot for a social network that employs around 20 people. Here’s a daily record dashboard (source) just to give a flavor of the volume increase that happened right around that time. Numbers went way up on the 16th of November.
A few days ago, a thing happened—or, at this point a series of things. An employee of the popular AI community platform Hugging Face released a dataset consisting of one million Bluesky skeets (the slang term of art for ‘posts’) harvested directly from the Bluesky Firehose API. The firehose is “an authenticated stream of events used to efficiently sync user updates (posts, likes, follows, handle changes, etc.)” In plain English, it’s a way to programmatically access user content, including posts, userids (referred to as DIDs), and who users follow and block.
This is possible because everything you post on Bluesky (save for direct messages) is public.
Before we dig in to the interesting data protection/privacy aspects of this, I need to do a quick primer on federation, decentralization, Bluesky, and importantly, its protocol.
Federation, Decentralization, and Protocols, Oh My!
Bluesky might visually look a lot like Twitter, but it is fundamentally different (in theory). First, it’s built on an open, federated, and decentralized protocol known as the Authenticated Transfer Protocol (ATProto for short). Federation is a way for a network of independent servers (or instances) to communicate with one other. In a microblogging or social network context (like Bluesky), each instance relies on the same underlying set of communication rules, or protocols, to share information. Unlike Twitter, Facebook, etc., each given instance is meant to run independently, and can set its own rules, policies, and moderation practices.
The easiest way to think about this is with email. Email works via a few different protocols. It’s federated in that I can send a message to your Gmail account from my proton.com account, and you’ll get it no problem, even though those two systems are owned by two separate companies and have their own bells and whistles under the hood.
Bluesky philosophically similar to another federated microblogging platform, Mastodon (which uses a different protocol known as ActivityPub). However, they don’t (generally) play well together, but that doesn’t stop motivated people from trying.
Federated and decentralized services have a number of benefits over their centralized cousins. Firstly, having a bunch of independent servers means that no single entity/ketamine-addled billionaire can turn the whole place into a Nazi bar or shut it down.
Second, users can choose which server to join based on their preferences (e.g., server rules, community culture, moderation practices) and provided it’s federated, can still communicate with users on other servers. That helps avoid network effect problems.
Third, most federated systems prioritize account portability. That means users aren’t stuck with the first server they chose, which avoids vendor/system lock-in.
Fourth, most federated systems are designed to interoperate with other systems. Mastodon’s ActivityPub (AP) protocol, for example, also plays nicely with other services, like blogging software, news sites, video platforms, etc. ATProto is still new, and outside of Bluesky, there aren’t many noteworthy projects using the protocol yet, save for some custom labelers and bots, though some projects do rely on Bluesky’s API, including the aforementioned Firehose.2
Some folks have rightly criticized whether Bluesky/ATProto is actually a federated, decentralized protocol, given that the main application associated with the protocol is is currently quite centralized and owned by Bluesky Social, an American benefit corporation. The developers of the protocol, which coincidentally also happen to be the developers of Bluesky, definitely think it will be though! Since that isn’t exactly germane to this post, I’m going to take the developers at their word.
Hugging Face Faceplants
So, with all of that, let’s get to the good stuff. As I mentioned earlier, a Hugging Face employee published a dataset of 1 million posts, which included not only posts, image links and other information, but also the user’s distributed ID or DID. The DID is a persistent, long-term identifier assigned for every account. For example, my account handle (@priva.cat) maps to the DID of did:plc:kw4kcuzwirhgox25tq3uvkfk. I can change my handle, but my DID will remain the same.
If you’ve read this blog for any length of time, you might catch on the fact that the DID (as well as the post content itself) can be used to identify me. In fact, it’s trivial to do so, using a site called Clearsky. In other words, this is personal data that relates to me.3
When Hugging Face released the dataset, a large percentage of Bluesky users then collectively shit bricks. Many were concerned about copyright infringement, particularly artists. Others were annoyed that their data was being used to train LLMs. Various blocklists were created which targeted not only the HF employee, but all folks with a hf.co handle in their name. The HF developer who posted the dataset eventually nuked it. The CEO thanked him for nuking the dataset, but stopped short of apologizing. and the Lead Ethicist of HF did apologize. Kinda.
Then some other person named Alpindale posted their own dataset of 2 million posts, to “correct the injustice” after the original dataset was nuked. I suspect it was more to annoy the people who complained, but I dunno. And new datasets keep popping up daily.
Now, it's probably a reasonable argument to make that Bluesky Social data is public data. For one, it’s built into ATProto. The primary reason being that all servers across the network need to know which posts (and blocks) exist in order to guarantee that posts get to the right people, and in the case of blocks, to respect the user’s request.
Bluesky and ATProto’s Privacy Notices both state this explicitly :
Profiles and Posts Are Public. The Bluesky App is a microblogging service for public conversation, so any information you add to your public profile and the information you post on the Bluesky App is public.
The Privacy Notices are also fairly comprehensive in the types of activities covered. These include provisioning the service, developing new products & services, troubleshooting, security, and “Sharing personal information with third parties as needed to provide the Bluesky App Services.” Bluesky relies on a variety of lawful bases to do this, primarily legitimate interests, consent, and performance of a contract. They are not particularly clear on which lawful basis applies to which processing activity though.
Bluesky, Parties, and Ken Lee
You might be asking yourself right now, if everything is public, why are so many people deeply butt-hurt about this? Bluesky says everywhere that posts and blocks and whatnot are all public — can’t people read?
The answer to this question is context.
I think most people using Bluesky assume that posting on a social network means that their posts are public. But “public” is contextual — its public within the confines of the network and for reasonable purposes (like making the platform work). Most people don’t assume that their tweets are being blasted to train large language models.
To illustrate this a bit more clearly, pretend you’re at a party. There are 100 people also at this party, so it’s a big group. You know many of these faces, but there are also some new faces, and some randos even wander in from off the street.
As parties sometimes do, everybody decides to play a game of Truth or Dare. It’s your turn, and you pick Dare. Someone says, “Sing us your favorite song.” Pretend with me that you’re a little disinhibited at this point, and you’re really getting along with everyone at the party. So you decide to go for it. You belt out Mariah Carey’s ‘Can’t Live Without You’, but manage to just absolutely bungle the shit out of it. Kinda like this very well-known 16-year old internet meme.
Now, let’s say someone secretly recorded you while you badly sang your heart out, and uploaded it to YouTube, Reddit, and TikTok. Let’s say they also whacked it into a database to train a GenAI model of voices.
Now, things are a little different in this context. Bad karaoke Truth or Dare within the confines of that party is one thing, even if that disclosure is public as far as the law is concerned.4 But broadcasting your shame on YouTube and TikTok without your knowledge or consent in order to be mocked by internet strangers all over the world is something very different. Having your shame be used to train an AI model is something else entirely.
Yes, you consented to sing the song. After all, you didn’t have to participate in Truth or Dare, after all. But the purpose (or reason) of your consent was limited to bonding with people at the party & social cohesion. You did not consent to having your rendition published online, or to have your voice be used to train an LLM. Those are different purposes.5 Context matters.
Conversations on Bluesky are a lot like a big internet party. Sure, the numbers are bigger, but at the most basic level, users agree to share information with one another publicly on the network because they want to communicate with their friends and other participants on the network. Technical details about the architecture aren’t even being considered. Should they be? I don’t know. But I do know that what caught everyone off-guard with the HF dataset debacle, and especially the dickheads who keep compiling ever-larger datasets6 is that people are using “public” posts in a very shitty way, and that this went beyond the purposes of the consent we gave.
Training an AI model is not within the Bluesky privacy notice or within acceptable norms (no matter how much OpenAI/Google/Meta try to normalize it). Hell, Bluesky has gone so far as to commit to not using post data for training its own generative AI models. To have a bunch of oblivious or entitled jerks take Bluesky posts and use them for something that already pisses people off without even telling them first, betrays user trust in sharing things on Bluesky, just like posting your embarrassing rendition of Mariah Carey’s song on TikTok would.
Another Slap in the Face
The thing that really pisses me off more than the original disclosure is the fact that both the Bluesky Mod Team and Hugging Face admins, by doing nothing administratively (or technically), are essentially sanctioning this behavior. It’s like the host of the party encouraging the dickhead to post that video of you online.
Especially when those guys openly boast that Bluesky users deserved it, or that they don’t need to honor user requests or the law.
I had a look through Bluesky’s Developer Guidelines and Hugging Face’s Content Guidelines, and this kind of foolishness directly runs afoul of both (in addition to the law, which is also being ignored by everyone involved).
The Bluesky Guidelines are pretty clear:
All services must have a method for deleting content a user has requested to be deleted.
Developers must have a system for appropriately responding to all user reports of violations within their apps.
…
Developers should maintain reasonable security measures to protect against unauthorized access or disclosure of any end-user information or app data.
Failure to respond appropriately to known violations of Bluesky Social’s Terms of Service, Privacy Policy or Community Guidelines may result in suspension of access to services or features run on Bluesky’s infrastructure. Developers should keep records of all reports of violations and their responses, and Bluesky may at any time request such data to ensure compliance with its policies.
Here’s the relevant sections from Hugging Face’s Content Policy, which states
"We do not tolerate the following Content on our Platform:
...
Content published without the explicit consent of the people represented;
Content that violates the privacy of a third party;
Content that violates any applicable law or regulation;" …
None of these dataset warriors have obtained consent. None of them are being transparent to users. They’re just breaking the law, ignoring Bluesky & HF’s policies, ignoring user deletion requests, and generally just being dicks. And no, there’s no ‘doing it to own the libs/haters’ exception to transparency and the purpose limitation.
What annoys me the most is that Bluesky and HF are out in the press preaching how they’re different and better than the dominant market players, but all playing by the same rulebook. Both Bsky and HF pretend to care about the little guy (their users) and tout their very noble goals (decentralization! data portability!, fighting concentrations of power!), but in practice, they’re barreling down the same road of enshittification as the rest of the Broligarchy. They’re just doing it at warp speed instead of gradually over a decade.
Once again, we’ve got a collective action problem that’s being ignored in favor of technological progress, big money, data extraction, and libertarian notions of ‘public data’.
It’s a shitty look. Both Bluesky and HF are acting like the host who’s egging the dickheads on, and it’s really disappointing as a user to know that this is probably what we should have expected all along.
Screw you Elon, it’s always Twitter to me.
Many, many people, most notably, Christine Lemmer-Webber have stated that Bluesky/ATProto is not in fact, a federated/decentralized system because currently only one server, bsky.social is operational. See: https://dustycloud.org/blog/how-decentralized-is-bluesky/
For all my American friends playing along, personal data is broad in the EU (and other privacy/data protection countries, including many states in the US!), and refers to any information relating to an identified or identifiable individual. That information can relate to them directly or indirectly. That ‘relating to‘ language is broad — much broader than just name and government ID — a DID, or a comment with sufficient information about me, is personal data. Yes, even in the US.
A recent case decided by the Court of Justice in the EU touched on similar themes — Schrems v. Meta Platforms Ireland Ltd., Case C-446/21, 4 October 2024. In this Schrems v. Meta case, the court addressed a few questions, one of which being whether Facebook could target advertising to Mr. Schrems based on his sexual orientation, when Schrems had disclosed this information publicly at a conference, but not on Facebook.
The court held that just because Schrems made his sexual orientation public in some contexts, that did not authorize Facebook to process this information for other purposes (aggregating and analyzing this information to target ads to him on Facebook).
This purpose thing I’m blathering on about is a core principle underscoring almost all data protection laws. It’s commonly referred to by lawyers as the ‘purpose limitation principle’. Here’s a good explanation of the purpose limitation principle from the ICO.
I don’t usually do the name-calling thing but this guy genuinely pisses me off, because I think this is mostly motivated by ‘triggering the libs’ 4chan-level bullshit. I know I shouldn’t read the comments, but the comments attached to this dataset enrage me.
I didn’t have such a deep understanding of privacy with BlueSky but first of all, this is an excellent article, congrats 🙏
These privacy issues aren't just limited to BlueSky and even if BlueSky solves it, others will still come up. I think the only way to truly tackle this is by finding solutions where we, as users, have full control over our own data. These could be self-sovereign apps. One example of this is the Calimero Network, which I follow. I believe this is the only way we can prevent these kinds of issues.
Thank you to Prof. Michael Veale and @iamgregb for clarifying a few points, and correcting a 404 error, respectively! I love my readers and colleagues in this community.