Diving Deeper into DeepSeek
You're in for a treat -- A guest post by Pete (aka Crapotkin), who has been doing a deep seek when it comes to the company's ToS & API terms.
// I’m cross-posting this with
on his blog . I’m trying to get him to write more, and so if you like this post, why not subscribe and provide more encouragement — CL //It’s been nearly two weeks since Carey wrote her last post about her deep-dive on DeepSeek, and the mania for it only seems to be getting wilder. If you’re the sort of privacy and data protection person who makes up the readership of this ‘stack, that means you’re probably hearing a lot of this sort of question:
My business really, really wants to use the new hotness that is DeepSeek. What do I do?
There’s a very quick and easy answer to that, which is that using DeepSeek’s open source model may well be a good idea, but YOLO’ing data into its API or ChatBot is most certainly not.
Showing the working to support that conclusion takes a little longer, so much so that I’ve had to break it into two posts. This one deals with why using DeepSeek’s API is… let’s just call it a much bolder move than going with one of the established providers. A post on use of the model itself will follow (reasonably) soon.
Deepseek Is Not Your Standard Business LLM Provider
If you’re being asked to approve use of DeepSeek, there’s a fair chance that the request is coming from the engineering department. If that’s the case, they’ll most likely be wanting to use DeepSeek’s API.
In addition to its ChatBot, Deepseek makes available an API so that developers can easily build its service into their products and tools. This is a pretty standard thing for the large LLM providers to do, and developers are starting to get quite used to building with those tools. APIs will be made available subject to terms of use from the provider. So how much difference is there between DeepSeek and the other big providers”?
Actually, quite a lot.
No Business Offering
The fundamental difference between DeepSeek and other big providers is that, whilst DeepSeek have API terms, they don’t have business terms. The line between API and business offerings can often seem pretty indistinct. Indeed, the big LLM providers tend to elide the two: OpenAI and Anthropic combine their business and API terms, and production use of Google’s models tends to fall under its Google Cloud Platform terms. A casual reader might look at DeepSeek’s “Open Platform Terms of Service” (let’s call them “platform terms”) and conclude that they were dealing with a similar set-up. The fact that the terms contemplate “enterprise developer” customers does nothing to allay that impression.
But it’s the wrong impression. Really. DeepSeek’s API is better thought of as a more nerdy outgrowth of its consumer app. The Platform Terms themselves are presented as a supplement to the consumer Terms of Use (consumer terms) discussed in the last two posts. Both the platform terms and the consumer terms apply, with the platform terms taking priority. This is very different to the offerings of the big US providers set out above, who tend to have a very clear division between their consumer and business terms.
Why does that matter? Broadly, because businesses tend to be able to get better standard terms. A provider can safely assume that the vast majority of their potential consumer customers are at the sharp end of the can't even problem. Businesses are, on average, more likely to read the terms (or pay someone to do it for them), to try and negotiate the terms, and to understand their legal and technical nuances. That results in some fairly predictable differences between consumer terms and business ones.
Output Liability
As previously discussed, the question of liability for outputs is a pretty important one. The whole point of an LLM is to be able to produce outputs, and many of the big current debates turn around how reliable they are. Are they accurate? Do they reflect systematic bias? Are they obscene? Could they be used as part of a nihilistic scheme to destroy the world? A business may have some limited control over what its users are putting into an LLM (more than the provider, anyway). If it knew what the output from the provider was going to be, though, it wouldn’t need an LLM in the first place. Asking the business to be liable for something it can’t control seems a little unreasonable.
This is a particularly acute problem with IP infringement. If an LLM is producing Musk/Trump slashfic or ducking questions about Tiananmen Square, it’s at least pretty easy to look at the output and see that happening. But the question of whether an LLM’s output infringes any of the copyrights in its training data is a much harder one, both legally and practically.
Take coding as a concrete example of this. Software coding is one thing that LLMs are widely agreed to be actually pretty useful for, and the idea that businesses might use LLMs to help write code seems intuitively plausible, and seems to have been borne out empirically. But that LLM was probably trained on masses of open source code. If a business uses an LLM to help write code, might it end up liable for infringement? Or, have its codebase subject to the terms of an open source license? It’s really very hard to know until we have more caselaw.
In the face of that sort of uncertainty, a lot of businesses might opt to just wait and see how the legal questions turn out. In response to concerns like this, the big providers started making guarantees around IP in late 2023. Anthropic, Google and OpenAI now all provide their business customers with IP indemnities. Here’s an example from OpenAI’s business terms:
We agree to defend and indemnify you for any damages finally awarded by a court of competent jurisdiction and any settlement amounts payable to a third party arising out of a third party claim alleging that the Services (including training data we use to train a model that powers the Services) infringe any third party intellectual property right.
So, if you’re a business getting sued because someone says your LLM-generated content is a copy of their work, you can expect the provider to take ownership of that problem for you. If you are a consumer, however, you’re on your own. Consumer terms tend not to contain IP indemnities and, at the risk of getting repetitive, DeepSeek’s terms are consumer terms.
Model Training
It’s not just LLM output you need to worry about, though. As we saw in the last post, it’s standard practice for the big US LLM providers to train their models on the content (inputs and outputs) processed by the consumer app. As with IP, confidentiality tends to be a pretty big concern of most businesses - the sort of thing that might keep them from using LLMs at all if there was too much uncertainty. Google’s language around the use of customer data submitted through Vertex AI is thus pretty typical of the sorts of reassurances US providers try to provide: “Google will not use Customer Data to train or fine-tune any AI/ML models without Customer's prior permission or instruction.”
DeepSeek’s terms certainly don’t contain anything that categorical. What uses of customer data can DeepSeek make? I have been scratching my head about this for some time and, honestly, I still have no idea.
This is the section of the API terms dealing with content usage in its entirety:
4.1 You are responsible for all Inputs you submit to our Services and corresponding Outputs. By submitting Inputs to our Services, you represent and warrant that you have all rights, licenses, and permissions that are necessary for us to process the Inputs under our Terms. You also represent and warrant that your submitting Inputs to us and corresponding Outputs will not violate our Terms, or any laws or regulations applicable to those Inputs and Outputs.
4.2 Subject to applicable law and our Terms, you have the following rights regarding the Inputs and Outputs of the Services: (1) You retain any rights, title, and interests—if any—in the Inputs you submit; (2) We assign any rights, title, and interests—if any—in the Outputs of the Services to you. (3) You may apply the Inputs and Outputs of the Services to a wide range of use cases, including personal use, academic research, derivative product development, training other models (such as model distillation), etc.
The relevant section of the consumer terms reads as follows:
4.1 You are responsible for all Inputs you submit to our Services and corresponding Outputs. By submitting Inputs to our Services, you represent and warrant that you have all rights, licenses, and permissions that are necessary for us to process the Inputs under our Terms. You also represent and warrant that your submitting Inputs to us and corresponding Outputs will not violate our Terms, or any laws or regulations applicable to those Inputs and Outputs.
4.2 Subject to applicable law and our Terms, you have the following rights regarding the Inputs and Outputs of the Services: (1) You retain any rights, title, and interests—if any—in the Inputs you submit; (2) We assign any rights, title, and interests—if any—in the Outputs of the Services to you. (3) You may apply the Inputs and Outputs of the Services to a wide range of use cases, including personal use, academic research, derivative product development, training other models (such as model distillation), etc.
4.3 In order to fulfill the requirements stipulated by laws and regulations or provide the Services specified in these Terms, and under the premise of secure encryption technology processing, strict de-identification rendering, and irreversibility to identify specific individuals, we may, to a minimal extent, use Inputs and Outputs to provide, maintain, operate, develop or improve the Services or the underlying technologies supporting the Services. If you refuse to allow us to process the data in the manner described above, you may provide feedback to us through the methods outlined in Section 10.
…
As you’ll have noticed, the terms are pretty much identical, except that the Customer terms have an additional paragraph allowing for DeepSeek to use customer data. But, remember, it’s all part of one big agreement. Those consumer terms also apply to API usage, unless the API terms contradict them. Does the fact that the API terms have reproduced the same language but omit the usage part mean that the API terms are contradicting the consumer terms? Does the fact these terms are governed by Chinese law affect the answer? My guess would be that the right to use data still stands but, frankly, I don’t have a clue. The only firm conclusion I can draw is that this is not a set of terms drafted to soothe the worries of twitchy compliance teams at large businesses.
Data Protection
US big tech, somewhat notoriously, has an uneasy relationship with EU data protection law. Put the name Google, Meta, Amazon or Microsoft into GDPRHub’s advanced search and you’ll find decisions dealing with those parties in respect of pretty much every GDPR article that a private actor can be accused of violating. For the purposes of this post, though, I’m going to confine myself to Article 28 (Processor) and Chapter V (Data Transfers).
Controllers, as we all know, determine the purposes and means of processing, and processors follow those instructions. Whilst you need to squint a bit to fit the reality of big tech’s relationship with its business customers into this formulation, tech companies certainly do provide long and expensively-drafted data processing agreements in which they, amongst other things, promise to only process customer data in accordance with the documented instructions of their customer.
What data actually counts as “customer data” in the context of LLM provision can be a bit of a fraught question. In particular, after having reviewed the terms of the big providers, I’m really very confused about whether and when they consider themselves processors or controllers of user account data. Take a look at how Anthropic deals with this. They have special service specific terms for when enterprise customers are using Claude through the enterprise interface rather than the API:
Seems pretty clear that they're a controller right? Let's just click on that privacy policy link:
So Claude for Work customers should direct their users to a Privacy Policy that claims not to apply to users of Claude for Work? Thanks for that.
Google's stance on user accounts is less explicitly contradictory, but it’s not something you could easily work out from its customer documentation. Their terms hold that Google is a processor of all "data provided to Google by Customer or End Users through" the applicable service. Given that user account details are going to be provided either by the business or the user, this seems as though Google should be processors? Although it is a little ambiguous. The determinative thing for me is that, if you actually have a Google Workspace account and you click on your profile, Google shows you a privacy notice, fulfils rights requests, and does a whole host of other obviously controller-ish things.
These sorts of mistakes and ambiguities might generate a certain amount of scepticism about whether US providers actually know themselves what usage rights they have over the personal data of their business customers. But it’s important not to take this too far. The DPAs of all the providers mentioned are usually very clear that they cover the personal data of content submitted to the services. And whilst data around accounts, access, logins etc. is both revealing and consequential, the content data is more so.
DeepSeek’s approach to data protection generates uncertainty of a much more fundamental kind. The data protection section of the terms is brief but, for me at least, kind of fascinating:
5.3 We will collect and process the personal information you provide as the data subject in accordance with the "DeepSeek Privacy Policy". However, when your end users access downstream systems, applications, or functions that you've developed based on the open platform, the processing rules for their collected personal information are not covered by this privacy policy. As the controller of personal information processing activities in that scenario, you should disclose the relevant privacy policy to your end users.
First off: are they a processor or not? They say that the customer is the data controller. And the processing of personal data contained in content submitted through the API isn’t subject to their privacy policy. So… are they a processor? If they are, why wouldn’t they say so?
The question is all the more frustrating because of the semi-fluent GDPR-speak. That’s presumably partly explained by the Chinese data protection framework which shares a lot of its approach with the GDPR. The Personal Information Protection Law of the People’s Republic of China, I learnt while writing this, defines “personal information processor” (PIP) as something directly analogous to “controller” (they even talk about “purposes and means method”). It also has an (undefined) concept of “entrusted party of personal information” which is required to process personal information “as agreed [with the PIP] and shall not process personal information beyond the agreed purpose and method of processing.” So - a processor. Finally, the PIP is also required to “agree with the entrusted party on the purpose, duration, and method of entrusted processing, type and protection measures of personal information as well as the rights and obligations of both parties, and supervise the personal information processing activities of the entrusted party.” And this agreement needs to be formalised in a DPA “contract”.
While the above explains DeepSeek’s apparent familiarity with EU data protection concepts, it raises at least as many questions as it answers. Most importantly: where’s the DPA contract? It certainly sounds like DeepSeek is intending to be a processor - it seems like the logical implication of saying that the developer is the controller and DeepSeek isn’t. But it is nowhere to be found.
Maybe the answer here is that DeepSeek considers itself a joint controller (another European data protection concept mirrored in the PIPL)? This would, presumably, mean that DeepSeek was training its models on the API data, especially if my suspicions about controlling language between the Consumer and API terms in Section 4 are accurate. However, instead of assisting the developers or organisations using the data, DeepSeek offloads all the disclosure, handling of rights requests, and presumably, liability, onto them. Clearly this would be worse from both the compliance and confidentiality points of view. But, mostly, it’s just really unclear what’s actually going on.
Obviously I’m not qualified to make any assessment of how well this works under Chinese law. But from the EU/UK standpoint, it’s fair to say that DeepSeek’s terms are fundamentally screwed. But can you blame them? DeepSeek do not look to me like a company that has embarked on a big GDPR compliance journey. But, if they did, how do we see that going, exactly?
China
By that I mean that DeepSeek is a Chinese company. Its primary market is in China, its servers are in China, and it's subject to Chinese law. In theory, the data transfers to China story still has a lot of road left to run. The DPC has been investigating TikTok’s transfers for the past God-knows-how-long, with no sign of a decision yet. And NOYB have now got in on the action. But the EDPB has already published its thoughts on the possibility of lawful transfers to China. Those are long, detailed, and speculative in places, but I think they can fairly be summarised as follows:
It’s therefore hard to blame DeepSeek for not playing a GDPR compliance game that they have no realistic chance of winning.
The potential scope of government access is a compliance worry for businesses. But more fundamentally, it’s a confidentiality worry. Stories about industrial espionage by China have been a constant of the international business press for more than a decade. This has been heating up as the US has taken a more confrontational stance with China on trade issues, meaning that it’s probably going to get a lot worse before it gets better. Much of the reporting emphasises sophisticated measures, the increasing use of human intelligence and a whole load of other stuff designed to give CISO’s sleepless nights. If a business is remotely worried about those sorts of threats, sending data to servers where the CCP can just rock up and ask for access makes things pretty easy for attackers or shadowy government types.
What’s a Business to Do?
All of the above may make it sound like I have some axe to grind against DeepSeek, as well as completely unjustified trust in US LLM providers. I really don’t. DeepSeek has one advantage over all the other providers covered, and it seems to me pretty decisive. You don’t actually have to trust their service in order to use their models: both R1 and V3 are available on open source terms. The most obvious US comparators are Meta’s Llama models, which are less powerful, less open, and subject to more restrictive license terms.
Running DeepSeek’s models on more trusted hardware can solve most of the problems outlined above. It avoids the whole ‘sending confidential or sensitive data to China’ problem, and means organizations are no longer subject to DeepSeek’s terms. They can also stop worrying about any other security problems DeepSeek may face as they continue to face worldwide scrutiny and interest. If hosting the model yourself sounds like too heavy a lift and you’d prefer a nice API, an increasing number of vendors are now providing that service (full disclosure - I work for such a vendor).
Using the model in this fashion may raise other issues, both legal and technical. OpenAI have been making angry (and pretty rich) protestations about how DeepSeek may have illegally trained R1 on OpenAI’s data. Mark Zuckerberg has suggested that DeepSeek may be an engine of Chinese censorship. Equally, the reliability of US big tech is looking increasingly shaky. Xitter has been in its saprophytic tailspin for years now, and Meta has set up its own AI-enabled Ministry of Truth. On the data protection front, Trump appears to be actively trying to pick a fight over data transfers, and the general stance among the broligarchs seems to be that they’re fed up with listening to all this European nonsense about human rights.
In the next post we’ll be discussing all of that, and trying to think about what impact DeepSeek, open source models and self hosting might have on it.