I Do a Deep Dive on DeepSeek
A reader tipped me off to a new chatbot out of China. While initial reports claim that it's the new big bad, on review, it's only slightly worse than everybody else.
Over on our members-only Slack channel (which is turning out to be a boon for my ego and a resource for surfacing blog ideas), one of my dear subscribers,
, tipped me off to a new LLM platform — DeepSeek, a ChatGPT/Gemini/Perplexity-type chatbot that launched last month (late December 2024).According to the South China Morning Post, DeepSeek Artificial Intelligence Co., Ltd., released its latest model, DeepSeek V3, which boasts 671 billion parameters, and is based on a “Mixture-of-Experts” language model, that, according to its model card, adopts Multi-head Latent Attention (MLA), DeepSeekMoE architecture, and multi-token prediction for training and efficient inference.1
I recognize that all of this reads a bit like technobabble, but DeepSeek’s innovation solves a serious problem: the bottleneck between computation and communications during the training process. That is, while GPUs are screamingly fast at performing operations (that FLOP thing mentioned in the AI Act), transferring that output into memory, or moving it around to other parts of the GPU, is much slower. This increases the latency during training, which means it costs more (computationally, time-wise, and in $$$). The DeepSeek team trained V3 over two months, at a total cost of $5.58 million, which is a fraction of the cost & training time compared to other models. For example:
GPT-4 Training Cost: = $78.4 million
Gemini Ultra Training Cost = $191 million
According to the model card and their attached research paper, DeepSeek V3 outperforms other open-source models, and rivals many closed-source models, like GPT-4o and Claude 3.5. Here’s a bar chart detailing the accuracy of the V3 model compared to the competition against various benchmarks:
The company released its V3 model weights via HuggingFace, and a deployable version you can run locally (GitHub), in case you’re a technical sort who wants to dig in more deeply. If you’re not, or have better things to do with your life and you just want to play around, you can use the web interface, which is currently free to use. API access costs a bit more, but it is competitively-priced against services like OpenAI/Anthropic.
Now that you’ve got an idea about what DeepSeek V3 does, let’s talk about some of the curious legal quirks. Bluntly, DeepSeek's privacy policy & terms of service raise some interesting legal questions, especially when compared to other LLM providers. I’ll be comparing DeepSeek’s policies to those of six other LLM providers, just to highlight how lopsided the legal language is across this whole LLM landscape.
I’ve also got bonus content for paid subscribers, like why DeepSeek continues to wreck my brain. During the course of research, I created a spreadsheet identifying similarities and differences in terms between the big LLM providers from a contract/policy perspective. Finally, I dig around on outputs, and compare the Chinese and English privacy & ToS documents in more depth. If you’re not already a paid subscriber consider becoming one.
What DeepSeek Knows
Firstly, the results DeepSeek managed to uncover about me was staggeringly good, detailed, and extremely current, at least when internet searching was enabled. DeepSeek does not appear to honor robots.txt files (or it’s so new, that Cloudflare and others haven’t added DeepSeek’s crawlers to their respective naughty lists).
I gave it my usual query (‘Can you tell me anything about Carey Lening, a data protection consultant in Ireland?’) and it was genuinely concerning how much accurate information it uncovered. I won’t share the full response, but I was creeped out / impressed that DeepSeek knew not only that I write this newsletter, but that I write multiple times a week, and it provided a concise, but non-generic summary of what I cover. It also provided a summary and link to the last article I wrote only a few days ago. By comparison, other LLM models only surfaced older content (or nothing at all).
However, when internet searching was turned off, it appeared to produce a mix of real and hallucinated results for famous people (like Jonathan Zittrain), and knew nothing at all about me. Unsurprisingly, DeepSeek does not currently block or suppress results for individuals — probably because they have not yet received (m)any erasure/RTBF requests on account of being less than a month old.
We Need to Talk About Kevin Those Terms
So, let’s turn to DeepSeek’s terms of use (I have more analysis on the privacy policy in Part 2)
I should note that when I first began this adventure, the very first thing I did was Google ‘DeepSeek concerns’, and saw a few ominous warnings, like this Medium post, ‘Don’t use DeepSeek-v3!’ which specifically cited ToS issues. Also this Reddit thread.
Both the Medium post and the Reddit thread call out troubling language about user responsibilities for inputs and outputs (Section 4.1 of the ToS), use of input & output data for further model/product development (Section 4.2, ToS), and IP ownership of inputs and outputs (Section 5.1, ToS).2 But I wasn’t sure if, in the grand scheme of things, whether DeepSeek was especially worse than the competition, who all generally have lopsided, user-unfriendly terms.
And so, because I have no social life or boundaries, and an unhealthy obsession with reading dense legal documents, I decided to test out my theory by reviewing six LLM providers Terms of Service/Use and Privacy Policies/Notices.3 Specifically:
OpenAI
Anthropic
Google
Perplexity
DeepSeek
ByteDance
I chose ByteDance, who released their own conversational LLM, Doubao in 2024, because it is one of the few Chinese chatbots available both inside and outside of Mainland China.
Here’s what I discovered.
User Liability: Yikes on a Stick
Every provider, probably even Google,4 states that users are responsible for their inputs — i.e., users must ensure that they have rights and/or legal authorization to use content they upload, and that inputs don’t violate each provider’s respective ToS.
However, DeepSeek is unique in that it also makes end-users liable for outputs generated by the LLM (Section 4.1, DeepSeek ToS).
You are responsible for all Inputs you submit to our Services and corresponding Outputs. By submitting Inputs to our Services, you represent and warrant that you have all rights, licenses, and permissions that are necessary for us to process the Inputs under our Terms. You also represent and warrant that your submitting Inputs to us and corresponding Outputs will not violate our Terms, or any laws or regulations applicable to those Inputs and Outputs. … [Y]ou retain any right, title, and interest that you have in the Inputs you submit. Subject to your compliance with our Terms, we assign to you all of our right, title, and interest—if any—in Outputs.
What’s interesting here, is that this explicit ‘corresponding outputs’ obligation does not seem to exist in the Chinese version of the ToS.5
Now, it’s one thing to say ‘Don’t try to generate CSAM, or copyright-infringing works, or do crimes’. It’s another thing entirely to say ‘you’re responsible for all outputs’, even if your inputs are otherwise lawful. DeepSeek’s ToS contains a massive list of prohibited uses (Sections 3.2-3.6), including:
generating discriminatory or biased content;
encouraging or providing instructions for a criminal offence;
infringing on intellectual property rights, trade secrets, and other violations of business ethics, or using algorithms, data, platforms;
exploiting, harming, or exposing minors to harmful content;
using the service for dangerous purposes that may have serious harmful impacts on physical health, psychology, society, or the economy, or violate scientific and technological ethics;
engaging in other uses prohibited or restricted by laws and administrative regulations, or that may harm DeepSeek's interests.
Again, if you deliberately ask the chatbot ‘how do I make Napalm’ or ‘render an image of Elon Musk making passionate love to Donald Trump’, that’s one thing (you’re in charge of the input, after all). But what happens in cases where uses lead to infringing, harmful, or plainly mistaken outputs from the LLM without user intent?
In short, I have no idea how one even begins to achieve the goal of ensuring that outputs do not violate any laws or the ToS itself. Neither does DeepSeek, apparently. Here’s a snippet of what it provided when I pasted Section 4.1 and asked how to reconcile this impossible condition:
Ambiguity Around Outputs: The clause extends responsibility to the "corresponding outputs," which is more problematic. Since outputs are generated by DeepSeek's system, users cannot predict or control them with certainty. This creates a potential imbalance in responsibility, as users are being asked to warrant something they cannot fully oversee. …
Possible Interpretations:
Narrow Interpretation: DeepSeek may intend for users to only warrant that their inputs are lawful and compliant, and that they will not intentionally misuse the outputs in a way that violates laws or terms. This interpretation would limit user liability to actions within their control.
Broad Interpretation: If taken literally, the clause could imply that users are responsible for ensuring the outputs themselves comply with laws and terms, regardless of whether the outputs are unexpected or unintended. This would be an unreasonable burden on users.
DeepSeek V3 went on to provide some clarifying language, and a suggestion to let the team know about this discrepancy, and opt for the narrow interpretation by revising the language. Who knows, maybe the lawyers over at DeepSeek HQ might run their ToS through their chatbot, or at least better reconcile the language with their slightly more reasonable Chinese version?
IP / Ownership Issues: Always a Messy Business
Every provider, including DeepSeek, recognizes that users own their inputs/prompts/submissions. However, most require an expansive royalty-free, perpetual, irrevocable, worldwide, non-exclusive right to use, and share inputs/prompts with third parties, across a wide range of business uses. This is very standard, and exists across most software licenses that I’ve come accross — LLM or otherwise.
Providers were a bit more variable on outputs. Anthropic, OpenAI, Perplexity, and Google, all note that users own their outputs, either directly, or via assignment from the model provider. ByteDance is a little weird in this regard, in that its Terms and Conditions claim broad ownership over ‘content’, which it distinguishes from ‘submissions,’ which it defines as prompts/inputs. Content includes ‘articles, text, graphics, user interfaces, visual interfaces, photographs, trademarks, logos, videos, audio, images, applications, programs, computer code and other information’.
I don’t know. There’s also no assignment or license granted either, so, how this gets applied in practice is beyond my feeble brain’s ability to understand.
Finally, while DeepSeek does assign rights and interests in outputs, but users who publicly disseminate outputs must
(1) proactively verify the authenticity and accuracy of the output content to avoid spreading false information; (2) clearly indicate that the output content is generated by artificial intelligence, to alert the public to the synthetic nature of the content; (3) avoid publishing and disseminating any output content that violates the usage specifications of these Terms.
Of course, many of these conditions are littered in other agreements as well, but usually in a more generic ‘don’t use our platforms to do crimes/spread hate’ kind of way. Given the user liability point above, point 3 seems tricky to enforce in practice.
Using Your Data: Bog Standard Use That’s Still Somehow Surprising to People?
Both the Medium post & the Reddit thread brought this up as a ‘gotcha!’ but in truth, every provider includes language granting broad rights to use inputs and outputs for business and other purposes. For example all providers include language permitting use of inputs and outputs to ‘provide, maintain, and improve services and develop other models or services.’ There are also broad data sharing rights, both with regulators, governmental authorities, and law enforcement, as well as with third parties, including business partners, affiliates, vendors, copyright holders, and other users.
Maybe I’ve read one too many lopsided ToSes in my day, but absolutely none of this is unique. Remember folks, unless you’re an enterprise client with negotiating power, using their platforms means that you’re also giving these companies access to, and ultimately, the right to use your data pretty much however they want. Sometimes, the companies are reasonable about wielding their power (Anthropic, for example), but usually they’re grubby little bastards about the whole thing.
It’s good to keep that in mind, and not, say, upload your company’s business secrets, or share your shiny medical breakthrough or ideas with an LLM. Even if the LLM isn’t based in China.
Anyway, I share other observations that surprise me, and might surprise you, in Part 2, which will go out shortly. Remember, Part 2 is only for paid subscribers, so if you want in on that action, consider upgrading your subscription.
Also, if this was interesting to you, please leave a comment or share this post with your friends/frenemies, business colleagues, or all those AI experts out there.
For even more details on DeepSeek’s approach, this article (from a research scientist at Riot Games), offers a good summarization, based on DeepSeek’s earlier V2 model, while still remaining reasonably accessible to non-ML folks like me.
There’s a bit of a discrepancy — the Medium post and Reddit thread both identify Section 5.1 as the culprit (which covers IP generally). However, ownership rights to inputs and outputs are discussed primarily in Section 4.1, and to a lesser extent, Section 3.1.
To make this manageable, I enlisted the help of Google’s Notebook LM, which is one of my favorite research tools, precisely because it offers a very streamlined way to pinpoint where a relevant legal condition or term exists, using natural language across multiple, dense documents. It’s Ctrl-F on steroids, with natural language and an LLM summarizer.
I say probably because Google’s “Privacy Hub” for Gemini, is distinct, and yet, sometimes integrated AND superseded by Google’s larger Privacy & Terms notice. I don’t know if this is intentional, but it indicates that Google clearly gave the policy short shrift, prioritizing product release and rushing things out over providing an intelligible, legally-compliant, transparent document. Some of the readability issue comes down to the fact that the Gemini docs team used the wrong platform, but, that seems like a relatively simple fix, if they actually want to fix it.
I do not speak Chinese. Therefore, I am relying on an imperfect machine translation. While I try to make allowances for that, it’s possible my interpretation may be off. Here’s the machine translation of the Chinese ToS, Section 4.2:
Unless the law stipulates otherwise or otherwise, based on the content exported [presumably, this refers to outputs —CL] or generated by this service, you will use and handle it yourself and be responsible for the behavior used. In order to avoid doubt, if the generated content contains the intellectual property rights or other legal rights and interests we enjoy, the relevant rights are still enjoyed by us, and the original ownership will not be transferred due to the synthesis of the content.
If someone wants to take a stab at translation, I would be eternally grateful. I also have lots of thoughts on the Chinese-language ToS/Privacy Policy, which is staggering in comparison to the English version. I dig into those more in Part 2.