Deepdive into Deepseek advances (via Prasad Raje) / Feb 2025
Srinivasan Keshav posted a link to this excellent deepdive by Prasad Raje of Udemy into the advances that DeepSeek R1 has made from a perspective of the core technology.
- Multi-headed Latent Attention (MLA). In the famous Google "Attention is all you need" paper, the attention block is responsible for a lot of the magic of LLMs but is also compute heavy [...] Deepseek has innovated here with Multi-headed latent attention - which essentially reduces the size of matrix multiplication applied to generate the K,V vectors that are inputs into the attention block. Combined with KV Caching, this reduces the memory needs [...]
- Mixture of Experts (MoE). The key idea here is that instead of feeding each token through one massive FFN, break down the single FFN into a number of smaller FFNs and route each token through a subset of these FFNs. [...] each of these smaller FFNs will learn during training something specific about how to transform each token, hence becoming an "expert". Deepseek took MoE to this 670B parameter scale that no one had done before [...] and created 256 FFNs and routes each token through only 8 of these.
- Multi-token prediction (MTP): [...] you compute more than 1 token and send the aggregate error to back propagate. The intuition is that you get more changes made to the model weights in each training step, thus reducing the total training steps needed [...] Deepseek took this idea further, added innovations of their own (Sequential vs parallel MTP) and used this to reduce training time. -- Prasad Raje