Tero Keski-Valkama’s Post

View profile for Tero Keski-Valkama, graphic

Helping AIs cure cancer. AI generalist working from Spain. Experience in leadership, AI research, software engineering.

A nice optimization that leverages redundance between the key and value representations, and also between heads. It probably conveys information through it more effectively, correlating keys (and hence queries) with values, which explains why it works better than multi-head attention.

View profile for Pramodith B., graphic

AI Engineer @ LinkedIn | Posts weekly about AI

Multi-Head Latent Attention (MLA) is a newly proposed technique for computing attention by the creators of DeepSeek-v2. MLA is lighter on the KV cache, making it easier to scale for long sequences, while also outperforming traditional Multi-Head Attention (MHA). 🚀 💡 Idea LoRA → Low-Rank Weight Matrices GaLoRE → Low-Rank Projection of Gradient Matrices LoReFT → Low-Rank Projection of Hidden States The use of Low-Rank projections to make computation more efficient has been common in the past few months. MLA uses the same idea. 💻 MLA → Low-Rank Projection of Query, Key, and Value Matrices. 📚 Deep Dive into MLA Let's assume that we have 10 attention heads per layer, and each head's output dimension is 24. In traditional MHA, the KV cache for each token will store a Key Tensor of shape (10 * 24) and a Value Tensor of the same shape. For a sequence length of L, that'd be L * 2 * (10 * 24). 🧠 In MLA, the idea is to create a common "compressed tensor" called C_KV that's much smaller, i.e., << (10 * 24). This can be achieved by using a Low-Rank Weight Matrix to Down-project the hidden state of a token. The original shape expected during self-attention can then be obtained by Up-Projecting C_KV with another Low-Rank Weight Matrix. 🔎 So assume that we want C_KV to be of size 120, where the original Hidden size is (10 * 24) = 240. C_KV = W_down[120, 240] * H[240, 1] K = W_upK[240, 120] * C_KV[120, 1] V = W_upV[240, 120] * C_KV[120, 1] At runtime, we just need to cache C_KV for each token and layer rather than K, V, which is where the savings come from. 💰 Note that we introduce three new matrices that need to be learned: W_down, W_upK, and W_upV. At inference time, the W_up matrices can be fused in a one-time operation with W_Q and W_O thanks to the associative law of matrix multiplication, further reducing compute. ⚡ 🌟 Wrap Up There's a bit more nuance that goes into this for RoPE-based models, and the details of how to make MLA compatible with RoPE are mentioned in the paper. For more, read (HTML): https://lnkd.in/e-dVS6Nc Their repo: GitHub - deepseek-ai/DeepSeek-V2

  • No alternative text description for this image
Sam Walker

Co-Founder | Master Prompt Engineer | @ Collaborative Dynamics - 10k+ active community

2mo

OOOO! Now THAT'S _interesting_! ... Er... trust me, non-hypernerds. It is.

To view or add a comment, sign in

Explore topics