Tero Keski-Valkama’s Post

Helping AIs cure cancer. AI generalist working from Spain. Experience in leadership, AI research, software engineering.

2mo Edited

A nice optimization that leverages redundance between the key and value representations, and also between heads. It probably conveys information through it more effectively, correlating keys (and hence queries) with values, which explains why it works better than multi-head attention.

Pramodith B.

AI Engineer @ LinkedIn | Posts weekly about AI

2mo

Multi-Head Latent Attention (MLA) is a newly proposed technique for computing attention by the creators of DeepSeek-v2. MLA is lighter on the KV cache, making it easier to scale for long sequences, while also outperforming traditional Multi-Head Attention (MHA). 🚀 💡 Idea LoRA → Low-Rank Weight Matrices GaLoRE → Low-Rank Projection of Gradient Matrices LoReFT → Low-Rank Projection of Hidden States The use of Low-Rank projections to make computation more efficient has been common in the past few months. MLA uses the same idea. 💻 MLA → Low-Rank Projection of Query, Key, and Value Matrices. 📚 Deep Dive into MLA Let's assume that we have 10 attention heads per layer, and each head's output dimension is 24. In traditional MHA, the KV cache for each token will store a Key Tensor of shape (10 * 24) and a Value Tensor of the same shape. For a sequence length of L, that'd be L * 2 * (10 * 24). 🧠 In MLA, the idea is to create a common "compressed tensor" called C_KV that's much smaller, i.e., << (10 * 24). This can be achieved by using a Low-Rank Weight Matrix to Down-project the hidden state of a token. The original shape expected during self-attention can then be obtained by Up-Projecting C_KV with another Low-Rank Weight Matrix. 🔎 So assume that we want C_KV to be of size 120, where the original Hidden size is (10 * 24) = 240. C_KV = W_down[120, 240] * H[240, 1] K = W_upK[240, 120] * C_KV[120, 1] V = W_upV[240, 120] * C_KV[120, 1] At runtime, we just need to cache C_KV for each token and layer rather than K, V, which is where the savings come from. 💰 Note that we introduce three new matrices that need to be learned: W_down, W_upK, and W_upV. At inference time, the W_up matrices can be fused in a one-time operation with W_Q and W_O thanks to the associative law of matrix multiplication, further reducing compute. ⚡ 🌟 Wrap Up There's a bit more nuance that goes into this for RoPE-based models, and the details of how to make MLA compatible with RoPE are mentioned in the paper. For more, read (HTML): https://lnkd.in/e-dVS6Nc Their repo: GitHub - deepseek-ai/DeepSeek-V2

1 Comment

Sam Walker

Co-Founder | Master Prompt Engineer | @ Collaborative Dynamics - 10k+ active community

2mo

OOOO! Now THAT'S _interesting_! ... Er... trust me, non-hypernerds. It is.

1 Reaction

To view or add a comment, sign in

More Relevant Posts

Jo Kristian Bergum

Distinguished Engineer at Vespa.ai
4mo
Report this post
Splade for long-context retrieval in Vespa.ai Splade is built on vanilla transformer models with length limitations, so chunking is still required for long contexts. With Vespa, you don't have to split the chunks into multiple retrievable units like with single-vector databases; instead, keep the original context. In my opinion, Vespa's tensor framework is pretty neat. It allows representing splade, colbert, splade-long, Colbert-long. AI-powered representations are tensors, not single-vectors. It features: - named tensor dimensions - represent both mapped (sparse) and indexed (dense), plus the combination (mixed) - different cell types (int8, bfloat16, float) - tensor compute epressions are compiled and optimized (c++)
1 Comment
Like Comment
To view or add a comment, sign in
Confront.ai

99 followers
3mo
Report this post
We're thrilled to announce the release of "Effort," a revolutionary new algorithm that transforms how we perform inference with large language models (LLMs). This cutting-edge technology allows for real-time adjustment of computational effort during inference, ensuring both speed and quality are maintained according to your needs. 🔹 Flexible Performance: At 50% computational effort, "Effort" matches the speed of standard matrix multiplications on Apple Silicon chips. Reduce it to 25%, and it doubles the speed while preserving most of the model's quality. 🔹 Smart Resource Management: Choose to load only the essential weights for your task, skipping up to 30% of the least important weights. This dynamic approach doesn't just save memory—it smartly distills the model on the fly. 🔹 Seamless Integration: Currently implemented for the Mistral model, "Effort" is designed to be universally compatible with all LLMs without the need for retraining—just a simple format conversion and some precomputation. 🔹 Download Now: The initial implementation, supporting FP16, is available on GitHub. It’s ready to use immediately after downloading the converted weights. 🔹 Continuous Improvement: While the core matrix operations are optimized, further enhancements are planned for non-essential components like softmax. Additionally, compatibility with Mixtral and Q8 models is underway. 🔗 Explore and contribute to the "Effort" algorithm today and take your LLM applications to the next level! Download it from GitHub and start experiencing the power of efficient inference. https://lnkd.in/dS2Qbm6y #AI #MachineLearning #TechInnovation #LanguageModels #EffortAlgorithm #OpenSource

Effort Engine

kolinko.github.io
Like Comment
To view or add a comment, sign in
Andrei Lopatenko 🇺🇦

VP AI & Engineering | Co-Founder | Keynote speaker | Ex-Google, Apple, WML
8mo
Report this post
it's easy to see a very prominent trend in all LLM/Generative AI articles focused on various computational optimizations There are many of them from numerous optimization techniques in quantization or distillation etc individual networks to batched inference optimizations to various techniques on how to optimize hardware infrastructure using different types of GPUs/CPUs to optimizing compilers for deep learning to optimization of computation due to specific data formats such as bfloat16, TF32, FP24 and many other directions etc All of these techniques are heavy both in machine learning and in traditional Computer Science computational optimization techniques I remember, around 7-5 years ago, there was a popular opinion in the tech media, that Computer Science is not needed anymore, all business tasks to solve are data science tasks, and Computer Science education is less important nowadays, as most business tasks are more about data science, and heavy optimization is needed only for companies that process millions of queries per second and need a cost-effective infrastructure to do it. My position always was that computational requirements of ML/AI tasks will reach limits very soon, and all computer science techniques from optimizing compilers for this sort of computations to other will be required again and the value of 'traditional' Computer Science will be huge again https://lnkd.in/gxw67XSe

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

arxiv.org
Like Comment
To view or add a comment, sign in
Youssef Hosni

Data Scientist | AI Researcher | Founder & Author @ To Data & Beyond
2mo
Report this post
Getting Started with LLM Inference Optimization: Best Resources [https://lnkd.in/dA96nhrf] 1️⃣ 𝐔𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 𝐋𝐋𝐌 𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 1.1. Mastering LLM Techniques: Inference Optimization by Nvidia https://lnkd.in/dk-QDPga 1.2. LLM Inference by Databricks https://lnkd.in/dB6wBADz 2️⃣ 𝐃𝐞𝐞𝐩 𝐔𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 𝐨𝐟 𝐋𝐋𝐌 𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 2.1. Deep Dive: Optimizing LLM inference https://lnkd.in/dsmDEJYn 3️⃣ 𝐋𝐋𝐌 𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐲 𝐇𝐮𝐠𝐠𝐢𝐧𝐠 𝐅𝐚𝐜𝐞 3.1. GPU Inference by Hugging Face https://lnkd.in/duVDq9Sz 3.2. Optimizing LLMs for Speed and Memory by Hugging Face https://lnkd.in/dqKDTWc6 3.3. Assisted Generation by Hugging Face https://lnkd.in/dt72yrjE 4️⃣ 𝐋𝐋𝐌 𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐋𝐢𝐛𝐫𝐚𝐫𝐢𝐞𝐬 & 𝐓𝐨𝐨𝐥𝐬 4.1. LLMs at Scale: Comparing Top Inference Optimization Libraries https://lnkd.in/dwSZjVcZ 4.2. Large language model inference optimizations on AMD GPUs https://lnkd.in/dWyQXX_5 4.3. Accelerate Large Language Model (LLM) Inference on Your Local PC https://lnkd.in/dN4RQiYu 4.4. OpenAI Latency Optimization https://lnkd.in/duWQT8z5 4.5. Inference Optimization Strategies for Large Language Models: Current Trends and Future Outlook https://lnkd.in/dfUKeT-K ➡ You can find more information in this article: https://lnkd.in/dA96nhrf ⭐ Subscribe to the newsletter To Data & Beyond to have full access to my past and future articles: https://lnkd.in/dVkD-7WT 📒 My eBook Data Science Portfolio for Success: Your Ultimate Guide to Building a Data Science Portfolio: https://lnkd.in/ddaqcmt8
Like Comment
To view or add a comment, sign in
Aleksa Gordić Aleksa Gordić is an Influencer

Shipping | x-Google DeepMind / Microsoft
2mo Edited
Report this post
[saving $$$ :)] I just posted about LLM.int8() but turns out there is an even better int8 method that's super recent - SmoothQuant! Here are the main ideas: LLM.int8() already pointed out that the issue are outlier features across some of the channels (it's very sparse, only a few channels are affected). Their solution was to compute the dot products for affected vectors in fp16 instead of using int8 and thus avoiding having to quantize them. The issue is: it's very hard to implement this efficiently, so even though LLM.int8() had ~2x mem reduction its latency results are suboptimal, and only for the biggest of the models it showed a tiny bit of improvement compared to the fp16 baseline, otherwise it's mostly slower! SmoothQuant solved this in a mathematically more elegant way and achieves much superior latency! (while keeping 2x mem reduction) How does it work? The high level idea is that even though activations (X) suffer from these outliers, actual LLM weights (W) in linear layers do not. So let's try and rebalance them? :) We're computing the following in a linear layer: Y = X@W (@ == matmul) That's mathematically equivalent with: Y = (X@diag(S)^(-1))@(diag(S)@W) (Since matmul is associative you can easily see that inverse diag(S) matrix will give you identity (I) matrix when multiplied with diag(S)) The nice computational property of diagonal (hence "diag") matrices is that they basically act as column-wise scalars. Ex: X = [[3, 232], [2, 185]] S = [[1/3, 0], [0, 1/232]] you end up with: Y = [[1, 1], [2/3, 185/232]] So we can use S to scale those outlier channels to a normal range! This can be very efficiently implemented by doing "kernel fusion" (see my flash attention blog for an explanation) so that this is basically as expensive as computing X@W without diag(S). Great! So now we're left with 2 questions: 1) How do we compute S? 2) How do we quantize our weights & activations after applying 1)? S is computed by passing a small number of sentences from your dataset through your target LLM (they used only ~512 from the Pile), observing the activations (X) and computing the absmax values across the channel dimension 'd' (X shape = [s,d], W shape = [d, d_out]). You repeat the same for your W matrix finding absmax across rows. To compute S you divide those 2 vectors and use a single hyperparam (alpha) to decide how much should absmax values from X vs W have an impact on the final S_j. This makes sure we don't make W hard to quantize because we "spilled" too much of the "outlier dynamics" to it. After this you can simply quantize the tensors W & X using either per tensor or per vector absmax that you can precompute on those same 512 sentences and reuse during runtime (turns out this works just well!). The crazy thing they show is they're serving a 530B model within a single node (8 GPUs) with lower latency and same accuracy compared to fp16 baseline (that needs 16 GPUs/slower!). This saves you 2x $. :)
11 Comments
Like Comment
To view or add a comment, sign in
Derrick Mwiti

Machine Learning Professional | Google D.E Machine Learning
9mo
Report this post
The KV cache is one mechanism for making inference fast in transformer-based models. Here's how it works: The Transformer KV Cache means caching the Key and Value states to reduce the model’s computational overhead and speed up inference. In the Transformer, the scaled dot-product attention is computed using K(ey) and V(alue). KV caching can be done in models that have a decoder, that is models that are generative. Generating multiple tokens can become a bottleneck in the system and therefore introducing a caching mechanism becomes important in making it more efficient. Auto-regressive models output one token at a time, this token is then added to the sequence of inputs. The attention of a token depends on the previous tokens. Since the attention for the previous token has already been computed, there is no need to compute it again. So the solution is to cache the previous Keys and Values and only compute the attention for the new token. Without caching the keys and values you would eventually run out of memory and not be able to generate more text, especially for large models and long sequences. When a model doesn’t have a KV cache computed there is a higher latency, but it gets lower as once the cache has been computed. Follow Derrick Mwiti for more machine learning resources.

5 Comments
Like Comment
To view or add a comment, sign in
Rahul Krishna

Founder: YARS | Consultant & Strategist | AI & Climate Sustainability | AI Model Training for various usecases, Automation Specialist | LLM Fine-tuning | Green and Nature Based Credits | Alumni: IIM Lucknow,IIT Kharagpur
4mo
Report this post
ALL LLMs are in 1.58 Bits -- Awkward .. but its True....!! The Next Big thing that can bring GPUs down to their foot --- Development and implications of 1-bit Large Language Models (LLMs) Background: This model uses ternary parameters (-1, 0, 1) for efficiency and matches the performance of traditional full-precision LLMs while significantly reducing computational costs in terms of latency, memory usage, and energy consumption. It introduces a new computation paradigm and suggests the potential for designing specialized hardware optimized for 1-bit operations, aiming for high-performance, cost-effective future LLMs. The Intuition: Imagine you're trying to understand a story, but instead of words, the story is told in a complex code. Traditionally, LLMs would use a detailed map (floating-point numbers) to decode every detail. Now, imagine simplifying that map to just three symbols (-1, 0, 1), making it easier to handle. This is what 1-bit models do. They use smart shortcuts and rules to interpret the story without needing the detailed map. Explaination: Let's use a simple example involving numbers to illustrate how 1-bit models and matrix multiplications work together. Suppose you have a small set of numbers that represent different features of text, like word importance, in a very simplified form: -1, 0, 1. Now, imagine you want to combine these features to understand a sentence's overall meaning. Original Detailed Representation: You might have detailed values like 0.25, -0.75, 0.5, etc., representing different words or parts of the text. 1-bit Simplified Representation: These are simplified to -1, 0, 1. For example, positive sentiment words could be 1, neutral words 0, and negative sentiment words -1. The Mathematics behind it and the Matrix Multiplication Role: To understand the text, you need to combine these features, like adding up the values to get the sentence's overall sentiment. Matrix multiplication allows this process to be done efficiently, even with the simplified -1, 0, 1 values, by organizing the data into matrices (grids) and then combining them following specific mathematical rules. This process helps in quickly analyzing and understanding large volumes of text, maintaining performance even with simplified data representation. Efficiency: Now lets talk about efficiency of these 1-Bit LLMs: The efficiency in 1-bit models comes from the drastic reduction in the complexity of computations. By using -1, 0, 1 instead of floating-point numbers, the amount of data the model needs to process is significantly reduced. This means that operations like matrix multiplication, which are foundational to how LLMs process and generate language, require less computational power and memory. This simplification leads to faster processing speeds and lower energy consumption, allowing these models to be more efficient while still maintaining their performance in tasks like understanding and generating text.
Like Comment
To view or add a comment, sign in
Lucy Day Werts

Senior Editor at Intelligent Computing, a Science Partner Journal at Zhejiang Lab, Hangzhou, China
4mo
Report this post
I wrote this news release for the Intelligent Computing journal article "Designing New Metaheuristics: Manual versus Automatic Approaches." The article describes a problem in the field of metaheuristics: the tendency to rely on inspiration rather than methodical design in the search for optimization algorithms for computational tasks. This embarrassing tendency has created a plethora of redundant and even kooky algorithms (metaphorically modeled on everything from wolves to zombies) that clutter the literature needlessly. The authors recommend, in addition to stronger editorial policies, a mindful shift towards metaheuristic software frameworks and automatic configuration tools, and better benchmarking and theoretical analysis. https://lnkd.in/gcQjJD79

Automatic design of metaheuristics: The future of optimization?

eurekalert.org
Like Comment
To view or add a comment, sign in
Umang Agarwal

Machine Learning Engineer | Generative AI | Computer Vision | NLP
3mo
Report this post
Happy to share a blog on Memory and Compute Optimization for Large Language Models. This blog explains the CUDA Out of Memory Error and how it can be mitigated for multi-billion parameter models. Describes quantization, a technique to reduce memory usage by representing model parameters with lower precision, thus allowing training on less expensive infrastructure with minimal performance loss. Followed by different quantization options and how to choose the best one for your use case. Finally, we discuss the memory and compute footprint comparison between each quantization option. https://lnkd.in/g9E-FCkK

Foudation Models: Memory and compute optimization

medium.com
Like Comment
To view or add a comment, sign in
Prudhvi P

Helping IT Professionals Launch Data Science Career within 5 Months | Next cohort starting in August
4mo
Report this post
Do you know SVM algorithm used widely in classification today is almost 60 year old technique ? I was quite surprised when my professor talked about SVM loss function was first proposed in 60-70's time. It was revived in 1990's again when the inventor, Vladmir Vapnik joined Bell labs. SVM can be applied to wide variety of tasks, it can be applied to non linear data as well using kernel trick and can be used for even clustering analysis. Before CNNs took the vision world by storm, SVMs were the defacto choice of computer vision experts. Even today for applications where you need light weight solutions with little latency and high robustness, SVM is the only choice which comes to a data scientist's mind. SVM reminds me of 'old is gold' in the world of large and compute heavy models

5 Comments
Like Comment
To view or add a comment, sign in

3,851 followers

View Profile Follow

Tero Keski-Valkama’s Post

More from this author

Large Language Models - What Now?

My Story of 63 Job Applications and 83 Job Interviews in Slightly Over 2 Months

How to job interview a Finnish software engineer:

Explore topics