WebAssembly and Kubernetes Go Better Together: Matt Butcher

We sat down with Fermyon co-founder and CEO Matt Butcher to have a chat about SpinKube, a new framework for integrating WebAssembly with Kubernetes.

May 22nd, 2024 2:00pm by Raghavan "Rags" Srinivas

Featued image for: WebAssembly and Kubernetes Go Better Together: Matt Butcher

Photo by Ahmad Dirini on Unsplash.

Matt Butcher

In this interview, we delve into details about the WebAssembly (WASM) format and Kubernetes with Matt Butcher, co-founder and CEO of Fermyon. Butcher is passionate about WebAssembly, distributed systems, and artificial intelligence.

Butcher is one of the original creators, or had a hand in the creation of Helm, Brigade, Cloud Native Application Bundles, Open Application Model, Glide and Krustlet. He has written/co-written many books including “Learning Helm” and “Go in Practice.”

In this interview, Butcher talks about the motivation and evolution of serverless functions and how Wasm complements it. He touches upon the use cases of Wasm in edge computing, AI and also on the server.

He also spoke about how SpinKube, recently introduced as a serverless platform on the Kubernetes platform powered by Wasm, was an effort motivated by the adoption of Kubernetes based on a similar orchestrator that was used in the Fermyon Cloud. Finally, Butcher paints a realistic picture of the challenges to the Wasm community especially with the existing communities such as Java.

Can you please introduce yourself, and provide a brief background as to how Wasm came about? Any particular pain point in distributed systems app development that you were intent on solving with Wasm?

I spent my career in software at the intersection of distributed computing, developer tooling, and supply chain security; it wasn’t really one benefit that drew me to Wasm, but how the characteristics of the technology mapped to the problems I was trying to solve (and more generally, the problems we’ve been trying to solve as an industry): portability; startup and execution speed; strong capability-based sandboxing; and artifact size.

At Microsoft, we were looking at Azure Functions and other similar technologies and were seeing a lot of waste. Systems were sitting idle 80% (or more) just waiting for inbound requests. Virtual machines were sitting pre-warmed in a queue consuming memory and CPU, waiting to have a serverless function installed and executed. And all this wastage was done in the name of performance. We asked ourselves: What if we could find a much more efficient form of compute that could cold start much faster? Then we wouldn’t have to pre-warm virtual machines. We could drastically improve density, also, because resources would be freed more often.

So it was really all about finding a better runtime for the developer pattern that we call “serverless functions.” There are too many definitions of serverless floating around. When we at Fermyon say that, we mean the software design pattern in which a developer does not write a server, but just an event handler. The code is executed when a request comes in and shuts down as soon as the request has been handled.

Yes. The first generation of serverless gave us a good design pattern (FaaS, or serverless functions) based on event handling. But from Lambda to KNative, the implementations were built atop runtimes that were not optimized for that style of workload. Virtual machines can take minutes to start, and containers tend to take a few dozen seconds. That means you either have to pre-warm workloads or take large cold-start penalties. Pre-warming is costly, involving paying for CPU and memory while something sits idle. Cold starts, though, have a direct user impact: slow response time.

We were attracted to WebAssembly specifically because we were looking for a faster form of compute that would reduce cold start time. And at under one millisecond, WebAssembly’s cold start time is literally faster than the blink of an eye. But what surprised us was that by reducing cold start, utilizing WebAssembly’s small binary sizes, and writing an effective serverless functions executor, we have been able to also boost density. That, in turn, cuts cost and improves sustainability since this generation of serverless is doing more with fewer resources.

Delving into the intersection of Wasm and edge computing, Is Wasm primarily tailored for edge applications where efficiency and responsiveness are paramount?

Serverless functions are the main way people do edge programming. Because of its runtime profile, Wasm is a phenomenal fit for edge computing — and we have seen content delivery networks like Fastly, and more recently Cloudflare, demonstrate that. And we are beginning to see more use cases of Wasm on constrained and disconnected devices — from 5G towers to the automotive industry (with cars running Spin applications because of that efficiency and portability).

One way to think about this problem is in terms of how many applications you can run on an edge node. Containers require quite a bit of system resources. On a piece of hardware (or virtual machine) that can run about 30-50 containers, we can run upwards of 5,000 serverless Wasm apps. (Fermyon Cloud runs 3,000 user apps per node on our cluster.)

The other really nice thing about WebAssembly is that it is OS- and architecture-neutral. It can run on Windows, Linux, Mac, and even many exotic operating systems. Likewise, it can execute on Intel architecture or ARM. That means the developer can write and compile the code without having to know what the destination is. And that’s important for edge computing where oftentimes the location is decided at execution time.

Navigating the realm of enterprise software: Many enterprises still use Java. Keeping Java developers and architects in mind, can you compare and contrast Java and Wasm? Can you talk about the role of WASI in this context?

Luke Wagner, the creator of Wasm, once said to me that Java broke new ground (along with .NET) in the late 90s and over decades they refined, optimized, and improved. WebAssembly was an opportunity to start afresh on a foundation of 20 years of research and development.

Wasm is indebted to earlier bytecode languages, for sure. But there’s more to it than that. WebAssembly was built with very stringent security sandboxing needs. Deny-by-default and capabilities-based system interfaces are essential to a technology that assumes it is running untrusted code. Java and .NET take the default disposition that the execution context can “trust” the code it is running. Wasm is the opposite.

Secondly, Wasm was built with the stated goal that many different languages (ideally any language) can be compiled to Wasm. JVM and .NET both started from a different perspective that specific languages would be built to accommodate the runtime. In contrast, the very first language to get Wasm support was C. Rust, Go, Python, and JavaScript all followed along soon after. In other words, WebAssembly demonstrated that it could be a target for existing programming languages while after decades neither Java nor .NET have been able to do this broadly.

Moreover, WebAssembly was designed for speed both in terms of cold start and runtime performance. While Java has always had notoriously slow startup times, WebAssembly binaries take less than a millisecond to cold start.

But the greatest thing about WebAssembly is that if it is actually successful, it won’t be long until both Java and .NET can compile to WebAssembly. The .NET team has said they’ll fully support WebAssembly late in 2024. Hopefully, Java won’t be too far behind.

Exploring the synergy between Wasm and Kubernetes, SpinKube was released at the Cloud Native Computing Foundation‘s Kubecon EU 2024 in Paris. Was Kubernetes an afterthought? How does it impact cloud native app development including addressing deficiencies, and redefining containerized applications?

Most of us from Fermyon worked in Kubernetes for a long time. We built Helm, Brigade, OAM, OSM, and many other Kubernetes projects. When we switched to WebAssembly, we knew we’d have to start with developer tools and work our way toward orchestrators. So we started with Spin to get developers building useful things. We built Fermyon Cloud so that developers would be able to deploy their apps to our infrastructure. Behind the scenes, Fermyon Cloud runs HashiCorp’s Nomad. It has been a spectacular orchestrator for our needs. As I said earlier, we’ve been able to host 3,000 user-supplied apps per worker node in our cluster, orders of magnitude higher than what we could have achieved with containers.

But inevitably we felt the pull of Kubernetes’ gravity. So many organizations are operating mature Kubernetes clusters. It would be hard to convince them to switch to Nomad. So we worked together with Microsoft, SUSE, Liquid Reply, and others to build an excellent Kubernetes solution. Microsoft did some of the heaviest lifting when they wrote the containerd support for Spin.

SpinKube’s immediate popularity surprised even us. People are eager to try a serverless runtime that outperforms Knative and other early attempts at serverless in Kubernetes.

Let’s chat about Wasm and Artificial Intelligence. There’s a buzz about Wasm and AI. Can you expand on this? Does this mean that Large Language Models (LLMs), etc. can be moved to the edge with Wasm?

There are two parts to working with LLMs. There’s training a model, which takes huge amounts of resources and time. And there’s using (or “inferring against”) a model. This later part can benefit a lot from runtime optimization. So we have focused exclusively on serverless inferencing with LLMs.

At its core, WebAssembly is a compute technology. We usually think of that in terms of CPU usage. But it’s equally applicable to the GPUs that AI uses for inferencing. So we have been able to build SDKs inside of the WebAssembly host that allow developers to do LLM inferencing without having to concern themselves with the underlying GPU architecture or resources.

Most importantly, Fermyon makes it possible to time-slice GPU usage at a very fine level of granularity. While most AI inferencing engines lock the GPU for the duration of a server process (e.g. hours, days, or months at a time), we lock the GPU only during an inferencing operation, which is seconds or maybe a few minutes at a time. This means you can get much higher density, running more AI applications on the same GPU.

This is what makes Wasm viable on the edge, where functions run for only a few eye blinks. Those edge nodes will need only a modest number of GPUs to serve thousands of applications.

WASM roadmap: What are the gaps hampering Wasm adoption and the community-driven roadmap to address them? Anything else you want to add?

WebAssembly is still gaining new features. The main one in flight right now is asynchronous function calls across components. This would allow WebAssembly modules to talk to each other asynchronously, and this will vastly improve the parallel computing potential of Wasm.

But always the big risk with WebAssembly comes from its most audacious design goal. Wasm is only successful if it can run all major programming languages, and run them in a more or less uniform way (e.g. with support for WASI, the WebAssembly System Interface). We’ve seen many languages add first-tier WebAssembly support. Some have good implementations that are not included in the main language and require separate toolchains. That’s a minor hindrance. Others, like Java, haven’t made any notable progress. While I think things are progressing well for all major languages other than Java, until those languages have complete support, they have not reached their full potential.

Raghavan "Rags" Srinivas (@ragss) is an Architect/Advocate enabling developers to build scalable and available systems. With a background in app development and infrastructure, he has gravitated towards distributed systems. He specializes in Cloud Computing, specifically multicloud — He worked on...