Featured Article

What does ‘open source AI’ mean, anyway?

Meet the guy working to find “the definition”

Comment

Open Source Initiative (OSI) executive director Stefano Maffulli
Image Credits: Open Source Initiative (OSI) // Stefano Maffulli, OSI Executive Director

The struggle between open source and proprietary software is well understood. But the tensions permeating software circles for decades have shuffled into the artificial intelligence space, in part because no one can agree on what “open source” really means in the context of AI.

The New York Times recently published a gushing appraisal of Meta CEO Mark Zuckerberg, noting how his “open source AI” embrace had made him popular once more in Silicon Valley. By most estimations, however, Meta’s Llama-branded large language models aren’t really open source, which highlights the crux of the debate.

It’s this challenge that the Open Source Initiative (OSI) is trying to address, led by executive director Stefano Maffulli (pictured above), through conferences, workshops, panels, webinars, reports and more.

AI ain’t software code

Image Credits: Westend61 via Getty

The OSI has been a steward of the Open Source Definition (OSD) for more than a quarter of a century, setting out how the term “open source” can, or should, be applied to software. A license that meets this definition can legitimately be deemed “open source,” though it recognizes a spectrum of licenses ranging from extremely permissive to not quite so permissive.

But transposing legacy licensing and naming conventions from software onto AI is problematic. Joseph Jacks, open source evangelist and founder of VC firm OSS Capital, goes as far as to say that there is “no such thing as open-source AI,” noting that “open source was invented explicitly for software source code.” Further, “neural network weights” (NNWs) — a term used in the world of artificial intelligence to describe the parameters or coefficients through which the network learns during the training process — aren’t in any meaningful way comparable to software.

“Neural net weights are not software source code; they are unreadable by humans, [and they are not] debuggable,” Jacks notes. “Furthermore, the fundamental rights of open source also don’t translate over to NNWs in any congruent manner.”

These inconsistencies last year led Jacks and OSS Capital colleague Heather Meeker to come up with their own definition of sorts, around the concept of “open weights.” And Maffulli, for what it’s worth, agrees with them. “The point is correct,” he told TechCrunch. “One of the initial debates we had was whether to call it open source AI at all, but everyone was already using the term.”

Meta analysis

Llama illustration
Image Credits: Larysa Amosova via Getty

Founded in 1998, the OSI is a not-for-profit public benefit corporation that works on a myriad of open source-related activities around advocacy, education and its core raison d’être: the Open Source Definition. Today, the organization relies on sponsorships for funding, with such esteemed donors as Amazon, Google, Microsoft, Cisco, Intel, Salesforce and Meta.

Meta’s involvement with the OSI is particularly notable right now as it pertains to the notion of “open source AI.” Despite Meta hanging its AI hat on the open-source peg, the company has notable restrictions in place regarding how its Llama models can be used: Sure, they can be used gratis for research and commercial use cases, but app developers with more than 700 million monthly users must request a special license from Meta, which it will grant purely at its own discretion.

Meta’s language around its LLMs is somewhat malleable. While the company did call its Llama 2 model open source, with the arrival of Llama 3 in April, it retreated somewhat from the terminology, using phrases such as “openly available” and “openly accessible” instead. But in some places, it still refers to the model as “open source.”

“Everyone else that is involved in the conversation is perfectly agreeing that Llama itself cannot be considered open source,” Maffulli said. “People I’ve spoken with who work at Meta, they know that it’s a little bit of a stretch.”

On top of that, some might argue that there’s a conflict of interest here: a company that has shown a desire to piggyback off the open source branding also provides finances to the stewards of “the definition”?

This is one of the reasons why the OSI is trying to diversify its funding, recently securing a grant from the Sloan Foundation, which is helping to fund its multi-stakeholder global push to reach the Open Source AI Definition. TechCrunch can reveal this grant amounts to around $250,000, and Maffulli is hopeful that this can alter the optics around its reliance on corporate funding.

“That’s one of the things that the Sloan grant makes even more clear: We could say goodbye to Meta’s money anytime,” Maffulli said. “We could do that even before this Sloan Grant, because I know that we’re going to be getting donations from others. And Meta knows that very well. They’re not interfering with any of this [process], neither is Microsoft, or GitHub or Amazon or Google — they absolutely know that they cannot interfere, because the structure of the organization doesn’t allow that.”

Working definition of open source AI

Concept illustration depicting finding a definition
Image Credits: Aleksei Morozov / Getty Images

The current Open Source AI Definition draft sits at version 0.0.8, constituting three core parts: the “preamble,” which lays out the document’s remit; the Open Source AI Definition itself; and a checklist that runs through the components required for an open source-compliant AI system.

As per the current draft, an Open Source AI system should grant freedoms to use the system for any purpose without seeking permission; to allow others to study how the system works and inspect its components; and to modify and share the system for any purpose.

But one of the biggest challenges has been around data — that is, can an AI system be classified as “open source” if the company hasn’t made the training dataset available for others to poke at? According to Maffulli, it’s more important to know where the data came from, and how a developer labeled, de-duplicated and filtered the data. And also, having access to the code that was used to assemble the dataset from its various sources.

“It’s much better to know that information than to have the plain dataset without the rest of it,” Maffulli said.

While having access to the full dataset would be nice (the OSI makes this an “optional” component), Maffulli says that it’s not possible or practical in many cases. This might be because there is confidential or copyrighted information contained within the dataset that the developer doesn’t have permission to redistribute. Moreover, there are techniques to train machine learning models whereby the data itself isn’t actually shared with the system, using techniques such as federated learning, differential privacy and homomorphic encryption.

And this perfectly highlights the fundamental differences between “open source software” and “open source AI”: The intentions might be similar, but they are not like-for-like comparable, and this disparity is what the OSI is trying to capture in its definition.

In software, source code and binary code are two views of the same artifact: They reflect the same program in different forms. But training datasets and the subsequent trained models are distinct things: You can take that same dataset, and you won’t necessarily be able to re-create the same model consistently.

“There is a variety of statistical and random logic that happens during the training that means it cannot make it replicable in the same way as software,” Maffulli added.

So an open source AI system should be easy to replicate, with clear instructions. And this is where the checklist facet of the Open Source AI Definition comes into play, which is based on a recently published academic paper called “The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency, and Usability in Artificial Intelligence.”

This paper proposes the Model Openness Framework (MOF), a classification system that rates machine learning models “based on their completeness and openness.” The MOF demands that specific components of the AI model development be “included and released under appropriate open licenses,” including training methodologies and details around the model parameters.

Stable condition

Stefano Maffulli presenting at the Digital Public Goods Alliance (DPGA) members summit in Addis Ababa
Stefano Maffulli presenting at the Digital Public Goods Alliance (DPGA) members summit in Addis Ababa.
Image Credits: OSI

The OSI is calling the official launch of the definition the “stable version,” much like a company will do with an application that has undergone extensive testing and debugging ahead of prime time. The OSI is purposefully not calling it the “final release” because parts of it will likely evolve.

“We can’t really expect this definition to last for 26 years like the Open Source Definition,” Maffulli said. “I don’t expect the top part of the definition — such as ‘what is an AI system?’ — to change much. But the parts that we refer to in the checklist, those lists of components depend on technology. Tomorrow, who knows what the technology will look like.”

The stable Open Source AI Definition is expected to be rubber stamped by the Board at the All Things Open conference at the tail end of October, with the OSI embarking on a global roadshow in the intervening months spanning five continents, seeking more “diverse input” on how “open source AI” will be defined moving forward. But any final changes are likely to be little more than “small tweaks” here and there.

“This is the final stretch,” Maffulli said. “We have reached a feature complete version of the definition; we have all the elements that we need. Now we have a checklist, so we’re checking that there are no surprises in there; there are no systems that should be included or excluded.”

More TechCrunch

We’re excited to invite Jesse Pollak to TechCrunch Disrupt 2024 to talk about the future of decentralization.

Jesse Pollak will tell us why Coinbase is launching its own Base blockchain at TechCrunch Disrupt 2024

Infactory is a kind of fact-checking search engine that will be focused exclusively on data at launch.

Humane execs leave company to found AI fact-checking startup

In a first, the Federal Trade Commission is banning an app from serving users under the age of 18. The agency announced on Tuesday that it’s banning NGL, an anonymous…

FTC bans NGL from offering its anonymous social app to minors

When people start navigation on Google Maps, the vehicle’s speed is shown in miles or kilometers, depending on the region.

Google Maps is rolling out speedometer, speed limits on iPhone and CarPlay globally

Design and animation are core to the Duolingo experience, which makes learning a new language or skill more like a game rather than a task to be dreaded.

Duolingo acquires Detroit-based design studio Hobbes

Two of my friends died within the last three years. By some coincidence, both of their birthdays fall in the beginning of July. So, twice this week, Facebook has reminded…

Facebook keeps asking me to say ‘happy birthday’ to dead people

Running a small business means doing more with less. AI agents can help, but building custom agents for specific workflows remains challenging, even with today’s low-code/no-code tools. The idea behind…

With $6M in seed funding, Enso plans to bring AI agents to SMBs

The feature puts Spotify in more direct competition with YouTube as a place where creators can interact with their listeners.

Chasing YouTube, Spotify adds comments to podcasts

A new iOS app called Wayther wants to help you better plan your road trips by giving you real-time road conditions and weather forecasts along your route. Created by indie…

Meet Wayther, an iOS weather forecast app designed specifically for road trips

Evolve has confirmed that the personal data of at least 7.6 million people was accessed during LockBit’s ransomware attack.

Evolve Bank says ransomware gang stole personal data on millions of customers

Etsy has been grappling with an influx of generic “junk” and AI-generated products on its platform. The service revised its seller policy on Tuesday, introducing new labels that clarify whether…

Etsy adds AI-generated item guidelines in new seller policy 

Seae Ventures is acquiring Unseen Capital after the death of founder Kayode Owens in 2021. The combined firm will continue to invest in healthcare for minorities and underserved populations. Owens,…

Seae Ventures acquires Unseen Capital after founder death

Apple released the third developer beta version of iOS 18 on Monday. While there are no major new features like Apple Intelligence in this update, there are some neat design…

With the latest iOS 18 developer beta, Apple makes flashlight UI more fun

A startup called DreamFlare AI is emerging from stealth on Tuesday with the goal of helping content creators make and monetize short-form AI-generated content. The company, co-founded by former Google…

Ex-Googler joins filmmaker to launch DreamFlare, a studio for AI-generated video

Nala, a remittance startup that is now widening its portfolio through a new B2B payments platform, has raised $40 million equity in a rare deal that becomes one of the largest…

Nala to use $40M Series A to build B2B payments platform, scale remittance services

Solo founder Cat Jones took the plunge on setting up a travel business right around the time the pandemic was hitting Europe in March 2020. Fast-forward to summer 2024 and…

Byway is using AI to help travelers slow down and take the scenic route

An adtech business owned by Microsoft is the target of a complaint backed by European privacy advocacy group, noyb — a nonprofit that punches far above its weight when it…

Microsoft-owned adtech Xandr accused of EU privacy breaches

Quora says that Previews works best with chatbots that “excel” at programming, like Claude 3.5 Sonnet, GPT-4o and Google’s Gemini 1.5 Pro.

Quora’s Poe now lets users create and share web apps

For over a decade, real-money gaming companies and fantasy sports startups have marketed themselves as video game companies. But as these businesses face increasing regulatory scrutiny, a coalition of more…

Indian game firms want to distance themselves from fantasy sports

Huffington Post founder Arianna Huffington and OpenAI CEO Sam Altman are throwing their weight behind a new venture, Thrive AI Health, that aims to build AI-powered assistant tech to promote…

OpenAI Startup Fund backs AI healthcare venture with Arianna Huffington

The essential labor of data work, like moderation and annotation, is systematically hidden from those who benefit from the fruits of that labor. A new project puts the lived experiences…

Data workers detail exploitation by tech industry in DAIR report

Hello and welcome back to TechCrunch Space. I hope everyone had a great Independence Day. On to the news!

TechCrunch Space: SpaceX’s big plans for Starship in Florida

Featured Article

Valuations of startups have quietly rebounded to all-time highs. Some investors say the slump is over. 

Generative AI businesses aside, the last couple of years have been relatively difficult for venture-backed companies. Very few startups were able to raise funding at prices that exceeded their previous valuations.   Now, approximately two years after the venture slump began in early 2022, some investors, like IVP general partner Tom…

21 hours ago
Valuations of startups have quietly rebounded to all-time highs. Some investors say the slump is over. 

VPN makers report having received a notification from Apple that their apps have been removed from the App Store in Russia.

Apple removes VPN apps at request of Russian authorities, say app makers

Europe’s next-generation launch vehicle, the Ariane 6, is poised to lift off for the first time tomorrow, as the continent looks to build out sovereign access to space and ensure…

Ariane 6 is the future of European heavy-lift launch — for better or worse

Over the past few days, Ghost says it has achieved two major milestones in its move to become a federated service.

Substack rival Ghost federates its first newsletter

The Samsung event will feature updates to the Galaxy Z Fold, Galaxy Z Flip, as well as more details on the Galaxy Ring and Galaxy AI.

Samsung Unpacked 2024: What we expect and how to watch Wednesday’s hardware event

Amazon has released an all-new version of its Echo Spot ahead of Prime Day, the company announced on Monday. The 2024 version of the Alexa-enabled smart alarm clock costs $79.99,…

Amazon revives its Echo Spot with an upgraded look and improved audio

One of the vendors to benefit from the database boom is Tembo, a startup creating a platform that lets developers deploy different flavors of Postgres.

Tembo capitalizes on the database boom and lands new cash to expand

TechCrunch Disrupt 2024 is set to welcome an impressive lineup of judges for the Startup Battlefield 200 competition, presented this year by Google Cloud. These judges will decide which company…

Mayfield’s Navin Chaddha is coming to TechCrunch Disrupt 2024