Data shows public AI repos may be quietly becoming a supply chain risk

Ian Eaves

25 Jul 2025 — 7 min read

Over the past few years, Hugging Face has become the default destination for sharing machine learning models—much like PyPI or npm did for Python and JavaScript. It’s an undeniably powerful resource: a shared infrastructure for open research, rapid prototyping, and production deployment.

But with that centrality comes a familiar problem: trust at scale.

As public AI hubs grow in popularity and usage, they’re starting to resemble package managers. And with that, they inherit many of the same risks—license ambiguity, opaque updates, and security blind spots. What’s different is that model files are usually binary blobs, often large, rarely audited, and sometimes executed with the same level of trust as code.

This post explores what we found when we took a closer look at Hugging Face’s model repository ecosystem—specifically:

how frequently models lack licenses
how often files are flagged by Hugging Face’s own security scanners
and what it means when a model’s metadata silently diverges from its contents

The numbers suggest a growing operational risk that isn’t being treated like one yet.

Licensing Ambiguity at Scale

One of the most basic questions an engineering team has to answer about a dependency is: “Are we allowed to use this?”

Out of 1,873,342 public repositories, approximately 1.2 million (64.4%) lack any declared license at all. Most of the time this doesn't really matter since the majority of these repositories are never downloaded from or actively used. Unfortunately, the problem is not restricted to such low download count repositories but applies to many high-profile and actively maintained models. Of the 20,060 most-downloaded repositories (those with more than 1,000 downloads) 25.1% still have no license.

This isn't a simple artifact of the downloads threshold either.

Even among repositories with 1M+ downloads, 16% are missing relevant license files. To be fair, this doesn't mean the model isn't licensed but it's not always trivial to track down which license applies and how, without license information in the repository itself.

Take jonathandinu/face-parsing for example. As of the time of writing it has been downloaded more than a million times over the past month. Although the model card describes a research license, there's no actual license file attached to the model. Instead, we have to look to the upstream project nvidia/mit-b5 which face-parsing was originally based on. Here again, we will find no explicit license, just another model card with a third link to yet another repository finally containing our actual license. In it you'll find a non SPDX identified license custom written by Nvidia for SegFormer which brings us to our second point: not all licenses are created equal.

Of the remaining 25% of popular repositories with an attached license only about 64% were permissively licensed.

License Type	Count	Share of Licensed Repos
Permissive	9,576	64.25%
Copyleft	241	1.62%
Restrictive	1,856	12.45%
Not Open-Source	1,418	9.51%
Unknown	1,814	12.17%

In other words:

35.75% of licensed repositories are not permissively licensed.
Combining this with unlicensed repositories, ~60.87% of popular repositories either have no license file or have restrictive licenses.

For companies deploying these models, this creates a significant compliance surface. Even if the model performs as expected, its licensing may be incompatible with commercial use, modification, or redistribution. Given the number of hoops which might be required to identify which license applies to which model, the risk of a mistake remains non-trivial.

Security Scans: Red Flags in High-Traffic Models

Licensing issues are one kind of risk. Another is the question of what’s inside the models people are downloading and executing—often automatically, and sometimes with elevated permissions.

To their credit, Hugging Face has been steadily expanding their security scanning infrastructure. As of this writing roughly 139,866 repositories have been marked as scanned, which amounts to about 7.5% of the total public set. Of those scanned, about 29,437 repositories (or 1.57% of all repositories) were flagged for containing one or more files with potential security issues.

These flags fall into three categories:

Suspicious: uncommon or unexpected formats
Caution: risky but common serialization formats
Unsafe: explicitly dangerous file types (e.g., executables)

At a glance, most flagged files fall into the “caution” category—but even that category includes formats like .pkl, .pt, .bin, and .ckpt, all of which can execute arbitrary code if deserialized without care. Here’s how the flagged files break down across extensions:

Category	Most Common File Types
Suspicious	`.onnx`, `.keras`, `.tflite`
Caution	`.pt`, `.pth`, `.pkl`, `.bin`, `.ckpt`
Unsafe	`.exe`, `.zip`, `.msi`, `.json`, `.pickle`, `.iso`

Some of these are understandable: Hugging Face allows arbitrary files, and the ML world has standardized around formats that weren’t always designed with safety in mind. But a few numbers are hard to ignore:

863 repositories include .bin files flagged as unsafe.
158 include .pt files marked unsafe—not just risky.
27 flagged repositories have over 100,000 downloads, and 6 of them have more than a million.

File Types in Flagged Repositories

At the repository level, some file extensions dominate. Formats like .pt and .pth—common for PyTorch models—show up in thousands of repositories marked with a “caution” flag. These aren’t inherently malicious but highlight the prevalence of opaque binary blobs in machine learning workflows. More concerning are the outliers: rare but dangerous extensions like .exe, .msi, and .iso, which appear in a handful of repositories yet carry an outsized security risk.

Zooming in to count flagged files rather than repositories, the picture shifts. Certain repositories contain dozens or even hundreds of flagged files. Here, serialized formats like .bin, .pickle, and .pkl spike in volume. These formats are well known for their vulnerability to arbitrary code execution when deserialized—a serious concern if developers load weights without rigorous sandboxing.

Interestingly, one format shows up in both views as a recurring anomaly: ONNX (.onnx). This cross-platform model format is often flagged as suspicious but very rarely escalates to the unsafe category.

File Size Drift and Model Tampering

Beyond licensing and flagged file types, there’s another class of signal that’s less visible but potentially more telling: drift between what a file claims to be and what it actually is.

To quantify this, we queried the Hugging Face API for the reported size of every file across all public repositories and compared those values to the Content-Length header returned when downloading the file directly. In the overwhelming majority of cases, the two matched exactly—as expected. But not always.

This isn't the first time similar issues have been reported but in our testing we identified 443 files across 98 repositories where the reported size and actual size diverged.

This kind of drift doesn’t necessarily indicate compromise, but it does violate a basic integrity assumption: that the artifact on disk is the same one described by the registry. And once that assumption breaks, a number of failure modes become possible.

Why Size Drift Matters

There are plenty of benign explanations:

A model is re-uploaded, but the metadata isn't updated.
A file is quantized, patched, or replaced in-place.
The registry caches a stale file length or returns an outdated pointer.

But even in these cases, the result is the same: a mismatch between what downstream consumers think they’re getting and what they actually receive.

From a security perspective, this opens the door to several attacks:

Hash mismatch or bypass: If a consumer validates against a stored hash based on old metadata, a new payload can slip through undetected.
Tampered weights or backdoors: An attacker could replace a model with a new one of different size, embedding subtle poisoning or logic bombs.
CI/CD blind spots: Pipelines relying on file size or timestamp to determine freshness may silently load compromised versions.

This risk is especially acute with serialized formats like .pt, .bin, or .pkl, where a small change in bytes can lead to arbitrary code execution during deserialization.

Conclusion: Trust Without Guarantees

The Hugging Face ecosystem has become essential infrastructure for AI development. It enables rapid experimentation, reproducible research, and large-scale deployment in a way that would’ve been unthinkable just a few years ago.

But as usage grows, so does the trust we place in artifacts we didn’t build, audit, or even fully understand.

This post explored just a few structural concerns within that ecosystem:

Licensing metadata is frequently missing or incomplete—even for top-tier models.
Security scanners have flagged tens of thousands of repositories for potentially risky files, including in widely used projects.
A small but measurable number of repositories exhibit file-level inconsistencies—enough to undermine assumptions about reproducibility and provenance.

These aren’t theoretical concerns. They’re operational ones—especially for teams putting models into production, automating deployments, or trying to meet regulatory or compliance requirements.

I’m curious how other teams are approaching this.
If you’re thinking about model security, software supply chain hygiene, or just trying to get a handle on licensing obligations across your stack, I’d love to hear from you.

Feel free to reach out: ian@ramalama.com

Source Code

If you're interested in running any of this analysis yourself we've put together a CLI tool for generating the data we used in this blog post.