I recently chatted with Andrew Nesbitt about his project, Ecosyste.ms. Ecosyste.ms catalogs open source projects by tracking packages, dependencies, repositories, and more. With this dataset Andrew is able to incredible insights into the world of open source. We chat all about how Ecosyste.ms works and how he manages to wrangle all this data.

This episode is also available as a podcast, search for “Open Source Security” on your favorite podcast player.

Peeling Back the Layers of Open Source with Ecosyste.ms

We all use open source every day in nearly everything we touch. But understanding the sheer scale and the connections is a whole other challenge. I had a chat with Andrew, the developer behind Ecosyste.ms. If you’ve ever wondered about the nuts and bolts of the open source landscape, this is a project to unwind it all.

So, what is Ecosyste.ms? Andrew described it as an extensive index of open source metadata. Think of it as a giant catalog, not just of software packages, but also the various versions, the dependencies between them, the repositories where their code lives, and a whole lot more. He’s trying to map the entire open source world. The primary goal is to figure out which packages are the most used and the most critical to the ecosystem. And the way to do that is to find every user of every package by mining the dependencies of everything out there.

The dataset Andrew has assembled is gigantic. The database size is terabytes, with one table holding twenty-two billion rows. I’ve dabbled with large datasets, and the thought of managing something of this size, especially with Postgres, is seriously impressive. Andrew mentioned how Postgres has been very reliable saving him from potential data disasters more than once.

One of the big hurdles is the diversity of package managers and how they present their data. Andrew has created a standardized schema, all packages have certain things in common. Some package managers offer neat APIs like a changes feed, making it easier to track updates. There are a ton of data quirks. For example NPM historically had case-sensitive package names that are still around.

The numbers presented on the Ecosyste.ms website show us this scale. There are 11.4 million packages. The average package has around ten versions, which when multiplied ends up being a huge number. Then there are 262 million repositories in Ecosyste.ms. Most impressive, there are 22 billion dependencies, each one a link between a repository and a package it relies on. He also identified around 1.7 million maintainers, defined as individuals who have the ability to publish new versions of a package. Andrew clarified that the number of people contributing code in any fashion is likely much larger, perhaps around six million active open source developers for projects linked to packages.

A particularly insightful concept Andrew discussed is how he identifies “critical” packages. An astonishingly small percentage of packages, often less than a thousand even in registries with over a million packages, account for 80% of all downloads or dependent repositories. These are the packages whose failure or compromise would cause widespread disruption. Identifying these critical components is where security efforts could be best focused.

This naturally leads to an idea Andrew is developing: “blast radius.” Given a vulnerability in a specific version of a package, his data can help map out how many other open source repositories and packages depend on that piece of software. This doesn’t automatically mean every dependent is actually vulnerable, but it could provide a measure of potential impact. It allows for a data driven approach to prioritize which projects need attention when a new vulnerability surfaces, moving beyond just the severity score of the flaw itself to understand its potential reach across the open source landscape.

Andrew emphasized that Ecosyste.ms is an open source project, fiscally hosted on Open Collective. It started with a grant and continues with some contract work for Open Collective, building tools on top of the core data. The API is open for anyone to use, without needing keys, and designed to be low-cost thanks to aggressive caching. Contributions are welcome as is financial support via their Open Collective page to help with server costs and further development.

Talking to Andrew about his work on Ecosyste.ms was a great chat. Projects like Ecosyste.ms are invaluable, not just for security researchers, but for anyone trying to make sense of the open source world. It’s a monumental undertaking, and the insights it reveals are most interesting and could be extremely useful.