Backends declared in the pyproject.toml files

Françoise CONIL

12 janvier 2024

This dataset

This dataset contains CSV and SQLite files with data about projects backends that were extracted from Metadata about every file uploaded to PyPI

Backends declared in the pyproject.toml files during 2018 to 2023 on PyPI

The aim of this analysis is to make some statistics on the build backends used in pyproject.toml files of the Python projects uploaded to PyPI.

As a result, the analysis take only the package uploaded after 2018 into account.

For each project, only the last source package is considered. It would be interesting to take the last source package of each year to see the evolution of the build backends usage.

Update : The statistics about build backend evolution over time have been done by Bastian Venthur in Investigating popularity of Python build backends over time

In a source package, there may exist several pyproject.toml files (poetry, meson-python, numpy, flit, jpterm, …) which have different usages (tests, plugins, …). I analysed only the pyproject.toml that is at the root of the project.

I have used the path metadata to identify the package files, I should have used the archive_path metadata.

The files below were produced in the indicated order, the aim was to produce pyproject_backends.db

  1. extract-pyproject-all-versions.csv, extract-pyproject-all-versions.db : for the projects having a pyproject.toml file and having been uploaded after 2018, get :
  2. extract-pyproject-latest.csv, extract-pyproject-latest.db : for each project found in extract-pyproject-all-versions, get the data of the latest uploaded_on date (1)
  3. pyproject_backends.csv, pyproject_backends.db : the build backend found in extract-pyproject-latest.db for each project only in the pyproject.toml file on the root of the project (2)

Source code for the data extraction.

(1) There are several pyproject.toml files for some projects (e.g poetry), often in test folders
(2) The test is quite basic, but there are few projects that have several pyproject.toml file matching this test

PyPI metadata further analysis

There were 120 698 “projects” with at least one pyproject.toml in their source package in extract-pyproject-latest.db, cf pyproject latest parquet query.

sqlite> select count(distinct project_name) from pyprojects;
120698

The posted charts showed 107 757 projects not 120 698. This is because some projects have pyproject.toml files but those files may not be at the root of the project or do not contain a build backend declaration. There are 12449 projects for which backend is NULL in pyproject_backends.db. TODO : find why the figures still differ.

sqlite> select count(project_name) FROM backends where backend is NULL;
12449
sqlite> select 120175 - 12449;
107726

After the publication of the first charts, I wanted to know how many projects had no source package, how many projects had no pyproject.toml to complete the first statistics.

I had to run a new parquet query removing the filter on the pyproject.toml presence in the package.

The new extraction gives 410 944 different “projects” (this is an extraction of the projects uploaded since 2018).

The following files have been added to the dataset :

  1. extract-project-releases-2018-and-later.csv, extract-project-releases-2018-and-later.db
    extract for each project since 2018 :

Source code of this new extraction, get additional information in its README file

There are 410 944 different project_name identified since 2018 in extract-project-releases-2018-and-later.db

sqlite> select count(distinct project_name) from pyprojects;
410944

On these 410 944 projects, ~ 11 % have no source package with a .tar.gz format on PyPI (there are .zip and .bz2 source packages but it’s a marginal 0.77 %).

This give a basis of 366 969 source projects with a .tar.gz format.

We found initially that 120 698 projets have a pyproject.toml file, we can think that :

sqlite> select count(distinct project_name) from pyprojects where source = 'true';
366969

sqlite> select 120698 * 100.0 / 410944;
29.3709118517365

sqlite> select 120698 * 100.0 / 366969;
32.8905166376451

In the 120 698 projets that have a pyproject.toml file :

extract-pyproject-latest.db

sqlite> select 8113 * 100.0 / 107757;
7.52897723581763
sqlite> select 8113 * 100.0 / 410944;
1.97423493225354

sqlite> select 49819 * 100.0 / 107757;
46.2327273402192
sqlite> select 49819 * 100.0 / 410944;
12.123062996418

sqlite> select 33576 * 100.0 / 107757;
31.1589966313093
sqlite> select 33576 * 100.0 / 410944;
8.1704563152157

TODO : How the 67 % remaining projects are packaged, among thoses the 10 % that did not upload a source package ?