12 janvier 2024
This dataset contains CSV and SQLite files with data about projects backends that were extracted from Metadata about every file uploaded to PyPI
The aim of this analysis is to make some statistics on the
build backends used in pyproject.toml
files of the Python projects uploaded to PyPI.
As a result, the analysis take only the package uploaded after 2018 into account.
For each project, only the last source package is considered. It would be interesting to take the last source package of each year to see the evolution of the build backends usage.
Update : The statistics about build backend evolution over time have been done by Bastian Venthur in Investigating popularity of Python build backends over time
In a source package, there may exist several
pyproject.toml files (poetry, meson-python, numpy, flit,
jpterm, …) which have different usages (tests, plugins, …). I analysed
only the pyproject.toml that is at the root of the
project.
I have used the path metadata to identify the package
files, I should have used the archive_path metadata.
The files below were produced in the indicated order, the aim was to produce pyproject_backends.db
pyproject.toml file and having
been uploaded after 2018, get :
project_namemax project_versionmax uploaded_onlist of distinct project_versionlist of distinct uploaded_onlist of distinct path...extract-pyproject-all-versions,
get the data of the latest uploaded_on
date (1)extract-pyproject-latest.db for each
project only in the pyproject.toml file on the root of the
project (2)Source code for the data extraction.
(1) There are several pyproject.toml files for some projects (e.g
poetry), often in test folders
(2) The test is quite basic, but there are few projects that have
several pyproject.toml file matching this test
There were 120 698 “projects” with at least one
pyproject.toml in their source package in
extract-pyproject-latest.db, cf pyproject
latest parquet query.
sqlite> select count(distinct project_name) from pyprojects;
120698
The posted
charts showed 107 757 projects not 120 698. This is because some
projects have pyproject.toml files but those files may not
be at the root of the project or do not contain a build backend
declaration. There are 12449 projects for which backend is NULL in
pyproject_backends.db. TODO : find why the figures still
differ.
sqlite> select count(project_name) FROM backends where backend is NULL;
12449
sqlite> select 120175 - 12449;
107726
After the publication of the first charts, I wanted to know
how many projects had no source package, how many projects had no
pyproject.toml to complete the first
statistics.
I had to run a new
parquet query removing the filter on the pyproject.toml
presence in the package.
The new extraction gives 410 944 different “projects” (this is an extraction of the projects uploaded since 2018).
The following files have been added to the dataset :
project_name,project_versionproject_releasesuffix(project_release, '.whl') AS wheelsuffix(project_release, '.tar.gz') AS sourcemax(uploaded_on) AS max_uploaded_ondate_part('year', max(uploaded_on)) AS max_yearlist(DISTINCT uploaded_on)Source code of this new extraction, get additional information in its README file
There are 410 944 different project_name identified
since 2018 in
extract-project-releases-2018-and-later.db
sqlite> select count(distinct project_name) from pyprojects;
410944
On these 410 944 projects, ~ 11 % have no source package with a
.tar.gz format on PyPI (there are .zip and
.bz2 source packages but it’s a marginal 0.77 %).
This give a basis of 366 969 source projects with a
.tar.gz format.
We found initially that 120 698 projets have a
pyproject.toml file, we can think that :
pyproject.toml file represents
~ 29.4 % of the total projects.tar.gz formatsqlite> select count(distinct project_name) from pyprojects where source = 'true';
366969
sqlite> select 120698 * 100.0 / 410944;
29.3709118517365
sqlite> select 120698 * 100.0 / 366969;
32.8905166376451
In the 120 698 projets that have a pyproject.toml file
:
setuptools represents ~ 46.25 %
of the source projects with a pyproject.toml, and ~ 12.1 %
of the total uploaded project since 2018poetry represents ~ 31.2 % of the
source projects with a pyproject.toml, and ~ 8.2 % of the
total uploaded project since 2018Hatchling represents ~ 7.5 % of
the source projects with a pyproject.toml, and ~ 2 % of the
total uploaded project since 2018extract-pyproject-latest.db
sqlite> select 8113 * 100.0 / 107757;
7.52897723581763
sqlite> select 8113 * 100.0 / 410944;
1.97423493225354
sqlite> select 49819 * 100.0 / 107757;
46.2327273402192
sqlite> select 49819 * 100.0 / 410944;
12.123062996418
sqlite> select 33576 * 100.0 / 107757;
31.1589966313093
sqlite> select 33576 * 100.0 / 410944;
8.1704563152157
TODO : How the 67 % remaining projects are packaged, among thoses the 10 % that did not upload a source package ?