12 janvier 2024
This dataset contains CSV and SQLite files with data about projects backends that were extracted from Metadata about every file uploaded to PyPI
The aim of this analysis is to make some statistics on the
build backends used in pyproject.toml
files of the Python projects uploaded to PyPI.
As a result, the analysis take only the package uploaded after 2018 into account.
For each project, only the last source package is considered. It would be interesting to take the last source package of each year to see the evolution of the build backends usage.
Update : The statistics about build backend evolution over time have been done by Bastian Venthur in Investigating popularity of Python build backends over time
In a source package, there may exist several
pyproject.toml
files (poetry, meson-python, numpy, flit,
jpterm, …) which have different usages (tests, plugins, …). I analysed
only the pyproject.toml
that is at the root of the
project.
I have used the path
metadata to identify the package
files, I should have used the archive_path
metadata.
The files below were produced in the indicated order, the aim was to produce pyproject_backends.db
pyproject.toml
file and having
been uploaded after 2018, get :
project_name
max project_version
max uploaded_on
list of distinct project_version
list of distinct uploaded_on
list of distinct path
...
extract-pyproject-all-versions
,
get the data of the latest uploaded_on
date (1)extract-pyproject-latest.db
for each
project only in the pyproject.toml
file on the root of the
project (2)Source code for the data extraction.
(1) There are several pyproject.toml files for some projects (e.g
poetry), often in test folders
(2) The test is quite basic, but there are few projects that have
several pyproject.toml file matching this test
There were 120 698 “projects” with at least one
pyproject.toml
in their source package in
extract-pyproject-latest.db
, cf pyproject
latest parquet query.
sqlite> select count(distinct project_name) from pyprojects;
120698
The posted
charts showed 107 757 projects not 120 698. This is because some
projects have pyproject.toml
files but those files may not
be at the root of the project or do not contain a build backend
declaration. There are 12449 projects for which backend is NULL in
pyproject_backends.db
. TODO : find why the figures still
differ.
sqlite> select count(project_name) FROM backends where backend is NULL;
12449
sqlite> select 120175 - 12449;
107726
After the publication of the first charts, I wanted to know
how many projects had no source package, how many projects had no
pyproject.toml
to complete the first
statistics.
I had to run a new
parquet query removing the filter on the pyproject.toml
presence in the package.
The new extraction gives 410 944 different “projects” (this is an extraction of the projects uploaded since 2018).
The following files have been added to the dataset :
project_name
,project_version
project_release
suffix(project_release, '.whl') AS wheel
suffix(project_release, '.tar.gz') AS source
max(uploaded_on) AS max_uploaded_on
date_part('year', max(uploaded_on)) AS max_year
list(DISTINCT uploaded_on)
Source code of this new extraction, get additional information in its README file
There are 410 944 different project_name
identified
since 2018 in
extract-project-releases-2018-and-later.db
sqlite> select count(distinct project_name) from pyprojects;
410944
On these 410 944 projects, ~ 11 % have no source package with a
.tar.gz
format on PyPI (there are .zip
and
.bz2
source packages but it’s a marginal 0.77 %).
This give a basis of 366 969 source projects with a
.tar.gz
format.
We found initially that 120 698 projets have a
pyproject.toml
file, we can think that :
pyproject.toml
file represents
~ 29.4 % of the total projects.tar.gz
formatsqlite> select count(distinct project_name) from pyprojects where source = 'true';
366969
sqlite> select 120698 * 100.0 / 410944;
29.3709118517365
sqlite> select 120698 * 100.0 / 366969;
32.8905166376451
In the 120 698 projets that have a pyproject.toml
file
:
setuptools
represents ~ 46.25 %
of the source projects with a pyproject.toml
, and ~ 12.1 %
of the total uploaded project since 2018poetry
represents ~ 31.2 % of the
source projects with a pyproject.toml
, and ~ 8.2 % of the
total uploaded project since 2018Hatchling
represents ~ 7.5 % of
the source projects with a pyproject.toml
, and ~ 2 % of the
total uploaded project since 2018extract-pyproject-latest.db
sqlite> select 8113 * 100.0 / 107757;
7.52897723581763
sqlite> select 8113 * 100.0 / 410944;
1.97423493225354
sqlite> select 49819 * 100.0 / 107757;
46.2327273402192
sqlite> select 49819 * 100.0 / 410944;
12.123062996418
sqlite> select 33576 * 100.0 / 107757;
31.1589966313093
sqlite> select 33576 * 100.0 / 410944;
8.1704563152157
TODO : How the 67 % remaining projects are packaged, among thoses the 10 % that did not upload a source package ?