<img src="https://sorting.cr.yp.to/sortbench/20260210-cezanne/plot32.png">

(For higher resolution:
[SVG](https://sorting.cr.yp.to/sortbench/20260210-cezanne/plot32.svg),
[PDF](https://sorting.cr.yp.to/sortbench/20260210-cezanne/plot32.pdf).)

This graph compares sorting performance for various `int32` array sizes
on one core of an AMD Ryzen 5 PRO 5650G
(2021 CPU launch; Zen 3 microarchitecture)
with overclocking disabled, Debian 12, gcc 12.2.0, clang 14.0.6
using the following libraries:
djbsort;
the [vxsort](https://github.com/damageboy/vxsort-cpp) library
[used in Microsoft's .NET](https://github.com/dotnet/runtime/pull/37159)
starting in 2020;
the [vqsort/highway](https://github.com/google/highway) library
[introduced by Google](https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html);
the [far](https://github.com/simd-sorting/fast-and-robust) library, which is the (sometimes faster) parent of vqsort;
and the [x86simdsort](https://github.com/intel/x86-simd-sort.git) library
[introduced by Intel](https://www.phoronix.com/news/Intel-AVX-512-Quicksort-Numpy).
See also the separate
[comparison page](comparison.html)
for a comparison of library features beyond speed.

The far, vqsort, vxsort, and x86simdsort libraries
use vectorized sorting networks for small lengths
and vectorized quicksort for large lengths.
Increasing the length cutoff
and calling djbsort for the base case
would improve the performance of those libraries
not just for small lengths but also for large lengths.
The point is that larger-length sorting uses smaller-length sorting as a subroutine.
Beware that combining code in this way would not achieve the
[security](security.html) and
[verification](verif.html) of djbsort.

The graph also includes "radixwrapper",
a recursive MSD radix sort that calls djbsort for lengths below 8192.
This also does not achieve the security and verification of djbsort.

There are also many slower sorting libraries.
The graph includes four of those for comparison:
"stdsort", which is `std::sort` from the installed C++ standard library;
"herf", which is radix-sorting code (with radix 2048) posted by Michael Herf in
[2001](http://stereopsis.com/radix.html)
plus a simplification to sort `int32` rather than `float32`;
["sid1607"](https://github.com/sid1607/avx2-merge-sort),
which is AVX2 merge-sorting code posted by Siddharth Santurkar in 2016;
and
["aspas"](https://github.com/vtsynergy/aspas_sort),
from a 2018 paper
"A framework for the automatic vectorization of parallel sort on x86-based processors".

The radixwrapper results are shown only for array sizes above 8192,
since for smaller array sizes radixwrapper simply calls djbsort.
The herf and aspas results are shown only for array sizes at least 64 and 16 respectively,
since it is clear that `std::sort` is preferable for smaller arrays.
Also, the sid1607 results are shown only for array sizes that are multiples of 64,
since that code crashes for other array sizes.
Note that Lukas Bergdoll reported in 2023
that "~50% of the time spent in `slice::sort`" in "clean compiling 60 crates"
was spent sorting arrays of
[length at most 20](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md).
It would be interesting to see more reports
on how the sorting time spent by applications
is distributed across array sizes and data types.

## How were the numbers collected?

The graph relies on data from
a small C++ benchmarking tool,
[sortbench](#sortbench),
used for all of the libraries.
The tool is run 8 times for each library
(which should catch any frequent VIPT effects).
Each run tries many array sizes,
including powers of 2 but also many more sizes
(which should catch slowdowns for inconvenient sizes).
For each array size, each run

* generates an array with random contents;
* fills 64 arrays with random permutations of the first array;
* sorts each of those 64 arrays in turn with the tested library, checking the cycle counter after each sort; and
* checks that all of the supposedly sorted arrays match the output of `std::sort`.

The 64 arrays are spaced by an odd number of integers, so most of them are not aligned to cache lines.
Exception: this spacing is not applied to the sid1607 library, which requires aligned arrays.
(Applications might specifically align arrays to try to save time in any case,
so it would be useful to separately benchmark the aligned case for all libraries.)

For each library at each size,
the graph shows
[stabilized quartiles](https://cr.yp.to/papers.html#rsrst)
of the resulting 512 cycle-count measurements.
The three quartiles are marked as horizontal line, small X, horizontal line.
Measurements can vary (producing noticeable vertical distances between the horizontal lines)
because of branch-prediction effects and other pipeline effects,
or because of some sorting algorithms handling different permutations at different speeds.
The benchmarking tool does not try to find "bad" inputs.

The tool collects cycle counts using
[libcpucycles](https://cpucycles.cr.yp.to),
which on the machine mentioned above ends up using RDPMC to read an on-core cycle counter.
The plotting script subtracts the observed cycle-counting overhead
from every cycle count in the graph before drawing the graph.

Since CPUs typically support overclocking and typically have overclocking enabled by default
(at the expense of
[security problems](https://blog.cr.yp.to/20230609-turboboost.html)
and [decreased hardware longevity](https://ieeexplore.ieee.org/abstract/document/8920389)),
it would also be useful to measure power consumption at various clock frequencies,
and to measure time consumption when overclocking is enabled
with specified numbers of cores active simultaneously.
Enabling overclocking can produce different CPU frequencies
for different libraries with different power consumption,
changing the relative time for those libraries.

## What about `int64` arrays?

<img src="https://sorting.cr.yp.to/sortbench/20260210-cezanne/plot64.png">

(For higher resolution:
[SVG](https://sorting.cr.yp.to/sortbench/20260210-cezanne/plot64.svg),
[PDF](https://sorting.cr.yp.to/sortbench/20260210-cezanne/plot64.pdf).)

There are two reasons
that the speedup of vectorized libraries over `std::sort` for `int64` arrays
is smaller than the speedup over `std::sort` for `int32` arrays.
First, a 256-bit vector instruction carries out 8 `int32` operations
but only 4 `int64` operations.
Second,
the AVX2 instruction set has `min` and `max` instructions for `int32x8` but not for `int64x4`.

## What about other CPUs?

Here's one core of a Broadcom BCM2712
(2023 CPU launch; Cortex-A76 microarchitecture)
in a Raspberry Pi 5
with Debian 13 (via Raspbian), gcc 14.2.0, clang 19.1.7.

<img src="https://sorting.cr.yp.to/sortbench/20260210-pi5/plot32.png">

(For higher resolution:
[SVG](https://sorting.cr.yp.to/sortbench/20260210-pi5/plot32.svg),
[PDF](https://sorting.cr.yp.to/sortbench/20260210-pi5/plot32.pdf).)

<img src="https://sorting.cr.yp.to/sortbench/20260210-pi5/plot64.png">

(For higher resolution:
[SVG](https://sorting.cr.yp.to/sortbench/20260210-pi5/plot64.svg),
[PDF](https://sorting.cr.yp.to/sortbench/20260210-pi5/plot64.pdf).)

Here's one core of an Intel Xeon E3-1220 v5
(2015 CPU launch; Skylake microarchitecture)
with Ubuntu 24.04, gcc 13.3.0, clang 18.1.3.

<img src="https://sorting.cr.yp.to/sortbench/20260210-samba/plot32.png">

(For higher resolution:
[SVG](https://sorting.cr.yp.to/sortbench/20260210-samba/plot32.svg),
[PDF](https://sorting.cr.yp.to/sortbench/20260210-samba/plot32.pdf).)

<img src="https://sorting.cr.yp.to/sortbench/20260210-samba/plot64.png">

(For higher resolution:
[SVG](https://sorting.cr.yp.to/sortbench/20260210-samba/plot64.svg),
[PDF](https://sorting.cr.yp.to/sortbench/20260210-samba/plot64.pdf).)

It would be useful to benchmark more CPUs.
Some libraries obtain a speed boost from AVX-512
(although vxsort would need some lines to change in the sortbench tool for AVX-512).
Also, vqsort supports vector instructions for PowerPC, RISC-V, et al.

## Running sortbench {#sortbench}

Currently the sortbench tool
supports (1) 64-bit Intel/AMD CPUs with AVX2
and (2) 64-bit ARM CPUs.
The tool has been tested under a few versions of Debian, Raspbian, and Ubuntu,
and probably works on more Debian-derived systems.
The tool needs the following system packages:

* gcc and other compiler tools: `apt install build-essential`
* clang: `apt install clang`
* for Ubuntu 22.04, missing C++ libraries: `apt install g++-12`
* cmake: `apt install cmake`
* meson: `apt install meson`
* Python 3: `apt install python3`
* matplotlib: `apt install python3-matplotlib`
* libcpucycles: `apt install libcpucycles-dev` (or install directly from [the source](https://cpucycles.cr.yp.to))

(Combined for Ubuntu 22.04:
`apt install build-essential clang g++-12 cmake meson python3 python3-matplotlib` and install libcpucycles from source.
Combined for Debian 12:
`apt install build-essential clang cmake meson python3 python3-matplotlib` and install libcpucycles from source.
Combined for Ubuntu 24.04 or Debian 13:
`apt install build-essential clang cmake meson python3 python3-matplotlib libcpucycles-dev`.)

The sortbench tool is included in the djbsort source distribution.
It is not
[installed](install.html) as part of a djbsort binary package
(and is not incorporated into the `djbsort-speed` utility).
So the next step is to
[download and unpack](download.html) djbsort
as an unprivileged user.

You still need a network connection after this for running the sortbench tool:
the tool automatically downloads copies of the aspas, far, sid1607, vqsort, vxsort, and x86simdsort libraries.
The tool assumes you have already installed djbsort (as a home-directory installation or a system installation)
and stdsort (as part of the standard C++ library).
The tool includes herf and radixwrapper, both of which are very small.

Last step:

    cd sortbench
    ./do

Results should end up in
`plot32.pdf`, `plot32.svg`, `plot32.png`,
`plot64.pdf`, `plot64.svg`, and `plot64.png`
in the `sortbench` directory.

The tool runs benchmarks on a single core
(while allowing whatever number of cores are used for compilation of the libraries).
The benchmarks typically take 10 to 20 minutes, depending on the core speed;
plotting takes under a minute.
More time can be taken by compilation,
but overall it is not surprising if `./do` finishes in under an hour total.

If you later run `rm */skipbench; ./do`
then the libraries will not be recompiled
but the benchmarks will be re-run.
This is useful if you have enabled
[more reliable cycle counters](https://cpucycles.cr.yp.to/counters.html)
in the meantime.
If you run `rm */skiprebuild; ./do`
then the libraries will be recompiled
and the benchmarks will be re-run.

One known failure mode in `./do` is vqsort compilation running out of RAM.
If this happens, try `taskset -c 0 ./do` to limit everything to core 0.
If that still isn't enough, try adding some swap space.
(As root, to add 16GB of swap space until reboot:
`mkdir /root/swap; chmod 700 /root/swap; dd if=/dev/zero of=/root/swap/tmp1 bs=16777216 count=1024; chmod 600 /root/swap/tmp1; mkswap /root/swap/tmp1; swapon /root/swap/tmp1`.)
The vqsort compilation options are also now adjusted by sortbench to skip compiling vqsort's tests and especially googletest.

For CPUs with heterogeneous cores (such as P-cores and E-cores in many recent Intel CPUs),
you should benchmark each type of core separately.
For example,

    taskset -c 0-3 ./do
    mkdir threads0-3
    mv plot*.* threads0-3
    taskset -c 4-7 ./do
    mkdir threads4-7
    mv plot*.* threads4-7

will limit benchmarks to OS threads 0–3 first, then to OS threads 4–7.
There is no standard mechanism to figure out the partition of threads into core types, but
`grep . /sys/devices/system/cpu/cpu*/cache/index*/size`
usually makes the patterns clear.
