Skip to content
HN On Hacker News ↗

Neoclassical C++: segmented iterators revisited (1)

▲ 24 points 2 comments by ibobev 2d ago HN discussion ↗

Pangram verdict · v3.3

We believe that this document is primarily human-written, with some AI-generated content detected

19 %

AI likelihood · overall

Mixed
82% human-written 18% AI-generated
SEGMENTS · HUMAN 5 of 5
SEGMENTS · AI 0 of 5
WORD COUNT 1,576
PEAK AI % 10% · §3
Analyzed
May 24
backend: pangram/v3.3
Segments scanned
5 windows
avg 315 words each
Distribution
82 / 18%
human / AI fraction
Verdict
Mixed
Pangram v3.3

Article text · 1,576 words · 5 segments analyzed

Human AI-generated
§1 Human · 5%

Introduction to segmented iterators

The legendary STL library has been an inspiration in the development of the C++ language and libraries. The understanding of the concepts and code developed by Stepanov, Lee, Musser, Austern, etc. is a treasure trove for any C++ programmer.

Stepanov’s core claim was that abstraction should have near-zero overhead. He argued that generic programming, done correctly, should not impose a noticeable runtime cost (abstraction penalty) compared to hand-written specialized code. This was not just a hope but a design principle of the STL.

Following this philosophy, in 2000 Matt Austern published the paper Segmented Iterators and Hierarchical Algorithms. The paper’s core argument was that many data structures are naturally segmented (e.g. deque), and generic algorithms that ignore that feature — treating every data structure as a uniform range of elements — are unnecessarily inefficient. A new kind of iterator abstraction, in which segmentation is explicit, could make possible to write hierarchical algorithms that exploit segmentation and reduce the abstraction penalty associated with standard iterators.

The classic motivating example is std::deque, which is internally a sequence of fixed-size arrays. Every ++it on a deque iterator may cross a block boundary, so inplicitly std::find or std::fill has to evaluate, on every step, whether the current pointer has reached the end of the current block, an indirection that makes deque iteration measurably slower and also defeats most compilers’ auto-vectorisers.

A segmented iterator could be a two-level structure, allowing an algorithm to operate on each contiguous segment efficiently (for instance, calling memset or a tight inner loop on each chunk) and only handle the segment-to-segment transitions at the outer level.

This paper remained influential but its ideas were never adopted into the C++ standard. On the other hand, some libraries have internally developed those core ideas. Some examples::

Around 2023, libc++‘s std::for_each was optimized for segmented iterators, which lead to high performance improvements and there is an ongoing effort to add more algorithms.)

Some Boost libraries, like Boost.PolyCollection, have internal algorithm implementations specially tuned for their internal segmented nature.

Austern’s paper explained

A traditional iterator presents a flat view: increment, decrement, dereference.

§2 Human · 1%

A segmented iterator additionally admits being decomposed into two coordinates:

a segment iterator that walks the outer sequence of segments (the blocks of a deque, the chunks of a chunk-list, the buckets of a chained hash table, etc.). This iterator is not dereferenceable.

a local iterator that walks the inner range inside one segment. This iterator is dereferenceable.

A hierarchical algorithm is then a generic algorithm that, when given a segmented iterator, dispatches to a two-level loop: an outer loop over the segments and, inside each segment, a simplified algorithm over a higher performance, less segmented, local_iterator that could be further segmented into new, lower-level, segment iterator/local iterator pairs. Ultimately, this recursive segmentation reaches to a non-segmentable, probably highly efficient, local_iterator.

For a deque<T> the picture is the canonical one-level segmentation case: a segment_iterator is essentially walking the block index, and the corresponding local_iterator is walking inside one block.

For a deeper container such as vector<vector<vector<T>>> the same decomposition recurses: the local_iterator produced at the outer level can be itself decomposed in a new segment_iterator (its segments are the middle vectors) and local_iterator, and the local_iterator of the innermost vector<T> is truly flat.

For transformations between iterators, local_iterators, and segment_iterators Austern proposes a segmented_iterator_traits class template that defines the primitives for segmentation-aware algorithms. Its synopsis, transcribed from the paper, is:

//General definition containing a flag identifying the iterator as non-segmented template <class Iterator> struct segmented_iterator_traits { typedef /* Type::value == false */ is_segmented_iterator; };

//The traits class is specialized for every segmented iterator type template <class Iterator> struct segmented_iterator_traits<Iterator> { // is_segmented_iterator::value == true if Iterator is segmented typedef /* unspecified */ is_segmented_iterator;

// A non-dereferenceable iterator that

§3 Human · 10%

traverses the sequence of segments typedef /* unspecified */ segment_iterator;

// An efficient dereferenceable iterator, traversing within a single segment // This could be further decomposed into a lower-level segment and local iterator typedef /* unspecified */ local_iterator;

// Return the segment that contains the current position of it static segment_iterator segment(Iterator it);

// Returns the position within the current segment of it static local_iterator local(Iterator it);

// Constructs higher-level iterator from segment + local static Iterator compose(segment_iterator s, local_iterator l);

// Returns a local iterator to the first element of the segment static local_iterator begin(segment_iterator s);

// Returns a local iterator one past the last element of the segment static local_iterator end(segment_iterator s); };Code language: C++ (cpp)

With that trait in place, the paper’s canonical example — fill over a segmented range — is written as a three-piece walk: the tail of the first segment, the whole of every interior segment, and the head of the last segment:

template <class SegIt, class T> void hierarchical_fill(SegIt first, SegIt last, const T& x) { typedef segmented_iterator_traits<SegIt> traits; typename traits::segment_iterator sf = traits::segment(first); typename traits::segment_iterator sl = traits::segment(last);

if (sf == sl) { //Single segment, fast path std::fill(traits::local(first), traits::local(last), x); } else { //Multi-segment case std::fill(traits::local(first), traits::end(sf), x); //Partial initial for (++sf; sf != sl; ++sf) //Middle "full" segments std::fill(traits::begin(sf), traits::end(sf), x); std::fill(traits::begin(sl), traits::local(last), x); //Partial final } }Code language: C++ (cpp)

The three calls to std::fill are flat loops over local_iterator, which can be as efficient as a T*. Austern’s paper claims that ~20% performance improvement could be achieved when compared to the traditional STL-like std::fill implementation.

§4 Human · 0%

However, we’ll see that with modern compilers, an efficient local_iterator allows much higher performance levels.

How Boost.Container is experimenting with segmentation

Boost.Container has started an experimental work on segmented algorithms closely following the original proposal. Time will tell if this experimental effort will lead to another Boost library proposal or it will remain as an implementation detail to speed up segmented data structures like deque, hub, etc.

You can find the ongoing work in the Boost.Container repository, including implementation, preliminary tests and very early benchmarks.

Initial experiments follow the original paper implementing algorithms that:

tag-dispatch on segmented_iterator_traits<Iter>::is_segmented_iterator.

When the iterator is effectively detected as segmented, for simple algorithms (more complex algorithms need more novel approaches), the algorithm decomposes the input range into first segment, middle segments, and last segment, calls itself recursively on each, and ultimately bottoms out on a flat loop over the local_iterator type.

When the passed iterator is not detected as segmented the algorithm collapses to a classic loop and behaves like the canonical standard library version.

Note: The multi-segment first-segment + middle-segment + end-segment approach from Austern’s paper can be simplified to single a while loop, but this can downgrade the performance for fixed size segment data structures (such as deque) as compilers can more easily auto-vectorize address ranges whose length is known at compile-time. I’ve found that single while loop with no fixed boundaries (start and end of a segment) is noticeably slower in many cases since the compiler can’t guarantee vectorization can work in every iteration.

template <class SegIt, class T> void hierarchical_fill(SegIt first, SegIt last, const T& x) { typedef segmented_iterator_traits<SegIt> traits; typename traits::segment_iterator sf = traits::segment(first); typename traits::segment_iterator sl = traits::segment(last); typename traits::local_iterator lf = traits::local(first);

while (true) { typename traits::local_iterator le = (sf == sl) ? traits::local(last) : traits::end(sf); std::fill(lf, le, x); if (sf == sl) break; lf = traits::begin(++sf); } }

The goal of this article is to show how these experimental segmented algorithms perform, starting with the classic boost::container::deque<T> container.

§5 Human · 4%

As explained in the previous blog post, in this deque implementation the block “index” is implemented as an array of T* that point to fixed-size blocks (arrays) of type T. So for Boost’s deque::iterator:

segment_iterator is defined as a T** walking between blocks.

local_iterator is defined as a plain T* that can efficiently move inside one contiguous block of memory.

Low hanging fruit: Benchmarking single-range, single-pass algorithms with deque

Fortunately, Austern’s fill example can be fully reused on many other similar algorithms. Unbounded algorithms taking length arguments instead of ranges (such as fill_n, generate_n) need a slightly more complicated segmentation handling, but essentially follow Austern’s proposal

Note: Segmentation handling can become much more complicated when the algorithm takes more than one input iterator range and/or output iterator ranges (e.g. std::merge takes two input ranges and an output iterator, with potentially three different segmented iterator types).

Algorithms like reverse that must iterate the range forward and backwards at the same time are also a challenge.

Finally, in some cases, an implementation taking advantage of a recursively segmented iterator (general case) is more complicated than handling a single-level segmentation.

Benchmark highlights (you can find the general benchmark code here):

Every test runs against the same boost::container::deque<T> with a fixed block size of 128 elements per block, with size() == 100'000.

The per-call cost is averaged over several thousand invocations.

The container is filled with contiguous positive values 0, 1, 2, …, N-1 (so all elements are non-negative, distinct and strictly ascending)

Where it makes sense the harness probes both a (hit) and a (miss) variant that refers to the answer the algorithm returns.

The selected 27 sub-bencharks (16 algorithms with “hit” + “miss” variants where applicable) are summarised in the table below.

Algorithm Hit case Miss case

all_of Predicate x ≥ 0 — true everywhere; must scan every element Predicate x ≠ N/2 the algorithm short-circuits in the middle.

any_of Predicate x == N/2 short-circuits in the middle. Predicate x < 0 full pass before answering “no”.

count Value = 0 counts a single hit but still scans the whole deque.