Bartlomiej Filipek
In this blog post, we’ll delve into the unroll<N>()
template function for template unrolling, understand its mechanics, and see how it can improve your code. We’ll look at lambdas, fold expressions, and integer sequences.
Let’s get started!
In a recent article Vector math library codegen in Debug · Aras’ website - Aras Pranckevičius discusses some coding techniques that help with performance of debug code… and I came across an intriguing technique utilized in the Blender Math library that he used in his text.
One interesting example was this one:
friend VecBase operator+(const VecBase &a, const VecBase &b)
{
VecBase result;
unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
return result;
}
And I’m curious how this unroll<Size>()
function works under the hood.
unroll()
Matters
Before we go into the intricacies of the unroll()
function, it’s good to learn such a technique is valuable. In performance-critical applications—such as graphics rendering, real-time simulations, or high-frequency trading—every millisecond counts. Traditional loops, while easy to write, introduce runtime overhead that can be minimized or eliminated using compile-time optimizations like loop unrolling.
In short, template unrolling automates the expansion of loops during compilation, replacing iterative constructs with repetitive code blocks.
unroll()
Template Function
Let’s break down the unroll()
function inspired by Blender’s C++ math library. This function leverages modern C++ features such as lambdas, variadic templates, and fold expressions to perform compile-time loop unrolling efficiently.
Here’s a simplified implementation of the unroll()
function:
#include <utility>
// Helper to implement unroll via parameter pack expansion
template<class Fn, std::size_t... I>
void unroll_impl(Fn fn, std::index_sequence<I...>) {
(fn(I), ...); // Calls fn(0), fn(1), ..., fn(N-1)
}
// Primary unroll function
template<int N, class Fn>
void unroll(Fn fn) {
unroll_impl(fn, std::make_index_sequence<N>());
}
Breaking It Down:
unroll_impl
fn
: The lambda function to execute.std::index_sequence<I...>
: A compile-time sequence of indices.(fn(I), ...)
to call fn
for each index in the sequence.unroll
:
N
: The number of times to unroll (i.e., the size).fn
: The lambda function to execute.index_sequence
from 0
to N-1
using std::make_index_sequence<N>()
and passes it to unroll_impl
.This setup ensures that the lambda fn
is invoked exactly N
times, each with a unique index from 0
to N-1
.
You can learn more about iteration at compile time in my other article: C++ Templates: How to Iterate through std::tuple: the Basics - C++ Stories
To illustrate the power of unroll()
combined with lambdas, let’s implement a simple vector addition operation.
#include <array>
#include <cassert>
#include <iostream>
// Base vector structure with 4 components
template<typename T>
struct Vector4 {
T x, y, z, w;
// Element access using indices
T& operator[](int index) {
assert(index >= 0 && index < 4);
return reinterpret_cast<T*>(this)[index];
}
const T& operator[](int index) const {
assert(index >= 0 && index < 4);
return reinterpret_cast<const T*>(this)[index];
}
// Vector addition using unroll and lambda
Vector4 operator+(const Vector4& other) const {
Vector4 result;
unroll<4>([&](auto i) {
result[i] = (*this)[i] + other[i];
});
return result;
}
};
x
, y
, z
, and w
.operator[]
: Allows accessing components via indices 0
to 3
.operator+
):
Vector4
named result
.unroll<4>()
with a lambda that adds corresponding components:
result[0] = this->x + other.x
result[1] = this->y + other.y
result[2] = this->z + other.z
result[3] = this->w + other.w
result
vector.The Blender Math code is available here: @Github commit
Using the Vector Addition
template<typename T>
std::ostream& operator<<(std::ostream& os, const Vector4<T>& v) {
unroll<4>([&](auto i) {
os << v[i] << " ";
});
return os;
}
int main() {
Vector4<float> vec1 = {1.0f, 2.0f, 3.0f, 4.0f};
Vector4<float> vec2 = {5.0f, 6.0f, 7.0f, 8.0f};
Vector4<float> sum = vec1 + vec2;
std::cout << "Sum: " << sum;
}
Play with the code @Compiler Explorer
When vec1 + vec2
is executed:
operator+
is called four times (for indices 0
to 3
), performing component-wise addition.unroll()
, there’s no loop overhead—the compiler expands these calls at compile time.Vector4
containing the sums of corresponding components.This approach not only enhances performance but also keeps the code clean and easy to understand.
unroll()
isn’t the only choice for loop unrolling; here are some other worth mentioning:
If you cannot use fold expressions, then here’s a recursive solution:
// Primary template for unrolling
template<int N>
struct Unroll {
template<typename Fn>
static inline void apply(Fn fn) {
Unroll<N - 1>::apply(fn); // Recurse with N-1
fn(N - 1); // Process the current index
}
};
// Specialization for the base case
template<>
struct Unroll<0> {
template<typename Fn>
static inline void apply(Fn fn) {
// Base case: do nothing
}
};
And here’s a working example @Compiler Explorer
There’s also a good example, with Dot Product in the great book on templates: C++ Templates: The Complete Guide (2nd Edition)
In this text, we explored an interesting technique for unrolling repetitive and simple loop statements. Thanks to C++17 features like fold expressions, combined with templates and lambdas, the code is elegant and easy to understand.
Read more: