Bartlomiej Filipek

Practical C++17: Loop Unrolling with Lambdas and Fold Expressions

2024-9-22

In this blog post, we’ll delve into the unroll<N>() template function for template unrolling, understand its mechanics, and see how it can improve your code. We’ll look at lambdas, fold expressions, and integer sequences.

Let’s get started!

A little background

In a recent article Vector math library codegen in Debug · Aras’ website - Aras Pranckevičius discusses some coding techniques that help with performance of debug code… and I came across an intriguing technique utilized in the Blender Math library that he used in his text.

One interesting example was this one:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    VecBase result;
    unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
    return result;
}

And I’m curious how this unroll<Size>() function works under the hood.

Why `unroll()` Matters

Before we go into the intricacies of the unroll() function, it’s good to learn such a technique is valuable. In performance-critical applications—such as graphics rendering, real-time simulations, or high-frequency trading—every millisecond counts. Traditional loops, while easy to write, introduce runtime overhead that can be minimized or eliminated using compile-time optimizations like loop unrolling.

In short, template unrolling automates the expansion of loops during compilation, replacing iterative constructs with repetitive code blocks.

Introducing the `unroll()` Template Function

Let’s break down the unroll() function inspired by Blender’s C++ math library. This function leverages modern C++ features such as lambdas, variadic templates, and fold expressions to perform compile-time loop unrolling efficiently.

Here’s a simplified implementation of the unroll() function:

#include <utility>

// Helper to implement unroll via parameter pack expansion
template<class Fn, std::size_t... I>
void unroll_impl(Fn fn, std::index_sequence<I...>) {
    (fn(I), ...); // Calls fn(0), fn(1), ..., fn(N-1)
}

// Primary unroll function
template<int N, class Fn>
void unroll(Fn fn) {
    unroll_impl(fn, std::make_index_sequence<N>());
}

Breaking It Down:

unroll_impl
- fn: The lambda function to execute.
- std::index_sequence<I...>: A compile-time sequence of indices.
- Utilizes a fold expression (fn(I), ...) to call fn for each index in the sequence.
unroll:
- N: The number of times to unroll (i.e., the size).
- fn: The lambda function to execute.
- Generates an index_sequence from 0 to N-1 using std::make_index_sequence<N>() and passes it to unroll_impl.

This setup ensures that the lambda fn is invoked exactly N times, each with a unique index from 0 to N-1.

You can learn more about iteration at compile time in my other article: C++ Templates: How to Iterate through std::tuple: the Basics - C++ Stories

Practical Example: Vector Addition

To illustrate the power of unroll() combined with lambdas, let’s implement a simple vector addition operation.

#include <array>
#include <cassert>
#include <iostream>

// Base vector structure with 4 components
template<typename T>
struct Vector4 {
    T x, y, z, w;

    // Element access using indices
    T& operator[](int index) {
        assert(index >= 0 && index < 4);
        return reinterpret_cast<T*>(this)[index];
    }

    const T& operator[](int index) const {
        assert(index >= 0 && index < 4);
        return reinterpret_cast<const T*>(this)[index];
    }

    // Vector addition using unroll and lambda
    Vector4 operator+(const Vector4& other) const {
        Vector4 result;
        unroll<4>([&](auto i) {
            result[i] = (*this)[i] + other[i];
        });
        return result;
    }
};

Vector4 Structure: Holds four components—x, y, z, and w.
operator[]: Allows accessing components via indices 0 to 3.
Addition Operator (operator+):
- Creates a new Vector4 named result.
- Calls unroll<4>() with a lambda that adds corresponding components:
  - result[0] = this->x + other.x
  - result[1] = this->y + other.y
  - result[2] = this->z + other.z
  - result[3] = this->w + other.w
- Returns the result vector.

The Blender Math code is available here: @Github commit

Using the Vector Addition

template<typename T>
std::ostream& operator<<(std::ostream& os, const Vector4<T>& v) {
    unroll<4>([&](auto i) {
            os << v[i] << " ";
        });
    return os;
}

int main() {
    Vector4<float> vec1 = {1.0f, 2.0f, 3.0f, 4.0f};
    Vector4<float> vec2 = {5.0f, 6.0f, 7.0f, 8.0f};

    Vector4<float> sum = vec1 + vec2;

    std::cout << "Sum: " << sum;
}

Play with the code @Compiler Explorer

When vec1 + vec2 is executed:

The lambda inside operator+ is called four times (for indices 0 to 3), performing component-wise addition.
Thanks to unroll(), there’s no loop overhead—the compiler expands these calls at compile time.
The result is a new Vector4 containing the sums of corresponding components.

This approach not only enhances performance but also keeps the code clean and easy to understand.

Other techniques

unroll() isn’t the only choice for loop unrolling; here are some other worth mentioning:

Manual Loop Unrolling: This technique involves explicitly writing out each iteration of the loop in your code. It’s straightforward and gives you complete control over the unrolling process. However, it can become tedious and error-prone for larger loops, and it may reduce code readability and maintainability.
Compiler Pragmas/Directives: Many compilers offer pragmas or directives that suggest or enforce loop unrolling. This method is easy to apply and allows the compiler to handle the complexity of unrolling. However, it is compiler-dependent, meaning not all compilers support the same pragmas, and the results may vary.
SIMD (Single Instruction, Multiple Data) Instructions: SIMD instructions enable the execution of the same operation on multiple data points simultaneously, effectively unrolling loops at the hardware level. This can lead to substantial performance improvements by utilizing the parallel processing capabilities of modern CPUs. The downside is that it requires specific knowledge of hardware instructions, making the code less portable and more complex.

A recursive version, C++14

If you cannot use fold expressions, then here’s a recursive solution:

// Primary template for unrolling
template<int N>
struct Unroll {
    template<typename Fn>
    static inline void apply(Fn fn) {
        Unroll<N - 1>::apply(fn); // Recurse with N-1
        fn(N - 1); // Process the current index
    }
};

// Specialization for the base case
template<>
struct Unroll<0> {
    template<typename Fn>
    static inline void apply(Fn fn) {
        // Base case: do nothing
    }
};

And here’s a working example @Compiler Explorer

There’s also a good example, with Dot Product in the great book on templates: C++ Templates: The Complete Guide (2nd Edition)

Summary

In this text, we explored an interesting technique for unrolling repetitive and simple loop statements. Thanks to C++17 features like fold expressions, combined with templates and lambdas, the code is elegant and easy to understand.

Have you implemented loop unrolling in your C++ projects?
Do you prefer using lambdas over traditional functors for performance-critical code?