Featured Articles

Recent tech articles that we loved...

Criteo
Criteo R&D

On October 2, the WomenDoTechToo initiative at Criteo reached a significant milestone. After successfully hosting local meetups across various locations for several years, we felt it was time to take a bold step forward by organizing a half-day conference at our headquarters in Paris. In this article, we invite you to enjoy and explore the insightful talks from that remarkable afternoon.

What’s Women Do Tech Too?

Before diving into the details of the event, let’s first establish the context and introduce the initiative.

Women Do Tech Too is a dynamic event hosted by Women@Criteo, one of our internal communities, and the R&D department. It celebrates the vibrant and diverse women’s community within Criteo and serves as a platform to showcase the incredible talent, skills, and contributions of women in the tech industry.

Why?

This event is driven by the belief that diversity and inclusion are paramount in fostering innovation and driving progress in the tech sector. By providing a space for women to share their expertise and experiences, Women Do Tech Too aims to empower and inspire others while highlighting the importance of diversity in shaping our technological landscape.

With a commitment to the Women in Tech community, Women Do Tech Too offers a stage to women speakers and welcomes all individuals who share a passion for championing diversity and equity in the tech community to attend and support these presentations. It offers an opportunity for open dialogue and collaboration, fostering a culture of respect, support, and collective growth.

The “Women do tech too” meetup is a valuable initiative that aims to create a safe space for everyone and provide opportunities for women to contribute to the tech industry. It is a platform for women to celebrate their achievements, foster professional connections, and inspire new generations of women in tech.

We have been actively organizing local tech meetups as part of this initiative. Specifically, we have successfully hosted four meetups: two in Paris, one in Barcelona, and one in Grenoble. For more insights, feel free to explore a couple of previous articles covering some of these meetups 👇

Women Do Tech Too Conference!

This year, we recognized the need to elevate the initiative and enhance its visibility. We understood that establishing a strong visual identity and creating a more impactful event aligned with the foundational pillars of our initiative was essential.

Thus, the Women Do Tech Conference was conceived as a half-day event featuring over ten speakers from Criteo and various other companies, all hosted in our newly designed space at our Paris office. In summary, the talks addressed a range of topics, including self-care, career advancement, future collaboration with Ada Tech School, navigating a world with and without third-party cookies, privacy in the realm of Generative AI, crafting a digital visual identity, the journey of a product manager in creating user-centric products, lessons learned from integrating a new theme into Criteo’s Design System, and the importance of taking breaks to maintain personal balance. WOW! 🤯

Now, it’s your turn to enjoy the replays and a selection of the best photos. Grab your favorite beverage and join this community of women in tech to celebrate their achievements, forge professional connections, and inspire future generations.

Agenda

From Idea to Action: WomenInTech, a safe place to find inspiration, knowledge, and roles models by Alejandra Paredes, Software Development Engineer at Criteo & Estelle Thou, Software Development Engineer at Criteo.

What do we need to thrive? This keynote will explore how WomenInTech fosters a community where women at all levels, from juniors to seniors, are encouraged to take initiative and speak up. As a safe place, WomenInTech is a community where we can take the spotlight and be listened to. Let’s explore together all the possibilities of:

  • How to transform an idea into action
  • How to take advantage of the WomenInTech network
  • How YOU can be the next role model
https://medium.com/media/05bb49afebc4ae1c667789fea8b0d148/href

Ensure a future of collaboration and diversity in the Tech Industry by Clara Philippot, Ada Tech School Paris Campus director.

How are we training the new generation of developers to learn and iterate from collaboration, agile methodology, and empathy?

https://medium.com/media/041e2623635593627b4dc5a71311185d/href

The path of staff engineer by Paola Ducolin, Staff Software Engineer at Datadog.

Earlier this year, I was promoted to Staff Engineer at my current company, Datadog. It was a three-year-long path. In this lightning talk, I will share the journey with its ups and downs.

https://medium.com/media/a5e63239cfd5449338aa1a73245ffa2e/href

Story of a failure by Agnès Masson-Sibut, Engineering Program Manager at Criteo.

Working as an EPM, one of our roles is to try to avoid failure. But sometimes, for many reasons, failure is there. Bill Gates said, “It’s fine to celebrate success, but it is more important to heed the lessons of failure.” This presentation will bring us through the story of a failure and, more importantly, through the learnings out of it.

https://medium.com/media/f72aa68ff6d5800e7ca957222d7a867e/href

Cookies 101 by Julie Chevrier, Software Developer Engineer at Criteo.

Have you ever wondered what happens after you click on a cookie consent banner and what the impact of your choice on the ads you see is? Join me to understand what is exactly a cookie and how it is used for advertising!

https://medium.com/media/1077415b6f359dd6b78cd39ea927e492/href

How to make recommendations in a world without 3rd party cookies by Lucie Mader, Senior Machine Learning Engineer at Criteo.

Depending on the browser you’re using and the website you’re visiting, the products in the ads you see might seem strange. We’ll discuss this issue and its possible relationship to third-party cookies in this talk.

https://medium.com/media/236e1a438b73c3e4357b6eead2bcc529/href

Privacy in the age of Generative AI by Jaspreet Sandhu, Senior Machine Learning Engineer at Criteo.

With the advent and widespread integration of Generative AI across applications, industrial or personal, how do we prevent misuse and ensure data privacy, security, and ethical use? This talk delves into the challenges and strategies for safeguarding sensitive information and maintaining user trust in the evolving landscape of AI-driven technologies.

https://medium.com/media/dd25970a1b71be3b75942d96707a312a/href

How to translate women’s empowerment into a brand visual identity by Camille Lannel-Lamotte, UI Designer at Criteo.

Uncover how color theory, symbolism, and language come together to shape the new brand image and get an insider’s view of the key elements that define it.

https://medium.com/media/5d167c67f90befef3623ecdcec542816/href

From Vision to Experience: The Product Manager’s Journey in Shaping User-Centric Products by Salma Mhenni, Senior Product Manager at Criteo.

Evolution of product managers’ roles in creating user-centric products, transitioning from initial vision to crafting meaningful user experiences.

https://medium.com/media/8fe3e1804d6b8eef5abe2a7ce0fa4ccf/href

Crafting Consistency: Integrating a new theme in Criteo’s React Design System by Claire Dochez, Software Developer Engineer at Criteo.

Last year, our team integrated a new theme into Criteo’s design system. This talk will cover the journey, emphasizing the key steps, challenges faced, and lessons learned along the way.

https://medium.com/media/4fc4a3c91918fa4f0faf01a754b2ca7b/href

Have a break and find YOUR own balance with the Wheel of Life! by Sandrine Planchon, Human-Minds — Coach in mental health prevention & Creator of disconnecting experiences.

When everything keeps getting faster, to the point of sometimes throwing you off balance, what about slowing down for a moment and reflecting on YOUR own need of balance in your life? The Wheel of Life can show a way to access it!

https://medium.com/media/dfa5bf08b80269dffab99ecbc6ec71cf/href

A picture is worth a thousand words.

.


Empowering Voices: The Women Do Tech Too Conference was originally published in Criteo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

HomeAway
Ercument Ilhan

EXPEDIA GROUP TECHNOLOGY — DATA

How Expedia Group scales up ranking bandit problems with low latency

A person walks in front of Sheikh Zayed Grand Mosque in UAE
Photo by Junhan Foong on Unsplash

In the pursuit of delivering tailored user experiences, strategic arrangement of user interface elements plays a pivotal role. Among the array of challenges in this endeavour, ranking problems that aim to determine the optimal order holds particular significance.

While supervised learning methods offer solid solutions, online learning by trial-and-error through multi-armed bandits presents distinct advantages such as not requiring a previously collected dataset and being able to adapt to the problem on the fly. By treating each ordering as an arm, it is straightforward to leverage multi-armed bandits to handle such scenarios.

That being said, addressing ranking problems with multi-armed bandits encounters a critical setback: the exponential increase in distinct orderings, i.e. arms, as the number of items and positions grows. For example, arranging just 8 items into 8 positions results in 40,320 unique orderings. Expanding this to 10 positions-items escalates the complexity to a daunting 3,628,800, highlighting the rapid expansion of problem size.

Why does this pose a challenge? While an obvious one is the difficulty that lies in learning the best arm among numerous options, the actual bottleneck for the linear bandits emerges at the steps of arm selection. In this blog post, you will learn how we tackle this at Expedia Group™ to scale our ranking bandits efficiently.

Bandits with linear payoffs, also known as linear bandits, are a class of multi-armed bandit problems where the reward obtained from pulling each arm is assumed to be a linear function of some underlying features associated with that arm. In addition to allowing context information to be incorporated as an extension of arm features to make contextual decisions, this allows some generalisation to be done between arms. Thus, learning via feedback gathered for some arms also contributes to the learning of others. This property of the linear bandits makes it possible to learn arm values even with very large number of distinct arms, so long as there is a sufficient amount of shared features. That being said, this does not directly resolve the challenge that arises at the arm selection step.

Every time the bandit algorithm selects an arm to pull, i.e., an action to execute, it typically does so by selecting the top-scoring arm according to its model. In linear bandits, this involves computing the dot product between the (context-)arm encoded vector and the linear model’s weights for each arm and then determining the one with the maximum value. For more details on how these algorithms work particularly in our applications, the reader is referred to our other related posts: Multi-Variate Web Optimisation Using Linear Contextual Bandits, Recursive Least Squares for Linear Contextual Bandits, and How to Optimise Rankings with Cascade Bandits.

The arm selection can be seen in the step that involves the arg max operation for the widely used Thompson Sampling for Contextual Bandits with Linear Payoffs [1] algorithm shown below, which is an exhaustive loop over every single arm value produced by dot product:

An algorithmic pseudocode for Thompson Sampling in contextual bandits, detailing steps for sampling, playing an arm, and observing a reward in iterations.
Figure 1: Arm selection procedure with Thompson Sampling in linear contextual bandits. [1]

This step can require an excessive amount of computational budget and might be prone to high latency in responses when the number of arms is large, thus rendering it unsuitable for applications that require real-time decisions to be made.

An approach that tackles this challenge is Greedy Search [2]. By traversing the arm space to find an approximate good solution for the top-scoring arm using the Hill Climbing technique, this algorithm sidesteps exhaustive processing. However, the effectiveness of this search method in finding the top-scoring arm depends on the problem size. Linked to this, Greedy Search suffers from diminishing performance as the number of features increases, leading to longer computation times and expanded search space. Consequently, this results in suboptimal scaling in approximation quality and operational efficiency as the number of arms grows.

To address the aforementioned challenges and determine the exact top-scoring arms more efficiently, we employ a technique that we call Assignment Solver as an alternative to Greedy Search. This method leverages two key properties of the ranking with linear bandits problem:

  • Linear bandit model: Arms are represented by a set of features made and arm score computation involves a dot product between these and the model weights.
  • Ranking problem: An item may occupy only one position at any given time, and similarly, each position may accommodate only one item.

Here is what we can infer from these together: arm scores are computed via summation of contributions coming from different item-position assignments; therefore, maximising the best valid assignment score instead of naively computing individual arm scores one by one can also let us determine the top-scoring arm.

To illustrate this overall idea and process, let’s consider a simple problem with 2 different contexts (Context #1 and Context #2) and 2 items (Item #1 and Item #2) to be ordered into 2 positions (Position #1 and Position #2) where every variable is a categorical one. Let’s also say that we incorporate an encoding scheme in this setting to enable addressing it as a linear bandit as follows:

A horizontal bar divided into 15 segments showing encoding components grouped into four categories: “Bias” (containing B), “Contexts” (containing C₁ and C₂), “Positions-Items” (containing A₁₁, A₁₂, A₂₁, A₂₂), and “Interactions” (containing I₁₁₁, I₁₁₂, I₁₂₁, I₁₂₂, I₂₁₁, I₂₁₂, I₂₂₁, I₂₂₂).
Figure 2: An example encoding scheme that defines the coefficients and their positions.

where B denotes bias, C denotes the context terms, A denotes the position-item terms and I denotes interaction terms between context and position-item. Depending on the context-arm combination at decision time, this encoding scheme produces a binary vector (only consisting of 0s and 1s) by following the rules below:

  • Bias term (B): This is always 1.
  • Context terms (C): Subscript denotes which context level is associated with the term; when it is Context #1, C₁ is 1 and C₂ is 0, and vice versa.
  • Position-item terms (A): The first element of the subscript denotes the position and the second element denotes the item associated with this term, e.g. when Position #1 contains Item #2, the value of A₁₂ becomes 1. Only 2 out of these 4 terms can take the value of 1 at a time in any context-arm encoding.
  • Interaction terms (I): The first element of the subscript denotes the current context, second and third elements denote the position and item associated with this interaction term, respectively. For example, when in Context #2, if Position #1 contains Item #2 and Position #2 contains Item #1 then the terms I₂₁₂ and I₂₂₁ would be set as 1 while all other terms are 0. Similarly to A, only 2 out of these 8 terms can take the value of 1 at a time in any context-arm encoding.

For a complete example, Context #2 and item ordering of [Item#2, Item#1] (arm) pair would have the terms B, C₂, A₁₂, A₂₁, I₂₁₂, I₂₂₁ of the encoding scheme activated producing the one-hot vector 101011000000110. The dot product of this vector with the model weights would then give us the arm score for this particular context arm. In other words, this encoding tells us which coefficients of the model weights we need to take into account. Continuing from this example, we can write down the score calculations of different orderings (arms) for Context #2 as below:

  • Score of [Item #1, Item #2] = B + C₂ + A₁₁ + A₂₂ + I₂₁₁ + I₂₂₂
  • Score of [Item #2, Item #1] = B + C₂ + A₁₂ + A₂₁ + I₂₁₂ + I₂₂₁

It might be obvious now that such a mapping allows us to dissect the terms that contribute to an ordering’s (arm’s) total score for each position-item assignment. Our goal is to find the top-scoring ordering for any given context. The terms B, C₂ (or C₁ when it’s Context #1) are present in every arm and do not affect the ordering of the scores; therefore, they can be ignored:

  • Score of [Item #1, Item #2] = … + A₁₁ + A₂₂ + I₂₁₁ + I₂₂₂
  • Score of [Item #2, Item #1] = … + A₁₂ + A₂₁ + I₂₁₂ + I₂₂₁

Here, we can see the relevant terms and write the contribution of position-item assignments independently:

  • Score contribution of Item #1 being at Position #1 = A₁₁ + I₂₁₁
  • Score contribution of Item #1 being at Position #2 = A₂₁ + I₂₂₁
  • Score contribution of Item #2 being at Position #1 = A₁₂ + I₂₁₂
  • Score contribution of Item #2 being at Position #2 = A₂₂ + I₂₂₂

Now, to make it even clearer, let’s put these in the matrix form which we call the score contribution matrix:

A 2x2 matrix showing position-item combinations. Each cell contains a sum formula: top-left is A₁₁ + I₂₁₁, top-right is A₁₂ + I₂₁₂, bottom-left is A₂₁ + I₂₂₁, and bottom-right is A₂₂ + I₂₂₂. The rows are labeled “Position #1” and “Position #2” from top to bottom, while columns are labeled “Item #1” and “Item #2” from left to right.
Figure 3: Score contribution matrix for Context #2.

Let’s remember the ranking problem: an item can only be present at a single position at one time, and a position can only be occupied by one item. With this in consideration, we can think of finding the best item ordering task as picking a cell from each column and row from this score contribution matrix to maximise the sum of cell values we picked. In this particular example, there are 2 different ways of doing such an assignment (which is equal to the number of different orderings we can generate):

  • Item #1 at Position #1 with Item #2 at Position #2
  • Item #2 at Position #1 with Item #1 at Position #2

as also illustrated below:

Two 2x2 matrices showing different item orderings and their scores. The left matrix shows Ordering: [Item #1, Item #2] with cells containing A₁₁ + I₂₁₁ (blue), A₁₂ + I₂₁₂ (crossed-out), A₂₁ + I₂₂₁ (crossed-out), and A₂₂ + I₂₂₂ (orange). The right matrix shows Ordering: [Item #2, Item #1] with cells containing A₁₁ + I₂₁₁ (crossed-out), A₁₂ + I₂₁₂ (orange), A₂₁ + I₂₂₁ (blue), and A₂₂ + I₂₂₂ (crossed-out). Below each matrix is its corresponding score formula using the non-crossed-out terms.
Figure 4: Illustration of every possible ordering and their score formulas when it’s Context #2, according to the score contribution matrix given in Figure 3.

Restructuring our problem this way helps us understand the assignment nature of the problem. However, finding the best assignment, i.e. the best ordering, by evaluating these scores exhaustively won’t provide us with any benefits over the default approach described earlier. In this simple example, there are only 2 possible assignments. In larger problem instances this number becomes very large, such as 8 items-positions having 40,320 assignments. This is where we take advantage of this reformulation: Our problem in this form is an instance of the fundamental assignment problem in combinatorial optimisation literature!

Assignment problem involves finding the best assignment of a set of tasks (items) to a set of resources (positions) in a way that optimises a certain objective function. In its most common form, it deals with assigning tasks to resources, where each task must be assigned to exactly one resource, and each resource can only be assigned one task. The objective is typically to minimise or maximise some measure of cost, such as minimising the total cost or maximising the total profit, which in our case is maximising the total assignment score.

The assignment problem has been extensively studied and various algorithms have been developed to solve it efficiently. One of the most well-known algorithms for solving the assignment problem is the Hungarian algorithm, also known as the Kuhn-Munkres algorithm [3]. This algorithm finds the optimal solution to the assignment problem in polynomial time, making it highly efficient for practical applications. Therefore, it’s a perfect method for us to utilise in this problem for real-world instances where finding solutions with a very small latency is of utmost importance.

In short, by applying the Hungarian algorithm to the score contribution matrix we build at every arm selection step, we can find the best item ordering very efficiently. This blog post won’t be covering the details of the Hungarian algorithm itself.

Now that we have the core idea explained, here is how the algorithm can work practically:

  • At the beginning of learning, we first initialise our algorithm by determining and storing the coefficients indices according to the encoder mapping in a compatible data structure to be used as a look-up table, e.g. a 4-dimensional tensor, as shown below:
A diagram showing a tensor representation with two parts: At the top, a row of variables (B, C₁, C₂, A₁₁, A₁₂, A₂₁, A₂₂, I₁₁₁, I₁₁₂, I₁₂₁, I₁₂₂, I₂₁₁, I₂₁₂, I₂₂₁, I₂₂₂) with corresponding index numbers 0–14. Below, a stored indices tensor showing two 2x2 matrices side by side: Context #1 on the left with pairs of indices ([3,7], [4,8], [5,9], [6,10]) and Context #2 on the right with pairs ([3,11], [4,12], [5,13], [6,14]).
Figure 5: Coefficients indices look-up table.
  • Every time the bandit needs to determine the top-scoring arm based on a given context, we refer to this look-up table to extract the coefficient indices for different position-item assignments. By using these, we then build our score contribution matrix by summing up the model weights corresponding to these indices which yields the score contribution matrix. Finally, we solve the maximum assignment problem in this matrix by using the Hungarian method and obtain the top-scoring ordering (arm). An example of this step with some arbitrary model weights for when it’s Context #2 is shown below:
A diagram showing two components: At the top, a row of model weights ranging from 0.05 to 0.5 with corresponding index numbers from 0 to 14. Below, a 2x2 score contribution matrix labeled “it’s Context #2“' with values: top left 0.18, top right 0.19 (highlighted in green), bottom left 0.24 (highlighted in green), and bottom right 0.2.
Figure 6: The score contribution matrix for Context #2 that is constructed with the given model weights according to the coefficients look-up table constructed in Figure 5.

So, how does the approach we just presented here perform against the baselines, exhaustive method and Greedy Search? Its strength against the previous approaches is twofold.

First, it is much faster to determine the top-scoring arm and it scales much better as the number of arms grows. The figure below demonstrates this comparison:

A comparison of three search algorithms (Exhaustive, Greedy Search, and Assignment Solver) shown in both a bar graph and data table. The graph shows computation time vs number of items/positions (3–12), with blue bars for Exhaustive, green for Greedy Search, and grey for Assignment Solver. The table below details the exact values, normalized by Exhaustive time for 6 arms, and includes the number of possible arms for each configuration.
Figure 7: The bar graph and the table of values for the time taken by Exhaustive, Greedy Search and Assignment Solver (the method described in this blog post) methods to find the top-scoring arms. The time values are normalised by the time taken by the Exhaustive method for 3 items and positions (6 arms).

As can be seen in Figure 7, Assignment Solver is far ahead when it comes to speed. When there are 6 items and positions, it’s already ~95 times faster than the Exhaustive method and ~12 times faster than Greedy Search. At 10 items and positions, it becomes ~40 times faster than Greedy Search thanks to its polynomial time complexity resulting in better scaling. Ultimately, it enables addressing enormously large problems which would otherwise be infeasible.

Second, Assignment Solver’s top-scoring arm results are exact regardless of the size of the problem, just like what the exhaustive method would produce, whereas Greedy Search has no exactness guarantees with an approximation quality tied to the problem size.

Final remarks

This blog post described the approach we use at Expedia Group to efficiently determine the exact top-scoring arms in high cardinality ranking bandit problems when there is a suitable encoding scheme incorporated. The substantial gains we demonstrate highlight the importance of adopting correct solutions tailored to the problem. These solutions enable prominent techniques from literature to be compatible with real-world use cases.

References

[1] Agrawal, S., & Goyal, N. (2013). Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013, 28, 127–135.

[2] Parfenov, F. & Mitsoulis-Ntompos, P. (2021). Contextual Bandits for Webpage Module Order Optimization. In Marble-KDD 21’, Singapore, August 16, 2021.

[3] Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.

Learn about life at Expedia Group

Identifying Top-Scoring Arms in Ranking Bandits With Linear Payoffs in Real-Time was originally published in Expedia Group Technology on Medium, where people are continuing the conversation by highlighting and responding to this story.

JobTeaser
Jean-Jacques Royneau

Managing z-index is a classic challenge for front-end developers. As a project grows, display conflicts and bugs related to z-index can quickly turn into serious headaches.

I recently took some time to clean up the z-indexes in one of our projects.
In this article, I’ll share the approach I used for this cleanup (while also laying the foundation for keeping things manageable in the long run).

1. Analyzing the existing z-indexes

The first step was to perform a thorough audit of all the z-index declarations in the project. This gave me an overview of how many z-indexes we had and what values were being used.

Here’s what I found:
- August 2023: 57 declarations with 8 distinct values.
- February 2024: 87 declarations with 19 distinct values.

This evolution showed me that the use of z-indexes had increased significantly. And there was no reason to believe it would stop.

With that in mind, my goal became to centralize and standardize these values.

2. Centralizing z-indexes with CSS Custom Properties

I centralized all the z-index values in a single file using CSS custom properties.

This allowed me to have an overview of the different layers of elements on the page, helping me get a (slightly) clearer picture of the task ahead.

If you’d like to get an idea (or maybe scare yourself a bit), here’s what that first file looked like:

:root {
- z-index-second-basement: -2;
- z-index-basement: -1;

- z-index-ground: 0;

- z-index-floor: 1;
- z-index-second-floor: 2;
- z-index-third-floor: 3;

- z-index-job-ads-secondary-filters--selects: 10;
- z-index-job-ads-results-sort: 10;
- z-index-job-ads-primary-filters: 20;

- z-index-fo-header: 30;
- z-index-above-fo-header: 31;
- z-index-fo-header--dropdowns: 100;

- z-index-career-center--login-modal--autocomplete-list: 99;

- z-index-bo-drawer-backdrop: 999;
- z-index-bo-drawer: 1000;

- z-index-feature-env-switcher: 9999;

- z-index-notifications-panel: 10000;

- z-index-msw-tools-panel--open-button: 99999;
- z-index-msw-tools-panel: 999999;
}

3. Understanding Stacking Contexts

While trying to make sense of each of these z-index values, I had to revisit an important CSS concept: stacking contexts.

Z-indexes are just a tool tied to this concept.

In my experience, z-indexes can be tricky to master because they interact with stacking contexts in TWO ways
- a z-index can modify the stacking order (which is fine, that’s its purpose)
- a z-index can create a new stacking context (and this is where things can go wrong)

The real challenge isn’t just about the z-index values themselves, but also the unintentional creation of stacking contexts. Many z-indexes end up creating stacking contexts that weren’t needed, which complicates the layering of elements.

For the next phase of my work, I paid special attention to these unintentional stacking context creations.

4. Removing, Reducing, and Grouping

With a better understanding of stacking contexts, I was able to identify unnecessary or redundant z-index values.

Quite often, the natural order of elements in the DOM is enough to ensure correct layering without needing a z-index.

So, I removed z-indexes where elements were already stacking correctly based on their HTML order (and in some cases, I even reordered the HTML itself to avoid needing a z-index 😉).

I also reduced the value of some z-indexes when they were higher than necessary.

This step is crucial because I believe that the lower and more limited z-index values are, the more we can control them over time.

This reduction also helped me achieve one of my goals: ensuring all z-index values were increments of 1 (I mean: -1, 0, 1, 2, 3…).

Why increments of 1? Because you need to control your z-indexes.

No more randomly picking a value (“There was space between 20 and 30, so I went with 27!”).

Finally, I grouped CSS custom properties that had identical values, which allowed me to identify some useful abstractions.

5. Defining a Visual Priority Order for Recurring Elements

We defined a visual hierarchy for elements that frequently appear on a page: menus, modals, dev tools, and notifications.

The order we chose is as follows:
page content < website menus < modals < dev tools < notifications.

From the previous work, the highest z-index value that belonged to page content (i.e. not including elements that appear on top) was 3.

This allowed me to make the priority logic dynamic:

:root {

--z-index-website-menu: 4;

--z-index-modal: calc(var(--z-index-website-menu) + 1);

--z-index-dev-tools: calc(var(--z-index-modal) + 1);

--z-index-notifications-panel: calc(var(--z-index-dev-tools) + 1);
}

Now, if the value of --z-index-website-menu ever needs to change (to decrease, of course!), all the following values will update automatically, without developers having to think about it.

6. Defining Utility Values

At the end of step 4, I had grouped together the CSS custom properties that had identical values. Some of these values turned out to be the recurring elements we discussed earlier.

Others didn’t correspond to a specific type of element but were often used because they’re useful for fine-tuning the stacking order in specific contexts.

I took these values to define utility custom properties:

:root {
--z-down-in-the-current-stacking-context: -1;

--z-reset-in-the-current-stacking-context: 0;

--z-up-in-the-current-stacking-context: 1;

}

These values are used to manage finer adjustments in any current stacking context.

7. Discouraging Hardcoded Values

To ensure this approach remains sustainable, I added a rule to our CSS linter, Stylelint, which prevents the use of hardcoded z-index values. This rule encourages developers to either use the centralized variables or define new ones when needed.

// .stylelintrc
"rules": {
// ...
"scale-unlimited/declaration-strict-value": "z-index"
// ...
}

However, be careful: it should be understood as a guideline and not a rule. It’s fine to silence the linter and use a hardcoded value. You just have to do it consciously.

Final thoughts

While all these steps seem to flow nicely in theory, that’s not always the case in real life.
In practice, I went back and forth between the different steps multiple times before reaching a satisfactory result. 😉

By centralizing values, eliminating unnecessary stacking contexts, and reducing the number of different z-index values, I was able to simplify and make z-index management more maintainable. This process lightened the codebase and made the handling of element layering more predictable.

If this approach seems helpful to you, I encourage you to take inspiration from it and apply it to your own projects. A well-structured z-index management strategy can make a real difference, making maintenance easier and avoiding subtle bugs related to element stacking.


Spring Cleaning for z-indexes was originally published in JobTeaser Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Thumbtack
Richard Demsyn-Jones

Three years ago we launched Thumbtack’s dedicated machine learning (ML) infrastructure team. Starting from a single engineer, we eventually grew this into a small team with a big impact. Today our client teams can explore generative AI or traditional ML with our tools, implement their approach through our model inference solution and our feature store, and track correctness with polished model monitoring. We experienced ups and downs getting to this point, giving us many learnings, and it’s been enough time to reflect.

The ML Infra team

Our ML Infra team has a wide scope, as the only dedicated AI/ML infrastructure team at Thumbtack. We build services used in the online serving of ML models in Thumbtack’s product experience and manage tools for the building of ML models. Model inference, feature management, Jupyter notebook environments, model monitoring, and generative AI capabilities are the main areas we work on. On top of that, ML Infra is often a unifying constant between different product teams. ML Infra acts as “connective tissue”, sharing ML building and system design knowledge across the company.

Yet, the team is small — you’d be surprised at how small it is for that scope. But we have learned how to operate as a flexible and high leverage team.

Our path

2021

After several years without any dedicated ML infrastructure team, our need for one had grown. Product teams were moving forward quickly with more and more ML use, yet they lacked most of the shared ML tooling and infrastructure that larger tech companies might typically have. Teams built the ML infrastructure that worked for them, most of which did not generalize across teams. Most pressingly, we had diverged production inference — the generation of predictions from our models in real time — into two separate architectures.

For a couple of years we knew this was an unsustainable setup, but it was only in 2021 that we built momentum around a serious effort to consolidate our ML infrastructure direction. We formed working groups around four topic areas: feature engineering, model experimentation, model inferencing, and model monitoring. Each group had an assortment of engineers and applied scientists from across the company, working together (on top of their regular responsibilities) to write a document outlining their area’s status and possibilities. With these documents in hand we had a strong case for more organized development. We considered perpetuating those working groups as a “virtual team” that could build out some of this infrastructure, but ultimately we decided to create a smaller permanent team.

2022

In 2022, we started the team with one engineer and had buy-in to grow the team. That year was both fast and slow. We were creative in finding additional engineering help. We created a “20% ML Infra” program and shared it with our broader engineering team, inviting anyone interested in the program to speak to their managers about volunteering to spend 20% of their time on ML Infra. There was substantial interest. Separately, we had also created a “Voyager” program at Thumbtack, where engineers could move to other teams for the duration of single brief projects. Between those two programs we had about five temporary collaborators who helped us in lean times. We even had part-time project management support from a colleague who was particularly interested in ML and had been involved in the creation of the team. This really helped us bridge the gap while we started on hiring, which we spent a lot of time on throughout the year.

While we felt we were on the path to success, we still had some learnings that year. We ended the year without any models in production. We had a very small team whose time was split across other responsibilities such as interviewing, communicating with stakeholders and potential client teams, answering questions about inherited legacy infrastructure, contributing to our 3 year strategy, and guiding our “20% ML Infra” and “Voyager” program contributors. While our temporary contributors were helpful, they needed ramp-up time and had to serve their home teams first, so their contributions were sporadic and often disrupted.

Where we did make progress was designing the fundamental structure of inference between our service and its interfaces with clients — those structural decisions have proven to be a good fit. We embraced a Minimum Viable Product (MVP) mindset to reach iterations that were good enough to try out. While that helped us learn quickly and make pivots, it left holes in our functionality. There was still work to be done.

2023

We framed our work areas as overlapping maturity curves, where our initial area (production inference) would be in a “Building” stage while our next area (notebooks) was in a “Trialing” stage. Then, as inference reached “Maturity”, notebooks would reach “Scaling”, and model monitoring would reach the “Building” stage. We had about five such areas in mind for our scope, and intended to typically shift them by one stage in each half.

Example of how we viewed our progress and direction (in mid-2023)

Our new team members ramped up and we were able to have a larger impact as a team, working across multiple areas. That increased capacity meant we were able to go deeper on our most confident initiatives, while laying the groundwork on future projects. In a typical half we would maintain some established software, build out something else, and scope out options for further out initiatives. It never exactly followed the timeline we laid out, and that’s normal for ambitious plans. Overall, the general trend of tools progressing and maintaining different maturity levels has held.

It took time. Our client teams varied in what they wanted, which made it hard to pick universal choices for inference, feature management, or notebook environments. Nor were those team preferences and needs stationary over time; what might have helped a team in one half might not have been the same as what they needed after another half of their own development. For example, while we explored feature stores, our client teams went deeper on improving their own data infrastructure. For some initiatives we embedded deeply into client teams to apply our tools to their uses. Over the year, we slowly earned adoption, but it was incomplete and Thumbtack remained fractured for each area covered by our core ML tools.

2024

With inference, features, notebooks, and monitoring, it was hard to earn adoption when existing teams already had patterns with which they were comfortable. The start of 2024 brought a new area where we didn’t have prior infrastructure: generative AI. We leaned heavily into enabling generative AI capabilities for teams to experiment with, and this has already resulted in enabling adoption across many use cases. Unlike earlier capability building initiatives where we might have been too soon or too late for client teams, with generative AI we have been able to time the capability building with the needs of different product teams. We were just fast enough that our client teams were able to use generative AI without being blocked by us, yet we were just-in-time enough that we have been able to adapt to the rapidly changing external environment. Rapid advances in generative AI affected our optimal choices around which models to use, whether to host internally or use external vendors, and whether to build our own ML models or use prompt engineering approaches with pre-built models. We learned a lot and had a lot of fun along the way. We were able to evangelize generative AI at the company and power a positive flywheel of more and more adoption and functionality.

We also finally realized the full adoption of our inference solution for new models. Furthermore, it meshed cleanly with our early generative AI needs. Now with high adoption of inference and generative AI, and partial adoption of our other solutions, we are on track at the 2 year mark of our 3 year strategy. We have a long future of exciting and impactful work ahead of us, and now our team has moved past the growing pains associated with creating a centralized team that enables foundational capabilities. Our engineers are all still on the team, and have each built a substantial practical expertise across a spectrum of ML infrastructure topics. Our many client teams and collaborators have made our team well known at the company, where people actively solicit our guidance in the early stages of their ML system designs.

Lessons learned

Until you have one client, you don’t have any clients

If you try to build the average of what different teams need, what you build might not work for any of them.

Nor can you build something that does everything for everyone. Aside from limiting trade-offs and extra complexity, building an idealized solution takes time — time where your clients might move on or build something good enough for themselves.

It is so hard to forecast what solutions will truly be needed by teams and have high adoption. We found a balance between designing general solutions and embedding in potential client teams. Pay attention to your users, and keep your focus on them until your product fits for them.

Timing is everything

The external technology landscape changes so quickly. When we first approached building a feature store, we found that feature store providers and open source projects were optimized for different settings than ours, and that we would have to build so much around them that we might as well build our own from scratch. Instead of picking between a false dichotomy of building something new or greatly complicating our stack to incorporate an external solution, we decided to step back. We took a pause on feature store work, which also gave us more time to understand what our client teams truly needed. A year later, we built a much more bare-bones feature store solution that we expect will serve immediate needs for a while longer, giving us more time — and external solutions more time to increase their own product maturity — before committing to a fully general solution.

You need to be doubly fortunate, building a product that solves client needs and delivering it at a time when it makes a difference for them. Prioritize more projects where you have a clear client, that client has a pressing need, and they make direct commitments to test your new features.

Sometimes we do have to invest in long timelines for technology, but often enough we can have reliable impact by doubling down on initiatives that already have traction.

Maintain the trust of your stakeholders

Throughout the history of the team, we have invested significant thought and effort in maintaining trust.

To build and maintain that trust we use a number of tactics. We communicate proactively with a wide set of informal stakeholders. We have a transparent prioritization process where we invite anyone to submit ideas and options, and where we proactively identify people and teams to talk to for their perspective. Our team communicates in open channels. We shadow the company procedures that other teams do, such as formally grading our OKRs and our overall performance despite lacking a specific stakeholder who will use that information. We set ambitious goals and hold ourselves accountable to them.

More than anything, the way we maintain trust is by working on projects with the highest opportunity for impact, with ruthless prioritization where we aren’t shy about making tough decisions.

Write a vision

Speaking of ways to maintain trust, it helps to regularly demonstrate that you think about the future. Writing a 3 year vision was helpful for us to think through where we could have impact, but it was probably more valuable as an artifact we could share broadly. It shows that we have a good grasp on what we want to accomplish.

It’s held up pretty well. That’s a credit to the vision being practical and well-informed by experience. Another Thumbtack leader has a saying that he’s never seen the second year of a two year plan. Well, we’re entering the third year of our three year plan and the document hasn’t yet become irrelevant. The largest revisions we made were as generative AI developed in different, and most notably faster, ways than we anticipated.

Reviewing and updating the vision is a helpful exercise to reevaluate our big picture direction.

There is no substitute for dedicated people

We had help from temporary collaborators early on, but their projects were often subject to pauses. Occasionally they paused indefinitely and never came back to them, as their obligations to home teams only increased over time as their tenure and responsibility grew. This created a lot of unreliability with our roadmap, for their projects and for the work of our permanent members. And while those temporary members built up very valuable expertise that they could bring back to their home teams, we weren’t scaling knowledge inside ML Infra itself to quite the extent we wanted.

Once we hired more full time engineers, our team really picked up momentum. Our engineers put their core focus on ML Infra, and together they accumulate knowledge that makes us ever wiser and more capable. As permanent members of the team, they can also take on relationship management with our client teams or with external companies.

We would be more than happy to take on more temporary help going forward, but in proportion with the full time capacity of the team.

When small, be flexible

With a team that started small and is still small, we couldn’t afford to lose too much time to less promising initiatives. Nor could we support everything. We learnt to say “no” a lot, or to say “not right now”. In hindsight we should have slowed down our scope growth from the beginning.

The “thinness” of ML Infra as a layer between platform and product teams is a choice parameter, and can vary depending on the situation of client and partner teams. Sometimes that means that a product team can go deeper on building their own ML infrastructure. We also need to make the most of the core infrastructure provided by Thumbtack’s excellent Platform team, especially in ways where we could automatically benefit from future improvements created by the larger team. When we ultimately built our feature store solution, we built it directly on top of their new generation data management infrastructure, rather than forking off from a lower level abstraction.

We follow most of the standard operating rhythms of the company, but we tend to be more agile. We maximize the time we can spend building. We try to avoid surprising our stakeholders, so we do deliver on what we commit to, but we commit selectively. We maintain our flexibility by not overly packing our roadmap, while also judiciously limiting our tech debt accumulation.

We have a lot on our roadmap, but it’s always subject to change, and we have some modesty about its inherent unpredictability. Revision is part of the plan.

What the future holds

Looking back on the last three years since forming the team, we accomplished a lot. We unified inference and built CI/CD around it, started an exponential growth curve of generative AI adoption, built a feature store, picked and implemented a Jupyter notebook solution, and made a model monitoring solution available for early adopters. Along the way we helped with launching many product experiments, and did lots of ad hoc ML systems consulting.

In the grand scheme of things, we’re still in early stages. Our capabilities vary in their polish and adoption. We’re still a small team. We have much work ahead of us, and that’s without knowing what novel ML-driven product experiences Thumbtack will need next. The future will bring larger and grander opportunities, and we’re excited for it.

Acknowledgement

First and foremost, to members of the ML Infra team for their superb work over these years. We also wouldn’t be here without all of our talented partners throughout Thumbtack, making this an exciting place to work with lots of opportunity. Thanks to Navneet Rao, Nadia Stuart, Laura Arrubla Toro, Cassandra Abernathy, Vijay Raghavan, and Oleksandr Pryimak for reviewing this blog and providing many suggestions.


What we learned building an ML infrastructure team at Thumbtack was originally published in Thumbtack Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

8th Light
Amanda Graham

As AI continues to influence decisions that impact humans, building trustworthy AI isn’t only about creating effective models — it’s about ensuring AI systems are ethical, reliable, and resilient over time. Although traditional software development principles provide a strong foundation for managing machine learning projects, AI’s unique challenges demand even more robust practices.

MLOps, the operational backbone of AI development, adapts proven DevOps principles like version control, CI/CD, and testing to meet the specific needs of AI. However, trustworthy AI requires additional considerations — such as data drift, bias mitigation, and explainability — that go beyond traditional software development.

This article explores seven core principles of MLOps, each essential to achieving trustworthy AI. 

7 Principles to Establishing Trustworthy AI

From managing data pipelines to ensuring accountability, these components work together to support AI systems that are both high-performing and aligned with today’s ethical and regulatory standards.

1. Data Pipelines

Effective data capture, transmission, and sanitization are integral to automating trustworthy AI workflows through MLOps. Data pipelines automate the ingestion, transformation, and validation of data to ensure high-quality, consistent inputs for machine learning models. Here’s how they work:

  • Automated Data Validation: Tools like TFX Data Validation or Great Expectations can validate data for consistency, schema correctness, and distribution anomalies before it feeds into models. These tools help ensure fairness by checking for biases or imbalances in the data, such as overrepresented or underrepresented groups.
  • Feature Stores: MLOps frameworks often use feature stores to ensure consistent and reusable feature engineering across training and production environments. This enables standardized features that reduce variability in model performance.
  • Data Drift Detection: MLOps enables real-time data monitoring using tools like Evidently AI, which can trigger alerts when significant drift occurs in the input data, distributional shift. Drift detection ensures reliability, as models that perform well in training but poorly in production environments can be retrained when drift exceeds thresholds. This is a decision point here for including a human in the loop.

2. Model Versioning and Governance

Version control for models in MLOps is often implemented using tools like DVC, Data Version Control, or MLflow. These tools track model artifacts and their corresponding training datasets, hyperparameters, and code, ensuring transparency in the model lifecycle.

  • Model Lineage: MLOps ensures every version of a model can be traced back to the specific datasets, features, and configurations that produced it. This lineage is critical for accountability, especially for organizations working in regulated environments like healthcare or finance.
  • Governance Frameworks: Tools like Kubeflow Pipelines or Seldon Core allow teams to set up governance workflows, automate approval processes, and ensure that models meet internal or regulatory requirements before deployment. They make sure that ethical considerations like fairness or privacy are part of the development lifecycle.

3. CI/CD Pipelines

In traditional DevOps, CI/CD pipelines enable fast, automated software releases. MLOps extends this concept to ML models, creating pipelines for model retraining, testing, and deployment.

  • Continuous Integration: Tools like Jenkins or GitLab CI can be integrated with MLOps frameworks to automatically test models against predefined trustworthiness benchmarks like accuracy and bias thresholds whenever there is a code change or new data.
  • Canary Deployment: MLOps allows you to deploy models in phases, using techniques like canary releases or shadow deployments, where only a small fraction of users interact with the new model. This reduces the risk of unintended consequences and provides real-world feedback, ensuring reliability and robustness.

4. Monitoring and Alerting

Monitoring in MLOps is important for identifying issues like data drift, concept drift, and model degradation in real-time.

  • Data and Model Drift Monitoring: Open-source tools like Fiddler AI or Evidently AI continuously monitor model performance and data input, flagging deviations from the norm. For example, if a model shows a decline in performance due to a shift in user behavior, an alert is triggered to retrain or adjust the model, maintaining fairness and reliability.
  • Bias Detection and Remediation: Tools like AI Fairness 360 can be used to monitor models post-deployment, ensuring that decisions are fair across all demographic groups. Automated retraining based on these fairness metrics can help mitigate biases over time, with human review providing final oversight.

5. Security

MLOps addresses AI security through automated pipelines that integrate DevSecOps practices into AI workflows:

  • Secure Data Handling: MLOps platforms often use encryption mechanisms like TLS for data in transit and AES for data at rest, ensuring secure handling of sensitive data in compliance with privacy regulations. 
  • Adversarial Robustness: MLOps allows continuous evaluation of models for vulnerabilities to adversarial attacks. Tools like Adversarial Robustness Toolbox (ART) can test models during deployment for adversarial examples, ensuring robustness against malicious inputs.
  • Federated Learning: In scenarios requiring data privacy, MLOps pipelines can incorporate federated learning, where models are trained across distributed datasets without centralizing the data itself. This method ensures privacy while still enabling scalable machine learning.

6. Transparency

Transparency in MLOps means ensuring that every action taken during model development and deployment is tracked and reproducible.

  • Model Explainability: Tools like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) are often integrated into MLOps pipelines to explain model decisions in ways that are understandable to non-technical stakeholders, ensuring trust in the system’s transparency.
  • Version Control for Models: MLOps ensures that every deployed model has detailed version control, so stakeholders can track the exact data, code, and parameters that were used. This helps build trust with users, clients, and regulators.

7. Accountability

Accountability in MLOps is achieved through comprehensive logging, auditing, and human oversight at every step of the AI lifecycle:

  • Logging and Audit Trails: Every interaction with the model, from the data used in training, to the production decisions, is logged. This ensures full traceability in case a model needs to be audited, or a legal dispute arises.

Wrap Up

You've seen both the strategic benefits of implementing trustworthy AI through MLOps and the technical components. Whether you're leading the charge from an executive level — or directly involved in AI implementation — the journey doesn't end with automation. It evolves with continuous improvement and human vigilance.

If you're ready to take the next step, let's talk about how MLOps can unlock the full potential of your AI solutions, ensuring they are not only high-performing, but also ethically responsible, and aligned with the demands of today’s regulatory landscape.

8th Light
Amanda Graham

As machine learning becomes more integrated with business processes, trustworthy AI principles have moved from  a "nice to have" to a necessity, often driven by regulatory requirements. Many organizations are heavily investing in AI governance to understand and apply these principles effectively. A compelling and perhaps surprising approach to implementing trustworthy AI principles is through MLOps.

MLOps integrates DevOps principles into the lifecycle of machine learning models. (In fact, our team discussed how this looked a few years back and gave some valuable insights into scaling ML and the beginnings of MLOps.) This includes everything from data collection and model training to deployment and continuous monitoring. With MLOps, organizations have a powerful toolset for building trustworthy AI systems at scale, allowing them to automate key processes while ensuring ethical standards are met. 

However, no system is entirely hands-off. Human oversight remains a critical part of the equation in order to address and mitigate bias, but MLOps provides the backbone for trustworthy AI by embedding checks and balances directly into the workflow.

The Intersection of MLOps and Trustworthy AI

MLOps provides a solid framework for implementing and measuring key trustworthy AI principles. Here's a high-level list. (And if you are more technical, you can find more detail in The MLOps Architecture Behind Trustworthy AI Principles.) 

  • Fairness: MLOps integrates fairness checks directly into data engineering and model development, allowing for continuous automated monitoring of model performance across different demographic groups. While achieving perfect fairness is complex, MLOps ensures that fairness is no longer an afterthought — but a fundamental part of the process. Regular audits by human experts complement these automated checks to ensure nuanced issues of bias are addressed.
     
  • Reliability and Robustness: MLOps automates the detection of data drift and triggers model retraining, ensuring that AI systems remain reliable over time, even as real-world data changes. This combination of automated retraining and human expertise ensures that models continue to meet performance expectations in dynamic environments.
     
  • Privacy: With techniques such as federated learning, differential privacy, and encryption, organizations protect sensitive data while still achieving robust model performance and scalability. Automated tracking of data usage ensures that privacy regulations are met, while human oversight ensures that privacy-utility tradeoffs are carefully balanced for each use case. MLOps allows organizations to maintain strong privacy controls without sacrificing performance. Keep in mind there are tradeoffs to Privacy Enhancing Techniques/Technologies (PETs). Federated learning has longer training times and differential privacy can change the model’s usefulness and accuracy. 
     
  • Security: With MLOps, robust security measures are implemented at every stage, from data handling to model deployment. Continuous monitoring identifies potential breaches or vulnerabilities, while periodic audits and updates by security teams reinforce protections. This dual approach of automation and human vigilance ensures that AI systems remain secure and resilient against evolving threats.
     
  • Transparency: MLOps enhances transparency by providing detailed version control, tracking and reproducing all model iterations. This supports regulatory compliance and helps stakeholders understand changes made during the AI lifecycle. However, while MLOps documents the decision-making process, explaining complex models like deep learning often requires human expertise to interpret and communicate how automated decisions are made, ensuring clarity and trust. Keep in mind that while this addresses transparency behind the scenes, UX decisions also must  be made in order to provide transparency to the users of systems. Learn more about how to address and mitigate bias in AI using product design.
     
  • Accountability: By documenting every step of the AI lifecycle, MLOps ensures teams identify who is responsible for each decision. This traceability strengthens accountability and makes it easier to address any issues that arise. Human oversight remains essential for ethical and legal responsibility, ensuring that accountability extends beyond technical documentation to address broader governance concerns.

The Reality of MLOps Adoption

Despite the clear advantages of MLOps generally, and the added benefits of aligning AI systems with trustworthy principles, adoption has been gradual. Perceived costs and the learning curve are major hurdles — particularly for smaller companies. MLOps often requires significant organizational change, including fostering cross-functional collaboration between data scientists and operations teams, a surprisingly rare feat. Simply implementing tools isn’t enough, companies must invest in culture, processes, and training to fully realize MLOps benefits. In many ways, the same issues come into play as in DevOps and you can learn more about overcoming those challenges in our Strategic DevOps Playbook.

The Path Forward with MLOps

MLOps is a powerful tool for scaling AI and embedding trustworthy principles, but it’s important to approach its implementation with realistic expectations. Although many aspects of the AI lifecycle can and should be automated, some challenges will always require human intervention and domain expertise. This isn’t a limitation, but something to be valued.

Implementing MLOps is not an overnight process. It requires a blend of strategic vision, operational discipline, and technical expertise. It also requires a company culture that supports continuous improvement. As organizations look to scale trustworthy AI, MLOps provides a strong foundation, but agility and ongoing refinement will be key to addressing evolving challenges and regulatory requirements.

Feel free to reach out if you'd like to discuss how MLOps can be implemented in your organization to enhance your AI solutions.

Criteo
Peter Milne

CrewAI and Criteo API — Part 1

Introduction

This article is the first in a series showing how to use CrewAI and Criteo APIs. We will see how to obtain credentials, use those credentials to get an Access Token and use the token call endpoints to get Accounts, Retailers, and Brands, all from a CrewAI crew.

CrewAI is a “cutting-edge framework for orchestrating role-playing autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly and tackle complex tasks.”
Criteo Retail Media API unlocks various possibilities to enhance media campaign performance from any platform. It allows you to create, launch, and monitor online marketing campaigns and provides a comprehensive view of their performance.
CrewAI and Retail Media togetner unlocks the power of AI and the power of Commerce Meda

Purpose

Using CrewAI, you will become familiar with Criteo’s Retail Media APIs and how to use them as tools for large language models (LLMs), AI Agents, etc.

For more detailed articles and videos on large language models (LLMs), AI Agents, and CrewAI, see my favourite authors listed at the end of this article. (Sam Witteveen and Brandon Hancock)

Overview

We aim to use a crew of AI Agents and Tasks to retrieve Accounts, Retailers and Brands for Retail Media APIs and perform rudimentary analytics. We will get a developer account at Criteo, create Tools to access the APIs, build an Agent that uses the tools and specify Tasks that will be executed sequentially by the Crew.

All the code for this article is in Python and uses poetry as the package manager/environment manager.

Prerequisites

You will need to install the following to run the code examples:

Step-by-Step Guide

Step 1: Criteo Developer Account, Credentials and Authentication

To use the Criteo APIs, you need a developer account created in the Developer Portal by clicking the ‘Get started’ button.

This will take you to the Criteo partners dashboard. Click on ‘create a new app’ (you can see my application already defined)

You will need consent to data provided by the APIs; follow the prompts to be authorised.

Once you have consent, click ‘create a new key’ to create credentials for your application.

A file containing the credentials is automatically downloaded to your local matching. You will use these credentials to obtain an access token for each API call.

# Here is an example:

---------------------------
| Criteo Developer Portal |
---------------------------


Please store your client secret carefully on your side.
You will need it to connect to the API and this is the only time we will be able to communicate it to you.
You can find more information on our API Documentation at https://developers.criteo.com.


application_id: <application id>
client_id: <client id>
client_secret: <client secret here>
allowed_grant_types: client_credentials

Tips: Keep your credentials secret. Don’t commit them to a public repository (GitHub, GitLab, etc).

Authentication with client credentials results in an AccessToken that is valid (at the time of writing) for about 15 minutes. Call the Criteo authentication API for a valid token using your client credentials.

The following code snippet is a function that retrieves an AccessToken using client credentials and caches it for 15 minutes.

https://medium.com/media/fd85928e4ba55314984b9138fe353e03/href

Lines 15–16 retrieve the client ID and secret from environment variables (.env)

Line 17 defines the headers, specifically the content-type of application/x-www-form-urlencoded. This header value is quite important.

Lines 18–22 set up the data containing your credentials.

Line 23 executes a post request to get an access token, and line 26 returns a structure containing the token, the token type, and an expiration time of seconds.

Example auth result as JSON:

{
"access_token": "eyJhbGciOiJSUzII ... pG5LGeb4aiuB0EKAhszojHQ",
"token_type": "Bearer",
"refresh_token": None,
"expires_in": 900
}

The rest of the code caches the result until the token expires.

Step 2: CrewAI environment setup

Clone the repository and change the directory to part_1. The code used in this article is in this directory. Already defined is a poetry project in the file: pyproject.toml

Run these commands in a terminal to install the dependencies, create/update the poetry environment and jump into the correct shell.

poetry install --no-root
poetry shell

VS Code:

If you use VSCode, check that it uses the correct virtual environment. To set the Python interpreter to your virtual environment. Get the path with this command

poetry env info --path
/Users/petermilne/Library/Caches/pypoetry/virtualenvs/part-1-qwAxeBFF-py3.12

and copy the path.

Then click on the ‘Python ….’ in the bottom right-hand corner of vs code.

Python interpreter in vs code

Choose: Enter interpreter path

and paste the path

Environment Variables .env

You will need to create a .envfile similar to this:

CRITEO_CLIENT_ID=<your client id>
CRITEO_CLIENT_SECRET=<your client secret>
RETAIL_MEDIA_API_URL=https://api.criteo.com/2024-07/retail-media/

# only if you use Azure
AZURE_OPENAI_API_KEY=
AZURE_OPENAI_ENDPOINT=
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=
OPENAI_API_VERSION=2024-02-15-preview

# only if you use Groq
GROQ_API_KEY=<your groq api key>
GROQ_AI_MODEL_NAME=llama-3.1-70b-versatile

Tip: Criteo APIs are versioned by date, e.g. 2024–07, be sure to use the current API version.

Groq or Azure OpenAI, and why not OpenAI?

Groq Cloud is a fast and inexpensive LLM service using new technology that is “powered by the Groq LPU and available as public, private, and co-cloud instances, GroqCloud redefines real-time.” It is free for developers and a great way to start with LLMs

Azure OpenAI is a private instance of an LLM service. What does “private” mean? This ensures OpenAI does not use the proprietary data you pass into the LLM to train its future models. i.e., the data from your API calls does not become part of the public domain!

Tips:

If your poetry environment is not running in the terminal, check you are in the correct directory/folder, then run:

poetry install --no-root
poetry shell

Step 3: CrewAI Tools using APIs

A tool in CrewAI is a skill or function that agents can utilize to perform various actions. This includes tools from the crewAI Toolkit and LangChain Tools, enabling everything from simple searches to complex interactions and effective teamwork among agents.

https://docs.crewai.com/core-concepts/Tools/

Our first task is to create CrewAI Tools to call the Retail Media REST APIs. We will create three simple tools to retrieve:

  • A list of Accounts accessible to the current user (user credentials)
  • A list of Brands for the accounts
  • A list of Retailers for the accounts

Each tool will use the equivalent REST API endpoint
(see: https://developers.criteo.com/retail-media/docs/account-endpoints)

Let’s discuss one of these tools: RetailersTool

https://medium.com/media/ee2f5d19659c3962dcde8082361fd045/href

Here, we have defined a class named RetailersToolthat implements the tool, subclassing the BaseTool from crewai_tool.

Lines 26–32 code the _run method implements the call to the Retail Media API and is invoked by the agents using the tool. You can see the parameters of accountId, pageIndex and pageSizepassed to the REST call. The response is the response body, which is JSON.

Step 4: Agents

An agent in CrewAI is an autonomous unit programmed to Perform tasks, Make decisions and Communicate with other agents. Think of an agent as a member of a team with specific skills and a particular job to do. Agents can have different roles, such as ‘Researcher’, ‘Writer’, or ‘Customer Support’, each contributing to the crew's overall goal.

https://docs.crewai.com/core-concepts/Agents/

You can think of an agent as the embodiment of a Persona, but you can think of it as a chunk of intelligent processing.

You can define agents entirely in code or in a yaml file with a little code in the crew. Using a yaml file encourages the separation of concerns and allows non-programmers to define agents’ properties.

Here, we define the agent account_manager properties in config/agents.yaml and the agent code in crew.py

Yaml snippet: agents.yaml

account_manager:
role: >
Account manager
goal: >
Provide lists of accounts, retailers and brands
backstory: >
You're an expert in managing accounts and retrieving information about accounts, retailers, and brands.
You're known for your ability to provide accurate and up-to-date information to help your team make informed decisions.
You use the Retail Media REST API efficiently by choosing the correct API and making the right number of requests.
Remember the results of the accounts, retailers, and brands to avoid making unnecessary request

verbose: True
cache: True

The agent is designed with three key elements: role, goal, and backstory.

Role: This defines the agent’s job within the crew. In this case, the role is simply Account Manager

Goal: This specifies what the agent aims to achieve. The goal is aligned with the agent’s role and the overall objectives of the crew. Here, the goal is to provide a list of accounts, retailers and brands

Backstory: This provides depth to the agent’s persona, enriching its motivations and engagements within the crew. The backstory contextualises the agent’s role and goal, making interactions more meaningful. Here, the agent is an expert in managing accounts and has specific instructions on how to go about its responsibilities.

The LLM uses these properties as part of the prompt to configure its behaviour and core competencies.

Code snippet: crew.py

    """
Account manager agent instance created from the config file.
The function is decorated with the @agent decorator to indicate that it is an agent.
"""

@agent
def account_manager(self) -> Agent:
return Agent(
config=self.agents_config["account_manager"]
llm=llm, # if you use Azure OpenAI or Groq
)

The actual code loads the properties from the YAML file and sets theLLM (if you are using Groq or Azure)

Step 5: Tasks

In the crewAI framework, tasks are specific assignments completed by agents. They provide all necessary details for execution, such as a description, the agent responsible, required tools, and more, facilitating a wide range of action complexities.
Tasks within crewAI can be collaborative, requiring multiple agents to work together. This is managed through the task properties and orchestrated by the Crew’s process, enhancing teamwork and efficiency.

https://docs.crewai.com/core-concepts/Tasks/

In this example, we have three tasks:

  • accounts: Retrieve Accounts data and produce a Markdown file.
  • brands: Retrieve Brands data for a specific Account and produce a Markdown file.
  • retailers: retrieve Retailers data for a specific Account and produce a Markdown file.

Similar to Agents, you can define tasks entirely in code or in a yaml file with a little code in the crew. Similarly, using a yaml file encourages the separation of concerns and allows non-programmers to define task properties.

Here, we define the task brands properties in config/tasks.yaml and the task code in crew.py

brands:
description: >
Iterate through the {accounts list}, and for each {account} retrieve the Retail Media brands. Use the {account id} to get the brands.
expected_output: >
A list of brands for the account formatted as a table in Markdown. Here is an example of the expected output:
| Brand ID | Brand Name |
agent: account_manager
context:
- accounts

A task typically includes the following properties:

Description: This is a detailed explanation of what the task entails. It provides the purpose and the steps needed to complete the task. Here, the brands task is instructed to retrieve the Brands for each Account.

Expected Output: This defines the desired outcome of the task. It should be clear and specific. In this example, the output is a markdown table with an example.

Agent: This refers to the entity responsible for executing the task. It could be a specific person, a team, or an automated system. Here, the task is to be done by the account_manageragent.

Context: This includes any additional information or data that provides background or input for the task. It helps understand the environment or conditions under which the task should be performed. The brandstask needs input from the results of accounts

Code snippet: crew.py

    """
Brands task instance created from the config file.
This function is decorated with the @agent decorator to indicate that it is an agent.
It's job is to retrieve Brands data for a specific Account and produce a Markdown file.
"""

@task
def brands(self) -> Task:
return Task(
config=self.tasks_config["brands"],
output_file="output/brands.md",
asynch=True,
context=[self.accounts()],
tools=[
BrandsTool(),
],
)

Similar to the agent configurations, the code for the tasks loads properties from the tasks.yaml file. In this example, you see that the output of the task is written to the file: output/brands.md.

Note that we have been explicit in the tool to be used to accomplish this task: BrandsTool() This enables the agent performing the task to be more focused and less confused.

Step 6: The Crew

A crew in crewAI represents a collaborative group of agents working together to achieve a set of tasks. Each crew defines the strategy for task execution, agent collaboration, and the overall workflow.

https://docs.crewai.com/core-concepts/Crews/

The crew is the fabric that stitches everything together. It creates instances of the Agents, Tasks, and Tools and specifies the crew's execution details. This is where the “rubber meets the road”.

Code Snippet: crew.py

https://medium.com/media/627a90d021b53f966df5afe854489257/href

Creating the LLM

Lines 10- 24 create the LLM used by the Agent. Here, you can create the LLM from Groq or Azure OpenAI, or, as we will see in later articles, you can use both for different agents on your crew.

Creating the Crew

The class Part1Crew is defined by lines 27–82 (note: some lines are omitted for brevity; complete code at: https://github.com/helipilot50/criteo-retail-media-crew-ai/blob/main/part_1/src/part_1/crew.py)

Lines 73–82 define the crew as a function/method in the class.
The process is sequential, meaning the tasks will be executed in the order they are defined. We have set the verbose flag to true to see a verbose log of activity in the file: output/part_1.log

Step 7 Running the Crew

To run the crew, enter the following command in the terminal.

crewai run

Ensure that you are using the correct Poetry environment. Many frustrating hours, grey hairs, and expletives can be avoided if you check the environment:

poetry env info

Each task will output its results to a file; these are:

  • accounts: output/accounts.md
  • brands: output/brands.md
  • retailers: output/retailers.md

Here is a sample of the output for Retailers:

| Retailer ID | Retailer Name | Campaign Eligibilities |
|-------------|---------------|------------------------|
| 314159 | Marysons | auction, preferred |
...
| 398687 | Office Stuff | auction, preferred |
| 873908 | GAI Group | auction, preferred |

Conclusion

Summary: We have seen how to use Retail Media APIs as tools used by Agents and Tasks in CrewAI. This is quite a simple example of how to walk through the setup and “plumbing” to connect these technologies.

Next Steps: If you haven't already done so, watch the videos by Sam and Brandon. And soon, this series will have a “Part 2”.

Code in GitHub

Additional Resources

Related Articles:

Links:

Favourite Authors:

Sam Witteveen — CEO & Co-Founder @ Red Dragon AI / Google Developer Expert for Machine Learning — Publications NeurIPS, EMNLP

https://medium.com/media/da57d18f43c83880c8924bbb07994afb/hrefhttps://medium.com/media/b3db2340ada66fc18acd3616a40c5091/href

Brandon Hancock — CrewAI Senior Software Engineer | Content Creator on YouTube

https://medium.com/media/3fc55fa389cd28538ad7680ff7473dd0/hrefhttps://medium.com/media/ec9d832736eb0af6dd7b3e73d3d4f6e5/href

CrewAI and Criteo API — Part 1 was originally published in Criteo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Postmark
Matt Reibach ([email protected])

We are happy to introduce the official Postman Collection for Postmark! If you are not familiar with Postman it is a graphical interface tool that allows you to test and share API calls at the click of a button. Our Postman collection makes it easy to try out the Postmark APIs and see the results instantly without the need to write any code.

Postman is similar to Postmark’s API Explorer because it allows you to view and send prepopulated API calls. Postman also allows you to save variables that can be shared across multiple API calls. This makes it even easier to test multiple API calls together. 

If you are not familiar with Postman, here are a few things you should know:

  • In order to get started we recommend downloading the Postman desktop application.
  • A Postman Collection is a set of API endpoints and requests, along with the required authorization, headers, request bodies, and parameters for each API call.
  • Postman Collections also include a set of variables called Collection Variables that may be required for API calls to work

Once you have installed Postman you can import the Postmark collection by clicking the Run in Postman button below. You can choose to Fork or Import the collection. 

Before you can use the collection you will need to update some of the Collection Variables. Most importantly you will need your API tokens. Postmark makes use of two types of API tokens, depending on the endpoint. The Server API token and the Account API token. You can access your API tokens in your Postmark account.

In Postman, make sure you have the top level directory of the collection ("Postmark APIs") selected. Then click on the "Variables" tab.

 

The "Current value" field will be used when a variable is accessed by the collection. To get started, for the api_token variable, replace the current value with your Postmark Server token. Next, replace the account_token current value with your Postmark Account token.

We recommend getting started with the Email endpoint. Many of the other endpoints require some email data before functioning properly. Expand the Email directory in the collection window and select Send a single email.

We have prepopulated the body of this call with an example message, but you will need to change the From field to a valid sender signature for your account before the API call will be accepted. You can also take this time to experiment with the other fields available in this API call.

Once you have changed the From address to a valid sender signature you can click the "Send" button in Postman to send the API call. The API response will be output beneath the request and if all goes well you should see a 200 response.

You can also verify that the API call was received successfully by logging into your Postmark account and viewing the activity feed for the server and message stream that you used to make the API call.

 

Airbnb
Chutian Wang

How Airbnb’s conversational AI platform powers LLM application development.

By Chutian Wang, Zhiheng Xu, Paul Lou, Ziyi Wang, Jiayu Lou, Liuming Zhang, Jingwen Qiang, Clint Kelly, Lei Shi, Dan Zhao, Xu Hu, Jianqi Liao, Zecheng Xu, Tong Chen

Introduction

Artificial intelligence and large language models (LLMs) are a rapidly evolving sector at the forefront of technological innovation. AI’s capacity for logical reasoning and task completion is changing the way we interact with technology.

In this blog post, we will showcase how we advanced Automation Platform, Airbnb’s conversational AI platform, from version 1, which supported conversational systems driven by static workflows, to version 2, which is designed specifically for emerging LLM applications. Now, developers can build LLM applications that help customer support agents work more efficiently, provide better resolutions, and quicker responses. LLM application architecture is a rapidly evolving domain and this blog post provides an overview of our efforts to adopt state-of-the-art LLM architecture to keep enhancing our platform based on the latest developments in the field.

Overview of Automation Platform

In a previous blog post, we introduced Automation Platform v1, an enterprise-level platform developed by Airbnb to support a suite of conversational AI products.

Automation Platform v1 modeled traditional conversational AI products (e.g., chatbots) into predefined step-by-step workflows that could be designed and managed by product engineering and business teams.

Figure 1. Automation Platform v1 architecture.

Challenges of Traditional Conversational AI Systems

Figure 2. Typical workflow that is supported by v1 of Automation Platform.

We saw several challenges when implementing Automation Platform v1, which may also be broadly applicable to typical conversational products:

  1. Not flexible enough: the AI products are following a predefined (and usually rigid) process.
  2. Hard to scale: product creators need to manually create workflows and tasks for every scenario, and repeat the process for any new use case later, which is time-consuming and error prone.

Opportunities of Conversational AI Driven by LLM

Our early experiments showed that LLM-powered conversation can provide a more natural and intelligent conversational experience than our current human-designed workflows. For example, with a LLM-powered chatbot, customers can engage in a natural dialogue experience asking open-ended questions and explaining their issues in detail. LLM can more accurately interpret customer queries, even capturing nuanced information from the ongoing conversation.

However, LLM-powered applications are still relatively new, and the community is improving some of its aspects to meet production level requirements, like latency or hallucination.So it is too early to fully rely on them for large scale and diverse experience for millions of customers at Airbnb. For instance, it’s more suitable to use a transition workflow instead of LLM to process a claim related product that requires sensitive data and numbers of strict validations.

We believe that at this moment, the best strategy is to combine them with traditional workflows and leverage the benefits of both approaches.

Figure 3. Comparison of traditional workflows and AI driven workflows

Architecture of LLM Application on Automation Platform v2

Figure 4 shows a high level overview of how Automation Platform v2 powers LLM applications.

Here is an example of a customer asking our LLM chatbot “where is my next reservation?”

  • Firstly, user inquiry arrives at our platform. Based on the inquiry, our platform collects relevant contextual information, such as previous chat history, user id, user role, etc.
  • After that, our platform loads and assembles the prompt using inquiry and context, then sends it to LLM.
  • In this example, the first LLM response will be requesting a tool execution that makes a service call to fetch the most recent reservation of the current user. Our platform follows this order and does the actual service call then saves call responses into the current context.
  • Next, our platform sends the updated context to LLM and the second LLM response will be a complete sentence describing the location of the user’s next reservation.
  • Lastly, our platform returns LLM response and records this round of conversion for future reference.
Figure 4. Overview of how Automation Platform v2 powers LLM application

Another important area we support is developers of LLM applications. There are several integrations between our system and developer tools to make the development process seamless. Also, we offer a number of tools like context management, guardrails, playground and insights.

Figure 5. Overview of how Automation Platform v2 powers LLM developers

In the following subsections, we will deep dive into a few key areas on supporting LLM applications including: LLM workflows, context management and guardrails.

While we won’t cover all aspects in detail in this post, we have also built other components to facilitate LLM practice at Airbnb including:

  • Playground feature to bridge the gap between development and production tech stacks by allowing prompt writers to freely iterate on their prompts.
  • LLM-oriented observability with detailed insights into each LLM interaction, like latency and token usage.
  • Enhancement to Tool management that is responsible for tools registration, the publishing process, execution and observability.

Chain of Thought Workflow

Chain of Thought is one of AI agent frameworks that enables LLMs to reason about issues.

We implemented the concept of Chain of Thought in the form of a workflow on Automation Platform v2 as shown below. The core idea of Chain of Thought is to use an LLM as the reasoning engine to determine which tools to use and in which order. Tools are the way an LLM interacts with the world to solve real problems, for example checking a reservation’s status or checking listing availability.

Tools are essentially actions and workflows, the basic building blocks of traditional products in Automation Platform v1. Actions and workflows work well as tools in Chain of Thought because of their unified interface and managed execution environment.

Figure 6. Overview of Chain of Thought workflow

Figure 6 contains the main steps of the Chain of Thought workflow. It starts with preparing context for the LLM, including prompt, contextual data, and historical conversations. Then it triggers the logic reasoning loop: asking the LLM for reasoning, executing the LLM-requested tool and processing the tool’s outcome. Chain of Thought will stay in the reasoning loop until a result is generated.

Figure 7. High level components powering Chain of Thought in Automation Platform

Figure 7 shows all high-level components powering Chain of Thought:

  1. CoT (Chain of Thought) IO handler: assemble the prompt, prepare contextual data, collect user input and general data processing before sending it to the LLM.
  2. Tool Manager: prepare tool payload with LLM input & output, manage tool execution and offer quality of life features like retry or rate limiting.
  3. LLM Adapter: allow developers to add customized logic facilitating integration with different types of LLMs.

Context Management

To ensure the LLM makes the best decision, we need to provide all necessary and relevant information to the LLM such as historical interactions with the LLM, the intent of the customer support inquiry, current trip information and more. For use cases like offline evaluation, point-in-time data retrieval is also supported by our system via configuration.

Given the large amount of available contextual information, developers are allowed to either statically declare the needed context (e.g. customer name) or name a dynamic context retriever (e.g. relevant help articles of customer’s questions ).

Figure 8. Overall architecture of context management in Automation Platform v2

Context Management is the key component ensuring the LLM has the access to all necessary contextual information. Figure 8 shows major Context Management components:

  1. Context Loader: connect to different sources and fetch relevant context based on developers’ customizable fetching logic.
  2. Runtime Context Manager: maintain runtime context, process context for each LLM call and interact with context storage.

Guardrails Framework

LLMs are powerful text generation tools, but they also can come with issues like hallucinations and jailbreaks. This is where our Guardrails Framework comes in, a safe-guarding mechanism that monitors communications with the LLM, ensuring it is helpful, relevant and ethical.

Figure 9. Guardrails Framework architecture

Figure 9 shows the architecture of Guardrails Framework where engineers from different teams create reusable guardrails. During runtime, guardrails can be executed in parallel and leverage different downstream tech stacks. For example, the content moderation guardrail calls various LLMs to detect violations in communication content, and tool guardrails use rules to prevent bad execution, for example updating listings with invalid setup.

What’s Next

In this blog, we presented the most recent evolution of Automation Platform, the conversational AI platform at Airbnb, to power emerging LLM applications.

LLM application is a rapidly developing domain, and we will continue to evolve with these transformative technologies, explore other AI agent frameworks, expand Chain of Thought tool capabilities and investigate LLM application simulation. We anticipate further efficiency and productivity gains for all AI practitioners at Airbnb with these innovations.

We’re hiring! If work like this interests you check out our careers site.

Acknowledgements

Thanks to Mia Zhao, Zay Guan, Michael Lubavin, Wei Wu, Yashar Mehdad, Julian Warszawski, Ting Luo, Junlan Li, Wayne Zhang, Zhenyu Zhao, Yuanpei Cao, Yisha Wu, Peng Wang, Heng Ji, Tiantian Zhang, Cindy Chen, Hanchen Su, Wei Han, Mingzhi Xu, Ying Lyu, Elaine Liu, Hengyu Zhou, Teng Wang, Shawn Yan, Zecheng Xu, Haiyu Zhang, Gary Pan, Tong Chen, Pei-Fen Tu, Ying Tan, Fengyang Chen, Haoran Zhu, Xirui Liu, Tony Jiang, Xiao Zeng, Wei Wu, Tongyun Lv, Zixuan Yang, Keyao Yang, Danny Deng, Xiang Lan and Wei Ji for the product collaborations.

Thanks to Joy Zhang, Raj Rajagopal, Tina Su, Peter Frank, Shuohao Zhang, Jack Song, Navjot Sidhu, Weiping Peng, Kelvin Xiong, Andy Yasutake and Hanlin Fang’s leadership support for the Intelligent Automation Platform.


Automation Platform v2: Improving Conversational AI at Airbnb was originally published in The Airbnb Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Zendesk
Matt Venz

How we enabled self service Kafka topic provisioning

We use Kafka extensively at Zendesk. We have multiple Kafka clusters and at the time of writing this we have over a thousand topics that are replicated on each cluster.

Until recently, our solution for provisioning and updating topics was to use a GitHub repo with a JSON document containing all the topic definitions and a service that periodically processed the document and created, updated, or deleted topics according to the latest definitions.

While this was vastly more efficient than creating topics manually, it still required development teams to co-ordinate changes with the Kafka administration team and wait for those changes to be pushed to GitHub, approved, merged, and finally deployed to the clusters.

The old way of provisioning Kafka topics

This workflow slowed down development teams and increased the workload on the Kafka administration team. We wanted to empower development teams to provision their own topics, increasing the speed with which they could make changes and freeing us up to focus on more interesting and productive work.

The joy of self service

Thanks to the efforts of our infrastructure teams, we do have a Self Service interface at Zendesk that allows teams to specify the resources that their service needs in a simple YAML file along with their code. When the service is deployed, the Self Service API takes the resource definitions and creates Kubernetes resources based on the definitions.

Provisioning topics through Self Service

In order to implement a Self Service implementation for Kafka topics, we needed to implement our own Kubernetes custom resource and a Kubernetes operator to reconcile the custom resources against the Kafka cluster.

A Kubernetes operator is a simple state machine that works by examining the specification of a resource and performing actions on a real system based on the specification. The operator is usually triggered by a change to a resource that produces an event. The operator then makes changes and continues to trigger new updates to the resource until the real system matches the Kubernetes representation.

The typical workflow for reconciling a Kafka topic resource looks something like this:

  1. A deployment triggers an update to the Kubernetes topic resource
  2. The operator receives an event with the updated resource
  3. If the topic doesn’t exist, create it based on the spec in the resource
  4. If the topic exists, update it based on the spec in the resource
  5. If the state of the topic has changed, update the resource to reflect the new state
  6. If the resource has changed, go to step 2 again
Lifecycle of the Kubernetes operator

Making configuration easy and safe

Kafka topics have a large number of configuration options (over 30 at this point), but the majority of them can be set to a default value for most topics. To simplify things for developers, we decided to limit the number of available options to the bare minimum required.

We currently allow 9 different values to be set, but only the topic name is required to provision a new topic. All other values are set to a default by the Kubernetes operator. The spec required to create a topic can be as simple as this:

  kafkaTopic:
- name: "example-topic"
attributes:
topicName: "example.topic"

Naming things is hard

In the example above, the topicName attribute is required in addition to the name attribute. This is because our interface requires us to provide a name for the Kubernetes resource. Unfortunately Kubernetes naming requirements are different to those of Kafka, and Kubernetes doesn’t allow the use of . in resource names.

If the topic name is also a valid Kubernetes resource name, even the topicName attribute can be left out:

kafkaTopic:
- name: "example-topic"

Unfortunately, in most cases we can’t do this as we encourage the use of . characters in our topic names to separate namespaces.

Sensible defaults and limits

Because we control our Kafka cluster as well as the provisioning of resources, we can provide defaults that make sense in our environment. These include values that are required by Kafka such as replication factor and partition count. Some more detailed examples are:

partition count: we default this to 2, which is quite small but still enough for many use cases. We also limit this to a maximum of 9 partitions as a cost control. If a team needs a topic with more partitions they can seek an exemption from the Kafka admin team.

replication factor: we expect this to be set to 3 for the vast majority of topics. This means one broker can go down and the data will still be safely replicated across at least 2 brokers. This can be set lower, but setting it higher would cause an error as there are only 3 brokers available.

min in sync replicas: we set this to one less than the replication factor for the sake of simplicity. We always have 3 replicas and by default each topic is replicated on all 3. In most cases we want to have at least 2 replicas in sync, which means we can still lose one more broker and the data will continue to be available. This allows us to take brokers down safely when we need to do maintenance. If the replication factor for a topic is lower, this value will be automatically adjusted.

User friendly values

Kafka’s configuration values are designed for Kafka’s needs and don’t prioritise the concerns of developers. Data size values are specified in bytes and time values are specified in milliseconds, so it’s not always easy to tell at a glance what a configuration value means.

To make things even easier for developers, we decided to support the use of flexible duration and size units for these kinds of configuration values. For example, instead of entering a retention time of 86400000, a developer can instead enter the value as 1 day.

1 day → 86400000 (ms)
1 MB→ 1048576 (bytes)

Migrating existing topics

Having done all the hard work of enabling developers to provision their own topics, we still have a huge number of topics that have been created and provisioned using the old service. Unfortunately, there was no way around the need to move all these topic definitions into the repos of the services that own them.

To make this process easier for developers, we used the existing topic definitions to generate stubs that they could add to their service definitions. This still required a small amount of manual work for the development teams, but it also made them aware of the new capabilities at their disposal.

Another thing to be aware of when managing topics in a decentralised way is that services which create topics need to be deployed before services that consume topics. This is the natural order under normal circumstances, but when deploying services onto a new cluster, it can be a bit harder to manage. When topics are centralised, the topics can all be created before any services. Now, we have to ensure services are deployed in the correct order.

Working smarter, not harder

The new and improved way of provisioning Kafka topics

With the help and collaboration of other teams and services at Zendesk, we’ve now implemented a system which empowers development teams to provision and maintain Kafka topics on their own, without requiring any input from the Kafka administration team.

At the same time, we’ve provided a more user-friendly interface and enabled sensible defaults. This approach reduces the chance of configuration errors and makes it easy for developers to know 1) what they can change and 2) what they should care about.

As the team responsible for Kafka, we can now spend more of our time improving our systems and implementing new features and capabilities, and less time dealing with simple administration tasks.


Provisioning Kafka topics the easy way! was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Subscribe to our newsletter.

Weekly articles
Get updates on some of the best articles from the week!
No spam
We will never spam you and your data will never be shared with anyone.