On October 2, the WomenDoTechToo initiative at Criteo reached a significant milestone. After successfully hosting local meetups across various locations for several years, we felt it was time to take a bold step forward by organizing a half-day conference at our headquarters in Paris. In this article, we invite you to enjoy and explore the insightful talks from that remarkable afternoon.
Before diving into the details of the event, let’s first establish the context and introduce the initiative.
Women Do Tech Too is a dynamic event hosted by Women@Criteo, one of our internal communities, and the R&D department. It celebrates the vibrant and diverse women’s community within Criteo and serves as a platform to showcase the incredible talent, skills, and contributions of women in the tech industry.
This event is driven by the belief that diversity and inclusion are paramount in fostering innovation and driving progress in the tech sector. By providing a space for women to share their expertise and experiences, Women Do Tech Too aims to empower and inspire others while highlighting the importance of diversity in shaping our technological landscape.
With a commitment to the Women in Tech community, Women Do Tech Too offers a stage to women speakers and welcomes all individuals who share a passion for championing diversity and equity in the tech community to attend and support these presentations. It offers an opportunity for open dialogue and collaboration, fostering a culture of respect, support, and collective growth.
The “Women do tech too” meetup is a valuable initiative that aims to create a safe space for everyone and provide opportunities for women to contribute to the tech industry. It is a platform for women to celebrate their achievements, foster professional connections, and inspire new generations of women in tech.
We have been actively organizing local tech meetups as part of this initiative. Specifically, we have successfully hosted four meetups: two in Paris, one in Barcelona, and one in Grenoble. For more insights, feel free to explore a couple of previous articles covering some of these meetups 👇
This year, we recognized the need to elevate the initiative and enhance its visibility. We understood that establishing a strong visual identity and creating a more impactful event aligned with the foundational pillars of our initiative was essential.
Thus, the Women Do Tech Conference was conceived as a half-day event featuring over ten speakers from Criteo and various other companies, all hosted in our newly designed space at our Paris office. In summary, the talks addressed a range of topics, including self-care, career advancement, future collaboration with Ada Tech School, navigating a world with and without third-party cookies, privacy in the realm of Generative AI, crafting a digital visual identity, the journey of a product manager in creating user-centric products, lessons learned from integrating a new theme into Criteo’s Design System, and the importance of taking breaks to maintain personal balance. WOW! 🤯
Now, it’s your turn to enjoy the replays and a selection of the best photos. Grab your favorite beverage and join this community of women in tech to celebrate their achievements, forge professional connections, and inspire future generations.
From Idea to Action: WomenInTech, a safe place to find inspiration, knowledge, and roles models by Alejandra Paredes, Software Development Engineer at Criteo & Estelle Thou, Software Development Engineer at Criteo.
What do we need to thrive? This keynote will explore how WomenInTech fosters a community where women at all levels, from juniors to seniors, are encouraged to take initiative and speak up. As a safe place, WomenInTech is a community where we can take the spotlight and be listened to. Let’s explore together all the possibilities of:
Ensure a future of collaboration and diversity in the Tech Industry by Clara Philippot, Ada Tech School Paris Campus director.
How are we training the new generation of developers to learn and iterate from collaboration, agile methodology, and empathy?
https://medium.com/media/041e2623635593627b4dc5a71311185d/hrefThe path of staff engineer by Paola Ducolin, Staff Software Engineer at Datadog.
Earlier this year, I was promoted to Staff Engineer at my current company, Datadog. It was a three-year-long path. In this lightning talk, I will share the journey with its ups and downs.
https://medium.com/media/a5e63239cfd5449338aa1a73245ffa2e/hrefStory of a failure by Agnès Masson-Sibut, Engineering Program Manager at Criteo.
Working as an EPM, one of our roles is to try to avoid failure. But sometimes, for many reasons, failure is there. Bill Gates said, “It’s fine to celebrate success, but it is more important to heed the lessons of failure.” This presentation will bring us through the story of a failure and, more importantly, through the learnings out of it.
https://medium.com/media/f72aa68ff6d5800e7ca957222d7a867e/hrefCookies 101 by Julie Chevrier, Software Developer Engineer at Criteo.
Have you ever wondered what happens after you click on a cookie consent banner and what the impact of your choice on the ads you see is? Join me to understand what is exactly a cookie and how it is used for advertising!
https://medium.com/media/1077415b6f359dd6b78cd39ea927e492/hrefHow to make recommendations in a world without 3rd party cookies by Lucie Mader, Senior Machine Learning Engineer at Criteo.
Depending on the browser you’re using and the website you’re visiting, the products in the ads you see might seem strange. We’ll discuss this issue and its possible relationship to third-party cookies in this talk.
https://medium.com/media/236e1a438b73c3e4357b6eead2bcc529/hrefPrivacy in the age of Generative AI by Jaspreet Sandhu, Senior Machine Learning Engineer at Criteo.
With the advent and widespread integration of Generative AI across applications, industrial or personal, how do we prevent misuse and ensure data privacy, security, and ethical use? This talk delves into the challenges and strategies for safeguarding sensitive information and maintaining user trust in the evolving landscape of AI-driven technologies.
https://medium.com/media/dd25970a1b71be3b75942d96707a312a/hrefHow to translate women’s empowerment into a brand visual identity by Camille Lannel-Lamotte, UI Designer at Criteo.
Uncover how color theory, symbolism, and language come together to shape the new brand image and get an insider’s view of the key elements that define it.
https://medium.com/media/5d167c67f90befef3623ecdcec542816/hrefFrom Vision to Experience: The Product Manager’s Journey in Shaping User-Centric Products by Salma Mhenni, Senior Product Manager at Criteo.
Evolution of product managers’ roles in creating user-centric products, transitioning from initial vision to crafting meaningful user experiences.
https://medium.com/media/8fe3e1804d6b8eef5abe2a7ce0fa4ccf/hrefCrafting Consistency: Integrating a new theme in Criteo’s React Design System by Claire Dochez, Software Developer Engineer at Criteo.
Last year, our team integrated a new theme into Criteo’s design system. This talk will cover the journey, emphasizing the key steps, challenges faced, and lessons learned along the way.
https://medium.com/media/4fc4a3c91918fa4f0faf01a754b2ca7b/hrefHave a break and find YOUR own balance with the Wheel of Life! by Sandrine Planchon, Human-Minds — Coach in mental health prevention & Creator of disconnecting experiences.
When everything keeps getting faster, to the point of sometimes throwing you off balance, what about slowing down for a moment and reflecting on YOUR own need of balance in your life? The Wheel of Life can show a way to access it!
https://medium.com/media/dfa5bf08b80269dffab99ecbc6ec71cf/href.
Empowering Voices: The Women Do Tech Too Conference was originally published in Criteo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
In the pursuit of delivering tailored user experiences, strategic arrangement of user interface elements plays a pivotal role. Among the array of challenges in this endeavour, ranking problems that aim to determine the optimal order holds particular significance.
While supervised learning methods offer solid solutions, online learning by trial-and-error through multi-armed bandits presents distinct advantages such as not requiring a previously collected dataset and being able to adapt to the problem on the fly. By treating each ordering as an arm, it is straightforward to leverage multi-armed bandits to handle such scenarios.
That being said, addressing ranking problems with multi-armed bandits encounters a critical setback: the exponential increase in distinct orderings, i.e. arms, as the number of items and positions grows. For example, arranging just 8 items into 8 positions results in 40,320 unique orderings. Expanding this to 10 positions-items escalates the complexity to a daunting 3,628,800, highlighting the rapid expansion of problem size.
Why does this pose a challenge? While an obvious one is the difficulty that lies in learning the best arm among numerous options, the actual bottleneck for the linear bandits emerges at the steps of arm selection. In this blog post, you will learn how we tackle this at Expedia Group™ to scale our ranking bandits efficiently.
Bandits with linear payoffs, also known as linear bandits, are a class of multi-armed bandit problems where the reward obtained from pulling each arm is assumed to be a linear function of some underlying features associated with that arm. In addition to allowing context information to be incorporated as an extension of arm features to make contextual decisions, this allows some generalisation to be done between arms. Thus, learning via feedback gathered for some arms also contributes to the learning of others. This property of the linear bandits makes it possible to learn arm values even with very large number of distinct arms, so long as there is a sufficient amount of shared features. That being said, this does not directly resolve the challenge that arises at the arm selection step.
Every time the bandit algorithm selects an arm to pull, i.e., an action to execute, it typically does so by selecting the top-scoring arm according to its model. In linear bandits, this involves computing the dot product between the (context-)arm encoded vector and the linear model’s weights for each arm and then determining the one with the maximum value. For more details on how these algorithms work particularly in our applications, the reader is referred to our other related posts: Multi-Variate Web Optimisation Using Linear Contextual Bandits, Recursive Least Squares for Linear Contextual Bandits, and How to Optimise Rankings with Cascade Bandits.
The arm selection can be seen in the step that involves the arg max operation for the widely used Thompson Sampling for Contextual Bandits with Linear Payoffs [1] algorithm shown below, which is an exhaustive loop over every single arm value produced by dot product:
This step can require an excessive amount of computational budget and might be prone to high latency in responses when the number of arms is large, thus rendering it unsuitable for applications that require real-time decisions to be made.
An approach that tackles this challenge is Greedy Search [2]. By traversing the arm space to find an approximate good solution for the top-scoring arm using the Hill Climbing technique, this algorithm sidesteps exhaustive processing. However, the effectiveness of this search method in finding the top-scoring arm depends on the problem size. Linked to this, Greedy Search suffers from diminishing performance as the number of features increases, leading to longer computation times and expanded search space. Consequently, this results in suboptimal scaling in approximation quality and operational efficiency as the number of arms grows.
To address the aforementioned challenges and determine the exact top-scoring arms more efficiently, we employ a technique that we call Assignment Solver as an alternative to Greedy Search. This method leverages two key properties of the ranking with linear bandits problem:
Here is what we can infer from these together: arm scores are computed via summation of contributions coming from different item-position assignments; therefore, maximising the best valid assignment score instead of naively computing individual arm scores one by one can also let us determine the top-scoring arm.
To illustrate this overall idea and process, let’s consider a simple problem with 2 different contexts (Context #1 and Context #2) and 2 items (Item #1 and Item #2) to be ordered into 2 positions (Position #1 and Position #2) where every variable is a categorical one. Let’s also say that we incorporate an encoding scheme in this setting to enable addressing it as a linear bandit as follows:
where B denotes bias, C denotes the context terms, A denotes the position-item terms and I denotes interaction terms between context and position-item. Depending on the context-arm combination at decision time, this encoding scheme produces a binary vector (only consisting of 0s and 1s) by following the rules below:
For a complete example, Context #2 and item ordering of [Item#2, Item#1] (arm) pair would have the terms B, C₂, A₁₂, A₂₁, I₂₁₂, I₂₂₁ of the encoding scheme activated producing the one-hot vector 101011000000110. The dot product of this vector with the model weights would then give us the arm score for this particular context arm. In other words, this encoding tells us which coefficients of the model weights we need to take into account. Continuing from this example, we can write down the score calculations of different orderings (arms) for Context #2 as below:
It might be obvious now that such a mapping allows us to dissect the terms that contribute to an ordering’s (arm’s) total score for each position-item assignment. Our goal is to find the top-scoring ordering for any given context. The terms B, C₂ (or C₁ when it’s Context #1) are present in every arm and do not affect the ordering of the scores; therefore, they can be ignored:
Here, we can see the relevant terms and write the contribution of position-item assignments independently:
Now, to make it even clearer, let’s put these in the matrix form which we call the score contribution matrix:
Let’s remember the ranking problem: an item can only be present at a single position at one time, and a position can only be occupied by one item. With this in consideration, we can think of finding the best item ordering task as picking a cell from each column and row from this score contribution matrix to maximise the sum of cell values we picked. In this particular example, there are 2 different ways of doing such an assignment (which is equal to the number of different orderings we can generate):
as also illustrated below:
Restructuring our problem this way helps us understand the assignment nature of the problem. However, finding the best assignment, i.e. the best ordering, by evaluating these scores exhaustively won’t provide us with any benefits over the default approach described earlier. In this simple example, there are only 2 possible assignments. In larger problem instances this number becomes very large, such as 8 items-positions having 40,320 assignments. This is where we take advantage of this reformulation: Our problem in this form is an instance of the fundamental assignment problem in combinatorial optimisation literature!
Assignment problem involves finding the best assignment of a set of tasks (items) to a set of resources (positions) in a way that optimises a certain objective function. In its most common form, it deals with assigning tasks to resources, where each task must be assigned to exactly one resource, and each resource can only be assigned one task. The objective is typically to minimise or maximise some measure of cost, such as minimising the total cost or maximising the total profit, which in our case is maximising the total assignment score.
The assignment problem has been extensively studied and various algorithms have been developed to solve it efficiently. One of the most well-known algorithms for solving the assignment problem is the Hungarian algorithm, also known as the Kuhn-Munkres algorithm [3]. This algorithm finds the optimal solution to the assignment problem in polynomial time, making it highly efficient for practical applications. Therefore, it’s a perfect method for us to utilise in this problem for real-world instances where finding solutions with a very small latency is of utmost importance.
In short, by applying the Hungarian algorithm to the score contribution matrix we build at every arm selection step, we can find the best item ordering very efficiently. This blog post won’t be covering the details of the Hungarian algorithm itself.
Now that we have the core idea explained, here is how the algorithm can work practically:
So, how does the approach we just presented here perform against the baselines, exhaustive method and Greedy Search? Its strength against the previous approaches is twofold.
First, it is much faster to determine the top-scoring arm and it scales much better as the number of arms grows. The figure below demonstrates this comparison:
As can be seen in Figure 7, Assignment Solver is far ahead when it comes to speed. When there are 6 items and positions, it’s already ~95 times faster than the Exhaustive method and ~12 times faster than Greedy Search. At 10 items and positions, it becomes ~40 times faster than Greedy Search thanks to its polynomial time complexity resulting in better scaling. Ultimately, it enables addressing enormously large problems which would otherwise be infeasible.
Second, Assignment Solver’s top-scoring arm results are exact regardless of the size of the problem, just like what the exhaustive method would produce, whereas Greedy Search has no exactness guarantees with an approximation quality tied to the problem size.
This blog post described the approach we use at Expedia Group to efficiently determine the exact top-scoring arms in high cardinality ranking bandit problems when there is a suitable encoding scheme incorporated. The substantial gains we demonstrate highlight the importance of adopting correct solutions tailored to the problem. These solutions enable prominent techniques from literature to be compatible with real-world use cases.
[1] Agrawal, S., & Goyal, N. (2013). Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013, 28, 127–135.
[2] Parfenov, F. & Mitsoulis-Ntompos, P. (2021). Contextual Bandits for Webpage Module Order Optimization. In Marble-KDD 21’, Singapore, August 16, 2021.
[3] Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Learn about life at Expedia Group
Identifying Top-Scoring Arms in Ranking Bandits With Linear Payoffs in Real-Time was originally published in Expedia Group Technology on Medium, where people are continuing the conversation by highlighting and responding to this story.
Managing z-index is a classic challenge for front-end developers. As a project grows, display conflicts and bugs related to z-index can quickly turn into serious headaches.
I recently took some time to clean up the z-indexes in one of our projects.
In this article, I’ll share the approach I used for this cleanup (while also laying the foundation for keeping things manageable in the long run).
The first step was to perform a thorough audit of all the z-index declarations in the project. This gave me an overview of how many z-indexes we had and what values were being used.
Here’s what I found:
- August 2023: 57 declarations with 8 distinct values.
- February 2024: 87 declarations with 19 distinct values.
This evolution showed me that the use of z-indexes had increased significantly. And there was no reason to believe it would stop.
With that in mind, my goal became to centralize and standardize these values.
I centralized all the z-index values in a single file using CSS custom properties.
This allowed me to have an overview of the different layers of elements on the page, helping me get a (slightly) clearer picture of the task ahead.
If you’d like to get an idea (or maybe scare yourself a bit), here’s what that first file looked like:
:root {
- z-index-second-basement: -2;
- z-index-basement: -1;
- z-index-ground: 0;
- z-index-floor: 1;
- z-index-second-floor: 2;
- z-index-third-floor: 3;
- z-index-job-ads-secondary-filters--selects: 10;
- z-index-job-ads-results-sort: 10;
- z-index-job-ads-primary-filters: 20;
- z-index-fo-header: 30;
- z-index-above-fo-header: 31;
- z-index-fo-header--dropdowns: 100;
- z-index-career-center--login-modal--autocomplete-list: 99;
- z-index-bo-drawer-backdrop: 999;
- z-index-bo-drawer: 1000;
- z-index-feature-env-switcher: 9999;
- z-index-notifications-panel: 10000;
- z-index-msw-tools-panel--open-button: 99999;
- z-index-msw-tools-panel: 999999;
}
While trying to make sense of each of these z-index values, I had to revisit an important CSS concept: stacking contexts.
Z-indexes are just a tool tied to this concept.
In my experience, z-indexes can be tricky to master because they interact with stacking contexts in TWO ways
- a z-index can modify the stacking order (which is fine, that’s its purpose)
- a z-index can create a new stacking context (and this is where things can go wrong)
The real challenge isn’t just about the z-index values themselves, but also the unintentional creation of stacking contexts. Many z-indexes end up creating stacking contexts that weren’t needed, which complicates the layering of elements.
For the next phase of my work, I paid special attention to these unintentional stacking context creations.
With a better understanding of stacking contexts, I was able to identify unnecessary or redundant z-index values.
Quite often, the natural order of elements in the DOM is enough to ensure correct layering without needing a z-index.
So, I removed z-indexes where elements were already stacking correctly based on their HTML order (and in some cases, I even reordered the HTML itself to avoid needing a z-index 😉).
I also reduced the value of some z-indexes when they were higher than necessary.
This step is crucial because I believe that the lower and more limited z-index values are, the more we can control them over time.
This reduction also helped me achieve one of my goals: ensuring all z-index values were increments of 1 (I mean: -1, 0, 1, 2, 3…).
Why increments of 1? Because you need to control your z-indexes.
No more randomly picking a value (“There was space between 20 and 30, so I went with 27!”).
Finally, I grouped CSS custom properties that had identical values, which allowed me to identify some useful abstractions.
We defined a visual hierarchy for elements that frequently appear on a page: menus, modals, dev tools, and notifications.
The order we chose is as follows:
page content < website menus < modals < dev tools < notifications.
From the previous work, the highest z-index value that belonged to page content (i.e. not including elements that appear on top) was 3.
This allowed me to make the priority logic dynamic:
:root {
…
--z-index-website-menu: 4;
--z-index-modal: calc(var(--z-index-website-menu) + 1);
--z-index-dev-tools: calc(var(--z-index-modal) + 1);
--z-index-notifications-panel: calc(var(--z-index-dev-tools) + 1);
}
Now, if the value of --z-index-website-menu ever needs to change (to decrease, of course!), all the following values will update automatically, without developers having to think about it.
At the end of step 4, I had grouped together the CSS custom properties that had identical values. Some of these values turned out to be the recurring elements we discussed earlier.
Others didn’t correspond to a specific type of element but were often used because they’re useful for fine-tuning the stacking order in specific contexts.
I took these values to define utility custom properties:
:root {
--z-down-in-the-current-stacking-context: -1;
--z-reset-in-the-current-stacking-context: 0;
--z-up-in-the-current-stacking-context: 1;
…
}
These values are used to manage finer adjustments in any current stacking context.
To ensure this approach remains sustainable, I added a rule to our CSS linter, Stylelint, which prevents the use of hardcoded z-index values. This rule encourages developers to either use the centralized variables or define new ones when needed.
// .stylelintrc
"rules": {
// ...
"scale-unlimited/declaration-strict-value": "z-index"
// ...
}
However, be careful: it should be understood as a guideline and not a rule. It’s fine to silence the linter and use a hardcoded value. You just have to do it consciously.
While all these steps seem to flow nicely in theory, that’s not always the case in real life.
In practice, I went back and forth between the different steps multiple times before reaching a satisfactory result. 😉
By centralizing values, eliminating unnecessary stacking contexts, and reducing the number of different z-index values, I was able to simplify and make z-index management more maintainable. This process lightened the codebase and made the handling of element layering more predictable.
If this approach seems helpful to you, I encourage you to take inspiration from it and apply it to your own projects. A well-structured z-index management strategy can make a real difference, making maintenance easier and avoiding subtle bugs related to element stacking.
Spring Cleaning for z-indexes was originally published in JobTeaser Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.
Three years ago we launched Thumbtack’s dedicated machine learning (ML) infrastructure team. Starting from a single engineer, we eventually grew this into a small team with a big impact. Today our client teams can explore generative AI or traditional ML with our tools, implement their approach through our model inference solution and our feature store, and track correctness with polished model monitoring. We experienced ups and downs getting to this point, giving us many learnings, and it’s been enough time to reflect.
Our ML Infra team has a wide scope, as the only dedicated AI/ML infrastructure team at Thumbtack. We build services used in the online serving of ML models in Thumbtack’s product experience and manage tools for the building of ML models. Model inference, feature management, Jupyter notebook environments, model monitoring, and generative AI capabilities are the main areas we work on. On top of that, ML Infra is often a unifying constant between different product teams. ML Infra acts as “connective tissue”, sharing ML building and system design knowledge across the company.
Yet, the team is small — you’d be surprised at how small it is for that scope. But we have learned how to operate as a flexible and high leverage team.
After several years without any dedicated ML infrastructure team, our need for one had grown. Product teams were moving forward quickly with more and more ML use, yet they lacked most of the shared ML tooling and infrastructure that larger tech companies might typically have. Teams built the ML infrastructure that worked for them, most of which did not generalize across teams. Most pressingly, we had diverged production inference — the generation of predictions from our models in real time — into two separate architectures.
For a couple of years we knew this was an unsustainable setup, but it was only in 2021 that we built momentum around a serious effort to consolidate our ML infrastructure direction. We formed working groups around four topic areas: feature engineering, model experimentation, model inferencing, and model monitoring. Each group had an assortment of engineers and applied scientists from across the company, working together (on top of their regular responsibilities) to write a document outlining their area’s status and possibilities. With these documents in hand we had a strong case for more organized development. We considered perpetuating those working groups as a “virtual team” that could build out some of this infrastructure, but ultimately we decided to create a smaller permanent team.
In 2022, we started the team with one engineer and had buy-in to grow the team. That year was both fast and slow. We were creative in finding additional engineering help. We created a “20% ML Infra” program and shared it with our broader engineering team, inviting anyone interested in the program to speak to their managers about volunteering to spend 20% of their time on ML Infra. There was substantial interest. Separately, we had also created a “Voyager” program at Thumbtack, where engineers could move to other teams for the duration of single brief projects. Between those two programs we had about five temporary collaborators who helped us in lean times. We even had part-time project management support from a colleague who was particularly interested in ML and had been involved in the creation of the team. This really helped us bridge the gap while we started on hiring, which we spent a lot of time on throughout the year.
While we felt we were on the path to success, we still had some learnings that year. We ended the year without any models in production. We had a very small team whose time was split across other responsibilities such as interviewing, communicating with stakeholders and potential client teams, answering questions about inherited legacy infrastructure, contributing to our 3 year strategy, and guiding our “20% ML Infra” and “Voyager” program contributors. While our temporary contributors were helpful, they needed ramp-up time and had to serve their home teams first, so their contributions were sporadic and often disrupted.
Where we did make progress was designing the fundamental structure of inference between our service and its interfaces with clients — those structural decisions have proven to be a good fit. We embraced a Minimum Viable Product (MVP) mindset to reach iterations that were good enough to try out. While that helped us learn quickly and make pivots, it left holes in our functionality. There was still work to be done.
We framed our work areas as overlapping maturity curves, where our initial area (production inference) would be in a “Building” stage while our next area (notebooks) was in a “Trialing” stage. Then, as inference reached “Maturity”, notebooks would reach “Scaling”, and model monitoring would reach the “Building” stage. We had about five such areas in mind for our scope, and intended to typically shift them by one stage in each half.
Our new team members ramped up and we were able to have a larger impact as a team, working across multiple areas. That increased capacity meant we were able to go deeper on our most confident initiatives, while laying the groundwork on future projects. In a typical half we would maintain some established software, build out something else, and scope out options for further out initiatives. It never exactly followed the timeline we laid out, and that’s normal for ambitious plans. Overall, the general trend of tools progressing and maintaining different maturity levels has held.
It took time. Our client teams varied in what they wanted, which made it hard to pick universal choices for inference, feature management, or notebook environments. Nor were those team preferences and needs stationary over time; what might have helped a team in one half might not have been the same as what they needed after another half of their own development. For example, while we explored feature stores, our client teams went deeper on improving their own data infrastructure. For some initiatives we embedded deeply into client teams to apply our tools to their uses. Over the year, we slowly earned adoption, but it was incomplete and Thumbtack remained fractured for each area covered by our core ML tools.
With inference, features, notebooks, and monitoring, it was hard to earn adoption when existing teams already had patterns with which they were comfortable. The start of 2024 brought a new area where we didn’t have prior infrastructure: generative AI. We leaned heavily into enabling generative AI capabilities for teams to experiment with, and this has already resulted in enabling adoption across many use cases. Unlike earlier capability building initiatives where we might have been too soon or too late for client teams, with generative AI we have been able to time the capability building with the needs of different product teams. We were just fast enough that our client teams were able to use generative AI without being blocked by us, yet we were just-in-time enough that we have been able to adapt to the rapidly changing external environment. Rapid advances in generative AI affected our optimal choices around which models to use, whether to host internally or use external vendors, and whether to build our own ML models or use prompt engineering approaches with pre-built models. We learned a lot and had a lot of fun along the way. We were able to evangelize generative AI at the company and power a positive flywheel of more and more adoption and functionality.
We also finally realized the full adoption of our inference solution for new models. Furthermore, it meshed cleanly with our early generative AI needs. Now with high adoption of inference and generative AI, and partial adoption of our other solutions, we are on track at the 2 year mark of our 3 year strategy. We have a long future of exciting and impactful work ahead of us, and now our team has moved past the growing pains associated with creating a centralized team that enables foundational capabilities. Our engineers are all still on the team, and have each built a substantial practical expertise across a spectrum of ML infrastructure topics. Our many client teams and collaborators have made our team well known at the company, where people actively solicit our guidance in the early stages of their ML system designs.
If you try to build the average of what different teams need, what you build might not work for any of them.
Nor can you build something that does everything for everyone. Aside from limiting trade-offs and extra complexity, building an idealized solution takes time — time where your clients might move on or build something good enough for themselves.
It is so hard to forecast what solutions will truly be needed by teams and have high adoption. We found a balance between designing general solutions and embedding in potential client teams. Pay attention to your users, and keep your focus on them until your product fits for them.
The external technology landscape changes so quickly. When we first approached building a feature store, we found that feature store providers and open source projects were optimized for different settings than ours, and that we would have to build so much around them that we might as well build our own from scratch. Instead of picking between a false dichotomy of building something new or greatly complicating our stack to incorporate an external solution, we decided to step back. We took a pause on feature store work, which also gave us more time to understand what our client teams truly needed. A year later, we built a much more bare-bones feature store solution that we expect will serve immediate needs for a while longer, giving us more time — and external solutions more time to increase their own product maturity — before committing to a fully general solution.
You need to be doubly fortunate, building a product that solves client needs and delivering it at a time when it makes a difference for them. Prioritize more projects where you have a clear client, that client has a pressing need, and they make direct commitments to test your new features.
Sometimes we do have to invest in long timelines for technology, but often enough we can have reliable impact by doubling down on initiatives that already have traction.
Throughout the history of the team, we have invested significant thought and effort in maintaining trust.
To build and maintain that trust we use a number of tactics. We communicate proactively with a wide set of informal stakeholders. We have a transparent prioritization process where we invite anyone to submit ideas and options, and where we proactively identify people and teams to talk to for their perspective. Our team communicates in open channels. We shadow the company procedures that other teams do, such as formally grading our OKRs and our overall performance despite lacking a specific stakeholder who will use that information. We set ambitious goals and hold ourselves accountable to them.
More than anything, the way we maintain trust is by working on projects with the highest opportunity for impact, with ruthless prioritization where we aren’t shy about making tough decisions.
Speaking of ways to maintain trust, it helps to regularly demonstrate that you think about the future. Writing a 3 year vision was helpful for us to think through where we could have impact, but it was probably more valuable as an artifact we could share broadly. It shows that we have a good grasp on what we want to accomplish.
It’s held up pretty well. That’s a credit to the vision being practical and well-informed by experience. Another Thumbtack leader has a saying that he’s never seen the second year of a two year plan. Well, we’re entering the third year of our three year plan and the document hasn’t yet become irrelevant. The largest revisions we made were as generative AI developed in different, and most notably faster, ways than we anticipated.
Reviewing and updating the vision is a helpful exercise to reevaluate our big picture direction.
We had help from temporary collaborators early on, but their projects were often subject to pauses. Occasionally they paused indefinitely and never came back to them, as their obligations to home teams only increased over time as their tenure and responsibility grew. This created a lot of unreliability with our roadmap, for their projects and for the work of our permanent members. And while those temporary members built up very valuable expertise that they could bring back to their home teams, we weren’t scaling knowledge inside ML Infra itself to quite the extent we wanted.
Once we hired more full time engineers, our team really picked up momentum. Our engineers put their core focus on ML Infra, and together they accumulate knowledge that makes us ever wiser and more capable. As permanent members of the team, they can also take on relationship management with our client teams or with external companies.
We would be more than happy to take on more temporary help going forward, but in proportion with the full time capacity of the team.
With a team that started small and is still small, we couldn’t afford to lose too much time to less promising initiatives. Nor could we support everything. We learnt to say “no” a lot, or to say “not right now”. In hindsight we should have slowed down our scope growth from the beginning.
The “thinness” of ML Infra as a layer between platform and product teams is a choice parameter, and can vary depending on the situation of client and partner teams. Sometimes that means that a product team can go deeper on building their own ML infrastructure. We also need to make the most of the core infrastructure provided by Thumbtack’s excellent Platform team, especially in ways where we could automatically benefit from future improvements created by the larger team. When we ultimately built our feature store solution, we built it directly on top of their new generation data management infrastructure, rather than forking off from a lower level abstraction.
We follow most of the standard operating rhythms of the company, but we tend to be more agile. We maximize the time we can spend building. We try to avoid surprising our stakeholders, so we do deliver on what we commit to, but we commit selectively. We maintain our flexibility by not overly packing our roadmap, while also judiciously limiting our tech debt accumulation.
We have a lot on our roadmap, but it’s always subject to change, and we have some modesty about its inherent unpredictability. Revision is part of the plan.
Looking back on the last three years since forming the team, we accomplished a lot. We unified inference and built CI/CD around it, started an exponential growth curve of generative AI adoption, built a feature store, picked and implemented a Jupyter notebook solution, and made a model monitoring solution available for early adopters. Along the way we helped with launching many product experiments, and did lots of ad hoc ML systems consulting.
In the grand scheme of things, we’re still in early stages. Our capabilities vary in their polish and adoption. We’re still a small team. We have much work ahead of us, and that’s without knowing what novel ML-driven product experiences Thumbtack will need next. The future will bring larger and grander opportunities, and we’re excited for it.
First and foremost, to members of the ML Infra team for their superb work over these years. We also wouldn’t be here without all of our talented partners throughout Thumbtack, making this an exciting place to work with lots of opportunity. Thanks to Navneet Rao, Nadia Stuart, Laura Arrubla Toro, Cassandra Abernathy, Vijay Raghavan, and Oleksandr Pryimak for reviewing this blog and providing many suggestions.
What we learned building an ML infrastructure team at Thumbtack was originally published in Thumbtack Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.
As AI continues to influence decisions that impact humans, building trustworthy AI isn’t only about creating effective models — it’s about ensuring AI systems are ethical, reliable, and resilient over time. Although traditional software development principles provide a strong foundation for managing machine learning projects, AI’s unique challenges demand even more robust practices.
MLOps, the operational backbone of AI development, adapts proven DevOps principles like version control, CI/CD, and testing to meet the specific needs of AI. However, trustworthy AI requires additional considerations — such as data drift, bias mitigation, and explainability — that go beyond traditional software development.
This article explores seven core principles of MLOps, each essential to achieving trustworthy AI.
From managing data pipelines to ensuring accountability, these components work together to support AI systems that are both high-performing and aligned with today’s ethical and regulatory standards.
Effective data capture, transmission, and sanitization are integral to automating trustworthy AI workflows through MLOps. Data pipelines automate the ingestion, transformation, and validation of data to ensure high-quality, consistent inputs for machine learning models. Here’s how they work:
Version control for models in MLOps is often implemented using tools like DVC, Data Version Control, or MLflow. These tools track model artifacts and their corresponding training datasets, hyperparameters, and code, ensuring transparency in the model lifecycle.
In traditional DevOps, CI/CD pipelines enable fast, automated software releases. MLOps extends this concept to ML models, creating pipelines for model retraining, testing, and deployment.
Monitoring in MLOps is important for identifying issues like data drift, concept drift, and model degradation in real-time.
MLOps addresses AI security through automated pipelines that integrate DevSecOps practices into AI workflows:
Transparency in MLOps means ensuring that every action taken during model development and deployment is tracked and reproducible.
Accountability in MLOps is achieved through comprehensive logging, auditing, and human oversight at every step of the AI lifecycle:
You've seen both the strategic benefits of implementing trustworthy AI through MLOps and the technical components. Whether you're leading the charge from an executive level — or directly involved in AI implementation — the journey doesn't end with automation. It evolves with continuous improvement and human vigilance.
If you're ready to take the next step, let's talk about how MLOps can unlock the full potential of your AI solutions, ensuring they are not only high-performing, but also ethically responsible, and aligned with the demands of today’s regulatory landscape.
As machine learning becomes more integrated with business processes, trustworthy AI principles have moved from a "nice to have" to a necessity, often driven by regulatory requirements. Many organizations are heavily investing in AI governance to understand and apply these principles effectively. A compelling and perhaps surprising approach to implementing trustworthy AI principles is through MLOps.
MLOps integrates DevOps principles into the lifecycle of machine learning models. (In fact, our team discussed how this looked a few years back and gave some valuable insights into scaling ML and the beginnings of MLOps.) This includes everything from data collection and model training to deployment and continuous monitoring. With MLOps, organizations have a powerful toolset for building trustworthy AI systems at scale, allowing them to automate key processes while ensuring ethical standards are met.
However, no system is entirely hands-off. Human oversight remains a critical part of the equation in order to address and mitigate bias, but MLOps provides the backbone for trustworthy AI by embedding checks and balances directly into the workflow.
MLOps provides a solid framework for implementing and measuring key trustworthy AI principles. Here's a high-level list. (And if you are more technical, you can find more detail in The MLOps Architecture Behind Trustworthy AI Principles.)
Despite the clear advantages of MLOps generally, and the added benefits of aligning AI systems with trustworthy principles, adoption has been gradual. Perceived costs and the learning curve are major hurdles — particularly for smaller companies. MLOps often requires significant organizational change, including fostering cross-functional collaboration between data scientists and operations teams, a surprisingly rare feat. Simply implementing tools isn’t enough, companies must invest in culture, processes, and training to fully realize MLOps benefits. In many ways, the same issues come into play as in DevOps and you can learn more about overcoming those challenges in our Strategic DevOps Playbook.
MLOps is a powerful tool for scaling AI and embedding trustworthy principles, but it’s important to approach its implementation with realistic expectations. Although many aspects of the AI lifecycle can and should be automated, some challenges will always require human intervention and domain expertise. This isn’t a limitation, but something to be valued.
Implementing MLOps is not an overnight process. It requires a blend of strategic vision, operational discipline, and technical expertise. It also requires a company culture that supports continuous improvement. As organizations look to scale trustworthy AI, MLOps provides a strong foundation, but agility and ongoing refinement will be key to addressing evolving challenges and regulatory requirements.
Feel free to reach out if you'd like to discuss how MLOps can be implemented in your organization to enhance your AI solutions.
This article is the first in a series showing how to use CrewAI and Criteo APIs. We will see how to obtain credentials, use those credentials to get an Access Token and use the token call endpoints to get Accounts, Retailers, and Brands, all from a CrewAI crew.
CrewAI is a “cutting-edge framework for orchestrating role-playing autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly and tackle complex tasks.”
Criteo Retail Media API unlocks various possibilities to enhance media campaign performance from any platform. It allows you to create, launch, and monitor online marketing campaigns and provides a comprehensive view of their performance.
CrewAI and Retail Media togetner unlocks the power of AI and the power of Commerce Meda
Using CrewAI, you will become familiar with Criteo’s Retail Media APIs and how to use them as tools for large language models (LLMs), AI Agents, etc.
For more detailed articles and videos on large language models (LLMs), AI Agents, and CrewAI, see my favourite authors listed at the end of this article. (Sam Witteveen and Brandon Hancock)
We aim to use a crew of AI Agents and Tasks to retrieve Accounts, Retailers and Brands for Retail Media APIs and perform rudimentary analytics. We will get a developer account at Criteo, create Tools to access the APIs, build an Agent that uses the tools and specify Tasks that will be executed sequentially by the Crew.
All the code for this article is in Python and uses poetry as the package manager/environment manager.
You will need to install the following to run the code examples:
To use the Criteo APIs, you need a developer account created in the Developer Portal by clicking the ‘Get started’ button.
This will take you to the Criteo partners dashboard. Click on ‘create a new app’ (you can see my application already defined)
You will need consent to data provided by the APIs; follow the prompts to be authorised.
Once you have consent, click ‘create a new key’ to create credentials for your application.
A file containing the credentials is automatically downloaded to your local matching. You will use these credentials to obtain an access token for each API call.
# Here is an example:
---------------------------
| Criteo Developer Portal |
---------------------------
Please store your client secret carefully on your side.
You will need it to connect to the API and this is the only time we will be able to communicate it to you.
You can find more information on our API Documentation at https://developers.criteo.com.
application_id: <application id>
client_id: <client id>
client_secret: <client secret here>
allowed_grant_types: client_credentials
Tips: Keep your credentials secret. Don’t commit them to a public repository (GitHub, GitLab, etc).
Authentication with client credentials results in an AccessToken that is valid (at the time of writing) for about 15 minutes. Call the Criteo authentication API for a valid token using your client credentials.
The following code snippet is a function that retrieves an AccessToken using client credentials and caches it for 15 minutes.
https://medium.com/media/fd85928e4ba55314984b9138fe353e03/hrefLines 15–16 retrieve the client ID and secret from environment variables (.env)
Line 17 defines the headers, specifically the content-type of application/x-www-form-urlencoded. This header value is quite important.
Lines 18–22 set up the data containing your credentials.
Line 23 executes a post request to get an access token, and line 26 returns a structure containing the token, the token type, and an expiration time of seconds.
Example auth result as JSON:
{
"access_token": "eyJhbGciOiJSUzII ... pG5LGeb4aiuB0EKAhszojHQ",
"token_type": "Bearer",
"refresh_token": None,
"expires_in": 900
}
The rest of the code caches the result until the token expires.
Clone the repository and change the directory to part_1. The code used in this article is in this directory. Already defined is a poetry project in the file: pyproject.toml
Run these commands in a terminal to install the dependencies, create/update the poetry environment and jump into the correct shell.
poetry install --no-root
poetry shell
VS Code:
If you use VSCode, check that it uses the correct virtual environment. To set the Python interpreter to your virtual environment. Get the path with this command
poetry env info --path
/Users/petermilne/Library/Caches/pypoetry/virtualenvs/part-1-qwAxeBFF-py3.12
and copy the path.
Then click on the ‘Python ….’ in the bottom right-hand corner of vs code.
Choose: Enter interpreter path
and paste the path
You will need to create a .envfile similar to this:
CRITEO_CLIENT_ID=<your client id>
CRITEO_CLIENT_SECRET=<your client secret>
RETAIL_MEDIA_API_URL=https://api.criteo.com/2024-07/retail-media/
# only if you use Azure
AZURE_OPENAI_API_KEY=
AZURE_OPENAI_ENDPOINT=
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=
OPENAI_API_VERSION=2024-02-15-preview
# only if you use Groq
GROQ_API_KEY=<your groq api key>
GROQ_AI_MODEL_NAME=llama-3.1-70b-versatile
Tip: Criteo APIs are versioned by date, e.g. 2024–07, be sure to use the current API version.
Groq Cloud is a fast and inexpensive LLM service using new technology that is “powered by the Groq LPU and available as public, private, and co-cloud instances, GroqCloud redefines real-time.” It is free for developers and a great way to start with LLMs
Azure OpenAI is a private instance of an LLM service. What does “private” mean? This ensures OpenAI does not use the proprietary data you pass into the LLM to train its future models. i.e., the data from your API calls does not become part of the public domain!
Tips:
If your poetry environment is not running in the terminal, check you are in the correct directory/folder, then run:
poetry install --no-root
poetry shell
A tool in CrewAI is a skill or function that agents can utilize to perform various actions. This includes tools from the crewAI Toolkit and LangChain Tools, enabling everything from simple searches to complex interactions and effective teamwork among agents.
https://docs.crewai.com/core-concepts/Tools/
Our first task is to create CrewAI Tools to call the Retail Media REST APIs. We will create three simple tools to retrieve:
Each tool will use the equivalent REST API endpoint
(see: https://developers.criteo.com/retail-media/docs/account-endpoints)
Let’s discuss one of these tools: RetailersTool
https://medium.com/media/ee2f5d19659c3962dcde8082361fd045/hrefHere, we have defined a class named RetailersToolthat implements the tool, subclassing the BaseTool from crewai_tool.
Lines 26–32 code the _run method implements the call to the Retail Media API and is invoked by the agents using the tool. You can see the parameters of accountId, pageIndex and pageSizepassed to the REST call. The response is the response body, which is JSON.
An agent in CrewAI is an autonomous unit programmed to Perform tasks, Make decisions and Communicate with other agents. Think of an agent as a member of a team with specific skills and a particular job to do. Agents can have different roles, such as ‘Researcher’, ‘Writer’, or ‘Customer Support’, each contributing to the crew's overall goal.
https://docs.crewai.com/core-concepts/Agents/
You can think of an agent as the embodiment of a Persona, but you can think of it as a chunk of intelligent processing.
You can define agents entirely in code or in a yaml file with a little code in the crew. Using a yaml file encourages the separation of concerns and allows non-programmers to define agents’ properties.
Here, we define the agent account_manager properties in config/agents.yaml and the agent code in crew.py
Yaml snippet: agents.yaml
account_manager:
role: >
Account manager
goal: >
Provide lists of accounts, retailers and brands
backstory: >
You're an expert in managing accounts and retrieving information about accounts, retailers, and brands.
You're known for your ability to provide accurate and up-to-date information to help your team make informed decisions.
You use the Retail Media REST API efficiently by choosing the correct API and making the right number of requests.
Remember the results of the accounts, retailers, and brands to avoid making unnecessary request
verbose: True
cache: True
The agent is designed with three key elements: role, goal, and backstory.
Role: This defines the agent’s job within the crew. In this case, the role is simply Account Manager
Goal: This specifies what the agent aims to achieve. The goal is aligned with the agent’s role and the overall objectives of the crew. Here, the goal is to provide a list of accounts, retailers and brands
Backstory: This provides depth to the agent’s persona, enriching its motivations and engagements within the crew. The backstory contextualises the agent’s role and goal, making interactions more meaningful. Here, the agent is an expert in managing accounts and has specific instructions on how to go about its responsibilities.
The LLM uses these properties as part of the prompt to configure its behaviour and core competencies.
Code snippet: crew.py
"""
Account manager agent instance created from the config file.
The function is decorated with the @agent decorator to indicate that it is an agent.
"""
@agent
def account_manager(self) -> Agent:
return Agent(
config=self.agents_config["account_manager"]
llm=llm, # if you use Azure OpenAI or Groq
)
The actual code loads the properties from the YAML file and sets theLLM (if you are using Groq or Azure)
In the crewAI framework, tasks are specific assignments completed by agents. They provide all necessary details for execution, such as a description, the agent responsible, required tools, and more, facilitating a wide range of action complexities.
Tasks within crewAI can be collaborative, requiring multiple agents to work together. This is managed through the task properties and orchestrated by the Crew’s process, enhancing teamwork and efficiency.
https://docs.crewai.com/core-concepts/Tasks/
In this example, we have three tasks:
Similar to Agents, you can define tasks entirely in code or in a yaml file with a little code in the crew. Similarly, using a yaml file encourages the separation of concerns and allows non-programmers to define task properties.
Here, we define the task brands properties in config/tasks.yaml and the task code in crew.py
brands:
description: >
Iterate through the {accounts list}, and for each {account} retrieve the Retail Media brands. Use the {account id} to get the brands.
expected_output: >
A list of brands for the account formatted as a table in Markdown. Here is an example of the expected output:
| Brand ID | Brand Name |
agent: account_manager
context:
- accounts
A task typically includes the following properties:
Description: This is a detailed explanation of what the task entails. It provides the purpose and the steps needed to complete the task. Here, the brands task is instructed to retrieve the Brands for each Account.
Expected Output: This defines the desired outcome of the task. It should be clear and specific. In this example, the output is a markdown table with an example.
Agent: This refers to the entity responsible for executing the task. It could be a specific person, a team, or an automated system. Here, the task is to be done by the account_manageragent.
Context: This includes any additional information or data that provides background or input for the task. It helps understand the environment or conditions under which the task should be performed. The brandstask needs input from the results of accounts
Code snippet: crew.py
"""
Brands task instance created from the config file.
This function is decorated with the @agent decorator to indicate that it is an agent.
It's job is to retrieve Brands data for a specific Account and produce a Markdown file.
"""
@task
def brands(self) -> Task:
return Task(
config=self.tasks_config["brands"],
output_file="output/brands.md",
asynch=True,
context=[self.accounts()],
tools=[
BrandsTool(),
],
)
Similar to the agent configurations, the code for the tasks loads properties from the tasks.yaml file. In this example, you see that the output of the task is written to the file: output/brands.md.
Note that we have been explicit in the tool to be used to accomplish this task: BrandsTool() This enables the agent performing the task to be more focused and less confused.
A crew in crewAI represents a collaborative group of agents working together to achieve a set of tasks. Each crew defines the strategy for task execution, agent collaboration, and the overall workflow.
https://docs.crewai.com/core-concepts/Crews/
The crew is the fabric that stitches everything together. It creates instances of the Agents, Tasks, and Tools and specifies the crew's execution details. This is where the “rubber meets the road”.
Lines 10- 24 create the LLM used by the Agent. Here, you can create the LLM from Groq or Azure OpenAI, or, as we will see in later articles, you can use both for different agents on your crew.
The class Part1Crew is defined by lines 27–82 (note: some lines are omitted for brevity; complete code at: https://github.com/helipilot50/criteo-retail-media-crew-ai/blob/main/part_1/src/part_1/crew.py)
Lines 73–82 define the crew as a function/method in the class.
The process is sequential, meaning the tasks will be executed in the order they are defined. We have set the verbose flag to true to see a verbose log of activity in the file: output/part_1.log
To run the crew, enter the following command in the terminal.
crewai run
Ensure that you are using the correct Poetry environment. Many frustrating hours, grey hairs, and expletives can be avoided if you check the environment:
poetry env info
Each task will output its results to a file; these are:
Here is a sample of the output for Retailers:
| Retailer ID | Retailer Name | Campaign Eligibilities |
|-------------|---------------|------------------------|
| 314159 | Marysons | auction, preferred |
...
| 398687 | Office Stuff | auction, preferred |
| 873908 | GAI Group | auction, preferred |
Summary: We have seen how to use Retail Media APIs as tools used by Agents and Tasks in CrewAI. This is quite a simple example of how to walk through the setup and “plumbing” to connect these technologies.
Next Steps: If you haven't already done so, watch the videos by Sam and Brandon. And soon, this series will have a “Part 2”.
Sam Witteveen — CEO & Co-Founder @ Red Dragon AI / Google Developer Expert for Machine Learning — Publications NeurIPS, EMNLP
https://medium.com/media/da57d18f43c83880c8924bbb07994afb/hrefhttps://medium.com/media/b3db2340ada66fc18acd3616a40c5091/hrefBrandon Hancock — CrewAI Senior Software Engineer | Content Creator on YouTube
CrewAI and Criteo API — Part 1 was originally published in Criteo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
We are happy to introduce the official Postman Collection for Postmark! If you are not familiar with Postman it is a graphical interface tool that allows you to test and share API calls at the click of a button. Our Postman collection makes it easy to try out the Postmark APIs and see the results instantly without the need to write any code.
Postman is similar to Postmark’s API Explorer because it allows you to view and send prepopulated API calls. Postman also allows you to save variables that can be shared across multiple API calls. This makes it even easier to test multiple API calls together.
If you are not familiar with Postman, here are a few things you should know:
Once you have installed Postman you can import the Postmark collection by clicking the Run in Postman button below. You can choose to Fork or Import the collection.
Before you can use the collection you will need to update some of the Collection Variables. Most importantly you will need your API tokens. Postmark makes use of two types of API tokens, depending on the endpoint. The Server API token and the Account API token. You can access your API tokens in your Postmark account.
In Postman, make sure you have the top level directory of the collection ("Postmark APIs") selected. Then click on the "Variables" tab.
The "Current value" field will be used when a variable is accessed by the collection. To get started, for the api_token variable, replace the current value with your Postmark Server token. Next, replace the account_token current value with your Postmark Account token.
We recommend getting started with the Email endpoint. Many of the other endpoints require some email data before functioning properly. Expand the Email directory in the collection window and select Send a single email.
We have prepopulated the body of this call with an example message, but you will need to change the From field to a valid sender signature for your account before the API call will be accepted. You can also take this time to experiment with the other fields available in this API call.
Once you have changed the From address to a valid sender signature you can click the "Send" button in Postman to send the API call. The API response will be output beneath the request and if all goes well you should see a 200 response.
You can also verify that the API call was received successfully by logging into your Postmark account and viewing the activity feed for the server and message stream that you used to make the API call.
By Chutian Wang, Zhiheng Xu, Paul Lou, Ziyi Wang, Jiayu Lou, Liuming Zhang, Jingwen Qiang, Clint Kelly, Lei Shi, Dan Zhao, Xu Hu, Jianqi Liao, Zecheng Xu, Tong Chen
Artificial intelligence and large language models (LLMs) are a rapidly evolving sector at the forefront of technological innovation. AI’s capacity for logical reasoning and task completion is changing the way we interact with technology.
In this blog post, we will showcase how we advanced Automation Platform, Airbnb’s conversational AI platform, from version 1, which supported conversational systems driven by static workflows, to version 2, which is designed specifically for emerging LLM applications. Now, developers can build LLM applications that help customer support agents work more efficiently, provide better resolutions, and quicker responses. LLM application architecture is a rapidly evolving domain and this blog post provides an overview of our efforts to adopt state-of-the-art LLM architecture to keep enhancing our platform based on the latest developments in the field.
In a previous blog post, we introduced Automation Platform v1, an enterprise-level platform developed by Airbnb to support a suite of conversational AI products.
Automation Platform v1 modeled traditional conversational AI products (e.g., chatbots) into predefined step-by-step workflows that could be designed and managed by product engineering and business teams.
We saw several challenges when implementing Automation Platform v1, which may also be broadly applicable to typical conversational products:
Our early experiments showed that LLM-powered conversation can provide a more natural and intelligent conversational experience than our current human-designed workflows. For example, with a LLM-powered chatbot, customers can engage in a natural dialogue experience asking open-ended questions and explaining their issues in detail. LLM can more accurately interpret customer queries, even capturing nuanced information from the ongoing conversation.
However, LLM-powered applications are still relatively new, and the community is improving some of its aspects to meet production level requirements, like latency or hallucination.So it is too early to fully rely on them for large scale and diverse experience for millions of customers at Airbnb. For instance, it’s more suitable to use a transition workflow instead of LLM to process a claim related product that requires sensitive data and numbers of strict validations.
We believe that at this moment, the best strategy is to combine them with traditional workflows and leverage the benefits of both approaches.
Figure 4 shows a high level overview of how Automation Platform v2 powers LLM applications.
Here is an example of a customer asking our LLM chatbot “where is my next reservation?”
Another important area we support is developers of LLM applications. There are several integrations between our system and developer tools to make the development process seamless. Also, we offer a number of tools like context management, guardrails, playground and insights.
In the following subsections, we will deep dive into a few key areas on supporting LLM applications including: LLM workflows, context management and guardrails.
While we won’t cover all aspects in detail in this post, we have also built other components to facilitate LLM practice at Airbnb including:
Chain of Thought is one of AI agent frameworks that enables LLMs to reason about issues.
We implemented the concept of Chain of Thought in the form of a workflow on Automation Platform v2 as shown below. The core idea of Chain of Thought is to use an LLM as the reasoning engine to determine which tools to use and in which order. Tools are the way an LLM interacts with the world to solve real problems, for example checking a reservation’s status or checking listing availability.
Tools are essentially actions and workflows, the basic building blocks of traditional products in Automation Platform v1. Actions and workflows work well as tools in Chain of Thought because of their unified interface and managed execution environment.
Figure 6 contains the main steps of the Chain of Thought workflow. It starts with preparing context for the LLM, including prompt, contextual data, and historical conversations. Then it triggers the logic reasoning loop: asking the LLM for reasoning, executing the LLM-requested tool and processing the tool’s outcome. Chain of Thought will stay in the reasoning loop until a result is generated.
Figure 7 shows all high-level components powering Chain of Thought:
To ensure the LLM makes the best decision, we need to provide all necessary and relevant information to the LLM such as historical interactions with the LLM, the intent of the customer support inquiry, current trip information and more. For use cases like offline evaluation, point-in-time data retrieval is also supported by our system via configuration.
Given the large amount of available contextual information, developers are allowed to either statically declare the needed context (e.g. customer name) or name a dynamic context retriever (e.g. relevant help articles of customer’s questions ).
Context Management is the key component ensuring the LLM has the access to all necessary contextual information. Figure 8 shows major Context Management components:
LLMs are powerful text generation tools, but they also can come with issues like hallucinations and jailbreaks. This is where our Guardrails Framework comes in, a safe-guarding mechanism that monitors communications with the LLM, ensuring it is helpful, relevant and ethical.
Figure 9 shows the architecture of Guardrails Framework where engineers from different teams create reusable guardrails. During runtime, guardrails can be executed in parallel and leverage different downstream tech stacks. For example, the content moderation guardrail calls various LLMs to detect violations in communication content, and tool guardrails use rules to prevent bad execution, for example updating listings with invalid setup.
In this blog, we presented the most recent evolution of Automation Platform, the conversational AI platform at Airbnb, to power emerging LLM applications.
LLM application is a rapidly developing domain, and we will continue to evolve with these transformative technologies, explore other AI agent frameworks, expand Chain of Thought tool capabilities and investigate LLM application simulation. We anticipate further efficiency and productivity gains for all AI practitioners at Airbnb with these innovations.
We’re hiring! If work like this interests you check out our careers site.
Thanks to Mia Zhao, Zay Guan, Michael Lubavin, Wei Wu, Yashar Mehdad, Julian Warszawski, Ting Luo, Junlan Li, Wayne Zhang, Zhenyu Zhao, Yuanpei Cao, Yisha Wu, Peng Wang, Heng Ji, Tiantian Zhang, Cindy Chen, Hanchen Su, Wei Han, Mingzhi Xu, Ying Lyu, Elaine Liu, Hengyu Zhou, Teng Wang, Shawn Yan, Zecheng Xu, Haiyu Zhang, Gary Pan, Tong Chen, Pei-Fen Tu, Ying Tan, Fengyang Chen, Haoran Zhu, Xirui Liu, Tony Jiang, Xiao Zeng, Wei Wu, Tongyun Lv, Zixuan Yang, Keyao Yang, Danny Deng, Xiang Lan and Wei Ji for the product collaborations.
Thanks to Joy Zhang, Raj Rajagopal, Tina Su, Peter Frank, Shuohao Zhang, Jack Song, Navjot Sidhu, Weiping Peng, Kelvin Xiong, Andy Yasutake and Hanlin Fang’s leadership support for the Intelligent Automation Platform.
Automation Platform v2: Improving Conversational AI at Airbnb was originally published in The Airbnb Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
We use Kafka extensively at Zendesk. We have multiple Kafka clusters and at the time of writing this we have over a thousand topics that are replicated on each cluster.
Until recently, our solution for provisioning and updating topics was to use a GitHub repo with a JSON document containing all the topic definitions and a service that periodically processed the document and created, updated, or deleted topics according to the latest definitions.
While this was vastly more efficient than creating topics manually, it still required development teams to co-ordinate changes with the Kafka administration team and wait for those changes to be pushed to GitHub, approved, merged, and finally deployed to the clusters.
This workflow slowed down development teams and increased the workload on the Kafka administration team. We wanted to empower development teams to provision their own topics, increasing the speed with which they could make changes and freeing us up to focus on more interesting and productive work.
Thanks to the efforts of our infrastructure teams, we do have a Self Service interface at Zendesk that allows teams to specify the resources that their service needs in a simple YAML file along with their code. When the service is deployed, the Self Service API takes the resource definitions and creates Kubernetes resources based on the definitions.
In order to implement a Self Service implementation for Kafka topics, we needed to implement our own Kubernetes custom resource and a Kubernetes operator to reconcile the custom resources against the Kafka cluster.
A Kubernetes operator is a simple state machine that works by examining the specification of a resource and performing actions on a real system based on the specification. The operator is usually triggered by a change to a resource that produces an event. The operator then makes changes and continues to trigger new updates to the resource until the real system matches the Kubernetes representation.
The typical workflow for reconciling a Kafka topic resource looks something like this:
Kafka topics have a large number of configuration options (over 30 at this point), but the majority of them can be set to a default value for most topics. To simplify things for developers, we decided to limit the number of available options to the bare minimum required.
We currently allow 9 different values to be set, but only the topic name is required to provision a new topic. All other values are set to a default by the Kubernetes operator. The spec required to create a topic can be as simple as this:
kafkaTopic:
- name: "example-topic"
attributes:
topicName: "example.topic"
In the example above, the topicName attribute is required in addition to the name attribute. This is because our interface requires us to provide a name for the Kubernetes resource. Unfortunately Kubernetes naming requirements are different to those of Kafka, and Kubernetes doesn’t allow the use of . in resource names.
If the topic name is also a valid Kubernetes resource name, even the topicName attribute can be left out:
kafkaTopic:
- name: "example-topic"
Unfortunately, in most cases we can’t do this as we encourage the use of . characters in our topic names to separate namespaces.
Because we control our Kafka cluster as well as the provisioning of resources, we can provide defaults that make sense in our environment. These include values that are required by Kafka such as replication factor and partition count. Some more detailed examples are:
partition count: we default this to 2, which is quite small but still enough for many use cases. We also limit this to a maximum of 9 partitions as a cost control. If a team needs a topic with more partitions they can seek an exemption from the Kafka admin team.
replication factor: we expect this to be set to 3 for the vast majority of topics. This means one broker can go down and the data will still be safely replicated across at least 2 brokers. This can be set lower, but setting it higher would cause an error as there are only 3 brokers available.
min in sync replicas: we set this to one less than the replication factor for the sake of simplicity. We always have 3 replicas and by default each topic is replicated on all 3. In most cases we want to have at least 2 replicas in sync, which means we can still lose one more broker and the data will continue to be available. This allows us to take brokers down safely when we need to do maintenance. If the replication factor for a topic is lower, this value will be automatically adjusted.
Kafka’s configuration values are designed for Kafka’s needs and don’t prioritise the concerns of developers. Data size values are specified in bytes and time values are specified in milliseconds, so it’s not always easy to tell at a glance what a configuration value means.
To make things even easier for developers, we decided to support the use of flexible duration and size units for these kinds of configuration values. For example, instead of entering a retention time of 86400000, a developer can instead enter the value as 1 day.
1 day → 86400000 (ms)
1 MB→ 1048576 (bytes)
Having done all the hard work of enabling developers to provision their own topics, we still have a huge number of topics that have been created and provisioned using the old service. Unfortunately, there was no way around the need to move all these topic definitions into the repos of the services that own them.
To make this process easier for developers, we used the existing topic definitions to generate stubs that they could add to their service definitions. This still required a small amount of manual work for the development teams, but it also made them aware of the new capabilities at their disposal.
Another thing to be aware of when managing topics in a decentralised way is that services which create topics need to be deployed before services that consume topics. This is the natural order under normal circumstances, but when deploying services onto a new cluster, it can be a bit harder to manage. When topics are centralised, the topics can all be created before any services. Now, we have to ensure services are deployed in the correct order.
With the help and collaboration of other teams and services at Zendesk, we’ve now implemented a system which empowers development teams to provision and maintain Kafka topics on their own, without requiring any input from the Kafka administration team.
At the same time, we’ve provided a more user-friendly interface and enabled sensible defaults. This approach reduces the chance of configuration errors and makes it easy for developers to know 1) what they can change and 2) what they should care about.
As the team responsible for Kafka, we can now spend more of our time improving our systems and implementing new features and capabilities, and less time dealing with simple administration tasks.
Provisioning Kafka topics the easy way! was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.