Medium

When I told 4,091 writers they weren’t getting paid

2024-9-25

Jacob Bennett

Subtle database errors and how we recovered

On September 5, 2024, our team turned on the new Partner Program payments system in production.

And we immediately sent an email to every partner saying they weren’t going to get paid. 😨

This wasn’t a SEV1. But it was a very visible bug on a very sensitive part of our platform. We had dozens of tickets come in and a few passionate posts expressing how incompetent that one engineer is (🙋‍♂)️. We figured out the problem, and it ended up being more subtle than I first thought.

Some context on the Partner Program payroll system

All of the logic related to “how much money should we send a partner” is scoped to a single user at a time. By the time this runs each month, earnings data has already been calculated on a daily level. The “payroll” work amounts to a simple flow of “get the amount we owe a user, then send it to that user.”

We did add one additional piece to this processor that increased the complexity over previous iterations: If a user’s unpaid earnings are less than $10 (USD), don’t create a Pending Transfer. Instead, accrue their balance and notify them that their balance will roll over. Once a user has reached the $10 minimum, pay them their entire account balance.

Here’s a simplified snippet from the codebase (the entry point to this script is RunUserPayroll).

func (a *Service) RunUserPayroll(ctx context.Context, userID string, r model.TimeRange, batchID string) error {
// Step 1: Aggregate their earnings from last month.
err := a.createPayrollCredit(ctx, userID, r, batchID)
if err != nil {
return fmt.Errorf("creating payroll credit: %w", err)
}

// Step 2: Pay the user all of their unpaid earnings.
_, err = a.createTransferRequest(ctx, userID)
if err != nil {
return fmt.Errorf("creating pending transfer: %w", err)
}

return nil
}

func (a *Service) createPayrollCredit(ctx context.Context, userID string, r model.TimeRange, batchID string) error {
// Get the amount the user earned that we haven't rolled up yet.
credit, err := a.calculatePayrollCredit(ctx, userID, r)
if err != nil {
return fmt.Errorf("calculating payroll credit: %w", err)
}

// If the user has not earned any money, we don't need to create a credit, we can exit early
if credit.IsZero() {
return nil
}

// Roll up the user's earnings into a credit
err = a.payroll.CreatePartnerProgramMonthlyCredit(ctx, &model.PartnerProgramMonthlyCredit{
ID: uuid.New().String(),
UserID: userID,
Period: r,
CreatedAt: time.Now(),
Amount: credit,
Note: "Partner Program Monthly Credit",
}, batchID)
if err != nil {
return fmt.Errorf("creating audit credit: %w", err)
}

return nil
}

func (a *Service) createTransferRequest(ctx context.Context, userID string) (*model.Transfer, error) {
// Get the user's current balance, which will now include the credit from this payroll run
balance, err := a.accountant.GetUserAccountBalance(ctx, userID)
if err != nil {
return nil, fmt.Errorf("getting user account balance: %w", err)
}

// If the user's current balance is above the minimum transferable threshold, we can create
// a pending transfer for the user
meetsThreshold, err := balance.GreaterThanOrEqual(a.config.MinTransferableAmount)
if err != nil {
return nil, fmt.Errorf("checking if user balance meets minimum transferable threshold: %w", err)
}
if !meetsThreshold {
log.Info(ctx, "User balance is below minimum transferable threshold, no transfer created", log.Tags{"user_id": userID, "balance": logAmount(balance)})
err = a.userNotifier.NotifyUserThresholdNotMet(ctx, userID)
if err != nil {
log.Warn(ctx, "Failed to notify user of threshold not met", log.Tags{"user_id": userID, "error": err.Error()})
}
return nil, nil
}

// Everything looks good, create the transfer.
transferRequest := transfers.NewTransferRequest(balance, userID)
transfer, err := a.transfers.CreateTransferRequest(ctx, transferRequest)
if err != nil {
return nil, fmt.Errorf("creating transfer request: %w", err)
}

return transfer, nil
}

The error we ran into is already in this code snippet. Have you noticed it yet?

“The Incident”

We ran the first steps of the payroll system at 11:45am PT. As we watched the logs and metrics in Datadog, two things happened.

First, we started to see a lot of INFO-level logs that said "User balance is below minimum transferable threshold, no transfer created" (you can see the log line in the snippet above). This INFO log by itself is not cause for alarm — if a user doesn’t meet the minimum transferable threshold, this is a valid state.

While those logs were spiking, we got pinged by Fer from User Services:

This is an actual problem and a cause for alarm.

We immediately cancelled the payroll run and dug into what was going on.

The first thing we noticed was the number of users we “successfully” processed was equal to the number of INFO logs I mentioned earlier. That meant 100% of users were going through the code path for failing to meet the minimum payout threshold.

That narrowed the problem area dramatically. We immediately pulled up the code—the same snippet from above—and walked through the logic step-by-step. Another engineer and I had an “aha!” moment at the same time: We are making a database write immediately followed by a database read. This might be a race condition on the database.

The call to a.payroll.CreatePartnerProgramMonthlyCredit writes the last month’s earnings as a credit record (increase the account balance) to the RDS cluster. The call to a.accountant.GetUserAccountBalances reads from the same table to get all unpaid earnings credits.

Reading from a database you just wrote to is a safe thing to do most of the time. But when it isn’t, it can be a subtle bug to track down. In our case, the bug came from how we configured our database. Most of our production RDS clusters are configured to use at least three read replicas. This architecture allows us to scale reads and writes separately. Latency between data written to the primary node and that same data being replicated to all read replicas is low enough for most applications. In my experience it’s available in a few seconds at most.

As a side note: We didn’t catch this before production because we don’t use read replicas in non-production environments. That will probably change now.

But the Partner Program payroll system didn’t actually want separate reads and writes. We wanted the same data we just wrote to the database immediately (<2ms later). That data hadn’t been propagated from the primary node to the read replicas, so 100% of database queries were returning an empty dataset.

The steps that affected this flow were:

  1. Write the monthly credit to the primary cluster database.
  2. Read all monthly credits from the read replicas.
  3. (RDS) Replicate the monthly credit from the primary database to the read replicas.

The race condition was caused in steps 2 and 3. We were querying for data that hadn’t been replicated yet.

So what was the fix?

There’s not a one-size-fits-all solution to fix every race condition. The fix we implemented here was to first fetch all of the unpaid credits and then create the new unpaid credit for last month’s earnings.

The RunPayroll function now looks like this:

func (a *Service) RunUserPayroll(ctx context.Context, userID string, r model.TimeRange, batchID string) error {
// Step 1: Get the user's current balance.
balance, err := a.accountant.GetUserAccountBalance(ctx, userID)
if err != nil {
return nil, fmt.Errorf("getting user account balances: %w", err)
}

// Step 2: Aggregate their earnings from last month.
createdCredit, err := a.createPayrollCredit(ctx, userID, r, batchID)
if err != nil {
return fmt.Errorf("creating payroll credit: %w", err)
}

// Step 3: Add the new credit to their prior balance
balance = balance + createdCredit

// Step 4: Pay the user all of their unpaid earnings.
_, err = a.createTransferRequest(ctx, userID, balance)
if err != nil {
return fmt.Errorf("creating pending transfer: %w", err)
}

return nil
}

func (a *Service) createPayrollCredit(ctx context.Context, userID string, r model.TimeRange, batchID string) (*money.Money, error) {
// Get the amount the user earned that we haven't rolled up yet.
credit, err := a.calculatePayrollCredit(ctx, userID, r)
if err != nil {
return nil, fmt.Errorf("calculating payroll credit: %w", err)
}

// If the user has not earned any money, we don't need to create a credit, we can exit early
if credit.IsZero() {
return nil, nil
}

// Roll up the user's earnings into a credit
err = a.payroll.CreatePartnerProgramMonthlyCredit(ctx, &model.PartnerProgramMonthlyCredit{
ID: uuid.New().String(),
UserID: userID,
Period: r,
CreatedAt: time.Now(),
Amount: credit,
Note: "Partner Program Monthly Credit",
}, batchID)
if err != nil {
return nil, fmt.Errorf("creating audit credit: %w", err)
}

return credit, nil
}

func (a *Service) createTransferRequest(ctx context.Context, userID string, amount *money.Money) (*model.Transfer, error) {
// If the user's current balance is above the minimum transferable threshold, we can create
// a pending transfer for the user
meetsThreshold, err := amount.GreaterThanOrEqual(a.config.MinTransferableAmount)
if err != nil {
return nil, fmt.Errorf("checking if amount meets minimum transferable threshold: %w", err)
}
if !meetsThreshold {
log.Info(ctx, "Amount is below minimum transferable threshold, no transfer created", log.Tags{"user_id": userID, "balance": logAmount(amount)})
err = a.userNotifier.NotifyUserThresholdNotMet(ctx, userID)
if err != nil {
log.Warn(ctx, "Failed to notify user of threshold not met", log.Tags{"user_id": userID, "error": err.Error()})
}
return nil, nil
}

// Everything looks good, create the transfer.
transferRequest := transfers.NewTransferRequest(amount, userID)
transfer, err := a.transfers.CreateTransferRequest(ctx, transferRequest)
if err != nil {
return nil, fmt.Errorf("creating transfer request: %w", err)
}

return transfer, nil
}

Race conditions are subtle, and they’re hard to identify without experiencing them firsthand. If you have a method of identifying them earlier, please let me know!

And if you nerd out about fixing race conditions, we’re hiring!


When I told 4,091 writers they weren’t getting paid was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.