Avoiding Race Conditions In Concurrent AWS Lambda Functions

I haven’t blogged in a long time! Here’s a quick ramble about something somewhat interesting that I whipped up earlier today.

I write lots of buggy software. One such example of buggy software is TagBot, which is a GitHub Action that runs hourly on roughly 2000 GitHub repositories. A couple of weeks ago, I wrote a new bug that caused TagBot to crash and send hundreds of notifications and emails. I realized that this was bad, so I implemented an error handling mechanism that reported errors as GitHub issues to the TagBot repository. Something like this:

try:
    do_all_the_stuff()
except Exception:
    open_issue(traceback.format_exc())

Ideally, this can be done with the GitHub API directly from the running GitHub Action, but the default API token does not have permissions to open issues in other repositories. So I spun up a simple web server with the Serverlesss Framework that accepted error reports and opened TagBot issues.

# "Client" code (in GitHub Actions)
except Exception:
    requests.post(
        "https://my.server/report", 
        json={"error": traceback.format_exc()},
    )

# Server code (in AWS Lambda)
@app.route("/report")
def handle_report():
    open_issue(request.json)

We’ve now gone from hundreds of user notifications to hundreds of issues being opened at once on one repository for the same bug… which is not ideal. That’s easy enough to deal with though, we just need to check for existing issues that contain similar errors before opening up a new one.

@app.route("/report")
def handle_report():
    if not is_duplicate(request.json):
        open_issue(request.json)

But what if this is happening 100 times at once? We have what’s called a race condition. Issue reporter A and B are both reporting the same issue. A looks for duplicates and doesn’t find any, then proceeeds to open the issue. At the same time, B looks for duplicates and doesn’t find any, because A is not done yet. B opens an issue too, and when they’re all finished, we have multiple identical issues open, which is not good.

What can we do about this? What we need is a message queue. AWS provides this via Simple Queue Service (SQS). You can send messages to a queue, and they get routed to consumers, which in this case are Lambda functions.

QUEUE = boto3.resource("sqs").Queue("my queue")

@app.route("/report")
def handle_report():
    QUEUE.send_message(MessageBody=request.json)
        
def handle_report_from_queue(report):
    if not is_duplicate(report):
        open_issue(report)

So if our HTTP handler stuffs messages into a queue instead of creating reports directly, a single consumer can read messages one at a time and handle them without race conditions. Right? Wrong. As far as I have been able to figure out, AWS does not allow you to limit the number of concurrent consumers of SQS messages. If you put in a bunch of messages, a bunch of Lambda functions will spawn at the same time to handle them. So we haven’t really gained anything. We had race conditions in our HTTP handlers, now we have race conditions in our SQS handlers.

What if we could delay each message by some amount? Well, we could just sleep for a while before sending the message. But that’s wasting compute time, and I’m afraid of leaving the AWS free tier, so no thank you. SQS has our back here, because it supports delivery delay. When we send our message, we specify a number of seconds. No consumers will receive the message until that number of seconds has elapsed. So if we can use a different delivery delay for each request, then our consumers will run nicely spaced out.

The simplest way to do this is to simply randomize the delay. The maximum delay is 15 minutes (900 seconds), so we can have a reasonable expectation that consumers will run separately until there is a very large number of concurrent requests.

@app.route("/report")
def handle_report():
    delay = random.randrange(900)
    QUEUE.send_message(MessageBody=request.json, DelaySeconds=delay)

But by randomizing the delay, we’ve given up something that might be important: order. Thankfully, order doesn’t matter for our application, it doesn’t really matter which opens the issue since they’re duplicates. If it did, we’d have to resort to something more complicated.

After all this, we still have no guarantees, but it’s definitely better than before. I must admit, this wouldn’t be a problem if the web API was running on a “real” server, but oh well. It’s nice not having to worry about maintaining a server

TL;DR

AWS SQS has a useful “delivery delay” option when sending messages that allows you to control when a message becomes visible to consumers. Randomize it to minimize your risk of race conditions when running functions concurrently!

See here for the actual code that I wrote for this task, although it’s a bit noisy.

Update: This is a Bad Solution

Ok, so turns out that this is actually an awful solution. Instead, you should use Lambda’s reserved concurrency to limit concurrent executions to 1! You get a guarantee that you’ll never have more than one instance of your Lambda running at once, and much better ordering (depending on the type of SQS queue you’re using). For applications on the Serverless Framework, it’s as simple as adding a key to your function declaration:

functions:
  myfunction:
    handler: myapp.handler
    reservedConcurrency: 1

My commit updating to this configuration is here. I was totally unaware of this feature when I first wrote this post!

2020/03/05