Fixing an intermittently failing Lambda

What do you do when a production API built on top of AWS Lambda suddenly starts failing intermittently?

We recently had a production issue with an API built using API Gateway and AWS Lambda. The API was failing at a high rate but not consistently.

Our first thought was a software update. That was quickly ruled out as we hadn't released any changes to production in days.

After digging through the CloudWatch logs we worked out that it was having trouble making an API call to a third party that we use. We ran some further tests to determine that their API was working correctly and they were not the cause of the problem.

For some reason the Lamdba had lost network access to the outside world but it wasn't happening for every request.

The inconsistent nature of the failure lead us to believe that it was only some of the instances that were having this networking issue. Those with network access were functioning correctly.

If we had been using a regular server our first reaction would have been to reboot it. But how do you do that with AWS Lambda? They're fully managed and you can't just restart them.

One suggestion was to re-deploy the SAM application. This wouldn't work because there were no CloudFormation changes. Instead we settled on increasing the configured memory for the Lambda function by 128MB using the AWS console. This small configuration was quick to do and resulted in AWS provision new Lambda almost immediately.

Once the crisis was over we reduced the memory by 128MB to keep CloudWatch in sync with the current configuration.