Optimize serverless apps with real-time troubleshooting

Serverless computing demands a new way of thinking to successfully troubleshoot apps. Carefully define error states, and maintain logs to pinpoint and solve performance issues.

Tom Nolle, Andover Intel

Published: 01 Mar 2018

Serverless computing introduces a lot of change into an IT organization. There are no fixed VMs to deploy and manage, applications are about events rather than workflows and you only pay for the exact compute processing you perform. But these changes pale in comparison to the adjustments a dev team needs to make to troubleshoot software in a serverless world.

Troubleshooting and debugging serverless apps is essentially an extreme form of real-time development. You need to troubleshoot event-driven systems differently, because the events are both spontaneous and asynchronous. Because the timing of and between events is one of the most critical things to test, you can't just run a piece of data and see what happens.

The framework for real-time troubleshooting and debugging is fairly well-understood. You test each element in isolation -- known as unit testing -- and then perform integration testing in a controlled environment, where you can inject test data in strict timing to mimic the real world. Then, you shift to a controlled test in your live environment -- again, using time-synchronized data injection. Extensive logging helps you understand if everything works, and if something doesn't work, logging helps isolate the problem.

Use this process as your baseline framework to troubleshoot and debug serverless apps. However, serverless is a more complex form of real-time development and testing because functions float unpredictably around a pool of hosting resources, rather than run in a single place. This introduces a variable delay as the function is scheduled in response to an event. It also means developers can't watch for function execution because it probably won't run on the exact resources they watch.

To account for this, adapt basic, real-time testing and debugging to this floating resource model. In addition, design serverless applications to signal any issues that you can then dive into with more detailed troubleshooting and testing.

Detect trouble early on

The first and most critical step to fix a serverless problem is to produce it dependably.

While user complaints or application failures clearly signal trouble, IT professionals are eager to find problems before they reach that level. In most cases, teams can achieve early detection through analysis of application logs. They can also use tools to spot conditions across multiple logs. But logging in a serverless architecture extends the period in which an application runs -- which costs you time and money. Logging times can also impact the relationship between events and processes, leading to errors. As a result, most serverless apps will run with little or no real-time logging, and IT teams will selectively enable logging when they suspect a problem.

Choose a serverless provider

If you've committed to a serverless cloud provider, explore the specific development and testing procedures for the technology. If you haven't committed, be sure to evaluate providers' development and testing facilities before you make your choice. And if you expect to use serverless in a multi-cloud environment, establish consistent development and testing approaches across serverless platforms, which can be a challenge, since testing tools and development languages vary significantly across providers.

So, other than user complaints, how will that suspicion arise? Fortunately, you can use the real-time processing in a serverless architecture to send you warnings. To process events, serverless apps are organized into systems that have a finite number of operating states. For example, an application component could be waiting for a signal, processing an event or generating a back-end transaction. Applications interpret events according to the expectations set by these operating states. So, anywhere an operating state exists, you can define one or more error states that represent an illogical or unexpected combination of previous conditions and arriving events. These states can send a message or log an incident and serve as a trigger for IT teams to more closely examine an app.

When that occurs, carefully and selectively enable logging. Don't turn on all levels of logging across your serverless environment, as you could alter application timing so much that you can't reproduce the problem. Instead, use your error state information to enable logging where it's most likely to be helpful, and expand the logging as you zero in on the conditions that caused the problem. When you think you've found the condition, try first to reproduce it in a controlled testing framework with test data. If that doesn't work, use test data in the live serverless deployment to produce it. The first and most critical step to fix a serverless problem is to produce it dependably.

Usually, the combination of error state analysis and log analysis pinpoints the problem with serverless apps and enables developers to make corrections. But, even then, it's still critical to test and validate any changes to ensure they haven't broken something else in the application. How you perform these tests will depend on the specific serverless execution framework you use.

Next Steps

Optimize serverless apps with an observability strategy

Essential Guide

Optimize serverless apps with real-time troubleshooting

Serverless computing demands a new way of thinking to successfully troubleshoot apps. Carefully define error states, and maintain logs to pinpoint and solve performance issues.

Detect trouble early on

Choose a serverless provider

Next Steps

Dig Deeper on Cloud deployment and architecture

Top benefits and disadvantages of serverless computing

Optimize serverless apps with an observability strategy

Compare Amazon CloudWatch vs. AWS CloudTrail

5 Azure Functions logging best practices