Cloud deployments of software often pose the most ticklish error detection and repair problems. Customers are constantly using a cloud app developer’s products, at all times of day and night and across geographies. Meanwhile, it’s a safe bet that something in those releases will be breaking, and error fixes will needed, said Brian Rue, CEO of Rollbar, which provides real-time error monitoring services for developers. The trick is detecting errors quickly, rather than waiting for customers to report them.
“You’re releasing improvements, releasing bug fixes, and that constant state of change means that you need to have a constant state of monitoring,” said Rue. “If something is broken, and you don’t find out about it until a customer writes in days later, it could easily be days or weeks before you find a way to repeat the problem. The development team gets caught up in a constant state of firefighting.”
Rue shares some best practices for error handling and making code error fixes in this article. Rue co-founded Rollbar after experiencing the problems of error handling when developing gaming apps, at first on a kitchen table in a garage with three colleagues.
The vicious circle
“Imagine a circle starting from deployment,” explained Rue. From deployment, the next thing that happens, typically, is an error happens. Your team needs to discover if it’s a new error or a read error. A new one calls for alerting and prioritization. Once an error is prioritized, then the developers can go explore the data for the error. They can discover what uses the error affects, the values of the variables and other information about the cause of the error.
“Usually, that’s enough data to enable writing and deploying error fixes,” said Rue. Then it’s on to the next problem. “That wheel of release, error monitoring and error fixes is constantly spinning,” he said.
Structured data is good data
The better the data structure is, the more the developer can discover every detail about each code error. “Data really should be structured in terms of keys and values, as opposed to just raw strings,” said Rue. So for example, let’s say there’s an error message that says: “This user tried to log in and it failed.” That might be something that the cloud developer wants to log. That should be logged as, say: “User login failed with the user ID as metadata.” That way, it’s both easier to group as it is, so there is just one message saying: “Login failed.” Then the cloud developer can see all those together and is closer to making error fixes.
“Once you have that structure, you can easily query data forward to see which logins failed. You can figure out how that correlates against other problems, and so on,” Rue said.
Add instrumentation to apps
The core of error monitoring is tracking the application from the perspective of the application, according to Rue. So, to use it the cloud app developer needs to be able to add the instrumentation into the application. Typically that’s as simple as installing a Ruby gem, installing a package from npm or installing a kind of Java middleware; all services most development team have used. “But, at a high level, this requires buy-in from the developers to identify what there is, and then make sure that each component is instrumented,” Rue said.