Debugging complex applications has always been a chore for developers. This tends to be especially true with open source code, and OpenStack, it seems, epitomizes that challenge.
Some OpenStack users still struggle to debug or troubleshoot even simple errors. This is partially because the only real interface for OpenStack troubleshooting is a set of log files -- one or more for each major module -- that contains terse error messages. As one user wrote in the 2017 OpenStack User Survey, admins essentially "need to trawl" through logs and the source code to determine the cause of an issue.
Ultimately, admins want more information about problems and less data. In the short-term, it would be a step forward for OpenStack to let admins view critical issues as a triggered event within a system monitoring tool, with the ability to drill down to see relevant log data.
In the long-term, it would be even better to infuse AI, and include a more graphical interface, that can flag issues, provide a probable cause and offer corrective actions.
While that seems like a distant hope, at least for now, there have been some advancements around OpenStack troubleshooting and debugging tools.
Ceilometer and Monasca
As small, initial OpenStack deployments expand to operational clouds, two major OpenStack projects, Ceilometer and Monasca, have tackled some data collection issues.
Of the two projects, Ceilometer is the more advanced. It converts log data from all OpenStack services to a Gnocchi database and makes it indexable. This enables admins to use that data for both billing and debugging.
Monasca is a multi-tenant monitoring as a service tool that enables IT teams to analyze log data and set alarms and notifications. Eventually, it should provide drill-down capabilities into the Gnocchi database to accelerate fault analysis.
There are two other sub-projects that focus on extending Ceilometer. Aodh creates policy-driven alarms for Ceilometer-generated data, while Panko, another sub-project, captures OpenStack state data at a point in time.
Third-party tools aid OpenStack troubleshooting
While the OpenStack projects mentioned above are a step in right direction, they aren't enough to significantly ease ongoing debugging. For now, teams will likely need third-party add-ons to fully parse OpenStack data and respond to any issues.
Datadog is one such tool. It enables teams to track, visualize and correlate metrics from both OpenStack and their applications. This helps admins spot and address any anomalies on their cloud platform. Grafana, an open source analytics and visualization tool, lets admins view trends within Ceilometer's time-series data, while Tata Communications' cloud inspector framework adds metadata to cloud instances to accelerate log searches. The vendor also has a long-term plan to integrate AI tools to provide automated discovery and corrective action.
The use of AI to monitor, debug and take corrective action as OpenStack scales is still in its early days. But pressure continues to mount to make increased automation in OpenStack obligatory. Containers will more than quadruple OpenStack virtual instance counts, while microservice and software-defined data centers will further extend the number of IT resources admins need to track -- and the speed at which they need to address performance issues.
Eventually, we will likely see expert chatbots that can guide infrastructure tuning and debugging, and debug as a service tools will increasingly enter the market. Ultimately, these AI approaches will be admins' best bet for OpenStack troubleshooting, and more responsive monitoring, in general. Expect more vendors to enter this space in 2018.