James Thew - Fotolia
Many people in IT operations wonder how cloud observability differs from the monitoring tools and approaches enterprises already have in place. Some speculate that observability is just another attempt to stir industry interest with yet another buzzword.
But this is not the case. In fact, it's critical that IT teams understand the difference between observability and monitoring.
Cloud monitoring is a process used to know the conditions of workloads and to measure quantifiable parameters related to overall operations. Monitoring provides specific, granular information, but it often lacks context.
Cloud observability, on the other hand, is essentially an attribute of a well-run system. It's about determining the health of an application by interpreting and assessing the state, or status, of a workload based on its externally visible properties.
In the context of the cloud, a system is observable if it's operating state can be deduced from what can be measured. Without observability, you can't fully support lifecycle management.
Two approaches to observability
In order for an element of the cloud to be observable, it must provide some indication of its operation. It also must have specific operating states that can be identified based on the information that element provides.
Conceptually, observability is divided into two camps -- methodological and operating-state. The methodological approach relies on deduction and focuses on metrics, tracing and log analysis. The operating-state approach relies on tracking and focuses on state identification and state-event relationships. In short, methodological observability is an operations task; operating-state observability is really part of DevOps.
For the rest of this article, we'll mostly focus on methodological observability, which is the dominant cloud observability model today. It's largely based on root-cause analysis, which starts with ensuring operations teams have comprehensive logs. A metric identifies an issue and alerts users to begin a trace of conditions, which then identifies a fault and determines the scope of the problem. The team then knows what happened, why it happened and how it impacted operations.
Log analysis in an observability strategy
Log analysis is a critical piece of the methodological puzzle. If your system records significant changes in condition across your cloud elements, you can correlate and analyze that data to trace conditions and decide the next action.
Anyone who's ever done cloud monitoring has, in some way, adopted a form of methodological observability. They've already dealt with some of the major challenges, such as when system logs are kept in different places or with different format rules, which makes finding the right information and associating log entries difficult.
For the cloud, observability is complicated by all the moving parts. An application's resources can be transient, so an issue might disappear by the time you try to examine it. However, you don't need the problem to be persistent during your analysis, because the relevant information can be recovered through log analysis.
This involves both tracing and correlation of log entries. To ensure log analysis is at the center of an observability strategy, you need to have search and trace products, such as Scalyr, in place. Companies like Elastic, Lightstep and Splunk specialize in observability, and VMware includes observability support in Tanzu. Dynatrace has an AI observability platform, and Arize AI has recently launched an AI-based platform as well.
Observability and cloud-native deployments
All of the tools above facilitate methodological observability, but they don't fully compensate for the complexity of a cloud-native deployment. There could be many microservices linked in a service mesh, deployed by Kubernetes, on an elastic cloud or hybrid resources.
To get the most from log management tools -- even those aimed at cloud observability -- you need to understand how an application is deployed. Otherwise, it's difficult to know where to start and what to look for. Even AI-based platforms have to be taught the specifics of a given cloud platform to be effective.
For some, it may be easier to address cloud observability with one of the vendor suites -- such as VMware Tanzu or Red Hat OpenShift -- because of the ingratiated relationship between log analysis and platform tools. Cloud suites may also be the focus of a future shift toward the operating-state model of observability.
The future of cloud observability
VMware's acquisition of Blue Medora, which provides application-aware monitoring, is an example of a trend to reimagine logs in a self-contextualizing way. This is likely the endgame for the methodological approach, and this sort of product could eventually wrap log analysis in something almost AI-like. However, this type of monitoring awareness would demand explicit understanding of state.
When observability first arose as an issue, proponents of the methodological approach argued it was impossible to know the state of a complex cloud system or its major components. However, there are increasing numbers of state-aware deployment tools and techniques. The most commonly used are declarative DevOps tools like Puppet and Ansible.
Declarative systems separate state determination from methodological specifics like log analysis, but you could feasibly use log analysis to define possible states and identify the current state of a cloud application. By understanding state, reported events could be processed in context.
This same approach was proposed a decade ago for large-scale carrier networks, but implementation of state-event-based observability lags far behind that of methodological models. This may change if state-event tools are enhanced in response to the complexity of cloud-native deployments.
The methodological framework for observability will inevitably change as applications become more complex. Whether it's an advanced application of AI or a shift to operating-state modeling, cloud users can expect improvements over the next several years and should be prepared to reevaluate their approach as a result.