Anterovium - Fotolia

Top 6 complexity challenges of operating a cloud at scale

It's not unusual for enterprises to scale their applications to meet demand, but they need to be aware of the associated cloud complexity issues. Look out for these six common problems.

Zachary Flower, Freelance web developer and writer

Published: 23 Sep 2020

Embracing cloud computing is hardly a unique experience for most companies, but there's a significant difference between living in the cloud and scaling in it.

As a cloud-native organization grows in size and complexity, its IT team will inevitably encounter a host of unfamiliar problems that make managing increased demand difficult and time-consuming.

Operating a cloud at scale can be extremely hard if your team isn't familiar with the common challenges associated with managing increased demand. Explore the following six issues -- challenges IT teams normally don't face with small-scale deployments -- to ensure your staff is completely prepared to deal with cloud scaling challenges.

1. Managing development environment costs

Most applications start as small, focused services that solve a single problem -- and solve it well. But, as a company grows, the needs of its customers multiply.

One feature becomes two, which becomes four, then eight and so on. This is manageable, but only to a point. With each additional feature, the complexity of the application infrastructure expands, as do the related costs.

Figuring out how to properly manage the costs of a growing development environment is nearly as difficult as figuring out how to manage the development environment itself. In fact, the two issues are intertwined.

Most companies start by giving engineers parity-like development environments on their own machines. This keeps cloud costs down and ensures developers -- remote or not -- aren't losing productivity due to login issues, or internet speeds or availability.

However, this becomes impractical when you operate at a certain scale in the cloud, and the organization needs to create cloud-based development environments with more dedicated resources. This solves the onboarding problem, but it also reintroduces the access problems. Furthermore, the use of more cloud-based resources can cost a fortune over time.

Finding a balance can be challenging, but the need for productivity and efficiency should outweigh the desire to achieve production parity. Here's a great place to start: Create and support processes that enable developers to run only a subset of the application infrastructure on their own machines, while filling in the gaps with mock or shared services.

2. Evaluating platform-native tools

At some point during its expansion, every company must make a choice: Use the most convenient tool or use the most portable one. Open source tools and services enable companies to jump from one cloud to another as the needs of their workloads change. However, a third-party tool can often solve an organization's problems in more efficient ways.

Yet, vendor lock-in is a very real problem for IT teams as not every cloud provider can meet the needs of every customer. Regardless of which cloud provider an organization chooses, it should tackle the problem from a this-and-that approach instead of a this-or-that approach. This can help identify the right tool for the job, rather than the right tool for the cloud.

This will be even more imperative as multi-cloud environments become more common. You'll have much more flexibility if you focus on the services that solve the right problem, while investing in the tools that tie those services together.

3. Testing for scale

There are different types of scale and, as a product grows, it is critical to understand all the different limitations. Sure, enterprises can guard against nonperformant database queries and introduce caching early, but it takes more than defensive development practices to test for scale. It's just as important to know how an application performs under increased traffic as it is to know how it performs under increased data. But how do enterprises test it?

Build tools early on that can exercise an application infrastructure. A production-like staging environment can be costly, but you can identify potential bottlenecks before they become real problems if you're able to hit your application with data and traffic that is representative of that production environment. Knowing the limits of your cloud environment is crucial. A proactive approach to scale is far less costly than a reactive one.

4. Breaking knowledge silos

One of the biggest cloud scaling challenges that comes from growing an engineering team is dealing with knowledge silos. When you're small, having "the API person," or "the database person" is not only convenient, it's efficient.

Individual experts in different parts of the stack allow for more consistent development in each area. With a small enough staff size, the nuances of each area are understood almost organically by the entire team. However, as the team grows with the cloud, those subject matter experts can become bottlenecks. Any change to the experts' areas of focus almost always requires their oversight to account for the large swathes of knowledge stored in their heads.

These knowledge silos, while great job security, are dangerous. If any of these people leave the company, an immeasurable amount of information about the background and details of a critical piece of infrastructure would be lost.

Leaning into documentation is a great place to start. However, discoverability can become a problem. To truly break up a knowledge silo, more than one person needs to be accountable for it. Documenting a process or procedure is important, but you must determine the team or department that should manage it. Properly handing the documentation off to that team is crucial.

5. Gaining and maintaining visibility

In the early stages of most applications, concepts like monitoring, log aggregation, metrics and exception monitoring are easy and generally inexpensive. Most cloud providers offer cloud native tools to address these needs.

However, if those tools aren't good enough, the amount of power you get out of any number of third-party and open source services is more than enough to meet early demand. But, as the need to operate a cloud at scale increases, the cost of these third-party options can be monumental. It can also be very time consuming.

When it comes to gaining visibility into your application infrastructure, it's always best to start small and grow from there. Monitoring is rarely a critical architectural component, so it should be easy to replace and modify as needed. The important thing to remember is that the data that you monitor needs to be usable. Beyond that, by thinking of it early and often, you can always ensure that you are ahead of the curve.

6. Avoiding the microservices dilemma

Managing a monolith application at scale can be complicated, which is why so many companies dive headfirst into microservices at the first sign of trouble. But, is this the right thing to do?

A successful service-oriented architecture requires careful planning and consideration. Going that route too quickly can lead to even more technical debt and heartache down the road.

If you don't have experience with breaking up a monolith into multiple services, the most important thing to do is define their contracts between clients, or microservices, ahead of time. You cannot simply spin up a new service and define the communication details as you go; this is a time to over plan.

Establish a set of standards that every service must adhere to, from monitoring to authentication to protocol. It doesn't matter if you choose Prometheus, OAuth and REST so much as it matters that you make a decision, document it and enforce it.

Next Steps

Businesses mired in cloud computing challenges

Top 6 complexity challenges of operating a cloud at scale

It's not unusual for enterprises to scale their applications to meet demand, but they need to be aware of the associated cloud complexity issues. Look out for these six common problems.

1. Managing development environment costs

2. Evaluating platform-native tools

3. Testing for scale

4. Breaking knowledge silos

5. Gaining and maintaining visibility

6. Avoiding the microservices dilemma

Next Steps

Dig Deeper on Cloud deployment and architecture

Green coding - MinIO: An unlikely problem in 'modern' software environments

SRE vs. DevOps: What's the difference?

Banks dump Terraform for Crossplane infrastructure as code

Infrastructure-as-Code series - Tenable: The joy of enforced immutability