Managing a single cloud account is a challenge all by itself. Only adding more instances and services to one account can lead to challenges from keeping track of what is running to quota limits set by the providers. When you hit the logical limits of that account, expanding the footprint of your cloud strategy is more than just spinning up a new account and dropping the corporate credit card on it. Both Azure and AWS have created logical ways to organize your accounts along project, product, purpose, team, or billing boundaries.  Figure out how to organize and plan for your expanding cloud environment and then learn the basics of using AWS or Azure structures and hierarchy to manage the growth of your cloud ecosystem. Enjoy this introduction of AWS Organizations, Control Tower, and Policies along with an overview of Azure Management, Subscriptions, and Resource Groups to effectively manage your workload.

Keywords

Some people glaze over or their blood pressure spikes when important “corporate-speak” words are used to describe IT, technology, or development efforts. But don’t worry, an explanation with how these word apply to the cloud work helps everyone be comfortable. As a cloud engineer, being comfortable with these concepts is an important piece of what’s required to be successful. These are things that may influence the day to day cloud word, especially in regulatory driven industries like health or finance. Please note that these are my best effort opinions on how to explain them in plain language and can vary based on the company or organization.

Governance

Governance is creating, applying, and ensuring the conformance of policies that overarching authorities set for an organization. It’s words used to describe all the work that happens that’s not directly related to the end result product of what’s being built. Rather, it’s coming up with rules and guardrails that are necessary to deal with any regulation or requirements for what you do.

Enterprise

This is often what people say when they mean “larger than a small business” (whatever that means). Sometimes it refers to an organization that has a thousand or more employees or maybe the revenue is $1 Billion. But enterprise starts to become most obvious when you are trying to purchase software or licenses and you scroll over on the pricing page and what you need falls under “Contact Sales for a Quote” instead of a box with a set price per user in it. That’s when you’re “enterprise”.

Corporate

“Corporate” in quotes usually means something that has to do with running the business. Like an internal mechanism for communicating with employees is a “corporate comms system” or HR software where you submit vacation days is “corporate software”. Any time someone doesn’t want to continue an argument based on some policy like dress code, they often fall back to “it’s just corporate policy”.

Policy

Policy is some plain language statement or edict about what an organization should do. Policy ties a necessary requirement for the business to apply internally. For example, an IT policy might state that all data at rest should be encrypted. But it doesn’t provide the actions or steps to accomplish that policy.

Compliance

Compliance refers to the actions associated with ensuring that the systems or people adhere to the policy. So a manual intervention step for the example above would be “someone checks the console for each virtual machine and checks to see if the disk is encrypted, then records that value in a spreadsheet”. You start to see here how manual work to apply policies can become unwieldy for a team very quickly.

Audit

An audit validate that a policy is in compliance. So to continue in the encrypted data example, if you say in your compliance document that you check and record a value, an audit will take a random sampling and see if the item is encrypted and check to see if the record exists. Use audits as an opportunity to learn, improve, and automate tasks. They are not to be feared and are an excellent mechanism to get “corporate” buy-in for critical work.

Security

Security at the level of enterprise or corporate refers to more than just infosec or cybersecurity protections. It entails physical security for offices, training folks to deal with social engineering attacks, and planning for what happens when someone forgets to pick up their laptop after a TSA screening.

Risk

People will talk about attack vectors, what the surface area is for attack, what the risk profile is for an application or environment. For example, data center versus public cloud has completely different risk profiles and requires different kinds of planning. Most teams or organizations get external help or consulting to validate any risk assumptions and help to cover gaps.

Operational

Operational refers to how the work is getting done and what anyone in a team does to accomplish those goals. It’s the direct tactics and implementation of strategies presented by leaders in an organization. It’s the day to day execution of the work you’re doing and how you do it. It can range from how laptops get in the hands of people who need them, what chat/video conference software is used, how people log into source control, and how people deploy code to an environment.

Performance

On the side of whatever you’re building for clients, customers, or end users, performance is about the metrics that should be met or exceeded. This can be doing things like making a website load faster, creating a new feature for customers, or basically anything else that makes more money or the thing you’re building better.

Optimization

Optimization is more about the internal improvement. How to ensure that your work runs as efficiently as possible. This can range from how you onboard new teammates, changing cloud vm types to save money, making the code pipeline work faster, automating manual processes, or anything else that an end user may not see, but helps any team or person do their job better.

Scalability

For anything that need to go from one to many, like apps, services, data storage, network connectivity, or traffic, you have to plan for what that looks like. This doesn’t mean we build environments to immediately support this kind of growth, rather it’s planning for what you would do when this happens.

Reliability

How do we ensure our growing environments continue to work as the scale increases? We start looking at multiple regions, availability zones, data centers, cloud providers. You start looking at Content Delivery Networks or other tools built to improve distributed systems.

Cloud Engineering

Congratulations! You don’t have to worry, now that you know what the scope of the work we’re thinking about, you’re basically a cloud engineer now. Even if we haven’t talked about anything technical yet, it’s critical to know how larger corporate entities think about how this stuff applies to the cloud. We have to start thinking about putting a box around the cloud and be able to give that box to developers and teams who need to get their work done in the cloud. Our job as a cloud engineer is to solve problems for cloud users and make the job as straightforward as possible for those who need to use the cloud. And if you are going from a cloud account that everyone has access to, over to something that requires a larger scale of accounts/subscriptions, things require some thought moving forward.

One of the key things to thinking through and solving expanding cloud issues is to pull in expertise from other areas. If you are able to, reach out to other teams to get the help you need with common problems to solve:

  • Networking teams to help with IP Address management, virtual network configuration, VPN or other connectivity solutions to cloud accounts.
  • InfoSec teams to help attach SSO solutions to the cloud, how to configure MFA to accounts, how to make sure any network connectivity is properly connected and trusted.
  • DevOps focused teams to create build processes into cloud environments, connecting managed 3rd party platform services to the cloud accounts, creating infrastructure as code systems to build and deploy infra along side with applications being developed.
  • Management/Leadership to think about how the organization is structured and what boundaries might be necessary on the cloud. Thinking about how to bill the cloud, how to handle what teams need cloud access, understanding how external corporate systems might need cloud account access.

Start drawing out your boundaries, like project teams, cost/billing codes, organizational hierarchy. Figure out what data governance needs to exist to allow access to any data stored in the cloud and how to properly handle requests for access. These are ideas that have been pulled from the Cloud Adoption Frameworks and Well Architected Frameworks that the cloud providers support.

Quotas

The first thing many people run into when hitting the limits of a single cloud account are quotas. There are still physical hardware limits in data centers that cloud providers use, and there is no limitless pool of infinite resources a cloud user can pull form to do their job. The presentation slides above have specific examples from the general quota documentation for the service providers here:

AWS Control Tower

AWS Control Tower is an orchestration platform for building, configuring, managing, and maintaining AWS accounts in a tree/directory structure. It uses a combination of services like Organizations, Service Catalog, IAM, S3, etc., to do many amazing things. It does all of this without direct human intervention once you start the setup process.

Control Tower, along with Organizations is initially free, but once you set them up and start adding accounts, there are costs associated with CloudTrail, CloudWatch, Config, IAM, Lambda, S3, Service Catalog, and SNS.

Your Control Tower is managed and configured in a Landing Zone Dashboard in one account that you’ve designated as your management account.

Service Catalog is the underlying mechanism that sets up and enroll the accounts that you create. The steps above provide the steps to add a single new account at a time to populate your organizational structure. Here are some other options for setting up many accounts at once or pulling in existing accounts outside of the structure.

Guardrails

The orchestration that AWS provides in this space becomes pretty powerful when you start to apply Guardrails and Service Control Policies (SCP) to Organization OUs and/or accounts. There are a number of types of rules that can be applied. The fall under two types of guardrails under three categories. The two types are preventative or detective. Preventative actually prevents an action from happening through the written SCP, and detective just reports on noncompliance within the Control Tower dashboard. The three categories are:

Mandatory

There’s a number of mandatory guardrails for things that prevent the deletion of certain logs or prevent changes to IAM policies. It’s designed to protect the Control Tower configuration and structure to enforce accounts to be managed by the tool.

Strongly Recommended

These rules are also easily applied and should be used unless there’s a specific use case where they need to be disabled. They include things like disallowing actions by the root user, preventing public access to S3, and enabling MFA for the root user.

Elective

Elective guardrails like enabling versioning in S3 are good suggestions, and should be used unless there’s a specific problem.

Service Control Policies

Service Control Policies help to set limits on resources and actions within your AWS account structures. They are written in a JSON syntax to ensure that certain conditions are met for an action to occur.

AWS Organizations

AWS Organizations creates a structure to apply policies to sets of AWS accounts as well as consolidate billing and manage user access through SSO. You can create and customize this structure to fit the needs of your teams and cloud users. This allows the guardrails as described above to be applied with necessary restrictions. For example, you can create an SCP which prevents every account in a certain OU from using Marketplace images, then allow an “infrastructure” OU to be able to create AMI’s that have the required software on them for teams to use. You can prevent app/dev teams from creating networking items like VPCs, and create a centralized account where accounts work through a transit gateway to connect to resources. Essentially, you’re able to turn that plain word policy into code that is applied to your Organizations.

Azure

Azure also has a management structure to support billing boundaries, management of resources, and applying policies to subscriptions and resources. In the slides I’ve provided above, you can see that there are three tiers of management (management groups, subscriptions, and resource groups) that support the governance items we’ve described already. In the case of Azure, subscriptions sit at the same level of AWS accounts at a high level. It serves as a boundary for billing or access purposes. Resource groups are a structure that can exist under subscriptions which is a container that holds resources you want to exist in the same lifecycle. If there are a group of resources at one point or another you would update, deploy, or delete all at the same time for any reason, they should exist under a resource group.

Management Groups

Azure management groups provides a structure above the subscription level to apply and governance/policy rules. Subscriptions within a management group inherit the rules from the groups it’s under.

Azure Policy

Azure policy is written in a JSON format and allows similar governance features as previously described, like enforcing tags to be applied to resources or preventing the use of resource types. They can also be connected together to form policy initiatives which can be applied directly to management groups, subscriptions, or resource groups as well.