DevOps in Action - Hello Steel Team!

Say hello to the Steel Team. Our focus area is supporting our DevOps teams with a combination of internal tooling (including CI/CD pipeline) as well as Docker and the new stack tooling around Kubernetes. We are the blacksmiths of RightScale.

Get off my lawn

Let me start by saying I consider myself a systems administrator. Cloud infrastructure operations to be specific. Writing infrastructure as code, integrating services, and maintaining databases are the tools of my trade. So last year when it became increasingly obvious that the RightScale Ops team in its current structure would be no more and that instead I would be part of a smaller, fully integrated and more agile DevOps team, I felt a bit like the cantankerous Clint Eastwood character from Gran Torino and just wanted to yell, “Get off my lawn!”

I have since come to see the reasoning and the effectiveness behind the restructuring to a DevOps model. The change came about due to the ever-increasing quantity and complexity of new microservices developed on ever-tighter deadlines. Despite our strong interface contracts honed over the years and built on a foundation of deploying in standardized ways via Docker containers, it was becoming increasingly difficult for our cross-development teams - from Ops to Development to QA - to keep up with the pace and breadth of development without becoming blockers. Our CTO covers this in more detail in Embracing Radical Devops.

Accepting change is hard for many people, myself included, but I’m looking forward to the future in our new structure and with my new team.

Hello Steel Team

Haphaestus

The name of our DevOps team is the Steel Team, and our “spirit animal” is Hephaestus, the Greek god of blacksmiths. Like Hephaestus, we build the “tools of war” that other RightScale DevOps teams use to deliver features and functionality to RightScale customers. Our focus area is a combination of internal tooling (including CI/CD pipeline) as well as the new stack tooling around Kubernetes and not-so-new-to-RightScale Docker.

We are a small and scrappy team that is every day running with scissors in the pursuit of awesome ways to enable the other RightScale DevOps teams to accelerate development while continually increasing stability so that they can maintain ownership of their full stacks. This is a tough challenge and one that we are largely attempting to facilitate through four approaches:

1. Disposable and Repeatable

RightScale has been in the cloud computing business for a long time. At this point we’re pretty good at end-to-end lifecycle management at the instance level through a variety of our built tools and our overall platforms. Two core RightScale philosophies we believe in are Disposability and Repeatability:

Disposability

Disposability means that any aspect of the services we provision and manage must not rely on ad hoc logic that is not tracked elsewhere. For Steel, this means infrastructure as code and high availability (HA) being table stakes. Building upon this, the concept of cloud instances being launched and patched and upgraded in place via manual operations is largely foreign to us; replacement and upgrade strategies come from launching replacement instances and bringing them into service via policies and orchestration. A valid strategy for troubleshooting an instance that is exhibiting issues is to launch a new instance and “replace” the problem one.

For example, in our current infrastructure stacks we use combinations of service discovery and singleton instances that are provisioned and launched through RightScale Cloud Management as configured arrays. As containers are deployed on these independent instances, they will register with our service discovery layers, which would dutifully update load balancers and bring them into service. Any instance anywhere in this stack that is failing (no longer sending monitoring metrics of any sort) will be removed from the discovery catalog and a replacement will be fired up and brought into service automatically.

When we moved to no-downtime releases a few years ago, this approach became central to our upgrade strategy. We now roll out new cloud images and updated packages via our orchestration layer, which concurrently handles the registration/removal in the right order of the member instances that make up the overall “fleet” of services.

If you want to learn more about our release process, watch these on-demand webinars on how we configure our infrastructure stacks:

Repeatability

Repeatability builds on the Disposability approach: You don’t want your services updating to new versions that break other aspects of your stack. Keeping the services updated and tested in an ongoing and (hopefully) easy approach is a must.

For our new infrastructure stacks, which integrate Kubernetes as the scheduler, repeatability is critical. In one example of that, a recent change to Kubernetes added an additional option to be passed when having workers join the cluster using token discovery around a hash of the certificate authority from the master. This is a reasonable extra security measure, but between Kubernetes 1.8.x and 1.9.x, the option moved from a suggestion to a breaking requirement. If your clusters are built leveraging the “kubelet” systemd unit and pointing at the “stable” branch, then you will one day discover that your clusters came up 1.8.x and worked and then next came up as a master running 1.9.x with no workers. Oops! For our RightScale product, we control when and how we move from even minor versions of software to avoid any situations where any development team could possibly run into a problem like this.

2. Right tools for the job

New methods of development require new tools to enable them. We are doubling down on our usage of containers to be a full-fledged Kubernetes + microservices architecture shop (almost entirely in Golang).

The charter of our team includes developing the tooling/lifecycle management of these Kubernetes clusters, which if you’ve worked at all with Kubernetes to date you know is an incredibly active and fast-moving codebase. One day’s “stable” API feature is the next day’s deprecated API, and keeping our build toolchains building consistently is no small feat. As part of this effort, we are challenged to revisit fundamental assumptions around what tooling we use. We continually evaluate and blend “best of breed” toolsets in combinations that best meet the needs of our new structure.

For example, one of Steel Team’s projects is a Hashicorp Terraform Provider for RightScale, which allows us to use RightScale where it makes sense, but bring in other community-supported providers for things that RightScale does not provide, such as direct interaction with DNS providers (route53, dnsmadeeasy, etc.). Use of this declarative and development-centric tool to build is enabling Devops by making it possible to configure the infrastructure in a consistent way from one source, while also leveraging our active management platform for orchestration, policy and governance concerns.

3. Examples, not ownership

At RightScale, the development teams ultimately own the services they run as well as the tooling involved to deploy and maintain these services. Steel Team does not own the specific invocations of infrastructure or deployments. If we did, a given development team would be externally dependent on us, and Steel Team would potentially become a blocker. If we are in the critical path at any level, then we have failed our mission to avoid choke points and single points of failure in all aspects of the development lifecycle.

For example, recently Steel Team finished a long-running development cycle on Kubernetes Federation with recommendations and tooling to build into our current cluster offerings to enhance our overall HA and DR strategies. Our “product” was to deliver comprehensive documentation and runbook operations; to set up example and proof-of-concept deployments in our sandboxes; and to pull requests directly with development teams for their production repositories to facilitate changes.

We set up pairing meetings with each of the development teams to address concerns and tune the implementations. To date, one of the development teams has fully adopted the Steel Team tooling into its workflow while another has elected to consider it at a future date. The ability of individual teams to choose the tools they use is a key aspect of our new DevOps approach.

Steel Team generates artifacts and example environments that other development teams build into their environments with our guidance, but as directed by their timelines and directives. In a very real sense, the Steel team “builds” tools that our internal “customers” bake into their own repos and can use/discard/update as desired; Steel team will not make mandates.

4. KISS (Keep It Simple, Stupid)

Like any codebase, the more complex that infrastructure as code gets, the more likely someone will make a mistake when trying to modify it. When dealing with infrastructure, mistakes can be subtle and compound quickly, and that’s Probably Not Good.

A shared centralized codebase with deltas kept locally is certainly elegant and possible, but determining inheritance and handling shared responsibility and testing of changes between different installs becomes a non-issue if you have your very own independent copy of the infrastructure codebase. So we keep it simple by duplicating infrastructure code bases across development teams. This does not follow the spirit of the DRY (Don’t Repeat Yourself) principle, but we have determined the tradeoffs at this point in time to be worth it.

Work for RightScale

Wanna work on Steel team or RightScale in general? Join the party!