Moving Beyond DevOps

Steven Kim
Steven Kim

We’ve come to often refer to the developer platform domain as devops. There are people who hold the title devops engineer, belong to the devops team or organization, working on devops systems and projects. And even for those of us who have renamed ourselves something else (SRE, platform engineer), I think it’s worth some introspection whether that change actually goes beyond the rename. My goal here is to get us recognizing that we need to continue evolving the devops trek we started some 15 years ago, and propose some characteristics I believe are important signals as to whether we have done so sufficiently.

The context I’d like to anchor this introspection on is our new home in the cloud. Not only is that context relevant to many these days, but present here are two key things happening at the same time that are prone to work against each other - a goal of increasing developer productivity and a meaningful increase in complexity.

Developer productivity is one of two common themes I’ve consistently heard for moving to the cloud; the other being elasticity/cost. Developer productivity translates to things like product velocity, quality of service, and developer satisfaction (recruiting, retention). Organizations who value or specifically leverage technology as a part of their win condition are housing themselves in the cloud in search of accelerated product release cycles, services that are resilient and effectively scale, and cloud capabilities (observability, regionality) that enable product teams to effectively run a higher quality of service.

Increased complexity with the cloud means new and more moving parts, which potentially mean new and more failure modes. It’s critical to orient ourselves effectively to build on, operate, and understand our new home. The matter of wringing success out of these technical advancements is not only about the difficult hurdle of adopting the technology; it’s arguably more a matter of an evolved organizational behavior built atop new lines of focus, accountability and excellence. They’re a new set of outcomes for each team and individual.

The intersection of the two can bring about secondary and deleterious effects. Asking developers to learn, architect for, and target new cloud primitives (e.g. kubernetes, API gateways, logging, profiling and monitoring techniques) is a lot of time not spent on product development. If, with increased complexity, the quality of service starts to suffer (defects, outages), developers are inevitably spending time fighting fires and away from improving product. Similarly, the growing new technical footprint takes time away from ops being able to understand the application better, ultimately hurting operational excellence. There is no way around the explosion of knowledge surface we are asking our teams to take on.

In these ways the devops model can struggle with these diverging pressures. I believe that the better collaboration we honed through devops alone can’t effectively absorb the significant extent of the breadth and depth of technology being added, and so fails to take advantage of the elevated field that the teams are supposed to benefit from. In effect, we can falter on both fronts; developer productivity will actually drop, while we get kicked in the teeth by complexity.

A platform oriented approach to developer productivity and leveraging the benefits of the cloud means isolating the two interests such that separate and respective owners can focus on building excellence in their areas. It’s successfully dividing the aforementioned complexity surface area, brought together through well defined contracts. Developers focus the vast majority of their efforts on application concerns – the code in support of that application’s function (including frameworks and other processes directly in support of the application), configuration related to the application’s behavior, and the correct behavior of the two at runtime. Platform owners focus the vast majority of their efforts on building and operating the optimal cloud strategy in support of the company’s overall tech strategy. And out of necessity due to the expansive surface area of technology required to accomplish the two, these are two separate teams.

Qarik - Model Chart Dark Green2 (1)

The important part here is that the two are separate; developers do not concern themselves with cloud implementation concerns (controversial: including whether the runtime is VMs or dockerized atop kubernetes), and platform owners do not concern themselves with the development or operations of the application.

Here are some examples of the two not being separate: your platform engineers get first-line paged when the application starts to fail on response latency SLOs; your developers determine and wrangle the ingress configuration for how traffic from end users are routed to the application runtime; here’s a more difficult one: when your application release fails because the target kubernetes cluster ran out of memory resources, your developers look through details of why this happened; when it fails because the container registry URI was incorrect in the manifest, your platform engineers get sent the logs of the failed pull.

In each of these examples, you’re doing it wrong because you’re pulling engineers into a situation they are not well equipped to address, and where they don’t own the SLOs (or any objectives/goals) to these problems. You’re spreading problems that will occur in a way that blurs lines of ownership and accountability, the opposite of focus and excellence.

And stepping back to before episodic problems even occur, if you want reliability and quality of service for your application, the best people to drive that are the same people who built the application.They understand, better than anyone and certainly anyone who runs the underlying platform, the particular ways that the application can and tends to fail. They should be the only ones on primary on-call rotation, and they should be supported and enabled with the tools to support their application without ever having to think about what might be happening in the underlying infrastructure.

Furthermore, your platform team should be treating the platform in the same manner as an application. They are developing and operating the platform. They are the only ones paged when the platform fails any of its contracts (the SLAs they agree to and own), because they know, better than anyone, the particular ways that the parts of the platform can and tends to fail.

During my years on Google Drive, the engineer’s contract with the platform (named google3) was primarily the source repo (named piper). The google3 platform provided, out of the box, the full spectrum of tools that our team needed to own our Drive services - piper and the code review process, testing platforms, release management, release candidate and cherry picking mechanisms, logging, monitoring and alerting, and various levels of profiling of our applications at runtime to understand how our application was faring. Our team spent zero time building out these capabilities. Our runtime platform was borg (analogous to kubernetes), but very few understood how borg worked and generally we never needed to. The platform notified the developers only of things that concerned them and of things that they can/needed to fix. During all my years I can’t recall a single time that I ever directly spoke to anyone on the teams who built all of this wonderful tooling.

There’s much more we can discuss. We can look at change cadences across the different areas (e.g. applications versus platform, code versus configuration) to ensure they are not disrupting one another. There is an ongoing evaluation of how far up the application stack should the “platform” go (i.e. being prescriptive about application frameworks, storage mechanisms, etc.) to strike the right balance of flexibility and supportability. How big should your platform team be, as a headcount ratio of the application developers supported? With the increased adoption of shared services, how do we deal with the real problem of scale (both technically and organizationally)?

And never mind what we call the team, function, job (though ultimately I believe naming is critical to get alignment), and let’s look at how we’re able to behave. Do lines of responsibility and accountability align with the required knowledge areas that are tenable for those teams to take on? Do those lines work with and take advantage of opportune areas of technical abstraction that isolate cadences, motivations, and expertise? Are we codifying and automating the predictable parts of how teams on either side of those lines should work together? To the extent that these are true, we will have successfully adopted and absorbed the incredible technological gains in the cloud, to return much of our continued focus toward where we started with devops: for the humans on those teams to effectively devote energy to communication, collaboration, and empathy.