Approaching Continuous Delivery with an SRE mentality

Steven Kim
Steven Kim

I have always categorized Continuous Integration (CI) and Continuous Delivery (CD) differently.

There are common technologies that can provide innovation in both areas, for example, automation, but necessarily solve different problems while targeting distinct groups: CI is geared predominantly toward supplying engineering teams with early feedback on code quality, and CD is typically more aligned with end objectives, such as reliability and service quality.

Unless an organization is cognizant of these differences, they are unlikely to achieve the yield expected from their CI or CD projects.

A primary goal of CD is reliability.

In fact, I would argue that CD projects should be relabelled as Site Reliability Engineering (SRE) projects. SRE has become a movement, so we need to be reminded of its meaning. This is where a team or individual is dedicated to making a product, service, or application reliable. Essentially, they sit on a reliability number or SLA and react when it drifts. 

Engineering excellence in CD is not predominantly about technology. Deployments should be boring, and the CD platform should handle reliability so engineers can focus on product velocity and agility. Organizations often fail to realize that CD technology is a minor component while cultural changes should make up the lion’s share of focus.

This is illustrated when automating the release process, a common technology-orientated goal for CD. We all know that an organization should see faster releases by doing this. But what does this achieve if it isn’t aligned with your business goals, such as product agility and reliability? It supplies your teams with a faster and more efficient way to make the same old mistakes. Aggressive upswings that chart consistent failures achieved at speed are unlikely to be metrics anyone wants to present at their next management meeting.

It is about AppDev teams owning their destiny end-to-end and instigating cultural change.

For example, a CD platform needs to allow for that ownership by transferring SLAs away from the centralized team, whether that’s the DevOps, Release or Change Management team, to the AppDev team or their dedicated SRE team. If there is a reliability issue that team is empowered and held accountable and decides how to update their release process to prevent that issue from happening again.

A centralized team that automates a release process does not know all the ways to optimize reliability for a particular application and will find it challenging to scale. The reality is that the needs of an AppDev team in the ‘last mile’ are varied. Unfortunately, as we have found, this leads to organizations failing to meet reliability and agility targets across their services.

From my own experience, a successful approach revolves around a detailed knowledge of the software and application-specific issues. For example, Google Drive’s dedicated SRE team would advise the software engineering team on releases and use its insights in areas such as how the app uses memory, network requirements, and upstream and downstream dependencies to improve reliability. A centralized team wouldn’t know at what traffic level a specific resource use is likely to stress, but that information is vital when releasing on a global scale.

Applying an SRE mentality does not mean you have to immediately recruit a dedicated SRE.

In a small organization, there isn’t anything stopping you from empowering someone on the AppDev team interested in reliability and has in-depth knowledge of the software they can apply to adopt an SRE mentality. From my experience, these individuals tend to be senior members of a team who have a broader understanding of the ecosystem and are more aligned with top-level business objectives and the team's health.

As the team scales, you can formalize the SRE role and eventually build a dedicated SRE team for each application over time and achieve release reliability for your business.