When an enterprise organization makes the choice to implement DevOps, they’ve completed the first of many steps in a long journey. While the first inclination of many engineering and IT leaders will be to start choosing tools and deploying software stacks, they should instead focus on the most critical element of DevOps: people. Without the right culture and continuous personnel development, most DevOps initiatives are doomed from the start.
Part 1 of this series provided a high-level look at getting started with DevOps, emphasizing the need for a two-pronged approach: engaging an outside partner to help quickly iterate on the technical foundation, while taking the necessary steps to build a long term foundation. This second part will go in depth on how enterprise organizations can set up their DevOps teams for success, focusing on the classic expression: “People > Process > Tools”.
Getting Started
For most enterprises, DevOps will represent a significant, foundational shift in how IT, operations, and software development are handled. When thinking about how to start such a major initiative with existing teams, the key is to start small. Rapid changes are jarring, and leaving little time for adaption will result in unproductive and unhappy engineers.
What does “small” look like? The key is right there in the title: “DevOps”. Facilitating small conversations between the development and operations/IT teams is a crucial first step. Remember that the primary goal of DevOps is to facilitate faster, more efficient software deployments by breaking down the wall between software developers and operations staff. In legacy enterprise teams, where the most interaction between these two groups might be happening in tickets, fostering collaboration will take time and effort. Starting with more socially focused events, like happy-hours or offsite events, can help to break the ice. For distributed/virtual teams, the recent COVID pandemic has given rise to “virtual” event providers, ensuring they aren’t left out of opportunities for social interaction.
Going beyond the social and into the technical, the next step is to adopt one to two practices that allow the team to collaborate on a shared issue while not demanding long-term resource investment. A shared root-cause analysis meeting between development and operations on the next outage post-mortem is a great way to put the two teams together on a shared problem space, while potentially providing fertile ground for future project ideas.
Ideally, post-mortem engagements help highlight shared problem spaces between operations, development, and the product groups. Perhaps there is a particular system or feature that’s causing customers to experience excessive latency, or a back-end generates a lot of error messages. The development teams may lack visibility into the operational performance of their code during runtime, while operations lack the insight into the “nuts and bolts” of the underlying code. Product management will be able to deliver specific insights on customer needs and their perception of overall experience and usability. Together, each group can augment the other’s strengths in evaluating the issue, and ultimately deliver a joint solution.
The next step after a successful collaboration is to start a serious dialogue with leadership teams. Ideally, a project was chosen that has customer-facing exposure: a win here acts as currency across technical and non-technical stakeholders. Being able to demonstrate value in this way will get the buy-in and commitment needed for a long-term DevOps initiative.
People
Once leadership buy-in is established, the next step is to start laying the foundation for a successful transition to DevOps culture. However, this section isn’t going to cover the latest CI/CD platform or container technology. As alluded to in the introductory paragraph: People are the foundation for success in DevOps. Shared objectives are critical for getting teams on the same page.
While having(and hiring) the right people is crucial, they can’t be effective without the right organizational alignment and shared goal structure in place. Bridging the gap between traditionally separate teams who shared few, if any KPIs and objectives, can seem like a daunting task. While a post-mortem can be a good kick start for a quick win, the longer term goal is to “shift operations left”. Shifting left means that operations teams are involved with development and product teams far earlier in the development life-cycle of an application or service. In legacy enterprise organizations, operations and infrastructure teams typically didn’t begin their work until the conclusion of a lengthy development phase approached. This cold hand-off from development to operations typically resulted in delayed releases, as instrumenting a long list of new features and functionality without shared feedback loops generates ample operational toil. Ops teams that integrate early in the development life-cycle can catch small problems before they become production issues, and can help guide development teams around performance and monitoring best practices.
Another key to driving a DevOps culture is to implement shared on-call duties. On-call, typically relegated to an ops duty, should be something that both development and operations(or DevOps) teams participate in. If only one team is responding to alerts, sometimes in the middle of the night, it can foster a sense of resentment and an “us vs. them” mentality. Particularly if repeat operational issues are not addressed quickly, languishing in the backlog in favor of new feature work. Developers that have to respond to a page at 2AM are far more likely to promptly resolve a nagging bug or performance problem! Ultimately, the primary goal is to foster a sense of shared ownership over both problems and their resolution. This kind of collaboration will not only improve morale and culture, but will also likely have a positive impact on ever-important metrics like Mean-Time-To-Resolution(MTTR).
To close the loop on post-mortems: it is critical that post-mortems, with rare exception, are looked at as a process failure. People, by and large, do not cause outages. Poor training, incomplete, confusing or incorrect documentation, uncaught technical and performance issues, and misaligned expectations cause outages. Post-mortems are meant to be a collective reflection on how all aspects of not only the technical, but the cultural and procedural can be improved. Post-mortems should identify any gaps in documentation or process, and action items should always be assigned to both technical and non-technical stakeholders. Follow-up meetings are essential to ensure there is progress and repeat outages are avoided.
Process
Another key aspect of success in a DevOps transition is having the right processes in place. The overarching theme of DevOps, practically it’s mantra, is “Continuous Improvement”. Whether it’s the actual code writing, on-call, documentation writing, post-mortems, or development strategy, every opportunity should be taken to make small, consistent improvements.
Organizations can look to operations teams to help lead the way on making small improvements. Ops teams are frequently subject to manual work and processes, often referred to as “toil”. An example might be patching a fleet of servers to handle a new version of the core software product. However, they’ve often developed a plethora of small scripts, automation, and tools to help make this type of work easier. Look to these teams to suggest and help integrate this automation further upstream in the development process. Better still, they may be able to get rid of technical debt by suggesting software or systems that are extraneous, and can be removed, simplifying and administration and reducing cost.
Operations teams can also help plan and implement “game days”. Game days serve as an excellent simulation for testing how both the technology and staff of an organization will react during an outage. Simulated failures can help highlight potential blind spots in monitoring, previously unknown performance issues, and identify gaps in documentation and process. A common thread among high-performing DevOps teams is a focus on operational excellence, developed through exercises like game days.
Shifting focus back to development, moving from legacy development patterns like Waterfall to something like Agile will provide a much more DevOps friendly development cadence. Extending the theme of continuous improvement, Agile is built around the idea of small units of change that are easily isolated, tested, and evaluated. New features can be released individually in a shorter time frame, presenting operations teams a better opportunity to contribute and provide actionable feedback for developers. Traditional Waterfall releases meant the operations team had to test, integrate, and deploy potentially hundreds or thousands of changes, leaving insufficient team to properly test and validate each one. With Agile, each change is a new opportunity for improvement.
People and Culture are the Foundation
The latest and greatest in tools and technology won’t take engineering teams far if they’re not built on a solid foundation of people and process. Together, people and process form the culture of DevOps.
Once an organization has the right people doing the right things, it’s finally time to optimize the tools and platform. In the next and final part of this series, we’ll look at the technology choices that can help conclude the launch of a successful DevOps culture.
Akava would love to help your organization adapt, evolve and innovate your DevOps initiatives. If you’re looking to discuss or implement any of these processes, reach out to [email protected] and reference this post.