Define the goal
As a cross-cutting concern, stability often conflicts with other objectives such as cost optimization and rapid delivery of new features. That makes it important is to set concrete goals for stability. We use key performance indicators such as availability, error rate and MTTR, which ideally were already considered during product development. Continuously measuring outcomes and transparently sharing results is the basis for some of the feedback cycles that we establish between development and operations.
According to Conway's Law, organization and technology cannot be completely separated from each other: our goal is to drive technical decoupling in the long run by decoupling teams.
The stability of digital products is a systemic problem: it can rarely be achieved by one-off, selective improvements. Therefore, we work simultaneously on the levels of technology, processes and culture in order not only to achieve a level of quality but to establish it sustainably.
We use risk models from Business Continuity Management to identify the biggest problem areas from a customer and company perspective and to quantify their impact. Existing models of service dependencies and CMDBs help to do this.
Complex systems don't like radical changes; all too often, that approach only replaces known problems with unknown problems. We use the Deming cycle ("plan-do-check-act") to continuously drive stability improvements in a rapid succession of small steps. By observating everyday system behavior as well as simulating various conditions (load tests / failover tests, game days), we generate input for further optimizations.
Technology and organization need to be in alignment in order to achieve stability and scalability in digital products. Our framework of practices is designed to provide all success factors for successfully completing transformation projects and going from crisis to stability.