A platform team's debt trap: how to conquer day 2 operations
Sep 27, 2024
It’s easy to get caught up in the excitement of shipping a platform service. We celebrate when a new service goes live as the reward after many months of planning, coding, and testing.
But what happens after we press deploy and everything settles? Welcome to Day 2.
Day 2 is the critical—yet often overlooked—phase of a platform capability life cycle. When the a myriad of issues can arise out in the wild, as we face real users, real data, and real world conditions. You’d better buckle up as we manage new releases, security patches, and a whole lot more. For platform teams, effectively managing this operational burden and support is critical for being able to continually deliver effective value to your organization.
In this post, we dive into what ‘day 2’ operations are, why these challenges are crucial yet frequently underestimated, and how to practically approach them. In the world of platform, the launch is only the beginning. How you navigate your support and operations, defines your success.
What are Day 0, Day 1, and Day 2 operations?
Let’s define what we mean when we say ‘day 2’, compared against Day 0 & Day 1.
Day 0 - Planning and design: How you architect services, choose technologies, define requirements, and set milestones.
Day 1 - Initial deployment and setup: Includes performing infrastructure setup, deploying your service, and doing initial user onboarding.
Day 2 - Ongoing operations and maintenance: Covers implementing security patches, bug fixing, integrating user feedback and optimizing performance.
With these definitions in mind, let’s move onto why ‘day 2’ operations pose an existential threat to platform teams, and look at ways to mitigate these issues.
Your platform team is a strategic investment
A platform is a strategic investment for an organization. Your platform is expected to deliver value by enabling faster development, improving efficiencies, and providing a stable foundation. To do so, a platform team must scale their team non-linearly. For example, if 1 platform engineer can serve 10 engineers, 2 platform engineers should serve 50. In summary, the addition of a member to a platform team should lead to outsized impact.
The main risk to scaling a platform team is poor or non-strategic bets on platform capabilities. More specifically, it’s often not what is prioritized, rather how the team invests in capabilities, and assesses the total cost of ownership of a service that can turn into time wasted. This includes things like on-going code maintenance, operational burden, security patching, and incorporating of user feedback. Each opportunity for a platform team’s investment must be counterbalanced against the predicted cost and operational burden of rollout.
A failure to assess the total cost of ownership of new services erodes your return on investment (ROI). Platform teams are particularly at risk when engineers jump to implementing solutions without considering the ‘2nd order effects’ which are the indirect consequences that overshadow feature benefits with operational cost.
For these reasons, day 2 issues pose a significant challenge to platform teams as they impact their ability to prioritize. The constant need to address scalability, reliability, and performance issues consumes a significant amount of the team’s resources, including a considerable support burden. This leads to a situation where urgent operational concerns take precedence over planned feature development or strategic improvements. As a result, platform teams find themselves in a ‘reactive mode’, struggling to balance immediate needs with long-term goals, or worse, the platform team becomes effectively ‘operationally bankrupt’.
Strategic use of ‘vendor engineering’ and managed services
One significant point of leverage for platform teams is their strategic usage of vendors and managed solutions to effectively ‘outsource’ parts of your platform’s operational burden. Charity Majors puts the need very well in her article on the future of ops jobs, and advises that you:
evaluate vendors and their products effectively. Ask probing questions to gauge compatibility and fit. Determine areas of friction you can live and dealbreakers.
quantify the cost of your and your team’s time and labor. Be ruthless about shedding as much labor as possible in order to focus on your core business.
Learn to manage the true cost of ownership and advocate and educate internally for the right solution, particularly by managing up to execs and finance folks.
When assessing a new service, it’s important to consider what existing solutions exist. This is often framed as a ‘build vs buy’ decision. However it’s also not often that simple. Some solutions also need to be built, making them both ‘build’ and ‘buy’. Seemingly free or self-hosted open-source solutions can sidestep licensing or hosting costs. but can still lead to the aforementioned ‘day 2’ costs. We must consider the full total cost of ownership, not just licensing or hosting costs.
Some vendors even offer hybrid self-hosted and managed solutions which separate self-hosting requirements from self-managing. For instance, Gitpod supports this hybrid operational model, providing customers the ability to self-host Gitpod whilst managing operational concerns with a shared operational model. We find that a growing number of platform teams are becoming aware of solutions with this type of model, as they understand the significant total cost of ownership that comes with trying to build and manage all of your platform architecture.
Platform Day 2 is where the ‘real work’ begins
The journey of a platform team doesn’t end with deployment. Day 2 operations represent the true test of a platform’s long-term value. By acknowledging and preparing for the challenges that arise after launch, platform teams can position themselves for sustainable success.
What can you practically do?
Assess the total cost of ownership for each new platform capability, considering not only implementation, but also the ongoing maintenance and support.
Strive for a platform team that can scale its impact non-linearly, serving an ever-growing number of developers efficiently.
Find the right balance between addressing immediate operational concerns and pursuing long-term strategic improvements.
Leverage managed services and use ‘vendor engineering’ strategically to offload operational burdens where it makes sense.
Regularly reassess your platform’s components and remain open to hybrid solutions that can balance control and operational efficiency.
A successful platform is not just about the technologies you choose or the features you build—it’s about creating a sustainable ecosystem to continuously deliver value to the organization. By embracing the realities of day 2 operations, platform teams can transform pitfalls into opportunities for growth and innovation. The true measure of platform success is not your initial deployments but in adapting in the face of real-world challenges.
Day 2 is where the real work begins and it’s where platform teams sink or swim.
Last updated
Sep 27, 2024