Platform, Fires, and You: Navigating the Fine Line Between Operations and Development

by Thomas Schizas on Mar 6, 2025 3:39:49 PM

Listen to this article instead

12:32

The Old-School Operations Role: Backbone or Bottleneck?

In the early days of IT, the operations team was the unsung hero—the silent, and often siloed, force that kept everything running. They were responsible for the infrastructure: Servers, databases, and networks that powered the business. They managed deployments, monitored systems, and ensured uptime. If it was working, no one noticed them. If it wasn't? Well, then the questions started: "What are you doing all day?" followed by frantic demands to "fix it NOW." Enter the world of firefighting.

But here's the thing: Firefighting is not something anyone wants to do. It's rather reactive, stressful, and usually a result of bigger systematic issues. So, how did we get here and more importantly, how can we get out?

DevOps, Platform Teams, and the Evolution of Operations

In the last decade, the emergence of DevOps fundamentally reshaped the relationship between development and operations. More than just a set of tools or processes, DevOps represents a cultural shift towards collaboration, transparency, and shared responsibility for the full software development lifecycle—from code creation through deployment to monitoring and observability.

A new role was created to encapsulate that shift. DevOps engineers are those who straddle the line between software development and infrastructure. They're often developers with a deep interest in infrastructure, deployments, and scalability of applications, or they are operations experts who have become fluent in the needs of modern software.

This DevOps philosophy has evolved further into what we now call Platform Teams. A rebranding of the traditional operations role, platform teams are responsible for infrastructure, monitoring, and scaling with a developer-first, cloud-native mindset. In a world where the vast majority of infrastructure lives in the cloud (thanks to AWS, Azure, Google Cloud etc.), platform teams ensure that the tools, frameworks, and environments are optimized to support developer productivity and effectiveness.

However one thing remains unchanged: Firefighting.

Firefighting: The All-Too-Common Reality

Firefighting, in the context of platform and DevOps teams, refers to the reactive, often chaotic response to infrastructure crises—whether that's a security breach, a failed deployment, a DDoS attack, or a service outage. In these moments, the team must jump into action and "put out the fire."

Unfortunately, firefighting is something that too often becomes the default mode in many organizations, even though it's both draining and inefficient.

Why do we constantly find ourselves operating in this mode?

Is it simply a natural byproduct of working with complex systems or growing organizations?

Is there something we can do to prevent it?

The Root Causes of Firefighting

While some “fires” are inevitable, most can be traced back to the underlying issues which, with the right strategies, could be avoided or minimized in the first place.

Below we present a few top culprits we identified as common causes of “fires”:

The Quick-and-Dirty Fix

Sometimes, in the heat of a crisis, teams apply temporary "band-aid" solutions to keep things running. These fixes, however, rarely address the root cause, and often come back to haunt the team later—sometimes with a vengeance.

Let's assume, for example, a critical service fails due to an incorrect manual configuration of a Kubernetes cluster such as setting incorrect resource limits for pods. Instead of diagnosing and fixing the underlying issue, the team might restart the affected pods or adjust the configuration manually without addressing the systematic problems. This could result in the issue cropping up again later.

Using Helm to automate Kubernetes configurations and enforce consistent deployment policies can prevent these manual missteps, and ensure more reliable environments.

Throwing Things Over the Fence

In organizations where development and platform teams work in silos, it's easy for teams to pass the bucket. Developers build an application, then "throw it over the fence" to the platform with the hope that it will simply work in production. This lack of collaboration in the first place breeds misalignment, misunderstanding and ultimately sets more “fires” that need to be put out later in production—primarily by the platform team.

For example in a traditional workflow, the development team might deploy a new application version, only to realize that the platform team didn't know about critical configurations in place or possible scaling issues for production. This leads to downtime and the need for firefighting.

Using Docker for containerization and Kubernetes for orchestration can help to bridge the gap by providing a standardized, consistent environment that both developers and platform teams can easily collaborate on and deploy.

Absence of Standardized Processes

Chaos often reigns when processes are unclear or non-existent. Without standardized operating procedures for deploying, monitoring, scaling, and handling incidents, platform teams are left, on many occasions, alone to react to problems as they arise. This again creates an effect of a constant state of firefighting.

For example, without clear deployment procedures, a Kubernetes cluster might be manually configured each time a new app is deployed, leading to inconsistencies and potential failures.

Leveraging a tool like Helm for Kubernetes helps to ensure repeatable and consistent application deployments, and also reduces the risk of failure due to human error.

The Inability to Say No

In fast-paced environments, there's often pressure placed on delivering new features quickly—sometimes at the expense of long-term stability of the system/service. Teams might agree to do risky deployments or feature releases under tight deadlines, only to face the consequences later when things inevitably go wrong.

As an example, the product team pushes to release a new feature without adequate testing or performance monitoring. This leads to Kubernetes deployment being overwhelmed by traffic, causing resource exhaustion and eventually system crashes.

Having a robust Prometheus and Grafana monitoring setup in place would allow teams to detect resource spikes early, and to prevent outages before they occur.

Constantly Shifting Priorities

When priorities are constantly changing, or there's a lack of clear direction, teams operate in a reactive state. Known issues get deferred and environment/code problems pile up, creating a breeding ground for firefighting.

An example might be that a platform team is tasked with deploying a critical security patch across multiple Kubernetes nodes, but because the priority constantly shifts to other tasks, the patch is delayed. In the meantime, a security vulnerability is exploited, and the team is scrambling to mitigate the damage immediately.

Using Terraform to automate infrastructure provisioning could ensure security updates are consistently applied across multiple environments regardless of shifting priorities.

Non-Existent or Minimal Testing

A major cause of firefighting is the absence of proper testing at various stages of the development cycle. When there is no automated testing, quality checks, or staging environments, bugs and issues are often only discovered after deployment to production. These issues lead to urgent firefighting efforts to resolve them creating instability and downtime.

If a new feature is deployed without sufficient testing, when it hits production it may cause a service crash because of an unhandled edge case. The team then scrambles to fix the issue in real-time, but the lack of pre-deployment validation has led to unnecessary disruption.

In these cases, a lack of unit tests, integration tests, and end-to-end tests can make minor issues quickly snowball into major failures.

How to Stop the Fires Before They Occur

While firefighting is inevitable to some extent, it doesn't have to be the default MO (modus-operandi). By shifting from a reactive to a proactive mindset, you can significantly reduce the frequency and severity of these incidents.

Here are some ways to minimize those incidents and “fires:"

Automate Everything (or as much as you can)

One of the most effective ways to reduce firefighting is through automation. From automated deployment pipelines to monitoring and scaling, automation removes much of the mundane work that tends to turn into bigger problems. Infrastructure-as-Code (IaC) practices are also crucial for ensuring consistent, reliable environments across the board.

By automating provisioning of infrastructure with Terraform, consistency is ensured across environments, reducing human error and the need for last-minute debugging or diagnosis of a potential issue. Similarly, using GitHub Actions for your CI/CD pipeline safeguards that every code change goes through the same automated and repeatable process—minimizing the risk of human error in deployments.

Prioritize Stability Over Speed

Instead of rushing to ship new features, focus on stability and predictability. Implement Continuous Integration and Continuous Delivery (CI/CD) to catch bugs and performance issues early in the development lifecycle. This reduces the chances of a small problem snowballing into a large, business-impacting issue.

With GitHub Actions, you can automate the entire CI/CD pipeline, ensuring that code is automatically tested and deployed in a consistent and repeatable manner. For deploying to Kubernetes clusters, you can use ArgoCD for continuous deployment, assuring that any changes made to your Git repository are automatically synchronized with your Kubernetes environment and reducing the chances of configuration drift and deployment errors.

Invest in Monitoring and Observability

You can't put out a fire if you don't know it's happening. Investing time setting up appropriate monitoring, logging, and observability tools ensures you can catch issues before they escalate. Set up “intelligent” alerts for anomalies detection and establish clear escalation paths allowing your team to respond promptly when things go wrong.

For example, implementing Prometheus for monitoring your Kubernetes clusters gives you powerful insights into resource usage, pod health, and other key metrics essential to monitor health of your cluster. Pairing it with Grafana for visualization, you can easily track the performance of your clusters, identify bottlenecks, and set up alerts for any critical thresholds. This allows your team to address issues before they escalate into downtime.

Foster Cross-Functional Collaboration

Development and platform teams should no longer work in silos. Encourage close collaboration between developers, platform, security, and other relevant stakeholders. By embedding operational awareness into the development process, you can identify and fix potential issues earlier before they become major incidents.

Tools like Slack or Microsoft Teams, etc. can be utilized to facilitate continuous communication between development, platform, and security teams, ensuring cross-team alignment. Integrating these tools with monitoring platforms like Prometheus means that any critical alert (such as high CPU utilization or memory consumption in Kubernetes) can automatically trigger notifications, improving response time and reducing the need of firefighting efforts.

Adopt a Culture of Continuous Improvement

Firefighting should never be treated as “business as usual”. Foster a culture of continuous improvement, where the team regularly reviews incidents, performs root cause analysis, and takes action to prevent recurrence. Post-mortem meetings and regular feedback loops help teams learn from failures and continuously refine their processes.

By shifting from reactive to proactive, automated, and collaborative practices, organizations can reduce the firefighting and operational chaos. And at the same time, it can create a more sustainable and efficient environment by using stable, mature and reliable systems.

Topics: Featured Cybersecurity & IT

Generative AI Safety: How Policy and Filtering Keep AI Use Secure

A Game of Telephone—Misused Proxies and The Dangers They Pose to Schools

In the 90s and early 2000s, having a cell phone was a big deal. Text messages were a luxury. Phone calls to your friends without being tethered to the wall was an indulgence. Free nights and weekends were top tier. Handing a kid one of those indestructible bricks with the ringback tones was like giving them a key to the world at large.

The Advantages of DNS Security and How DNS Protection Delivers Them

While most organizations have strengthened their firewalls, endpoints, and email protection, one critical part of the security stack is often overlooked: the Domain Name System (DNS).