GitOps and Machine Sets: Operating Infrastructure at Scale

Before GitOps, infrastructure management looked like this: someone SSH’d into a server, ran a few commands, and hoped nothing broke. If it did, they fixed it manually and moved on. The next person to touch that server had no idea what changed or why.

That approach works at startup scale. It fails catastrophically at enterprise scale.

GitOps emerged as a response to this problem. It applies the same discipline that engineers use for application code — version control, code review, automated testing — to infrastructure operations. The result is an operating model where infrastructure is declarative, versioned, and self-healing.

Here’s what engineering leaders need to understand about GitOps, and why machine sets are a natural fit for the pattern.

What GitOps Actually Is

GitOps is built on three principles:

Declarative configuration. The desired state of the system is described in files — not in scripts or manual procedures. Every server, every deployment, every network policy exists as a configuration file in a Git repository.

Version control as source of truth. Git is the authoritative record of what the system should look like. Any change to the system starts as a change to a file in Git. The commit history is the audit log. The diff between any two points in time shows exactly what changed.

Automated drift reconciliation. A software agent running in the environment continuously compares the actual state of the system against the desired state in Git. When they differ — because of a manual change, an automated scaling event, or a failure — the agent reconciles the difference automatically.

The result is a closed loop: humans write configuration, Git stores it, and automation enforces it. No SSH. No manual fixes. No configuration drift.

Why Engineering Leaders Should Care

GitOps solves three problems that become critical as organizations grow:

Auditability. Every change to infrastructure is recorded in Git. Who made the change, when, and what they changed are all visible in the commit history. For compliance-heavy industries — finance, healthcare, defense — this is transformative. Instead of retrospective audits that piece together what happened from logs and conversations, you have a complete, verifiable record.

Disaster recovery. If a cluster fails, recovery is a matter of applying the configuration from Git to a new environment. No hunting for configuration files. No reconstructing setup procedures from memory. The entire infrastructure is codified and recoverable from a single repository.

Team autonomy. Teams can manage their own infrastructure without needing access to production systems. They submit a pull request. The review process catches issues. The automation applies the change. This separates the concern of “what should the infrastructure look like” from “who has permission to touch production.”

Machine Sets as a Pattern

The original post referenced machine sets — a concept from Kubernetes that describes a group of machines with identical configuration. A machine set defines the desired number of replicas, the machine type, and the configuration template. The cluster controller ensures the actual number matches the desired number.

Machine sets and GitOps are a natural pairing. The machine set configuration lives in Git. Any change to the desired number of machines, the machine type, or the configuration template starts as a Git commit. The GitOps agent applies the change to the cluster. The cluster controller handles the rest.

This pattern extends beyond machine sets to any infrastructure resource that can be described declaratively. Load balancers. Database instances. DNS records. TLS certificates. Service mesh configurations. If it can be expressed as a file, it can be managed through GitOps.

What works better: Start with a single resource type that is well-understood and low-risk — a Kubernetes namespace, a set of machine sets, or a monitoring configuration. Establish the GitOps workflow for that resource type first. Prove that the review process, the automation, and the reconciliation loop work before expanding to more critical resources. The workflow is the product. The resource type is just the first customer.

The Human Side of GitOps

GitOps is a technical pattern, but its biggest impact is organizational. It changes who can do what and how changes flow through the system.

In a traditional model, infrastructure changes require direct access to production systems. That access is restricted to a small group of senior engineers, creating a bottleneck and a single point of failure. In a GitOps model, anyone can propose an infrastructure change through a pull request. The review process provides quality control. The automation provides safety.

This shift has implications for team structure. Platform teams shift from being operators to being enablers — they build the GitOps workflow, define the review criteria, and maintain the automation. Application teams gain the ability to manage their own infrastructure within the boundaries defined by the platform.

What works better: Design the GitOps workflow with the team’s skill level in mind. A team that is new to declarative configuration needs a simpler workflow with more guardrails than a team that has been doing it for years. Start with a template-based approach where teams fill in predefined configuration blocks. Add flexibility as their confidence grows.

When GitOps Isn’t the Answer

GitOps is not a universal solution. It works best for infrastructure that is stable, predictable, and can be described declaratively. It works less well for:

Ephemeral debugging. When you need to SSH into a server to diagnose a live issue, the GitOps workflow is too slow. The reconciliation loop will also undo your temporary changes. GitOps environments need escape hatches for troubleshooting — but those escape hatches should be logged and rare.

Stateful infrastructure. Databases, message queues, and stateful sets require careful handling. GitOps can manage the configuration around them, but the state itself lives outside Git. The reconciliation loop must be designed to preserve state while enforcing configuration.

Greenfield experimentation. When you’re exploring a new technology or architecture, the overhead of the GitOps workflow can slow down learning. Use a separate, less constrained environment for experimentation and promote proven configurations into the GitOps workflow.

What I’ve Learned

Five things that have shaped how I think about GitOps:

GitOps is an operating model, not a tool. ArgoCD and Flux are tools that implement GitOps. The real value is the discipline of declarative configuration, version-controlled history, and automated reconciliation. The tool matters less than the practice.
The audit trail is the killer feature. Every compliance conversation I’ve been in becomes simpler when the answer to “what changed?” is “here’s the Git log.” GitOps turns compliance from a retrospective exercise into a continuous one. That alone justifies the investment for regulated industries.
Start with low-risk resources and expand. Don’t try to GitOps your entire infrastructure at once. Pick one resource type that is manageable, prove the workflow, and expand. The confidence comes from experience, not planning.
Invest in the review process. GitOps shifts quality control from “who has access” to “what gets reviewed.” A good review process catches configuration errors before they reach production. A bad review process becomes a bottleneck. Invest in automated validation, clear review criteria, and fast feedback loops.
Plan for the escape hatch. There will be emergencies where you need to bypass the GitOps workflow. Design for them explicitly — limited-duration access, full logging, mandatory post-incident reconciliation. The escape hatch should be uncomfortable enough that it’s used only when necessary, but functional enough that it works when it’s needed.