off-the-stack
cd ~/careers
Infrastructure / reliabilityaka "SRE"

Site Reliability Engineer

Software engineering pointed at keeping things up — and paid at the very top of the range.

Entry
$110k
Mid
$165k
Senior
$230k+
Demand
High

SRE applies engineering to operations: you write code and build systems that keep large services reliable, scalable, and observable, and you own the incident response when they aren't. It's intense and high-responsibility, but it's also among the best-paid engineering roles anywhere, and many students never realize it's a distinct, learnable path rather than 'ops'.

The myth

It's just sysadmin work with a trendy name.

The reality

SRE is a coding discipline: you automate away toil, build reliability tooling, define SLOs, and engineer systems to fail gracefully. The best SREs are strong software engineers first.

cat ./what_you_actually_do.md

  • Build automation and tooling that removes manual operational toil at scale.
  • Define SLOs and error budgets, then engineer systems to meet them.
  • Own observability — metrics, logging, tracing — so problems are visible before they're outages.
  • Lead incident response and run blameless postmortems that actually prevent repeats.
  • Engineer for scale, resilience, and graceful failure from the start.

cat ./why_underrated.md

Students hear 'reliability' or 'ops' and picture thankless pager duty, not one of the highest-paid engineering tracks in the field. But SRE is a genuine software discipline with a clear body of knowledge, and because reliability directly protects revenue, companies pay a premium for people who can do it well. The on-call responsibility filters a lot of people out, which keeps supply tight and compensation high for those who lean in. It's hiding in plain sight next to the SWE track everyone fixates on.

grep -i 'good fit' ./who.md

  • Engineers who stay calm when things are on fire and like fixing root causes.
  • Systems thinkers drawn to infrastructure, scale, and automation.
  • People who'd rather prevent ten outages than ship one feature.

cat ./pay.md

SRE comp matches or beats product engineering at most large companies, because the cost of downtime is enormous and good SREs are scarce. Senior SREs at infra-heavy or big-tech companies clear $250k total comp. The trade-off is on-call: the responsibility (and sometimes the 3am page) is real.

./break_in.sh

  1. Go deep on Linux, networking, and one cloud

    The fundamentals matter more here than anywhere. Run real services and break them on purpose to learn how they fail.

  2. Learn to code, not just script

    Go or Python, plus IaC (Terraform) and Kubernetes. SRE that can't build tooling stays junior.

  3. Read the Google SRE book

    It's free online and is the shared vocabulary of the entire discipline — SLOs, error budgets, toil, the lot.

  4. Start adjacent and move in

    Backend, DevOps, or platform roles are natural on-ramps; volunteer for reliability and on-call work to make the jump.

tail -f ./a_day.log

  • 09:00Review overnight alerts; one near-miss becomes a small automation to prevent the real thing.
  • 11:00Write Terraform and a Go tool to remove a recurring manual operational step.
  • 14:00Tune SLOs and dashboards for a service that's been quietly burning its error budget.
  • 16:00Run a blameless postmortem and turn the findings into concrete action items.

ls ./toolbelt

  • Go / Python
  • Kubernetes
  • Terraform
  • Prometheus / Grafana
  • A major cloud
  • Linux internals