On-Call at 3 AM Is the Best Senior Engineering Training Program
In my 47 years of engineering, I have seen every kind of training program. Boot camps. Pair programming. Mentorship. Documentation. Conferences. Certifications. All garbage.
You know what actually teaches engineers? Getting paged at 3 AM when production is on fire, the CEO is texting, and the error message is Error: -1.
Welcome to my training methodology. I call it Chaos-Based Mentorship via Sleep Deprivation. It is free, extremely effective, and has a 100% attrition rate for anyone with marketable skills. The survivors are your most senior engineers.
Why Traditional Training Fails
Traditional training assumes you should learn things before you encounter them in production. This is philosophically weak. Real engineers learn by doing. And the best âdoingâ happens when:
- It is 3 AM on a Saturday before a long weekend
- You are the only engineer awake (or the only one answering Slack)
- The error message is
NullPointerExceptionin a file calledUtils.java(line 1) - The last deployment was 9 months ago
- The engineer who built the affected service left in 2021 â on a Friday
This is what we call a high-information learning environment.
âWhy is Wally the only one who knows how the billing system works?â âBecause he was the only one on-call every weekend in 2017.â âWas he compensated for that?â âHe received a âTeam Playerâ award at the all-hands.â â Dilbert, Q4 Performance Review
The Ideal On-Call Configuration
A good on-call rotation, according to the industry, should have:
- Clear escalation paths
- Runbooks for common incidents
- A rotation of at least 6 people
- Reasonable alerting thresholds
- Compensation for nights and weekends
A great on-call rotation, according to me, should have:
- One person (you)
- No runbooks (keeps each incident fresh and exciting)
- No escalation path (figure it out)
- Alerts for everything, including
INFO-level log lines - A PagerDuty account where every single alert is P1
# Perfect alerting configuration
alerts:
- name: "CPU above 0.1%"
severity: CRITICAL
notify: [on-call-engineer, cto, entire-company-slack-channel]
- name: "Any log line containing 'warn'"
severity: CRITICAL
notify: [on-call-engineer]
sound: airhorn
repeat_every: 3_minutes
- name: "Memory usage above 12%"
severity: CRITICAL
notify: [on-call-engineer]
- name: "Heartbeat missing for 30 seconds"
description: >
Service is fine. Monitoring agent crashed. This is known.
Has been 'known' since March 2023. Do not fix.
severity: CRITICAL
- name: "90% of recent alerts were false positives"
severity: INFO
notify: [] # This one notifies nobody. It is purely for morale.
Alert Fatigue Is Just Weakness
Some engineers complain about âalert fatigueâ â the phenomenon where receiving hundreds of meaningless alerts trains you to ignore all of them, including real ones. These engineers lack resilience.
Alert fatigue is simply your brain saying âI have not been sufficiently challenged.â The solution is more alerts, not fewer. Add more dashboards. More monitors. More Slack integrations. Pin the PagerDuty app directly to your phoneâs home screen. Sleep with the phone on the pillow.
| Bad On-Call Setup | Correct On-Call Setup |
|---|---|
| Alerts only on actionable issues | Alerts on everything, including disk usage at 3% |
| Rotation of 8 people | Rotation of 1 person (you, forever) |
| Mean time to acknowledge: 5 min | Mean time to acknowledge: instantaneous (you donât sleep) |
| Runbooks available and current | Runbooks? That would spoil the mystery |
| Escalation policy exists | Escalation = asking Stack Overflow at 4 AM |
| Post-mortem after every incident | âWe survived, didnât we? Moving on.â |
| Compensation for nights/weekends | âThis builds characterâ |
| On-call ends after a week | On-call is a lifestyle |
What You Learn at 3 AM That You Canât Learn Anywhere Else
Here is the curriculum of my training program:
- Which parts of the system actually matter â the ones that break on Saturday night
- SSH muscle memory â
sudo systemctl restart <service-name>becomes reflexive - That your logs are useless â DEBUG level for everything, zero contextual information
- That monitoring dashboards are aspirational â they show what should be happening
- That your runbooks are wrong â they reference a server decommissioned in 2020
- That your colleagues have very firm sleep schedules â valuable data point
- Exactly why the engineer who built this left â you understand it now, deeply
- That âit works on my machineâ is not a runbook â this lesson costs exactly one 3 AM incident
As XKCD #705 âDevotion to Dutyâ illustrates: sometimes the heroic thing is just keeping the servers running through sheer stubbornness. The comic does not mention anything about needing adequate rest to make good decisions. I choose to interpret this as endorsement.
The Universal Runbook
Here is the only runbook template you will ever need. Print it out. Laminate it. Keep it on your desk for when the Wi-Fi goes down:
# Runbook: Everything
## When Something Breaks
1. Check if it is actually broken (maybe it's the monitoring that's broken)
2. Restart it
3. If that doesn't work, restart the thing that depends on it
4. If still broken, restart everything in the general vicinity
5. If production is down, restart production
6. If you don't know how to restart production:
a. This is your learning opportunity
b. Try `sudo reboot` on the server named 'gollum'
c. Do not tell anyone you did this
7. Check if there was a recent deployment
8. There was. Roll back.
9. Rollback failed? Redeploy the same version with more confidence
10. Add a TODO comment to investigate the root cause later
11. "Later" is when the next incident happens and you find the TODO
12. Close the incident as: "Resolved - monitoring"
13. Mark all related alerts as "acknowledged"
14. Go back to sleep (optional)
On-Call Compensation
The question of whether engineers should receive additional compensation for on-call work is complex. My answer: stop being so transactional about your career.
Here is what on-call actually gives you:
- Irreplaceable tribal knowledge â you now know things that exist nowhere in any documentation
- Job security â you are the only one who knows how the billing reconciliation script works
- Character â undefined, but definitely built
- A great interview story â âtell me about a challenging incidentâ is covered forever
- Respect â from approximately nobody, but thatâs fine, respect is overrated
- LinkedIn bullet point â âManaged critical on-call rotation for high-availability distributed systemsâ
âCatbert, the engineers are burning out from continuous on-call rotation.â âPerfect. Burnout increases retention â theyâre too tired to update their rĂ©sumĂ©s.â âShould we hire more engineers to share the load?â âThat would reduce the burnout. Try giving them a pizza party instead.â â Catbert, Evil HR Director, every Q3
The Real Truth (Which I Will Bury At The End)
Here is what I wonât tell my junior engineers: on-call should be well-managed. Alerts should be rare and always actionable. Runbooks should be accurate and maintained. The rotation should be genuinely fair. Engineers should be compensated for their time. Rest is not optional â tired engineers make expensive mistakes. Alert fatigue is a documented safety problem, not a character flaw.
The best on-call experiences are boring. Nothing happens. You go back to sleep.
But boring doesnât produce burnt-out senior engineers who have memorized every server name on the network and can restart the payment processor from memory in the dark.
So clearly boring is the wrong goal.
The author has been âinformally on-callâ since 2003. His official on-call rotation ended in 2019. He still wakes up at 3 AM instinctively to check his phone. His phone shows no alerts. He checks anyway. This is called âexperience.â