Incident Response for Solo Builders: Your Personal Playbook
When production breaks at 2am and you are the only engineer, panic is your enemy. Build a three-layer incident response system — stop the bleeding, fix root cause, prevent recurrence — before you need it.
It is 11pm. Your phone buzzes. A Discord alert from your monitoring system: Foresight just placed five trades with inflated conviction scores. The P&L is dropping. Your trading bot is bleeding money, and you are the only engineer.
This is not a hypothetical. This happened. And the difference between a crisis and a story you tell later is whether you have a playbook before you need it.
The Three-Layer Model
Every incident response follows the same structure, whether you are at Google with 10,000 engineers or in your apartment at midnight with a laptop.
The layers are not optional. The order is not flexible. You do not skip Layer 1 to get to Layer 2. You do not jump to Layer 3 during an active incident. The discipline is the system.
Layer 1: Stop the Bleeding
Layer 1 has one job: halt the damage. Not fix the problem. Not understand the problem. Stop it from getting worse.
This layer is mechanical. It requires zero creativity, zero analysis, and zero deep thinking. That is by design. When your adrenaline is spiking and your brain is screaming "fix it fix it fix it," you need a response that works without your higher cognitive functions.
For Foresight, Layer 1 is a single config change: TRADING_ENABLED=false. The bot stops placing new trades. Existing positions are untouched — they will resolve on their own. Total time: seconds.
For a web application, Layer 1 might be: revert to the last known good deployment. Or flip a feature flag. Or reroute traffic to the maintenance page.
The point is: you decided this before the incident. You practiced it. You know exactly which file to edit, which command to run, which flag to flip. There is no thinking during Layer 1.
Kill Switches: Your Emergency Toolkit
Every significant feature in your system needs a way to disable it without deploying code. If the only way to turn something off is to push a code change, you do not have a kill switch — you have a prayer.
I use three kill switch types across my systems:
Environment variables for process-level control. TRADING_ENABLED, SIGNAL_ENGINE_ACTIVE, CONTENT_PIPELINE_ENABLED. Simple, instant, and the process reads them on restart.
JSON config files for feature-level control. Foresight reads config.json on every cycle. I can change a value, and the next iteration picks it up. No restart needed.
Database flags for state-level control. The InDecision framework stores feature flags in the database. Any service can check them. Changes propagate across the entire system.
Layer 2: Fix Root Cause
Layer 1 bought you time. The bleeding has stopped. Now you can think.
Layer 2 is structured debugging applied under pressure. You do not guess. You form hypotheses, gather evidence, and make targeted changes. This is where the investigation stack from the previous lesson pays off.
The rules for Layer 2:
Targeted changes only. Do not refactor while fixing. Do not "improve" code you happen to be looking at. Fix the bug. Only the bug.
Test before deploy. Your fix needs a regression test before it ships. If you are too panicked to write tests, you are too panicked to deploy code.
Validate after deploy. After the fix is live, monitor the system. Watch the first few operations. Confirm the behavior matches your expectations. Do not deploy and walk away.
The Real Timeline
Theory is clean. Reality is messy. Here is what the Foresight conviction bug actually looked like, minute by minute.
Notice the rhythm. Layer 1 in 5 minutes. Layer 2 in 3 hours. Layer 3 over 2 weeks. The system was safe in 5 minutes, fixed in 3 hours, and hardened over weeks.
That is not accidental. That is the design. Urgency decreases as you move through the layers. Layer 1 is a sprint. Layer 2 is a focused session. Layer 3 is a project.
Data Hygiene During Incidents
Here is what most incident guides do not teach you: corrupted data is the silent second incident.
When Foresight placed trades with inflated conviction, those trades created data. Trade records, P&L calculations, performance metrics — all based on a broken formula. If I fixed the code but did not flag the corrupted data, every downstream system that consumed that data would produce wrong results.
During Layer 1, I flagged every trade placed in the affected window. During Layer 2, I recalculated the correct conviction scores for those trades and annotated them in the database. During Layer 3, I built monitoring that detects conviction drift automatically.
Data hygiene is not optional. It is part of the incident response.
Emotional Management
I will be direct about something most engineering content ignores: production incidents are emotionally intense.
Your system is broken. Real money, real users, or real reputation is at risk. Your brain floods with cortisol. Your thinking narrows. You want to DO SOMETHING — anything — to make it stop.
This is exactly when you make the worst decisions. Panic leads to:
- Deploying untested fixes
- Making broad changes instead of targeted ones
- Forgetting to check if the fix actually worked
- Breaking something else while "fixing" the original problem
Layer 1 is designed specifically for this state. It is mechanical. Flip a switch. Run a command. No analysis, no creativity, no judgment required. It works when your brain is running at 50% because it was designed to work at 50%.
Layer 3: Prevent Recurrence
The incident is over. The system is stable. The code is fixed. Now the real work begins.
Layer 3 asks a different question: not "what happened?" but "what structural change prevents this class of failure from ever happening again?"
For Foresight, Layer 3 was a multi-week project:
Corrective mode: The bot now auto-detects conviction drift — if the conviction score deviates from expected ranges, it pauses trading and alerts. The bug cannot silently affect trades anymore.
Formula consistency monitoring: A new monitoring check compares formula outputs against known-good baselines. Drift triggers an alert before any trade is placed.
Refactor safety net: A new CI check verifies that all variable consumers are updated when a variable is renamed. The class of bug that caused the incident — stale references after a rename — is now caught in CI.
The Post-Mortem
Every incident gets a post-mortem. No exceptions. Even if the incident was small. Even if the fix was obvious.
The post-mortem answers five questions:
- What happened? Timeline, symptoms, impact.
- Why did it happen? Root cause analysis (5 Whys from the previous lesson).
- What did we do? Layer 1, 2, and 3 actions.
- What will we change? Concrete, actionable items with owners and deadlines.
- What did we learn? Patterns to watch for, mindset shifts, process gaps.
The post-mortem goes into lessons.md in the project root. That file is loaded at the start of every session. The lesson compounds across sessions — what you learned from this incident shapes how you handle the next one. This is the compound learning loop from Architect of War: encode lessons into operational documents, not memories.
Building Your Playbook
You build an incident response playbook the same way you build any emergency system: before you need it.
Open your main project right now. Answer these questions:
- What is the most critical feature?
- How do you turn it off in under 60 seconds?
- Where are the logs?
- What does "healthy" look like in the monitoring?
- Who needs to be notified?
If you cannot answer all five instantly, your playbook has gaps. Fill them now, while you are calm and unhurried.
Lesson 159 Drill
Build your personal incident response playbook for your most critical project.
- Inventory your kill switches. List every feature that matters and how you would disable each one. If any feature lacks a kill switch, build one.
- Write your Layer 1 checklist. A numbered list of mechanical steps: what to check, what to disable, what to monitor. No thinking required.
- Define your Layer 2 process. How do you enter structured debugging mode? What tools do you use? Where are the logs?
- Draft a post-mortem template. Five sections: What happened, Why, What we did, What changes, What we learned. Store it in your project root.
- Run a fire drill. Simulate an incident. Set a timer. Execute your Layer 1 checklist. How long did it take? What was missing?
The goal is not a perfect document. The goal is a document that exists. You will iterate on it after every real incident. But the first version needs to be written while your system is healthy, your mind is clear, and nobody is losing money.