Post-Mortem: API Outage - Rate Limiter Misconfiguration

Security SpecialistOperations & StrategyDevops

This is an example post-mortem. Delete this file or use it as reference when creating real post-mortems.

Metadata

Field	Value
Incident Date	2024-01-15
Severity	P2
Authors	Alice, Bob
Status	Final
Incident Log	2024-01-15-example-api-outage

Summary

On January 15, 2024, at 14:32 UTC, our API began rejecting most requests with 429 (rate limited) errors. The incident lasted approximately 2 hours and 13 minutes.

The root cause was a typo in a rate limiter configuration change deployed at 14:15 UTC. The threshold was set to 10 requests per minute instead of the intended 1000. This caused legitimate user requests to be rejected.

The incident was detected via automated monitoring and resolved by rolling back the configuration change. No funds were at risk, but approximately 3,000 users experienced failed transactions during the outage window.

Impact

Users

Users affected: ~3,000
Duration: 2h 13m
Services unavailable: API (full), Frontend (degraded)

Financial

Funds at risk: None
Actual losses: None
Estimated costs: ~2 engineering hours for response

Reputation

Public visibility: Low (brief Discord posts)
Media coverage: None
Community response: Minor frustration, resolved quickly

Timeline

Time (UTC)	Event
14:15	Config change deployed
14:32	Incident began (errors started)
14:32	Detected via DataDog alert
14:38	Response started
14:42	Incident Leader assigned
15:05	Root cause identified
15:25	Rollback complete
16:45	Incident resolved

See Incident Log for detailed timeline.

Root Cause

Primary Cause

Human error during configuration change. The rate limit threshold was typed as 10 instead of 1000.

Contributing Factors

No validation in config deployment process for obviously wrong values
Config change was not tested in staging environment
No gradual rollout for config changes
Alert threshold (5% error rate) took 17 minutes to trigger

5 Whys

Question	Answer
Why did the API reject requests?	Rate limiter threshold was too low
Why was the threshold wrong?	Typo in configuration file
Why wasn't the typo caught?	No review process for config changes
Why is there no review process?	Config changes considered "low risk"
Why are they considered low risk?	Never had an incident before (survivorship bias)

What Went Well

Monitoring detected the issue within 17 minutes
Team mobilized quickly once alert fired
Root cause identified within 30 minutes
Rollback was straightforward
Communication to users was timely

What Went Wrong

Config change deployed without testing
No peer review for config changes
Alert took 17 minutes to fire (threshold too high)
No validation for obviously wrong values (10 vs 1000)

Where We Got Lucky

This happened during business hours when the team was available
The fix (rollback) was simple. If the old config was also broken, recovery would have been harder
No financial impact because this was the API layer, not smart contracts

Action Items

Action	Owner	Deadline	Status
Add peer review requirement for config changes	Dave	2024-01-22	Done
Add config validation for rate limit thresholds	Dave	2024-01-29	Done
Lower alert threshold to 1% error rate	Bob	2024-01-19	Done
Add staging environment testing for config changes	Dave	2024-02-15	In Progress
Document config change process	Alice	2024-01-31	Not Started

Lessons for Runbooks

Existing runbook sufficient: Third-Party-Outage (config rollback section applicable)
No new runbook needed

Detection

Aspect	Details
How detected	Monitoring alert (DataDog)
Time to detection	17 minutes
Could we detect faster?	Yes - lower alert threshold to 1%

Meeting Notes

Attendees: Alice, Bob, Carol, Dave

Discussion points:

Agreed config changes need same rigor as code changes
Discussed whether to require staging for all changes (decided yes)
Dave to implement validation this week

See Incident-Response-Policy for post-mortem process.