5 Warning Signs Your API is About to Go Down
Your API doesn't just "suddenly" go down. There are always warning signs—latency spikes, error rate increases, memory leaks creeping up. The problem? Most teams ignore these signals until it's too late.
By the time you're scrambling to fix a production outage, you've already lost revenue, users, and trust. Smart teams catch problems before they become disasters.
Here are the 5 warning signs that your API is about to fail—and what to do about each one.
Warning Sign #1: Gradual Latency Increases
What It Looks Like
Normal:
- P50 latency: 50ms
- P95 latency: 200ms
- P99 latency: 500ms
Warning:
- P50 latency: 150ms (+200%)
- P95 latency: 1,000ms (+400%)
- P99 latency: 5,000ms (+900%)
The pattern: Latency creeps up over days/weeks, not suddenly.
Why It Happens
Common causes:
- Database query inefficiency - Unoptimized queries slow down as data grows
- Memory leaks - Application gradually consumes more RAM
- Connection pool exhaustion - Running out of database connections
- Disk I/O saturation - Writes/reads hitting limits
- Third-party API slowdowns - Dependencies getting slower
The Real Danger
Latency compounds:
- User makes request → 2s response
- Frontend times out → retries request
- Now 2 slow requests competing for resources
- Latency gets worse → more retries
- Death spiral begins
Example:
9:00 AM: 200ms average response time
10:00 AM: 500ms (users starting to notice)
11:00 AM: 2s (timeouts begin)
11:30 AM: 10s (cascade failure)
12:00 PM: Total outage
How to Catch It
Monitor P95 and P99 latency, not just average:
// Bad: Only tracking average
const avgLatency = totalTime / requestCount;
// Good: Track percentiles
const p95 = latencies.sort()[Math.floor(latencies.length * 0.95)];
const p99 = latencies.sort()[Math.floor(latencies.length * 0.99)];
Alert thresholds:
- P50 increases by 50% → Warning
- P95 increases by 100% → Critical
- P99 increases by 200% → Emergency
How to Fix It
Immediate actions:
- Check slow query logs - Find database bottlenecks
- Review recent deployments - New code causing issues?
- Scale horizontally - Add more servers (temporary fix)
- Enable caching - Reduce database load
Long-term fixes:
- Add database indexes - Speed up common queries
- Implement query caching - Redis/Memcached
- Optimize N+1 queries - Use eager loading
- Profile memory usage - Fix leaks
Warning Sign #2: Increasing Error Rates
What It Looks Like
Healthy API:
- 0.01% error rate (99.99% success)
- Errors are random, not clustered
Warning:
- 0.1% error rate (+10x)
- 1% error rate (+100x)
- Errors clustered by time/endpoint
Types of Errors to Watch
5xx Errors (Server-side):
- 500 Internal Server Error
- 502 Bad Gateway
- 503 Service Unavailable
- 504 Gateway Timeout
Why they matter: Your fault, not the user's. System is struggling.
4xx Errors (Client-side):
- 400 Bad Request
- 401 Unauthorized
- 429 Too Many Requests
- Usually user error, but...
Red flag: If 429 (rate limit) errors spike, you're hitting capacity limits.
Error Patterns That Predict Outages
Pattern 1: Time-based clustering
Errors spike at the same time each day
→ Likely cron job or scheduled task overloading system
Pattern 2: Endpoint-specific errors
/api/payments: 10% error rate
/api/users: 0.01% error rate
→ Payments endpoint about to fail completely
Pattern 3: Cascading failures
Dependency API slows down → Your API times out → Retries →
More timeouts → Rate limits hit → Total failure
How to Catch It
Set up error rate alerts:
// Calculate error rate
const errorRate = (errorCount / totalRequests) * 100;
// Alert thresholds
if (errorRate > 0.1) {
alertTeam('WARNING: Error rate at ' + errorRate + '%');
}
if (errorRate > 1) {
alertTeam('CRITICAL: Error rate at ' + errorRate + '%');
}
Monitor by endpoint:
- Don't just track global error rate
- Some endpoints fail first (canaries)
- Payment/auth endpoints = highest priority
How to Fix It
Immediate:
- Check error logs - What's failing?
- Review monitoring dashboards - CPU/memory/disk usage
- Increase timeouts temporarily - Stop cascade failures
- Enable circuit breakers - Fail fast instead of retrying
Long-term:
- Add retry logic with backoff - Don't hammer failing services
- Implement rate limiting - Protect yourself from traffic spikes
- Set up health checks - Auto-restart unhealthy instances
- Load test regularly - Find breaking points before users do
Warning Sign #3: Resource Exhaustion (CPU/Memory/Disk)
What It Looks Like
Healthy system:
- CPU: 30-50% average
- Memory: 60% used, stable
- Disk: 50% used, slowly growing
Warning:
- CPU: 80%+ sustained
- Memory: 90%+ (or climbing steadily)
- Disk: 95%+ (or growing 1% daily)
The Resource Death Spiral
How it starts:
- Memory leak causes gradual RAM increase
- System starts swapping to disk
- Disk I/O spikes → Everything slows down
- Slow responses → More concurrent requests
- More requests → More memory usage
- Loop repeats until crash
Timeline:
Week 1: Memory at 70% (normal)
Week 2: Memory at 80% (slight concern)
Week 3: Memory at 90% (warning)
Week 4: Memory at 95% (critical)
Week 5: Out of Memory → Crash
CPU Saturation
Pattern:
Normal: CPU spikes to 60% during traffic, drops to 20% at night
Warning: CPU stays at 70%+ even during low traffic
Critical: CPU pegged at 100%, requests queuing
Common causes:
- Inefficient algorithms (O(n²) instead of O(n))
- Heavy regex operations
- Unoptimized JSON parsing
- Infinite loops (bugs)
Disk Exhaustion
Sneaky problem: Takes weeks to manifest.
Pattern:
Monday: 60% disk used
Friday: 65% disk used (+5% in one week)
4 weeks later: 85% disk used
8 weeks later: 100% → System crashes
Common causes:
- Log files growing unchecked
- Database not pruning old data
- Temp files not cleaned up
- Uploaded files with no retention policy
How to Catch It
Monitor resource trends, not just current state:
# Bad: Only current usage
df -h
# Good: Growth rate over time
df -h | awk '{print $5}' | tail -n+2 | sort -n
Alert thresholds:
- CPU: 70% sustained for 15+ min
- Memory: 80% and climbing
- Disk: 80% used or growing >5%/week
How to Fix It
CPU:
- Profile application (find hot paths)
- Optimize slow functions
- Scale horizontally (add servers)
- Enable response caching
Memory:
- Find leaks: Use profilers (Node.js:
node --inspect, Python:memory_profiler) - Restart strategy: Auto-restart when memory hits 80%
- Garbage collection tuning: Adjust GC settings
- Scale up: More RAM (short-term fix)
Disk:
- Log rotation: Implement daily rotation, keep 7 days
- Database cleanup: Archive old data
- Temp file cleanup: Cron job to delete old temp files
- Monitoring: Alert at 80%, investigate at 70%
Warning Sign #4: Third-Party Dependency Degradation
What It Looks Like
Your API is fine, but:
- Database queries take 2x longer
- Payment API (Stripe) response time increases
- Email service (SendGrid) starts timing out
- Cloud provider (AWS) has elevated latency
The trap: You don't control the dependency, but you suffer the consequences.
How Dependencies Fail
Pattern 1: Slow degradation
Stripe API:
Week 1: 100ms average
Week 2: 150ms average
Week 3: 300ms average
Week 4: 500ms average → Your checkout breaks
Pattern 2: Intermittent failures
SendGrid:
Monday: Works fine
Tuesday: 5% of emails fail
Wednesday: Works fine
Thursday: 10% fail
Friday: Total outage
Pattern 3: Cascade failure
OpenAI API slows down →
Your AI features timeout →
Users retry →
More requests to OpenAI →
OpenAI rate limits you →
Total failure
Real Example: The Stripe Effect
March 2019: Stripe had a 4-hour outage.
Impact:
- Thousands of SaaS companies couldn't process payments
- Estimated $150M+ in lost revenue
- Companies with fallbacks (PayPal) survived
- Companies relying solely on Stripe crashed
The lesson: Your uptime depends on your weakest dependency.
How to Catch It
Monitor dependencies like you monitor your own API:
Use API Status Check:
- Tracks 100+ critical APIs (Stripe, OpenAI, AWS, etc.)
- Real-time alerts when dependencies degrade
- Historical uptime data
Track dependency latency:
const start = Date.now();
const response = await stripe.charge(...);
const latency = Date.now() - start;
// Alert if dependency is slow
if (latency > 1000) {
alertTeam('Stripe API slow: ' + latency + 'ms');
}
Monitor error rates from dependencies:
try {
await sendgrid.send(email);
} catch (error) {
trackDependencyError('sendgrid', error);
}
How to Fix It
Before outages:
Multi-provider setup:
- Payments: Stripe + PayPal + Square
- Email: SendGrid + Resend + AWS SES
- AI: OpenAI + Anthropic + Google
Circuit breakers:
// Stop calling a failing API if (failureRate > 50%) { useBackup(); }Caching:
// Serve cached data during outages const cached = await cache.get(key); if (cached) return cached;Graceful degradation:
- Payment fails → Show manual payment option
- AI fails → Show cached responses
- Email fails → Queue for retry
During outages:
- Switch to backup provider (auto or manual)
- Communicate with users
- Queue failed requests for retry
- Monitor backup provider capacity
Warning Sign #5: Traffic Pattern Anomalies
What It Looks Like
Normal traffic:
- Predictable daily/weekly patterns
- Gradual growth over time
- Spikes during launches/sales
Warning:
- Sudden 10x traffic increase
- Traffic from unusual geographies
- High error rates + high traffic
- Bots/scrapers hammering endpoints
Types of Anomalies
Anomaly 1: Unexpected traffic spike
Normal: 1,000 requests/minute
Suddenly: 10,000 requests/minute
Causes:
- Product Hunt launch (good)
- Reddit/HN post (good)
- DDoS attack (bad)
- Infinite loop in client code (bad)
Anomaly 2: Geographic anomaly
Normal: 80% US, 15% Europe, 5% other
Suddenly: 60% China + unusual access patterns
Likely: Scraping/data harvesting
Anomaly 3: Single-user hammering
Normal: 10 requests/minute per user
One user: 1,000 requests/minute
Likely: Broken client retry logic or malicious actor
The Real Danger: Bots
Bot traffic can kill your API:
- Scrapers: Harvest all your data
- DDoS: Overwhelm your servers
- Credential stuffing: Try stolen passwords
- API abuse: Exploit free tier
Example:
Startup offers free tier: 100 API calls/day
Bot creates 1,000 accounts
= 100,000 free calls/day
Your costs: $1,000/day (AWS bills)
Your revenue: $0
How to Catch It
Monitor traffic patterns:
// Track requests per user
const userRequests = {};
function trackRequest(userId) {
userRequests[userId] = (userRequests[userId] || 0) + 1;
// Alert if user is hammering API
if (userRequests[userId] > 100) {
alertTeam('User ' + userId + ' made 100+ requests');
}
}
Geographic monitoring:
// Track origin countries
const countryRequests = {};
function trackCountry(countryCode) {
countryRequests[countryCode] =
(countryRequests[countryCode] || 0) + 1;
}
Bot detection:
- Missing User-Agent headers
- Unusual access patterns (perfect timing = bot)
- High request rate from single IP
- Requests to non-existent endpoints (scanning)
How to Fix It
Immediate:
Rate limiting:
// Limit requests per user/IP if (requestCount > 100) { return 429; // Too Many Requests }Block abusive IPs:
const blockedIPs = ['1.2.3.4', '5.6.7.8']; if (blockedIPs.includes(userIP)) { return 403; // Forbidden }CAPTCHA for suspicious traffic:
- Cloudflare Turnstile
- Google reCAPTCHA
- hCaptcha
Long-term:
- API keys: Require authentication
- Usage limits: Hard caps per tier
- CDN: Cloudflare/Fastly absorb DDoS
- WAF (Web Application Firewall): Block malicious patterns
How to Implement Early Warning System
Step 1: Set Up Comprehensive Monitoring
Must-track metrics:
- Latency: P50, P95, P99
- Error rates: Overall + per endpoint
- Resource usage: CPU, memory, disk
- Dependency health: Third-party API status
- Traffic patterns: Requests/min, user patterns
Tools:
- API Status Check - Monitor dependencies
- Datadog/New Relic - Full stack monitoring
- Prometheus + Grafana - Self-hosted monitoring
Step 2: Define Alert Thresholds
Latency alerts:
- P95 > 500ms: Warning
- P95 > 1,000ms: Critical
- P99 > 2,000ms: Emergency
Error rate alerts:
- 0.1%: Warning
- 1%: Critical
- 5%: Emergency
Resource alerts:
- CPU > 70%: Warning
- Memory > 80%: Critical
- Disk > 80%: Warning
Step 3: Automate Response
Auto-scaling:
# Example: Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Circuit breakers:
// Stop calling failing dependencies
if (errorRate > 50%) {
circuitBreaker.open();
useFallback();
}
Auto-restart:
# Restart if memory > 90%
if [ $(free | grep Mem | awk '{print $3/$2 * 100.0}') > 90 ]; then
systemctl restart myapp
fi
Step 4: Build a Response Playbook
For each warning sign, document:
- What to check
- How to diagnose
- Immediate fixes
- Long-term solutions
Example playbook:
## Warning: Latency Spike
### Immediate Checks
1. Database slow query log
2. Recent deployments
3. Third-party API status
### Diagnosis
- Run EXPLAIN on slow queries
- Check for N+1 queries
- Review new code changes
### Immediate Fixes
- Scale horizontally (add servers)
- Enable caching
- Rollback bad deployment
### Long-term Solutions
- Add database indexes
- Optimize queries
- Implement query caching
Key Takeaways
5 warning signs your API is about to fail:
- Gradual latency increases - Monitor P95/P99, not just average
- Increasing error rates - 0.1% → 1% = death spiral starting
- Resource exhaustion - CPU/memory/disk trends predict crashes
- Dependency degradation - Your uptime depends on weakest link
- Traffic anomalies - Bots/DDoS can kill you before you notice
How to protect yourself:
- ✅ Monitor proactively (don't wait for users to complain)
- ✅ Set up alerts for all 5 warning signs
- ✅ Build fallbacks before you need them
- ✅ Document response playbooks
- ✅ Test your monitoring (simulate failures)
Remember: Outages don't happen suddenly. They give you warnings. The question is: Are you listening?
Need to monitor API dependencies? Track 100+ critical APIs with API Status Check - Get alerts before your dependencies take you down.
Monitor Your APIs
Check the real-time status of 100+ popular APIs used by developers.
View API Status →