5 Warning Signs Your API is About to Go Down

Your API doesn't just "suddenly" go down. There are always warning signs—latency spikes, error rate increases, memory leaks creeping up. The problem? Most teams ignore these signals until it's too late.

By the time you're scrambling to fix a production outage, you've already lost revenue, users, and trust. Smart teams catch problems before they become disasters.

Here are the 5 warning signs that your API is about to fail—and what to do about each one.

Warning Sign #1: Gradual Latency Increases

What It Looks Like

Normal:

  • P50 latency: 50ms
  • P95 latency: 200ms
  • P99 latency: 500ms

Warning:

  • P50 latency: 150ms (+200%)
  • P95 latency: 1,000ms (+400%)
  • P99 latency: 5,000ms (+900%)

The pattern: Latency creeps up over days/weeks, not suddenly.

Why It Happens

Common causes:

  1. Database query inefficiency - Unoptimized queries slow down as data grows
  2. Memory leaks - Application gradually consumes more RAM
  3. Connection pool exhaustion - Running out of database connections
  4. Disk I/O saturation - Writes/reads hitting limits
  5. Third-party API slowdowns - Dependencies getting slower

The Real Danger

Latency compounds:

  • User makes request → 2s response
  • Frontend times out → retries request
  • Now 2 slow requests competing for resources
  • Latency gets worse → more retries
  • Death spiral begins

Example:

9:00 AM: 200ms average response time
10:00 AM: 500ms (users starting to notice)
11:00 AM: 2s (timeouts begin)
11:30 AM: 10s (cascade failure)
12:00 PM: Total outage

How to Catch It

Monitor P95 and P99 latency, not just average:

// Bad: Only tracking average
const avgLatency = totalTime / requestCount;

// Good: Track percentiles
const p95 = latencies.sort()[Math.floor(latencies.length * 0.95)];
const p99 = latencies.sort()[Math.floor(latencies.length * 0.99)];

Alert thresholds:

  • P50 increases by 50% → Warning
  • P95 increases by 100% → Critical
  • P99 increases by 200% → Emergency

How to Fix It

Immediate actions:

  1. Check slow query logs - Find database bottlenecks
  2. Review recent deployments - New code causing issues?
  3. Scale horizontally - Add more servers (temporary fix)
  4. Enable caching - Reduce database load

Long-term fixes:

  1. Add database indexes - Speed up common queries
  2. Implement query caching - Redis/Memcached
  3. Optimize N+1 queries - Use eager loading
  4. Profile memory usage - Fix leaks

Warning Sign #2: Increasing Error Rates

What It Looks Like

Healthy API:

  • 0.01% error rate (99.99% success)
  • Errors are random, not clustered

Warning:

  • 0.1% error rate (+10x)
  • 1% error rate (+100x)
  • Errors clustered by time/endpoint

Types of Errors to Watch

5xx Errors (Server-side):

  • 500 Internal Server Error
  • 502 Bad Gateway
  • 503 Service Unavailable
  • 504 Gateway Timeout

Why they matter: Your fault, not the user's. System is struggling.

4xx Errors (Client-side):

  • 400 Bad Request
  • 401 Unauthorized
  • 429 Too Many Requests
  • Usually user error, but...

Red flag: If 429 (rate limit) errors spike, you're hitting capacity limits.

Error Patterns That Predict Outages

Pattern 1: Time-based clustering

Errors spike at the same time each day
→ Likely cron job or scheduled task overloading system

Pattern 2: Endpoint-specific errors

/api/payments: 10% error rate
/api/users: 0.01% error rate
→ Payments endpoint about to fail completely

Pattern 3: Cascading failures

Dependency API slows down → Your API times out → Retries →
More timeouts → Rate limits hit → Total failure

How to Catch It

Set up error rate alerts:

// Calculate error rate
const errorRate = (errorCount / totalRequests) * 100;

// Alert thresholds
if (errorRate > 0.1) {
  alertTeam('WARNING: Error rate at ' + errorRate + '%');
}

if (errorRate > 1) {
  alertTeam('CRITICAL: Error rate at ' + errorRate + '%');
}

Monitor by endpoint:

  • Don't just track global error rate
  • Some endpoints fail first (canaries)
  • Payment/auth endpoints = highest priority

How to Fix It

Immediate:

  1. Check error logs - What's failing?
  2. Review monitoring dashboards - CPU/memory/disk usage
  3. Increase timeouts temporarily - Stop cascade failures
  4. Enable circuit breakers - Fail fast instead of retrying

Long-term:

  1. Add retry logic with backoff - Don't hammer failing services
  2. Implement rate limiting - Protect yourself from traffic spikes
  3. Set up health checks - Auto-restart unhealthy instances
  4. Load test regularly - Find breaking points before users do

Warning Sign #3: Resource Exhaustion (CPU/Memory/Disk)

What It Looks Like

Healthy system:

  • CPU: 30-50% average
  • Memory: 60% used, stable
  • Disk: 50% used, slowly growing

Warning:

  • CPU: 80%+ sustained
  • Memory: 90%+ (or climbing steadily)
  • Disk: 95%+ (or growing 1% daily)

The Resource Death Spiral

How it starts:

  1. Memory leak causes gradual RAM increase
  2. System starts swapping to disk
  3. Disk I/O spikes → Everything slows down
  4. Slow responses → More concurrent requests
  5. More requests → More memory usage
  6. Loop repeats until crash

Timeline:

Week 1: Memory at 70% (normal)
Week 2: Memory at 80% (slight concern)
Week 3: Memory at 90% (warning)
Week 4: Memory at 95% (critical)
Week 5: Out of Memory → Crash

CPU Saturation

Pattern:

Normal: CPU spikes to 60% during traffic, drops to 20% at night
Warning: CPU stays at 70%+ even during low traffic
Critical: CPU pegged at 100%, requests queuing

Common causes:

  • Inefficient algorithms (O(n²) instead of O(n))
  • Heavy regex operations
  • Unoptimized JSON parsing
  • Infinite loops (bugs)

Disk Exhaustion

Sneaky problem: Takes weeks to manifest.

Pattern:

Monday: 60% disk used
Friday: 65% disk used (+5% in one week)
4 weeks later: 85% disk used
8 weeks later: 100% → System crashes

Common causes:

  • Log files growing unchecked
  • Database not pruning old data
  • Temp files not cleaned up
  • Uploaded files with no retention policy

How to Catch It

Monitor resource trends, not just current state:

# Bad: Only current usage
df -h

# Good: Growth rate over time
df -h | awk '{print $5}' | tail -n+2 | sort -n

Alert thresholds:

  • CPU: 70% sustained for 15+ min
  • Memory: 80% and climbing
  • Disk: 80% used or growing >5%/week

How to Fix It

CPU:

  1. Profile application (find hot paths)
  2. Optimize slow functions
  3. Scale horizontally (add servers)
  4. Enable response caching

Memory:

  1. Find leaks: Use profilers (Node.js: node --inspect, Python: memory_profiler)
  2. Restart strategy: Auto-restart when memory hits 80%
  3. Garbage collection tuning: Adjust GC settings
  4. Scale up: More RAM (short-term fix)

Disk:

  1. Log rotation: Implement daily rotation, keep 7 days
  2. Database cleanup: Archive old data
  3. Temp file cleanup: Cron job to delete old temp files
  4. Monitoring: Alert at 80%, investigate at 70%

Warning Sign #4: Third-Party Dependency Degradation

What It Looks Like

Your API is fine, but:

  • Database queries take 2x longer
  • Payment API (Stripe) response time increases
  • Email service (SendGrid) starts timing out
  • Cloud provider (AWS) has elevated latency

The trap: You don't control the dependency, but you suffer the consequences.

How Dependencies Fail

Pattern 1: Slow degradation

Stripe API:
Week 1: 100ms average
Week 2: 150ms average
Week 3: 300ms average
Week 4: 500ms average → Your checkout breaks

Pattern 2: Intermittent failures

SendGrid:
Monday: Works fine
Tuesday: 5% of emails fail
Wednesday: Works fine
Thursday: 10% fail
Friday: Total outage

Pattern 3: Cascade failure

OpenAI API slows down →
Your AI features timeout →
Users retry →
More requests to OpenAI →
OpenAI rate limits you →
Total failure

Real Example: The Stripe Effect

March 2019: Stripe had a 4-hour outage.

Impact:

  • Thousands of SaaS companies couldn't process payments
  • Estimated $150M+ in lost revenue
  • Companies with fallbacks (PayPal) survived
  • Companies relying solely on Stripe crashed

The lesson: Your uptime depends on your weakest dependency.

How to Catch It

Monitor dependencies like you monitor your own API:

Use API Status Check:

  • Tracks 100+ critical APIs (Stripe, OpenAI, AWS, etc.)
  • Real-time alerts when dependencies degrade
  • Historical uptime data

Track dependency latency:

const start = Date.now();
const response = await stripe.charge(...);
const latency = Date.now() - start;

// Alert if dependency is slow
if (latency > 1000) {
  alertTeam('Stripe API slow: ' + latency + 'ms');
}

Monitor error rates from dependencies:

try {
  await sendgrid.send(email);
} catch (error) {
  trackDependencyError('sendgrid', error);
}

How to Fix It

Before outages:

  1. Multi-provider setup:

    • Payments: Stripe + PayPal + Square
    • Email: SendGrid + Resend + AWS SES
    • AI: OpenAI + Anthropic + Google
  2. Circuit breakers:

    // Stop calling a failing API
    if (failureRate > 50%) {
      useBackup();
    }
    
  3. Caching:

    // Serve cached data during outages
    const cached = await cache.get(key);
    if (cached) return cached;
    
  4. Graceful degradation:

    • Payment fails → Show manual payment option
    • AI fails → Show cached responses
    • Email fails → Queue for retry

During outages:

  1. Switch to backup provider (auto or manual)
  2. Communicate with users
  3. Queue failed requests for retry
  4. Monitor backup provider capacity

Warning Sign #5: Traffic Pattern Anomalies

What It Looks Like

Normal traffic:

  • Predictable daily/weekly patterns
  • Gradual growth over time
  • Spikes during launches/sales

Warning:

  • Sudden 10x traffic increase
  • Traffic from unusual geographies
  • High error rates + high traffic
  • Bots/scrapers hammering endpoints

Types of Anomalies

Anomaly 1: Unexpected traffic spike

Normal: 1,000 requests/minute
Suddenly: 10,000 requests/minute

Causes:
- Product Hunt launch (good)
- Reddit/HN post (good)
- DDoS attack (bad)
- Infinite loop in client code (bad)

Anomaly 2: Geographic anomaly

Normal: 80% US, 15% Europe, 5% other
Suddenly: 60% China + unusual access patterns

Likely: Scraping/data harvesting

Anomaly 3: Single-user hammering

Normal: 10 requests/minute per user
One user: 1,000 requests/minute

Likely: Broken client retry logic or malicious actor

The Real Danger: Bots

Bot traffic can kill your API:

  • Scrapers: Harvest all your data
  • DDoS: Overwhelm your servers
  • Credential stuffing: Try stolen passwords
  • API abuse: Exploit free tier

Example:

Startup offers free tier: 100 API calls/day
Bot creates 1,000 accounts
= 100,000 free calls/day
Your costs: $1,000/day (AWS bills)
Your revenue: $0

How to Catch It

Monitor traffic patterns:

// Track requests per user
const userRequests = {};

function trackRequest(userId) {
  userRequests[userId] = (userRequests[userId] || 0) + 1;
  
  // Alert if user is hammering API
  if (userRequests[userId] > 100) {
    alertTeam('User ' + userId + ' made 100+ requests');
  }
}

Geographic monitoring:

// Track origin countries
const countryRequests = {};

function trackCountry(countryCode) {
  countryRequests[countryCode] = 
    (countryRequests[countryCode] || 0) + 1;
}

Bot detection:

  • Missing User-Agent headers
  • Unusual access patterns (perfect timing = bot)
  • High request rate from single IP
  • Requests to non-existent endpoints (scanning)

How to Fix It

Immediate:

  1. Rate limiting:

    // Limit requests per user/IP
    if (requestCount > 100) {
      return 429; // Too Many Requests
    }
    
  2. Block abusive IPs:

    const blockedIPs = ['1.2.3.4', '5.6.7.8'];
    if (blockedIPs.includes(userIP)) {
      return 403; // Forbidden
    }
    
  3. CAPTCHA for suspicious traffic:

    • Cloudflare Turnstile
    • Google reCAPTCHA
    • hCaptcha

Long-term:

  1. API keys: Require authentication
  2. Usage limits: Hard caps per tier
  3. CDN: Cloudflare/Fastly absorb DDoS
  4. WAF (Web Application Firewall): Block malicious patterns

How to Implement Early Warning System

Step 1: Set Up Comprehensive Monitoring

Must-track metrics:

  1. Latency: P50, P95, P99
  2. Error rates: Overall + per endpoint
  3. Resource usage: CPU, memory, disk
  4. Dependency health: Third-party API status
  5. Traffic patterns: Requests/min, user patterns

Tools:

  • API Status Check - Monitor dependencies
  • Datadog/New Relic - Full stack monitoring
  • Prometheus + Grafana - Self-hosted monitoring

Step 2: Define Alert Thresholds

Latency alerts:

  • P95 > 500ms: Warning
  • P95 > 1,000ms: Critical
  • P99 > 2,000ms: Emergency

Error rate alerts:

  • 0.1%: Warning
  • 1%: Critical
  • 5%: Emergency

Resource alerts:

  • CPU > 70%: Warning
  • Memory > 80%: Critical
  • Disk > 80%: Warning

Step 3: Automate Response

Auto-scaling:

# Example: Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Circuit breakers:

// Stop calling failing dependencies
if (errorRate > 50%) {
  circuitBreaker.open();
  useFallback();
}

Auto-restart:

# Restart if memory > 90%
if [ $(free | grep Mem | awk '{print $3/$2 * 100.0}') > 90 ]; then
  systemctl restart myapp
fi

Step 4: Build a Response Playbook

For each warning sign, document:

  1. What to check
  2. How to diagnose
  3. Immediate fixes
  4. Long-term solutions

Example playbook:

## Warning: Latency Spike

### Immediate Checks
1. Database slow query log
2. Recent deployments
3. Third-party API status

### Diagnosis
- Run EXPLAIN on slow queries
- Check for N+1 queries
- Review new code changes

### Immediate Fixes
- Scale horizontally (add servers)
- Enable caching
- Rollback bad deployment

### Long-term Solutions
- Add database indexes
- Optimize queries
- Implement query caching

Key Takeaways

5 warning signs your API is about to fail:

  1. Gradual latency increases - Monitor P95/P99, not just average
  2. Increasing error rates - 0.1% → 1% = death spiral starting
  3. Resource exhaustion - CPU/memory/disk trends predict crashes
  4. Dependency degradation - Your uptime depends on weakest link
  5. Traffic anomalies - Bots/DDoS can kill you before you notice

How to protect yourself:

  • ✅ Monitor proactively (don't wait for users to complain)
  • ✅ Set up alerts for all 5 warning signs
  • ✅ Build fallbacks before you need them
  • ✅ Document response playbooks
  • ✅ Test your monitoring (simulate failures)

Remember: Outages don't happen suddenly. They give you warnings. The question is: Are you listening?


Need to monitor API dependencies? Track 100+ critical APIs with API Status Check - Get alerts before your dependencies take you down.

Monitor Your APIs

Check the real-time status of 100+ popular APIs used by developers.

View API Status →