Where can I monitor API status in real-time?

API Status Check (apistatuscheck.com) provides real-time monitoring for 100+ APIs with uptime tracking and alerts. You can view dashboards, subscribe to feeds, and set up notifications in minutes.

Is Databricks Down? Complete Status Check Guide + Quick Fixes

Q: Is Databricks Down? Complete Status Check Guide + Quick Fixes?

This post explains Is Databricks Down? Complete Status Check Guide + Quick Fixes with clear steps and practical examples. Use the guidance to apply the recommendations in your own API workflows.

Databricks workspace won't load?
Clusters stuck starting?
Jobs failing with connection errors?

Before panicking, verify if Databricks is actually down—or if it's a problem with your workspace, clusters, or network. Here's your complete guide to checking Databricks status and fixing common issues fast.

Quick Check: Is Databricks Actually Down?

Don't assume it's Databricks. 70% of "Databricks down" reports are actually workspace configuration issues, cluster startup failures, cloud provider problems, or networking misconfigurations.

1. Check Official Sources

Databricks Status Page:
🔗 status.databricks.com

What to look for:

✅ "All Systems Operational" = Databricks is fine
⚠️ "Partial Service Disruption" = Some services affected
🔴 "Service Disruption" = Databricks is down

Real-time updates:

Control Plane status (workspace access, authentication)
Data Plane status (clusters, jobs, notebooks)
Regional outages (AWS, Azure, GCP)
API availability
SQL Warehouses status
Unity Catalog status

Twitter/X Search:
🔗 Search "Databricks down" on Twitter

Why it works:

Users report outages instantly
See if others in your region affected
Databricks team responds here

Pro tip: If 100+ tweets in the last hour mention "Databricks down," it's probably actually down.

2. Check Service-Specific Status

Databricks has multiple services that can fail independently:

Service	What It Does	Status Check
Control Plane	Workspace UI, authentication, API	status.databricks.com
Data Plane	Clusters, jobs, notebooks, compute	Check status page under "Data Plane"
SQL Warehouses	SQL endpoints, queries, dashboards	Check status page under "SQL"
Unity Catalog	Data governance, metadata	Check status page under "Unity Catalog"
Delta Lake	Table reads/writes, transactions	Check status page under "Delta"
MLflow	Model tracking, registry	Check status page under "MLflow"
Jobs/Workflows	Scheduled jobs, orchestration	Check status page under "Jobs"

Your service might be down while Databricks globally is up.

How to check which service is affected:

Visit status.databricks.com
Look for specific service status
Check your cloud provider region (AWS us-east-1, Azure East US, etc.)
Check "Incident History" for recent issues
Subscribe to status updates (email/SMS)

3. Check Cloud Provider Status

Databricks runs on cloud providers—their outages affect Databricks.

Cloud Provider	Status Page	What to Check
AWS	health.aws.amazon.com	EC2, S3, IAM in your region
Azure	status.azure.com	Virtual Machines, Storage, Active Directory
GCP	status.cloud.google.com	Compute Engine, Cloud Storage

Decision tree:

Cloud provider down + Databricks status OK → Cloud provider issue
Cloud provider OK + Databricks status down → Databricks issue
Both OK + Your workspace down → Workspace configuration issue
Specific region down → Regional cloud outage

4. Test Different Access Methods

If workspace UI fails but REST API works, it's likely a browser/network issue.

Access Method	Test Method
Workspace UI	Try loading your workspace URL
REST API	Test API endpoint with `curl`
CLI	Run `databricks workspace list`
JDBC/ODBC	Try SQL Warehouse connection

Quick API test:

# Test Databricks REST API
curl -H "Authorization: Bearer <your-token>" \
  https://<workspace-url>/api/2.0/clusters/list

If API works but UI doesn't:

Clear browser cache
Try incognito/private mode
Try different browser
Check browser console for errors (F12)

Common Databricks Error Messages (And What They Mean)

Error: "Unable to Reach Workspace"

What it means: Can't connect to Databricks workspace.

Causes:

Network connectivity issues
DNS resolution failure
Workspace suspended/deleted
VPN/proxy interference
Browser cache corruption

Quick fixes:

Check if databricks.com loads in browser
Verify workspace URL is correct
Check workspace status in cloud provider console
Disable VPN temporarily
Clear browser cache and cookies
Try different browser or incognito mode

Error: "RESOURCE_DOES_NOT_EXIST"

What it means: Cluster, job, or resource not found.

Causes:

Cluster terminated
Job deleted
Incorrect cluster ID
Workspace permissions changed
Resource moved to different workspace

Quick fixes:

Verify resource ID is correct
Check if cluster was auto-terminated
Start a new cluster if needed
Check workspace permissions
Verify you're in the correct workspace

Error: "Cluster Failed to Start: Cloud Provider Error"

What it means: Can't provision cloud resources for cluster.

Causes:

Cloud provider capacity limits (no available VMs)
Insufficient cloud account quotas
Service limits exceeded
Regional outage
IAM/permissions issues
Invalid instance type

Quick fixes:

Check cloud provider quotas:
- AWS: EC2 vCPU limits
- Azure: VM core limits
- GCP: Compute Engine quotas
Try different instance type:
- Use smaller instance size
- Switch to different instance family
Try different availability zone:
- Edit cluster config → Availability → Change zone
Request quota increase:
- AWS: Service Quotas console
- Azure: Subscription → Usage + quotas
- GCP: IAM & Admin → Quotas
Retry in a few minutes:
- Transient capacity issues often resolve quickly

Error: "Authentication Failed" / "Invalid Access Token"

What it means: Can't authenticate to Databricks.

Causes:

Token expired
Token revoked
Wrong token for workspace
SSO/SAML issues
Permissions changed

Quick fixes:

Generate new personal access token:
- Workspace → Settings → User Settings → Access Tokens
- Generate New Token
- Copy and save securely
Check token permissions:
- Token must have appropriate scopes
- Check workspace admin didn't revoke access

Re-authenticate CLI:

databricks auth login --host <workspace-url>

Check SSO status:
- Try logging in via browser first
- SSO provider might be down
- Check with IT if corporate SSO

Error: "Notebook Execution Failed: Cluster Terminated"

What it means: Cluster stopped while notebook was running.

Causes:

Auto-termination triggered (idle timeout)
Cluster crashed (OOM, driver failure)
Cloud provider spot instance preempted
Manual termination
Cost limits exceeded

Quick fixes:

Check cluster event log:
- Compute → Click cluster → Event Log tab
- Look for termination reason
Restart cluster:
- Click "Start" on terminated cluster
- Or create new cluster
Adjust auto-termination:
- Edit cluster → Auto Termination
- Set longer timeout (60-120 minutes)
Use on-demand instances:
- Edit cluster → AWS/Azure/GCP settings
- Disable Spot/Preemptible instances
Increase cluster resources:
- OOM errors? Add more memory/nodes

Error: "Job Run Failed: Cannot Create Run"

What it means: Job scheduler can't start new job run.

Causes:

Cluster pool exhausted
Job concurrency limits
Cluster policy restrictions
Permissions issues
Cluster configuration errors

Quick fixes:

Check job run history:
- Workflows → Your Job → Run History
- Look for error details
Check cluster availability:
- If using cluster pool, check pool capacity
- Try running manually first
Check job concurrency:
- Edit job → Advanced → Max Concurrent Runs
- Increase if needed
Verify cluster config:
- Job cluster configuration valid?
- Instance types available?
Check permissions:
- User has "Can Manage Run" permission?

Error: "Delta Table Transaction Conflict"

What it means: Concurrent writes to same Delta table failed.

Causes:

Multiple jobs writing simultaneously
Optimistic concurrency conflict
Incomplete transactions
Table locked

Quick fixes:

Retry transaction:
- Delta handles most conflicts automatically
- Retry usually succeeds
Check concurrent jobs:
- Multiple jobs writing to same table?
- Add job dependencies or locks
Run OPTIMIZE:
```
OPTIMIZE delta.`/path/to/table`
```

Check table history:

DESCRIBE HISTORY delta.`/path/to/table`

Increase retry settings:

spark.conf.set("spark.databricks.delta.retryWriteConflict.enabled", "true")

Error: "SQL Warehouse Connection Failed"

What it means: Can't connect to SQL Warehouse endpoint.

Causes:

Warehouse stopped
Warehouse starting up
Network connectivity
Authentication failure
Warehouse configuration error

Quick fixes:

Check warehouse status:
- SQL Warehouses → Your Warehouse → Status
- Start if stopped
Wait for startup:
- Warehouses take 1-3 minutes to start
- Check status indicator

Test connection string:

# Test JDBC connection
curl https://<workspace>.cloud.databricks.com/sql/1.0/warehouses/<warehouse-id>

Check network access:
- IP Access Lists blocking you?
- VPN required for workspace?
Verify credentials:
- Token valid and not expired?
- User has warehouse access?

Quick Fixes: Databricks Not Working?

Fix #1: Restart Cluster (The Classic)

Why it works: Clears connection cache, restarts Spark driver, resets configurations.

How to do it right:

For interactive clusters:

Compute → Select your cluster
Click "Restart" (not "Terminate")
Wait 2-5 minutes for startup
Check cluster event log if restart fails

For job clusters:

Workflows → Select job
"Run Now" creates new cluster automatically
Or edit job → Cluster → Change configuration
Save and run

Pro tip: Use "Restart" not "Terminate" to keep cluster config and libraries installed.

Fix #2: Check Cloud Provider Quotas

Databricks needs cloud resources—quotas limit what you can provision.

Common quota issues:

AWS:

vCPU limits: Default 5 vCPUs per instance type
Spot instance limits: Lower than on-demand
EBS volume limits: Storage quotas

Check AWS quotas:

AWS Console → Service Quotas
Search "EC2"
Look for "Running On-Demand instances"
Request increase if needed

Azure:

VM core limits: Total vCPUs per region
Spot VM limits: Separate quota
Storage account limits: IOPS/throughput

Check Azure quotas:

Azure Portal → Subscriptions
Usage + quotas
Search "Compute"
Request increase if needed

GCP:

Compute Engine quotas: CPUs, GPUs, IP addresses
Preemptible VM quotas: Separate from regular VMs
Persistent disk quotas: Storage limits

Check GCP quotas:

GCP Console → IAM & Admin → Quotas
Filter by "Compute Engine"
Request increase if needed

Pro tip: Request quota increases before launching large clusters. Approval can take 24-48 hours.

Fix #3: Clear Browser Cache and Cookies

Workspace UI issues often caused by stale cache.

Chrome:

Press Ctrl+Shift+Delete (Windows) or Cmd+Shift+Delete (Mac)
Time range: "All time"
Check: Cookies, Cached images and files
Click "Clear data"
Reload workspace

Firefox:

Press Ctrl+Shift+Delete (Windows) or Cmd+Shift+Delete (Mac)
Time range: "Everything"
Check: Cookies, Cache
Click "Clear Now"
Reload workspace

Safari:

Safari → Preferences → Privacy
Click "Manage Website Data"
Remove databricks.com entries
Reload workspace

Quick test: Try incognito/private mode first—if it works, cache is the issue.

Fix #4: Check Cluster Logs

Cluster logs show what went wrong.

View cluster logs:

Compute → Select cluster
Click "Event Log" tab (for cluster lifecycle events)
Click "Spark UI" → Executors (for Spark errors)
Click "Driver Logs" (for detailed driver errors)

Common log messages:

"Driver not responding":

Driver crashed (OOM, error)
Network connectivity lost
Fix: Increase driver memory, check network

"Executor lost":

Executor node failed
Cloud provider reclaimed spot instance
Fix: Use on-demand instances, add retry logic

"Failed to bind to port":

Port conflict (rare)
Fix: Restart cluster, try different cluster

"Cannot connect to S3/ADLS/GCS":

Storage credentials expired/invalid
Fix: Update workspace storage credentials

Fix #5: Verify Network Configuration

Network issues prevent cluster communication.

Check VPC/VNet configuration:

AWS:

VPC must allow outbound internet (for cluster communication)
Security groups must allow internal cluster traffic
Subnet must have NAT gateway or internet gateway
Check: Databricks workspace → Settings → Network

Azure:

VNet must allow outbound internet
NSG rules must allow cluster communication
Subnet delegation required for Databricks
Check: Azure Portal → Virtual Networks

GCP:

VPC must allow outbound internet
Firewall rules must allow cluster traffic
Subnet must have Private Google Access enabled
Check: GCP Console → VPC Networks

Quick test:

# From cluster notebook, test outbound connectivity
%sh
curl -I https://pypi.org

If curl fails:

Network configuration issue
Check firewall/security groups
Verify NAT gateway/internet gateway configured

Fix #6: Update Libraries and Dependencies

Outdated or conflicting libraries cause failures.

Check installed libraries:

Compute → Select cluster
Click "Libraries" tab
Look for red "Failed" status

Common library issues:

"Library installation failed":

PyPI/Maven package not found
Network connectivity to package repository
Conflicting dependencies

Fix:

Remove failing library
Restart cluster
Install compatible version
Check library logs for details

Best practices:

Pin library versions (pandas==1.5.3, not just pandas)
Test libraries on test cluster first
Use init scripts for complex setups
Avoid conflicting libraries (e.g., TensorFlow + PyTorch issues)

Fix #7: Check Workspace Storage Credentials

Databricks needs credentials to access your cloud storage.

AWS S3:

IAM role attached to cluster
Instance profile configured
S3 bucket policy allows Databricks role

Check credentials:

# Test S3 access from notebook
dbutils.fs.ls("s3://your-bucket/")

If access denied:

Workspace Admin → Settings → AWS Credentials
Verify instance profile ARN correct
Check S3 bucket policy
Test with aws s3 ls from cluster

Azure ADLS:

Service principal credentials
OAuth tokens
Managed identity

Check credentials:

# Test ADLS access
dbutils.fs.ls("abfss://container@storage.dfs.core.windows.net/")

If access denied:

Workspace Settings → Azure ADLS Gen2
Verify service principal credentials
Check storage account IAM roles

GCP GCS:

Service account keys
Workload identity

Check credentials:

# Test GCS access
dbutils.fs.ls("gs://your-bucket/")

If access denied:

Workspace Settings → GCP Credentials
Verify service account has Storage Object Admin role

Fix #8: Adjust Cluster Configuration

Wrong cluster config causes failures.

Common configuration issues:

1. Instance type not available:

Try different instance type
Check cloud provider availability
Use instance pool for guaranteed capacity

2. Insufficient resources:

Increase driver memory (Edit → Driver → Memory)
Add more worker nodes
Use larger instance types

3. Auto-scaling issues:

Set min/max workers appropriately
Don't set min = max (disables autoscaling)
Allow 2-3x headroom for scaling

4. Spark configuration:

Check advanced options for custom Spark configs

Common settings:

spark.sql.shuffle.partitions 200
spark.executor.memory 4g
spark.driver.memory 8g

Test configuration:

Create new cluster with default config
If works, issue was custom config
Add custom configs one by one to isolate problem

Databricks Workspace Not Loading?

Issue: "Workspace URL Not Responding"

Troubleshoot:

1. Check workspace status:

Log into cloud provider console (AWS/Azure/GCP)
Find Databricks workspace resource
Check if workspace running/healthy
Check for cost/budget alerts (workspace suspended?)

2. Check DNS resolution:

# Test DNS lookup
nslookup <workspace-url>.cloud.databricks.com

If DNS fails:

DNS server issue
Try Google DNS (8.8.8.8)
Try different network

3. Check browser:

Try incognito mode
Try different browser
Clear cache and cookies (see Fix #3)
Check browser console for errors (F12)

4. Check network:

Disable VPN temporarily
Try mobile hotspot (bypass corporate network)
Check firewall rules
Try from different location

Issue: "403 Forbidden" or "Access Denied"

Troubleshoot:

1. Check workspace permissions:

Workspace admin may have removed your access
Contact workspace admin
Check email for access revocation notice

2. Check IP Access Lists:

Workspace → Settings → IP Access Lists
Your IP might be blocked
VPN might change your IP to blocked range

3. Check SSO/SAML:

Corporate SSO might be down
Re-authenticate via SSO portal
Contact IT if persistent

4. Check user status:

User account might be disabled
Check with workspace admin

Issue: "Slow Workspace Performance"

Causes:

Too many notebooks/jobs open
Large result sets loading
Browser memory exhaustion
Network latency

Fixes:

1. Close unused notebooks:

File → Close other notebooks
Detach from clusters when not in use

2. Limit result display:

# Don't display huge DataFrames
# Instead of:
display(df)

# Use:
display(df.limit(100))

3. Clear output:

Cell menu → Clear All Outputs
Reduces page memory

4. Use dedicated browser:

Use separate browser profile for Databricks
Avoid 50+ tabs in same browser

Databricks Clusters Not Starting?

Issue: "Cluster Stuck on 'Pending'"

Troubleshoot:

1. Check cloud provider capacity:

No available VMs in region/zone
Try different instance type
Try different availability zone
Use on-demand instead of spot

2. Check cluster event log:

Compute → Cluster → Event Log
Look for error messages
Common: "Cannot launch instances", "Insufficient capacity"

3. Check quotas:

See Fix #2 (Check Cloud Provider Quotas)
Request quota increase if needed

4. Wait and retry:

Capacity issues often transient
Wait 10-15 minutes
Terminate and restart cluster

Issue: "Cluster Starts Then Immediately Terminates"

Troubleshoot:

1. Check init scripts:

Init script failure causes cluster termination
Edit cluster → Init Scripts → Remove temporarily
Test if cluster starts without init scripts
Fix init script errors

2. Check cluster policy:

Policy restrictions preventing cluster launch?
Contact workspace admin
Try cluster without policy

3. Check driver logs:

Compute → Cluster → Driver Logs
Look for startup errors
Common: Library conflicts, configuration errors

4. Check instance profile/service principal:

Invalid credentials cause startup failure
Test credentials separately
Update workspace credentials if needed

Issue: "Cluster Running But Notebooks Won't Execute"

Troubleshoot:

1. Detach and reattach notebook:

Notebook → Cluster dropdown → Detach
Wait 10 seconds
Reattach to cluster

2. Check cluster status:

Green = Running
Gray = Stopped
Orange = Starting/Restarting
Red = Failed

3. Check notebook language:

Notebook language must match cluster
SQL notebooks need SQL-compatible cluster
Python notebooks work on all clusters

4. Test with simple command:

# Test if cluster responding
print("Hello from cluster!")

If timeout:

Cluster may be overloaded
Check Spark UI → Executors → Active tasks
Restart cluster if needed

Databricks Jobs Not Running?

Issue: "Job Stuck in 'Pending' State"

Troubleshoot:

1. Check job queue:

Workflows → Job runs
Look for many pending runs
Max concurrent runs limit reached?

2. Check cluster availability:

If using cluster pool, pool might be empty
If using existing cluster, cluster might be stopped
Try "Run Now" manually to test

3. Check permissions:

User must have "Can Manage Run" permission
Check job → Permissions tab
Contact job owner if needed

4. Check job schedule:

Edit job → Schedule
Verify schedule is enabled
Check if manual pause enabled

Issue: "Job Runs But Fails Immediately"

Troubleshoot:

1. Check job run output:

Workflows → Job → Latest run → View run
Click failed task
Check error message and stack trace

2. Check notebook/script:

Syntax errors
Missing parameters
Broken dependencies
Test manually in notebook first

3. Check job parameters:

Edit job → Parameters
Verify parameter values correct
Especially file paths, credentials

4. Check cluster logs:

Click failed run → Cluster Logs
Look for startup or execution errors

Issue: "Job Runs Slower Than Expected"

Causes:

Undersized cluster
Data skew
Inefficient queries
Cold start (cluster creation time)

Fixes:

1. Use existing cluster:

Edit job → Cluster → Use existing cluster
Avoid cold start time
But: cluster must be running when job triggers

2. Use cluster pools:

Pre-warmed instances
Faster startup (30-60 seconds vs 3-5 minutes)
Edit job → Cluster → Pool

3. Optimize job:

Check Spark UI for bottlenecks
Reduce data shuffles
Add partitioning
Cache intermediate results

4. Scale up cluster:

Increase worker nodes
Use larger instance types
Enable autoscaling

Databricks SQL Warehouses Issues?

Issue: "SQL Warehouse Won't Start"

Troubleshoot:

1. Check warehouse size:

Larger warehouses take longer (1-3 minutes)
Wait patiently
Check status indicator

2. Check cloud quotas:

Same quota issues as clusters
See Fix #2 (Check Cloud Provider Quotas)

3. Check permissions:

User must have "Can Use" permission
SQL Warehouses → Warehouse → Permissions

4. Check workspace status:

Control plane down affects warehouse startup
Check status.databricks.com

Issue: "SQL Query Timeout"

Troubleshoot:

1. Check query complexity:

Large joins, aggregations take time
Break into smaller queries
Add filters to reduce data scanned

2. Increase warehouse size:

Edit warehouse → Cluster size
Larger = more query slots, faster execution
2X-Large for heavy workloads

3. Check query queue:

SQL Warehouses → Query History
Too many concurrent queries?
Increase warehouse cluster size or concurrency

4. Optimize query:

-- Add filters to reduce data
SELECT * FROM large_table
WHERE date >= '2026-01-01'  -- Partition filter
LIMIT 1000

-- Use materialized views for common queries
CREATE MATERIALIZED VIEW AS ...

Issue: "SQL Dashboard Not Loading"

Troubleshoot:

1. Check warehouse status:

Dashboard queries need running warehouse
Start warehouse if stopped
Auto-stop might have terminated it

2. Check query refresh:

Dashboard → Refresh settings
Manual refresh vs auto-refresh
Long-running queries block dashboard

3. Check data permissions:

Unity Catalog permissions required
User must have SELECT on tables
Check with data owner

4. Check widget queries:

Dashboard → Edit → Check each widget
Individual widget query might be failing
Fix or disable problematic widgets

Unity Catalog Issues?

Issue: "Cannot Access Table: Unity Catalog Error"

Troubleshoot:

1. Check catalog permissions:

-- Check grants on catalog
SHOW GRANTS ON CATALOG your_catalog;

-- Check grants on schema
SHOW GRANTS ON SCHEMA your_catalog.your_schema;

-- Check grants on table
SHOW GRANTS ON TABLE your_catalog.your_schema.your_table;

2. Request access:

Contact data owner
Use "Request Access" button in Catalog Explorer
Workspace admin can grant permissions

3. Check catalog exists:

-- List available catalogs
SHOW CATALOGS;

-- List schemas in catalog
SHOW SCHEMAS IN your_catalog;

4. Check table path:

Unity Catalog uses three-level namespace
Format: catalog.schema.table
Check for typos

Issue: "Metastore Connection Failed"

Troubleshoot:

1. Check workspace metastore assignment:

Workspace Settings → Unity Catalog
Verify metastore assigned to workspace
Contact workspace admin if not assigned

2. Check network connectivity:

Metastore in different region?
Network rules blocking connection?
Check VPC/VNet peering if using private connectivity

3. Check metastore status:

Account Console → Metastores
Check if metastore healthy
Look for error messages

Delta Lake Issues?

Issue: "Delta Table Not Found"

Troubleshoot:

1. Check table path:

# Verify path exists
dbutils.fs.ls("dbfs:/path/to/delta/table")

# Or for Unity Catalog
spark.sql("DESCRIBE TABLE your_catalog.your_schema.your_table")

2. Check table registration:

-- List tables in schema
SHOW TABLES IN your_schema;

-- Register external Delta table
CREATE TABLE your_table
USING DELTA
LOCATION '/path/to/delta/table';

3. Check permissions:

Read permissions on storage location
Unity Catalog permissions if using UC
Check with workspace admin

Issue: "Delta Transaction Failed"

Troubleshoot:

1. Retry operation:

Delta handles most conflicts automatically
Simply retry the operation

2. Check concurrent writes:

Multiple jobs writing same table?
Use merge operations instead of inserts
Add transaction isolation

3. Run table maintenance:

-- Optimize table
OPTIMIZE your_table;

-- Vacuum old files (default 7 day retention)
VACUUM your_table RETAIN 168 HOURS;

-- Check table history
DESCRIBE HISTORY your_table;

Regional Outages: Is It Just Me?

Databricks deploys across multiple cloud regions:

Cloud Provider	Common Regions
AWS	us-east-1, us-west-2, eu-west-1, ap-southeast-1
Azure	East US, West Europe, Southeast Asia, UK South
GCP	us-central1, europe-west1, asia-southeast1

How to check for regional issues:

1. Check DownDetector:
🔗 downdetector.com/status/databricks

Shows:

Real-time outage reports
Heatmap of affected regions
Spike in reports = likely real outage

2. Check cloud provider status:

AWS outage might affect only us-east-1
Azure issue might affect only one region
GCP regional issues isolated

3. Check Databricks status by region:

status.databricks.com
Filter by cloud provider and region
Subscribe to your specific region alerts

4. Test from different region:

If available, try workspace in different region
Isolates if issue is regional vs global

When Databricks Actually Goes Down

What Happens

Recent major outages:

October 2023: 4-hour AWS us-east-1 control plane outage
July 2023: 2-hour authentication service disruption (all clouds)
March 2023: 6-hour Azure East US regional outage

Typical causes:

Cloud provider outages (AWS/Azure/GCP failures)
Control plane authentication issues
Network connectivity problems
Database backend failures
Deployment issues (rare)

How Databricks Responds

Communication channels:

status.databricks.com - Primary source
@databricks on Twitter/X
Email alerts (if subscribed to status page)
In-app notifications (if workspace accessible)

Timeline:

0-15 min: Users report issues on Twitter/DownDetector
15-30 min: Databricks acknowledges on status page
30-120 min: Updates posted every 30 min
Resolution: Usually 1-4 hours for major outages

What to Do During Outages

1. Check if data plane still works:

Running clusters may continue working
Jobs may complete even if UI is down
Check via CLI: databricks clusters list

2. Use backup compute:

AWS EMR for Spark workloads
Azure HDInsight or Synapse
GCP Dataproc
Run critical jobs elsewhere temporarily

3. Monitor status page:

status.databricks.com
Subscribe to SMS/email updates
Check estimated time to resolution

4. Document impact:

Note affected jobs/workflows
Capture error messages
Will help with root cause analysis later

5. Prepare for recovery:

Have restart procedures ready
Check data consistency after outage
Re-run failed jobs when service restored

Databricks Down Checklist

Follow these steps in order:

Step 1: Verify it's actually down

Check Databricks Status
Check API Status Check
Check cloud provider status (AWS/Azure/GCP)
Search Twitter: "Databricks down"
Test REST API: curl workspace API endpoint
Try different browser/incognito mode

Step 2: Quick fixes (if Databricks is up)

Restart cluster
Clear browser cache and cookies
Check cluster event logs
Verify network connectivity
Check cloud provider quotas
Update workspace credentials

Step 3: Cluster troubleshooting

Check cluster configuration (instance type, size)
Verify cloud provider capacity available
Check init scripts (disable temporarily)
Review driver logs for errors
Test with default cluster config
Check cluster policy restrictions

Step 4: Network troubleshooting

Test outbound connectivity from cluster
Verify VPC/VNet configuration
Check security group / NSG rules
Verify NAT gateway / internet gateway
Check storage credentials (S3/ADLS/GCS)
Test with different network/VPN

Step 5: Job/workflow troubleshooting

Check job run history and error messages
Test notebook manually first
Verify job parameters correct
Check job permissions
Review cluster logs for failed run
Test with simpler job configuration

Step 6: Nuclear option

Create new cluster with default config
Re-import notebook from revision history
Contact Databricks support: databricks.com/support
Open ticket with cloud provider if quota/capacity issue

Prevent Future Issues

1. Set Up Proactive Monitoring

Monitor Databricks status:

Subscribe to status.databricks.com (email/SMS)
Use API Status Check for automated monitoring
Set up Slack/Discord/email alerts for outages

Monitor your workloads:

# Add health checks to critical notebooks
try:
    # Your data pipeline code
    df = spark.read.table("my_table")
    # Alert on failure
except Exception as e:
    dbutils.notebook.exit(f"FAILED: {str(e)}")

Monitor cluster health:

Set up alerts for cluster failures
Monitor job success rates
Track cluster startup times (increasing = potential issues)

2. Use Cluster Pools

Why cluster pools help:

Pre-warmed instances
Faster startup (30-60 seconds vs 3-5 minutes)
Guaranteed capacity
Consistent environment

Create cluster pool:

Compute → Pools → Create Pool
Set min/max idle instances
Choose instance type
Use pool for interactive clusters and jobs

Pro tip: Size pool based on peak demand. Keep 2-3 idle instances ready.

3. Build Redundancy

For critical pipelines:

Multi-region strategy:

Deploy workspaces in multiple regions
Failover to backup region during outage
Use cross-region storage replication

Retry logic:

from retry import retry

@retry(tries=3, delay=60)
def run_critical_job():
    # Your job code
    spark.sql("INSERT INTO target_table SELECT * FROM source_table")

Backup compute:

Keep alternate compute ready (EMR, HDInsight, Dataproc)
Document failover procedures
Test failover quarterly

4. Optimize Cluster Configuration

Right-size clusters:

Start small, scale up as needed
Use autoscaling for variable workloads
Don't over-provision (wastes cost)

Best practices:

Use cluster pools for fast startup
Set appropriate auto-termination (30-60 min idle)
Use spot/preemptible for non-critical workloads
Pin library versions for consistency

Test configurations:

Test new configs on dev cluster first
Gradually roll out to production
Monitor performance metrics

5. Implement Job Orchestration Best Practices

Job dependencies:

Use Databricks Workflows for orchestration
Set proper task dependencies
Add retry policies (3 retries with exponential backoff)

Job monitoring:

# Send notifications on job completion
dbutils.notebook.exit(json.dumps({
    "status": "SUCCESS",
    "rows_processed": row_count,
    "duration_seconds": duration
}))

Failure handling:

Set up alerts for job failures
Use dead letter queues for failed records
Log detailed error messages

6. Maintain Cloud Provider Health

Monitor quotas:

Set up alerts for quota usage (80% threshold)
Request quota increases proactively
Keep buffer for burst capacity

Track cloud provider status:

Subscribe to AWS/Azure/GCP status pages
Monitor your specific regions
Note cloud provider maintenance windows

Resource management:

Clean up unused clusters/pools
Delete old job runs (retention policy)
Archive unused notebooks/data

7. Keep Credentials Updated

Regular credential rotation:

Rotate personal access tokens quarterly
Update service principal credentials before expiration
Test credentials after rotation

Credential management:

Use Databricks Secrets for sensitive data
Avoid hardcoding credentials
Use service principals for automation

# Use Databricks Secrets
secret = dbutils.secrets.get(scope="my_scope", key="api_key")

8. Document Your Setup

Critical documentation:

Cluster configurations (save as JSON)
Job configurations and dependencies
Network architecture (VPC/VNet setup)
Credential management procedures
Incident response runbooks

Keep updated:

Review docs quarterly
Update after any config changes
Share with team members

Key Takeaways

Before assuming Databricks is down:

✅ Check Databricks Status
✅ Check cloud provider status (AWS/Azure/GCP)
✅ Test REST API with curl
✅ Search Twitter for "Databricks down"
✅ Check cluster event logs and driver logs

Common fixes:

Restart cluster (fixes 50% of issues)
Check cloud provider quotas (capacity limits)
Clear browser cache and cookies
Verify network configuration (VPC/VNet)
Update storage credentials
Adjust cluster configuration (instance type, size)

Cluster issues:

Check event logs for startup failures
Verify cloud provider capacity available
Use on-demand instances instead of spot
Remove init scripts temporarily to test
Check cluster policy restrictions

Job/workflow issues:

Test notebooks manually first
Check job run history for error details
Verify job parameters and permissions
Review cluster logs for failed runs
Add retry logic and monitoring

SQL Warehouse issues:

Wait for startup (1-3 minutes)
Check warehouse permissions
Increase warehouse size for heavy workloads
Optimize slow queries

Unity Catalog issues:

Check three-level namespace (catalog.schema.table)
Verify permissions with SHOW GRANTS
Request access from data owner

If Databricks is actually down:

Monitor status.databricks.com
Running clusters may continue working
Use backup compute for critical jobs
Usually resolved within 1-4 hours

Prevent future issues:

Use cluster pools for fast, reliable startup
Build retry logic into critical jobs
Monitor proactively with alerts
Keep cloud quotas sized appropriately
Document configurations and procedures
Test failover scenarios regularly

Remember: Most "Databricks down" issues are actually cluster configuration, cloud provider quotas, network setup, or permissions problems. Try the fixes in this guide before assuming Databricks is down.

Need real-time Databricks status monitoring? Track Databricks uptime with API Status Check - Get instant alerts when Databricks goes down.

Related Resources

Is Databricks Down Right Now? — Live status check
Databricks Outage History — Past incidents and timeline
Databricks vs Snowflake Uptime — Which platform is more reliable?
API Outage Response Plan — How to handle downtime like a pro

Quick Check: Is Databricks Actually Down?

1. Check Official Sources

2. Check Service-Specific Status

3. Check Cloud Provider Status

4. Test Different Access Methods

Common Databricks Error Messages (And What They Mean)

Error: "Unable to Reach Workspace"

Error: "RESOURCE_DOES_NOT_EXIST"

Error: "Cluster Failed to Start: Cloud Provider Error"

Error: "Authentication Failed" / "Invalid Access Token"

Error: "Notebook Execution Failed: Cluster Terminated"

Error: "Job Run Failed: Cannot Create Run"

Error: "Delta Table Transaction Conflict"

Error: "SQL Warehouse Connection Failed"

Quick Fixes: Databricks Not Working?

Fix #1: Restart Cluster (The Classic)

Fix #2: Check Cloud Provider Quotas

Fix #3: Clear Browser Cache and Cookies

Fix #4: Check Cluster Logs

Fix #5: Verify Network Configuration

Fix #6: Update Libraries and Dependencies

Fix #7: Check Workspace Storage Credentials

Fix #8: Adjust Cluster Configuration

Databricks Workspace Not Loading?

Issue: "Workspace URL Not Responding"

Issue: "403 Forbidden" or "Access Denied"

Issue: "Slow Workspace Performance"

Databricks Clusters Not Starting?

Issue: "Cluster Stuck on 'Pending'"

Issue: "Cluster Starts Then Immediately Terminates"

Issue: "Cluster Running But Notebooks Won't Execute"

Databricks Jobs Not Running?

Issue: "Job Stuck in 'Pending' State"

Issue: "Job Runs But Fails Immediately"

Issue: "Job Runs Slower Than Expected"

Databricks SQL Warehouses Issues?

Issue: "SQL Warehouse Won't Start"

Issue: "SQL Query Timeout"

Issue: "SQL Dashboard Not Loading"

Unity Catalog Issues?

Issue: "Cannot Access Table: Unity Catalog Error"

Issue: "Metastore Connection Failed"

Delta Lake Issues?

Issue: "Delta Table Not Found"

Issue: "Delta Transaction Failed"

Regional Outages: Is It Just Me?

When Databricks Actually Goes Down

What Happens

How Databricks Responds

What to Do During Outages

Databricks Down Checklist

Prevent Future Issues

1. Set Up Proactive Monitoring

2. Use Cluster Pools

3. Build Redundancy

4. Optimize Cluster Configuration

5. Implement Job Orchestration Best Practices

6. Maintain Cloud Provider Health

7. Keep Credentials Updated

8. Document Your Setup

Key Takeaways

Related Resources

Monitor Your APIs