Is Databricks Down? Complete Status Check Guide + Quick Fixes
Databricks workspace won't load?
Clusters stuck starting?
Jobs failing with connection errors?
Before panicking, verify if Databricks is actually downβor if it's a problem with your workspace, clusters, or network. Here's your complete guide to checking Databricks status and fixing common issues fast.
Quick Check: Is Databricks Actually Down?
Don't assume it's Databricks. 70% of "Databricks down" reports are actually workspace configuration issues, cluster startup failures, cloud provider problems, or networking misconfigurations.
1. Check Official Sources
Databricks Status Page:
π status.databricks.com
What to look for:
- β "All Systems Operational" = Databricks is fine
- β οΈ "Partial Service Disruption" = Some services affected
- π΄ "Service Disruption" = Databricks is down
Real-time updates:
- Control Plane status (workspace access, authentication)
- Data Plane status (clusters, jobs, notebooks)
- Regional outages (AWS, Azure, GCP)
- API availability
- SQL Warehouses status
- Unity Catalog status
Twitter/X Search:
π Search "Databricks down" on Twitter
Why it works:
- Users report outages instantly
- See if others in your region affected
- Databricks team responds here
Pro tip: If 100+ tweets in the last hour mention "Databricks down," it's probably actually down.
2. Check Service-Specific Status
Databricks has multiple services that can fail independently:
| Service | What It Does | Status Check |
|---|---|---|
| Control Plane | Workspace UI, authentication, API | status.databricks.com |
| Data Plane | Clusters, jobs, notebooks, compute | Check status page under "Data Plane" |
| SQL Warehouses | SQL endpoints, queries, dashboards | Check status page under "SQL" |
| Unity Catalog | Data governance, metadata | Check status page under "Unity Catalog" |
| Delta Lake | Table reads/writes, transactions | Check status page under "Delta" |
| MLflow | Model tracking, registry | Check status page under "MLflow" |
| Jobs/Workflows | Scheduled jobs, orchestration | Check status page under "Jobs" |
Your service might be down while Databricks globally is up.
How to check which service is affected:
- Visit status.databricks.com
- Look for specific service status
- Check your cloud provider region (AWS us-east-1, Azure East US, etc.)
- Check "Incident History" for recent issues
- Subscribe to status updates (email/SMS)
3. Check Cloud Provider Status
Databricks runs on cloud providersβtheir outages affect Databricks.
| Cloud Provider | Status Page | What to Check |
|---|---|---|
| AWS | health.aws.amazon.com | EC2, S3, IAM in your region |
| Azure | status.azure.com | Virtual Machines, Storage, Active Directory |
| GCP | status.cloud.google.com | Compute Engine, Cloud Storage |
Decision tree:
Cloud provider down + Databricks status OK β Cloud provider issue
Cloud provider OK + Databricks status down β Databricks issue
Both OK + Your workspace down β Workspace configuration issue
Specific region down β Regional cloud outage
4. Test Different Access Methods
If workspace UI fails but REST API works, it's likely a browser/network issue.
| Access Method | Test Method |
|---|---|
| Workspace UI | Try loading your workspace URL |
| REST API | Test API endpoint with curl |
| CLI | Run databricks workspace list |
| JDBC/ODBC | Try SQL Warehouse connection |
Quick API test:
# Test Databricks REST API
curl -H "Authorization: Bearer <your-token>" \
https://<workspace-url>/api/2.0/clusters/list
If API works but UI doesn't:
- Clear browser cache
- Try incognito/private mode
- Try different browser
- Check browser console for errors (F12)
Common Databricks Error Messages (And What They Mean)
Error: "Unable to Reach Workspace"
What it means: Can't connect to Databricks workspace.
Causes:
- Network connectivity issues
- DNS resolution failure
- Workspace suspended/deleted
- VPN/proxy interference
- Browser cache corruption
Quick fixes:
- Check if databricks.com loads in browser
- Verify workspace URL is correct
- Check workspace status in cloud provider console
- Disable VPN temporarily
- Clear browser cache and cookies
- Try different browser or incognito mode
Error: "RESOURCE_DOES_NOT_EXIST"
What it means: Cluster, job, or resource not found.
Causes:
- Cluster terminated
- Job deleted
- Incorrect cluster ID
- Workspace permissions changed
- Resource moved to different workspace
Quick fixes:
- Verify resource ID is correct
- Check if cluster was auto-terminated
- Start a new cluster if needed
- Check workspace permissions
- Verify you're in the correct workspace
Error: "Cluster Failed to Start: Cloud Provider Error"
What it means: Can't provision cloud resources for cluster.
Causes:
- Cloud provider capacity limits (no available VMs)
- Insufficient cloud account quotas
- Service limits exceeded
- Regional outage
- IAM/permissions issues
- Invalid instance type
Quick fixes:
- Check cloud provider quotas:
- AWS: EC2 vCPU limits
- Azure: VM core limits
- GCP: Compute Engine quotas
- Try different instance type:
- Use smaller instance size
- Switch to different instance family
- Try different availability zone:
- Edit cluster config β Availability β Change zone
- Request quota increase:
- AWS: Service Quotas console
- Azure: Subscription β Usage + quotas
- GCP: IAM & Admin β Quotas
- Retry in a few minutes:
- Transient capacity issues often resolve quickly
Error: "Authentication Failed" / "Invalid Access Token"
What it means: Can't authenticate to Databricks.
Causes:
- Token expired
- Token revoked
- Wrong token for workspace
- SSO/SAML issues
- Permissions changed
Quick fixes:
- Generate new personal access token:
- Workspace β Settings β User Settings β Access Tokens
- Generate New Token
- Copy and save securely
- Check token permissions:
- Token must have appropriate scopes
- Check workspace admin didn't revoke access
- Re-authenticate CLI:
databricks auth login --host <workspace-url> - Check SSO status:
- Try logging in via browser first
- SSO provider might be down
- Check with IT if corporate SSO
Error: "Notebook Execution Failed: Cluster Terminated"
What it means: Cluster stopped while notebook was running.
Causes:
- Auto-termination triggered (idle timeout)
- Cluster crashed (OOM, driver failure)
- Cloud provider spot instance preempted
- Manual termination
- Cost limits exceeded
Quick fixes:
- Check cluster event log:
- Compute β Click cluster β Event Log tab
- Look for termination reason
- Restart cluster:
- Click "Start" on terminated cluster
- Or create new cluster
- Adjust auto-termination:
- Edit cluster β Auto Termination
- Set longer timeout (60-120 minutes)
- Use on-demand instances:
- Edit cluster β AWS/Azure/GCP settings
- Disable Spot/Preemptible instances
- Increase cluster resources:
- OOM errors? Add more memory/nodes
Error: "Job Run Failed: Cannot Create Run"
What it means: Job scheduler can't start new job run.
Causes:
- Cluster pool exhausted
- Job concurrency limits
- Cluster policy restrictions
- Permissions issues
- Cluster configuration errors
Quick fixes:
- Check job run history:
- Workflows β Your Job β Run History
- Look for error details
- Check cluster availability:
- If using cluster pool, check pool capacity
- Try running manually first
- Check job concurrency:
- Edit job β Advanced β Max Concurrent Runs
- Increase if needed
- Verify cluster config:
- Job cluster configuration valid?
- Instance types available?
- Check permissions:
- User has "Can Manage Run" permission?
Error: "Delta Table Transaction Conflict"
What it means: Concurrent writes to same Delta table failed.
Causes:
- Multiple jobs writing simultaneously
- Optimistic concurrency conflict
- Incomplete transactions
- Table locked
Quick fixes:
- Retry transaction:
- Delta handles most conflicts automatically
- Retry usually succeeds
- Check concurrent jobs:
- Multiple jobs writing to same table?
- Add job dependencies or locks
- Run OPTIMIZE:
OPTIMIZE delta.`/path/to/table` - Check table history:
DESCRIBE HISTORY delta.`/path/to/table` - Increase retry settings:
spark.conf.set("spark.databricks.delta.retryWriteConflict.enabled", "true")
Error: "SQL Warehouse Connection Failed"
What it means: Can't connect to SQL Warehouse endpoint.
Causes:
- Warehouse stopped
- Warehouse starting up
- Network connectivity
- Authentication failure
- Warehouse configuration error
Quick fixes:
- Check warehouse status:
- SQL Warehouses β Your Warehouse β Status
- Start if stopped
- Wait for startup:
- Warehouses take 1-3 minutes to start
- Check status indicator
- Test connection string:
# Test JDBC connection curl https://<workspace>.cloud.databricks.com/sql/1.0/warehouses/<warehouse-id> - Check network access:
- IP Access Lists blocking you?
- VPN required for workspace?
- Verify credentials:
- Token valid and not expired?
- User has warehouse access?
Quick Fixes: Databricks Not Working?
Fix #1: Restart Cluster (The Classic)
Why it works: Clears connection cache, restarts Spark driver, resets configurations.
How to do it right:
For interactive clusters:
- Compute β Select your cluster
- Click "Restart" (not "Terminate")
- Wait 2-5 minutes for startup
- Check cluster event log if restart fails
For job clusters:
- Workflows β Select job
- "Run Now" creates new cluster automatically
- Or edit job β Cluster β Change configuration
- Save and run
Pro tip: Use "Restart" not "Terminate" to keep cluster config and libraries installed.
Fix #2: Check Cloud Provider Quotas
Databricks needs cloud resourcesβquotas limit what you can provision.
Common quota issues:
AWS:
- vCPU limits: Default 5 vCPUs per instance type
- Spot instance limits: Lower than on-demand
- EBS volume limits: Storage quotas
Check AWS quotas:
- AWS Console β Service Quotas
- Search "EC2"
- Look for "Running On-Demand instances"
- Request increase if needed
Azure:
- VM core limits: Total vCPUs per region
- Spot VM limits: Separate quota
- Storage account limits: IOPS/throughput
Check Azure quotas:
- Azure Portal β Subscriptions
- Usage + quotas
- Search "Compute"
- Request increase if needed
GCP:
- Compute Engine quotas: CPUs, GPUs, IP addresses
- Preemptible VM quotas: Separate from regular VMs
- Persistent disk quotas: Storage limits
Check GCP quotas:
- GCP Console β IAM & Admin β Quotas
- Filter by "Compute Engine"
- Request increase if needed
Pro tip: Request quota increases before launching large clusters. Approval can take 24-48 hours.
Fix #3: Clear Browser Cache and Cookies
Workspace UI issues often caused by stale cache.
Chrome:
- Press
Ctrl+Shift+Delete(Windows) orCmd+Shift+Delete(Mac) - Time range: "All time"
- Check: Cookies, Cached images and files
- Click "Clear data"
- Reload workspace
Firefox:
- Press
Ctrl+Shift+Delete(Windows) orCmd+Shift+Delete(Mac) - Time range: "Everything"
- Check: Cookies, Cache
- Click "Clear Now"
- Reload workspace
Safari:
- Safari β Preferences β Privacy
- Click "Manage Website Data"
- Remove databricks.com entries
- Reload workspace
Quick test: Try incognito/private mode firstβif it works, cache is the issue.
Fix #4: Check Cluster Logs
Cluster logs show what went wrong.
View cluster logs:
- Compute β Select cluster
- Click "Event Log" tab (for cluster lifecycle events)
- Click "Spark UI" β Executors (for Spark errors)
- Click "Driver Logs" (for detailed driver errors)
Common log messages:
"Driver not responding":
- Driver crashed (OOM, error)
- Network connectivity lost
- Fix: Increase driver memory, check network
"Executor lost":
- Executor node failed
- Cloud provider reclaimed spot instance
- Fix: Use on-demand instances, add retry logic
"Failed to bind to port":
- Port conflict (rare)
- Fix: Restart cluster, try different cluster
"Cannot connect to S3/ADLS/GCS":
- Storage credentials expired/invalid
- Fix: Update workspace storage credentials
Fix #5: Verify Network Configuration
Network issues prevent cluster communication.
Check VPC/VNet configuration:
AWS:
- VPC must allow outbound internet (for cluster communication)
- Security groups must allow internal cluster traffic
- Subnet must have NAT gateway or internet gateway
- Check: Databricks workspace β Settings β Network
Azure:
- VNet must allow outbound internet
- NSG rules must allow cluster communication
- Subnet delegation required for Databricks
- Check: Azure Portal β Virtual Networks
GCP:
- VPC must allow outbound internet
- Firewall rules must allow cluster traffic
- Subnet must have Private Google Access enabled
- Check: GCP Console β VPC Networks
Quick test:
# From cluster notebook, test outbound connectivity
%sh
curl -I https://pypi.org
If curl fails:
- Network configuration issue
- Check firewall/security groups
- Verify NAT gateway/internet gateway configured
Fix #6: Update Libraries and Dependencies
Outdated or conflicting libraries cause failures.
Check installed libraries:
- Compute β Select cluster
- Click "Libraries" tab
- Look for red "Failed" status
Common library issues:
"Library installation failed":
- PyPI/Maven package not found
- Network connectivity to package repository
- Conflicting dependencies
Fix:
- Remove failing library
- Restart cluster
- Install compatible version
- Check library logs for details
Best practices:
- Pin library versions (
pandas==1.5.3, not justpandas) - Test libraries on test cluster first
- Use init scripts for complex setups
- Avoid conflicting libraries (e.g., TensorFlow + PyTorch issues)
Fix #7: Check Workspace Storage Credentials
Databricks needs credentials to access your cloud storage.
AWS S3:
- IAM role attached to cluster
- Instance profile configured
- S3 bucket policy allows Databricks role
Check credentials:
# Test S3 access from notebook
dbutils.fs.ls("s3://your-bucket/")
If access denied:
- Workspace Admin β Settings β AWS Credentials
- Verify instance profile ARN correct
- Check S3 bucket policy
- Test with
aws s3 lsfrom cluster
Azure ADLS:
- Service principal credentials
- OAuth tokens
- Managed identity
Check credentials:
# Test ADLS access
dbutils.fs.ls("abfss://container@storage.dfs.core.windows.net/")
If access denied:
- Workspace Settings β Azure ADLS Gen2
- Verify service principal credentials
- Check storage account IAM roles
GCP GCS:
- Service account keys
- Workload identity
Check credentials:
# Test GCS access
dbutils.fs.ls("gs://your-bucket/")
If access denied:
- Workspace Settings β GCP Credentials
- Verify service account has Storage Object Admin role
Fix #8: Adjust Cluster Configuration
Wrong cluster config causes failures.
Common configuration issues:
1. Instance type not available:
- Try different instance type
- Check cloud provider availability
- Use instance pool for guaranteed capacity
2. Insufficient resources:
- Increase driver memory (Edit β Driver β Memory)
- Add more worker nodes
- Use larger instance types
3. Auto-scaling issues:
- Set min/max workers appropriately
- Don't set min = max (disables autoscaling)
- Allow 2-3x headroom for scaling
4. Spark configuration:
- Check advanced options for custom Spark configs
- Common settings:
spark.sql.shuffle.partitions 200 spark.executor.memory 4g spark.driver.memory 8g
Test configuration:
- Create new cluster with default config
- If works, issue was custom config
- Add custom configs one by one to isolate problem
Databricks Workspace Not Loading?
Issue: "Workspace URL Not Responding"
Troubleshoot:
1. Check workspace status:
- Log into cloud provider console (AWS/Azure/GCP)
- Find Databricks workspace resource
- Check if workspace running/healthy
- Check for cost/budget alerts (workspace suspended?)
2. Check DNS resolution:
# Test DNS lookup
nslookup <workspace-url>.cloud.databricks.com
If DNS fails:
- DNS server issue
- Try Google DNS (8.8.8.8)
- Try different network
3. Check browser:
- Try incognito mode
- Try different browser
- Clear cache and cookies (see Fix #3)
- Check browser console for errors (F12)
4. Check network:
- Disable VPN temporarily
- Try mobile hotspot (bypass corporate network)
- Check firewall rules
- Try from different location
Issue: "403 Forbidden" or "Access Denied"
Troubleshoot:
1. Check workspace permissions:
- Workspace admin may have removed your access
- Contact workspace admin
- Check email for access revocation notice
2. Check IP Access Lists:
- Workspace β Settings β IP Access Lists
- Your IP might be blocked
- VPN might change your IP to blocked range
3. Check SSO/SAML:
- Corporate SSO might be down
- Re-authenticate via SSO portal
- Contact IT if persistent
4. Check user status:
- User account might be disabled
- Check with workspace admin
Issue: "Slow Workspace Performance"
Causes:
- Too many notebooks/jobs open
- Large result sets loading
- Browser memory exhaustion
- Network latency
Fixes:
1. Close unused notebooks:
- File β Close other notebooks
- Detach from clusters when not in use
2. Limit result display:
# Don't display huge DataFrames
# Instead of:
display(df)
# Use:
display(df.limit(100))
3. Clear output:
- Cell menu β Clear All Outputs
- Reduces page memory
4. Use dedicated browser:
- Use separate browser profile for Databricks
- Avoid 50+ tabs in same browser
Databricks Clusters Not Starting?
Issue: "Cluster Stuck on 'Pending'"
Troubleshoot:
1. Check cloud provider capacity:
- No available VMs in region/zone
- Try different instance type
- Try different availability zone
- Use on-demand instead of spot
2. Check cluster event log:
- Compute β Cluster β Event Log
- Look for error messages
- Common: "Cannot launch instances", "Insufficient capacity"
3. Check quotas:
- See Fix #2 (Check Cloud Provider Quotas)
- Request quota increase if needed
4. Wait and retry:
- Capacity issues often transient
- Wait 10-15 minutes
- Terminate and restart cluster
Issue: "Cluster Starts Then Immediately Terminates"
Troubleshoot:
1. Check init scripts:
- Init script failure causes cluster termination
- Edit cluster β Init Scripts β Remove temporarily
- Test if cluster starts without init scripts
- Fix init script errors
2. Check cluster policy:
- Policy restrictions preventing cluster launch?
- Contact workspace admin
- Try cluster without policy
3. Check driver logs:
- Compute β Cluster β Driver Logs
- Look for startup errors
- Common: Library conflicts, configuration errors
4. Check instance profile/service principal:
- Invalid credentials cause startup failure
- Test credentials separately
- Update workspace credentials if needed
Issue: "Cluster Running But Notebooks Won't Execute"
Troubleshoot:
1. Detach and reattach notebook:
- Notebook β Cluster dropdown β Detach
- Wait 10 seconds
- Reattach to cluster
2. Check cluster status:
- Green = Running
- Gray = Stopped
- Orange = Starting/Restarting
- Red = Failed
3. Check notebook language:
- Notebook language must match cluster
- SQL notebooks need SQL-compatible cluster
- Python notebooks work on all clusters
4. Test with simple command:
# Test if cluster responding
print("Hello from cluster!")
If timeout:
- Cluster may be overloaded
- Check Spark UI β Executors β Active tasks
- Restart cluster if needed
Databricks Jobs Not Running?
Issue: "Job Stuck in 'Pending' State"
Troubleshoot:
1. Check job queue:
- Workflows β Job runs
- Look for many pending runs
- Max concurrent runs limit reached?
2. Check cluster availability:
- If using cluster pool, pool might be empty
- If using existing cluster, cluster might be stopped
- Try "Run Now" manually to test
3. Check permissions:
- User must have "Can Manage Run" permission
- Check job β Permissions tab
- Contact job owner if needed
4. Check job schedule:
- Edit job β Schedule
- Verify schedule is enabled
- Check if manual pause enabled
Issue: "Job Runs But Fails Immediately"
Troubleshoot:
1. Check job run output:
- Workflows β Job β Latest run β View run
- Click failed task
- Check error message and stack trace
2. Check notebook/script:
- Syntax errors
- Missing parameters
- Broken dependencies
- Test manually in notebook first
3. Check job parameters:
- Edit job β Parameters
- Verify parameter values correct
- Especially file paths, credentials
4. Check cluster logs:
- Click failed run β Cluster Logs
- Look for startup or execution errors
Issue: "Job Runs Slower Than Expected"
Causes:
- Undersized cluster
- Data skew
- Inefficient queries
- Cold start (cluster creation time)
Fixes:
1. Use existing cluster:
- Edit job β Cluster β Use existing cluster
- Avoid cold start time
- But: cluster must be running when job triggers
2. Use cluster pools:
- Pre-warmed instances
- Faster startup (30-60 seconds vs 3-5 minutes)
- Edit job β Cluster β Pool
3. Optimize job:
- Check Spark UI for bottlenecks
- Reduce data shuffles
- Add partitioning
- Cache intermediate results
4. Scale up cluster:
- Increase worker nodes
- Use larger instance types
- Enable autoscaling
Databricks SQL Warehouses Issues?
Issue: "SQL Warehouse Won't Start"
Troubleshoot:
1. Check warehouse size:
- Larger warehouses take longer (1-3 minutes)
- Wait patiently
- Check status indicator
2. Check cloud quotas:
- Same quota issues as clusters
- See Fix #2 (Check Cloud Provider Quotas)
3. Check permissions:
- User must have "Can Use" permission
- SQL Warehouses β Warehouse β Permissions
4. Check workspace status:
- Control plane down affects warehouse startup
- Check status.databricks.com
Issue: "SQL Query Timeout"
Troubleshoot:
1. Check query complexity:
- Large joins, aggregations take time
- Break into smaller queries
- Add filters to reduce data scanned
2. Increase warehouse size:
- Edit warehouse β Cluster size
- Larger = more query slots, faster execution
- 2X-Large for heavy workloads
3. Check query queue:
- SQL Warehouses β Query History
- Too many concurrent queries?
- Increase warehouse cluster size or concurrency
4. Optimize query:
-- Add filters to reduce data
SELECT * FROM large_table
WHERE date >= '2026-01-01' -- Partition filter
LIMIT 1000
-- Use materialized views for common queries
CREATE MATERIALIZED VIEW AS ...
Issue: "SQL Dashboard Not Loading"
Troubleshoot:
1. Check warehouse status:
- Dashboard queries need running warehouse
- Start warehouse if stopped
- Auto-stop might have terminated it
2. Check query refresh:
- Dashboard β Refresh settings
- Manual refresh vs auto-refresh
- Long-running queries block dashboard
3. Check data permissions:
- Unity Catalog permissions required
- User must have SELECT on tables
- Check with data owner
4. Check widget queries:
- Dashboard β Edit β Check each widget
- Individual widget query might be failing
- Fix or disable problematic widgets
Unity Catalog Issues?
Issue: "Cannot Access Table: Unity Catalog Error"
Troubleshoot:
1. Check catalog permissions:
-- Check grants on catalog
SHOW GRANTS ON CATALOG your_catalog;
-- Check grants on schema
SHOW GRANTS ON SCHEMA your_catalog.your_schema;
-- Check grants on table
SHOW GRANTS ON TABLE your_catalog.your_schema.your_table;
2. Request access:
- Contact data owner
- Use "Request Access" button in Catalog Explorer
- Workspace admin can grant permissions
3. Check catalog exists:
-- List available catalogs
SHOW CATALOGS;
-- List schemas in catalog
SHOW SCHEMAS IN your_catalog;
4. Check table path:
- Unity Catalog uses three-level namespace
- Format:
catalog.schema.table - Check for typos
Issue: "Metastore Connection Failed"
Troubleshoot:
1. Check workspace metastore assignment:
- Workspace Settings β Unity Catalog
- Verify metastore assigned to workspace
- Contact workspace admin if not assigned
2. Check network connectivity:
- Metastore in different region?
- Network rules blocking connection?
- Check VPC/VNet peering if using private connectivity
3. Check metastore status:
- Account Console β Metastores
- Check if metastore healthy
- Look for error messages
Delta Lake Issues?
Issue: "Delta Table Not Found"
Troubleshoot:
1. Check table path:
# Verify path exists
dbutils.fs.ls("dbfs:/path/to/delta/table")
# Or for Unity Catalog
spark.sql("DESCRIBE TABLE your_catalog.your_schema.your_table")
2. Check table registration:
-- List tables in schema
SHOW TABLES IN your_schema;
-- Register external Delta table
CREATE TABLE your_table
USING DELTA
LOCATION '/path/to/delta/table';
3. Check permissions:
- Read permissions on storage location
- Unity Catalog permissions if using UC
- Check with workspace admin
Issue: "Delta Transaction Failed"
Troubleshoot:
1. Retry operation:
- Delta handles most conflicts automatically
- Simply retry the operation
2. Check concurrent writes:
- Multiple jobs writing same table?
- Use merge operations instead of inserts
- Add transaction isolation
3. Run table maintenance:
-- Optimize table
OPTIMIZE your_table;
-- Vacuum old files (default 7 day retention)
VACUUM your_table RETAIN 168 HOURS;
-- Check table history
DESCRIBE HISTORY your_table;
Regional Outages: Is It Just Me?
Databricks deploys across multiple cloud regions:
| Cloud Provider | Common Regions |
|---|---|
| AWS | us-east-1, us-west-2, eu-west-1, ap-southeast-1 |
| Azure | East US, West Europe, Southeast Asia, UK South |
| GCP | us-central1, europe-west1, asia-southeast1 |
How to check for regional issues:
1. Check DownDetector:
π downdetector.com/status/databricks
Shows:
- Real-time outage reports
- Heatmap of affected regions
- Spike in reports = likely real outage
2. Check cloud provider status:
- AWS outage might affect only us-east-1
- Azure issue might affect only one region
- GCP regional issues isolated
3. Check Databricks status by region:
- status.databricks.com
- Filter by cloud provider and region
- Subscribe to your specific region alerts
4. Test from different region:
- If available, try workspace in different region
- Isolates if issue is regional vs global
When Databricks Actually Goes Down
What Happens
Recent major outages:
- October 2023: 4-hour AWS us-east-1 control plane outage
- July 2023: 2-hour authentication service disruption (all clouds)
- March 2023: 6-hour Azure East US regional outage
Typical causes:
- Cloud provider outages (AWS/Azure/GCP failures)
- Control plane authentication issues
- Network connectivity problems
- Database backend failures
- Deployment issues (rare)
How Databricks Responds
Communication channels:
- status.databricks.com - Primary source
- @databricks on Twitter/X
- Email alerts (if subscribed to status page)
- In-app notifications (if workspace accessible)
Timeline:
- 0-15 min: Users report issues on Twitter/DownDetector
- 15-30 min: Databricks acknowledges on status page
- 30-120 min: Updates posted every 30 min
- Resolution: Usually 1-4 hours for major outages
What to Do During Outages
1. Check if data plane still works:
- Running clusters may continue working
- Jobs may complete even if UI is down
- Check via CLI:
databricks clusters list
2. Use backup compute:
- AWS EMR for Spark workloads
- Azure HDInsight or Synapse
- GCP Dataproc
- Run critical jobs elsewhere temporarily
3. Monitor status page:
- status.databricks.com
- Subscribe to SMS/email updates
- Check estimated time to resolution
4. Document impact:
- Note affected jobs/workflows
- Capture error messages
- Will help with root cause analysis later
5. Prepare for recovery:
- Have restart procedures ready
- Check data consistency after outage
- Re-run failed jobs when service restored
Databricks Down Checklist
Follow these steps in order:
Step 1: Verify it's actually down
- Check Databricks Status
- Check API Status Check
- Check cloud provider status (AWS/Azure/GCP)
- Search Twitter: "Databricks down"
- Test REST API:
curlworkspace API endpoint - Try different browser/incognito mode
Step 2: Quick fixes (if Databricks is up)
- Restart cluster
- Clear browser cache and cookies
- Check cluster event logs
- Verify network connectivity
- Check cloud provider quotas
- Update workspace credentials
Step 3: Cluster troubleshooting
- Check cluster configuration (instance type, size)
- Verify cloud provider capacity available
- Check init scripts (disable temporarily)
- Review driver logs for errors
- Test with default cluster config
- Check cluster policy restrictions
Step 4: Network troubleshooting
- Test outbound connectivity from cluster
- Verify VPC/VNet configuration
- Check security group / NSG rules
- Verify NAT gateway / internet gateway
- Check storage credentials (S3/ADLS/GCS)
- Test with different network/VPN
Step 5: Job/workflow troubleshooting
- Check job run history and error messages
- Test notebook manually first
- Verify job parameters correct
- Check job permissions
- Review cluster logs for failed run
- Test with simpler job configuration
Step 6: Nuclear option
- Create new cluster with default config
- Re-import notebook from revision history
- Contact Databricks support: databricks.com/support
- Open ticket with cloud provider if quota/capacity issue
Prevent Future Issues
1. Set Up Proactive Monitoring
Monitor Databricks status:
- Subscribe to status.databricks.com (email/SMS)
- Use API Status Check for automated monitoring
- Set up Slack/Discord/email alerts for outages
Monitor your workloads:
# Add health checks to critical notebooks
try:
# Your data pipeline code
df = spark.read.table("my_table")
# Alert on failure
except Exception as e:
dbutils.notebook.exit(f"FAILED: {str(e)}")
Monitor cluster health:
- Set up alerts for cluster failures
- Monitor job success rates
- Track cluster startup times (increasing = potential issues)
2. Use Cluster Pools
Why cluster pools help:
- Pre-warmed instances
- Faster startup (30-60 seconds vs 3-5 minutes)
- Guaranteed capacity
- Consistent environment
Create cluster pool:
- Compute β Pools β Create Pool
- Set min/max idle instances
- Choose instance type
- Use pool for interactive clusters and jobs
Pro tip: Size pool based on peak demand. Keep 2-3 idle instances ready.
3. Build Redundancy
For critical pipelines:
Multi-region strategy:
- Deploy workspaces in multiple regions
- Failover to backup region during outage
- Use cross-region storage replication
Retry logic:
from retry import retry
@retry(tries=3, delay=60)
def run_critical_job():
# Your job code
spark.sql("INSERT INTO target_table SELECT * FROM source_table")
Backup compute:
- Keep alternate compute ready (EMR, HDInsight, Dataproc)
- Document failover procedures
- Test failover quarterly
4. Optimize Cluster Configuration
Right-size clusters:
- Start small, scale up as needed
- Use autoscaling for variable workloads
- Don't over-provision (wastes cost)
Best practices:
- Use cluster pools for fast startup
- Set appropriate auto-termination (30-60 min idle)
- Use spot/preemptible for non-critical workloads
- Pin library versions for consistency
Test configurations:
- Test new configs on dev cluster first
- Gradually roll out to production
- Monitor performance metrics
5. Implement Job Orchestration Best Practices
Job dependencies:
- Use Databricks Workflows for orchestration
- Set proper task dependencies
- Add retry policies (3 retries with exponential backoff)
Job monitoring:
# Send notifications on job completion
dbutils.notebook.exit(json.dumps({
"status": "SUCCESS",
"rows_processed": row_count,
"duration_seconds": duration
}))
Failure handling:
- Set up alerts for job failures
- Use dead letter queues for failed records
- Log detailed error messages
6. Maintain Cloud Provider Health
Monitor quotas:
- Set up alerts for quota usage (80% threshold)
- Request quota increases proactively
- Keep buffer for burst capacity
Track cloud provider status:
- Subscribe to AWS/Azure/GCP status pages
- Monitor your specific regions
- Note cloud provider maintenance windows
Resource management:
- Clean up unused clusters/pools
- Delete old job runs (retention policy)
- Archive unused notebooks/data
7. Keep Credentials Updated
Regular credential rotation:
- Rotate personal access tokens quarterly
- Update service principal credentials before expiration
- Test credentials after rotation
Credential management:
- Use Databricks Secrets for sensitive data
- Avoid hardcoding credentials
- Use service principals for automation
# Use Databricks Secrets
secret = dbutils.secrets.get(scope="my_scope", key="api_key")
8. Document Your Setup
Critical documentation:
- Cluster configurations (save as JSON)
- Job configurations and dependencies
- Network architecture (VPC/VNet setup)
- Credential management procedures
- Incident response runbooks
Keep updated:
- Review docs quarterly
- Update after any config changes
- Share with team members
Key Takeaways
Before assuming Databricks is down:
- β Check Databricks Status
- β Check cloud provider status (AWS/Azure/GCP)
- β
Test REST API with
curl - β Search Twitter for "Databricks down"
- β Check cluster event logs and driver logs
Common fixes:
- Restart cluster (fixes 50% of issues)
- Check cloud provider quotas (capacity limits)
- Clear browser cache and cookies
- Verify network configuration (VPC/VNet)
- Update storage credentials
- Adjust cluster configuration (instance type, size)
Cluster issues:
- Check event logs for startup failures
- Verify cloud provider capacity available
- Use on-demand instances instead of spot
- Remove init scripts temporarily to test
- Check cluster policy restrictions
Job/workflow issues:
- Test notebooks manually first
- Check job run history for error details
- Verify job parameters and permissions
- Review cluster logs for failed runs
- Add retry logic and monitoring
SQL Warehouse issues:
- Wait for startup (1-3 minutes)
- Check warehouse permissions
- Increase warehouse size for heavy workloads
- Optimize slow queries
Unity Catalog issues:
- Check three-level namespace (catalog.schema.table)
- Verify permissions with SHOW GRANTS
- Request access from data owner
If Databricks is actually down:
- Monitor status.databricks.com
- Running clusters may continue working
- Use backup compute for critical jobs
- Usually resolved within 1-4 hours
Prevent future issues:
- Use cluster pools for fast, reliable startup
- Build retry logic into critical jobs
- Monitor proactively with alerts
- Keep cloud quotas sized appropriately
- Document configurations and procedures
- Test failover scenarios regularly
Remember: Most "Databricks down" issues are actually cluster configuration, cloud provider quotas, network setup, or permissions problems. Try the fixes in this guide before assuming Databricks is down.
Need real-time Databricks status monitoring? Track Databricks uptime with API Status Check - Get instant alerts when Databricks goes down.
Related Resources
- Is Databricks Down Right Now? β Live status check
- Databricks Outage History β Past incidents and timeline
- Databricks vs Snowflake Uptime β Which platform is more reliable?
- API Outage Response Plan β How to handle downtime like a pro
Monitor Your APIs
Check the real-time status of 100+ popular APIs used by developers.
View API Status β