Grafana Cloud Outage History
Past incidents and downtime events
Complete history of Grafana Cloud outages, incidents, and service disruptions. Showing 50 most recent incidents.
February 2026(1 incident)
IRM Pages Not Accessible
4 updates
This incident has been resolved.
A fix as implemented, and we are seeing recovery throughout the rollout. We will continue to monitor results.
The issue has been identified and we are implementing a fix.
As of 16:40 UTC, we are currently investigating an issue where IRM pages are not accessible. Users may experience errors or be unable to load IRM-related pages during this time. Our team is actively working to identify the root cause and restore full functionality as quickly as possible. We will provide updates as more information becomes available.
January 2026(20 incidents)
Some Dashboards in Prod-Us-Central-3 unable to load
3 updates
This incident has been resolved.
A fix has been implemented, and we are monitoring the results.
We are currently investigating an issue impacting dashboards for users in the prod-us-central-3 region. This is preventing impacted dashboards from loading as expected. This is also impacting a very small subset of users in the prod-us-central-0 region as well. We will provide more details regarding the scope as they become available.
Grafana OnCall and IRM Loading Issues
3 updates
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
As of 22:55 UTC, we have observed marked improvement with the incident impacting IRM and OnCall. We are still investigating and will continue to monitor and provide updates.
We are currently investigating an issue impacting some customers when accessing Grafana Oncall and IRM. Impacted customers may experience long load times, or even time-outs when attempting to access these components. We'll provide more information as it becomes available.
Grafana Cloud instances unavailable
3 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
Some users experience their Grafana Cloud instances as unavailable.
Increased write error rate for logs in prod-us-west-0
1 update
We were experiencing increased write error rate for logs in prod-us-west-0 from 6:55 to 7:15 UTC. We have since observed continued stability and are marking this as resolved.
Upgrade from Free → Pro failing for users
3 updates
Engineering has released a fix and as of 00:13 UTC, customers should no longer experience issues upgrading from Free to Pro subscriptions. At this time, we are considering this issue resolved. No further updates.
Engineering has identified the issue and is currently exploring remediation options. At this time, users will continue to experience the inability to upgrade from Free to Pro subscriptions. We will continue to provide updates as more information is shared.
As of 20:05 UTC, our engineering team became aware of an issue related to subscription plan upgrades. Users experiencing this issue will not be able to upgrade from a Free plan to a Pro subscription. Engineering is actively engaged and assessing the issue. We will provide updates accordingly.
Investigating Issues with Email Delivery
3 updates
This incident has been resolved.
We are noticing significant improvement, and things are stabilizing as expected. Our engineering teams will continue to monitor progress.
We are currently investigating an issues impacting Email delivery for some Services, including Alert Notifications.
Synthetic monitoring secrets - proxy URL changes
2 updates
The incident is resolved. We are in contact with customers affected by this change.
During the secrets migration in https://status.grafana.com/incidents/47d1q4sphrmj, secrets proxy URLs for some customers updated in the following regions: prod-us-central-0, prod-us-east-0, and prod-eu-west-2. This was an unexpected breaking change affecting a subset of customers. This will specifically affect customers who are using secrets on private probes behind a firewall. We are investigating. If your private probes are impacted, we ask you to update firewall rules for the secrets proxy to allow outbound connections to the updated hosts: gsm-proxy-prod-eu-west-2.grafana.net -> gsm-proxy-prod-eu-west-4.grafana.net gsm-proxy-prod-us-central-0.grafana.net -> gsm-proxy-prod-us-central-4.grafana.net gsm-proxy-prod-us-east-0.grafana.net -> gsm-proxy-prod-us-east-2.grafana.net Note that this URL change affects only a small subset of customers, the majority of customers will not need to update firewall rules. For affected customers, private probes will show the following error in probe logs, for example: Error during test execution: failed to get secret: Get "https://gsm-proxy-prod-us-east-2.grafana.net/api/v1/secrets/.../decrypt": Forbidden undefined
Hosted Traces elevated write latency in prod-us-central-0 region.
3 updates
We consider this incident as resolved since the latency hasn't been elevated since the fix was applied. The issue was caused by a latency spike in a downstream dependency, causing an increased backpressure on the Hosted Traces ingestion path, which degraded gateway performance and resulted in an elevated write latency. After clearing the affected gateway services the degraded state went away and normal operation was restored.
The issue was identified and a fix was applied. After applying the fix, latency went down to a regular and expected value. We're currently monitoring the component's health before resolving the incident.
We're currently investigating an issue with elevated write latency in Hosted Traces prod-us-central-0 region. It's experiencing sustained high write latency since 7:20 AM UTC. Only a small subset of the requests are impacted.
Incident: Metrics Querying Unavailable in EU (Resolved)
1 update
Impact: Between 14:30 and 14:38 UTC, some customers in prod-eu-west-2 may have experienced issues querying metrics. During this time, read requests to the metrics backend were unavailable, resulting in failed or incomplete query responses. The root cause of the issue was identified and addressed. Resolution: The affected components were restored, and service was fully available by 14:38 UTC. We have taken additional steps to prevent this type of disruption from occurring in the future. Next Steps: We are reviewing monitoring and safeguards around this failure mode to further improve reliability.
Degraded Writes in AWS us-east-2
9 updates
This incident has been resolved.
The issue hasn't been seen for a reasonable amount of time and hasn't occurred when it was expected to occur. We're still closely monitoring systems behaviour and will update this incident accordingly.
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
We are continuing to monitor for any further issues.
The impact on this has been mitigated at this time and we are currently monitoring.
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
We are currently investigating an issue causing degraded write performance across multiple products in the AWS us-east-2 region. Our engineering team is actively working to determine the full scope and impact of the issue and restore normal service levels.
Degraded Writes in AWS us-east-2
2 updates
This incident has been resolved.
We are currently investigating an issue causing degraded write performance across multiple products in the AWS us-east-2 region. Our engineering team is actively working to determine the full scope and impact of the issue and restore normal service levels.
Partial Mimir Write Outage
6 updates
This incident has been resolved. Both read and write 5xx's and increased latency were experienced in the two periods: 23:56:15 to 00:32:45 UTC 00:55:30 to 01:36:15 UTC
Customers should no longer experience issues. We will continue to monitor and provide updates.
We are continuing to investigate this issue.
Users may experience intermittent 5xx errors when writing metrics, though retries may eventually succeed, which can lead to delayed or missing data. We continue to investigate and will update when we have more to share.
As of 00:28 UTC, we have observed improvement with the partial write outage. Customers should no longer experience issues with metrics ingestion. We will continue to monitor and provide updates.
As of 23:57 UTC, our engineers became aware of an issue with prod-us-west-0 resulting in a partial write outage. Users may experience intermittent 5xx errors when writing metrics, though retries may eventually succeed, which can lead to delayed or missing data. We continue to investigate and will update when we have more to share.
Connectivity issues for Azure PrivateLink endpoints.
2 updates
The scope of this incident was smaller than originally anticipated. As of 16:27 UTC our engineering team merged a fix for those affected and we are considering this as resolved.
We're experiencing an issue with connectivity loss for Azure PrivateLink endpoints in all available Azure regions. The issue affects users trying to ingest Alloy data or use PDC over Azure PrivateLink. Our team is actively investigating the issue for the root cause.
PDC Agent Connectivity Issues in prod-eu-west-3
4 updates
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
Engineering has released a fix and as of 17:01 UTC, customers should no longer experience connectivity issues. We will continue to monitor for recurrence and provide updates accordingly.
Engineering has identified the issue and will be deploying a fix shortly. At this time, users will continue to experience disruptions for queries routed via PDC. We will continue to provide updates as more information is shared.
We are investigating an issue in prod-eu-west-3 where PDC agents are failing to maintain/re-establish connectivity. Some agents are struggling to reconnect, which may cause disruptions or degraded performance for customer queries routed over PDC. We’ll share updates as we learn more.
Tempo write degradation in prod-eu-west-3 - tempo-prod-08
4 updates
Engineering has released a fix and we continue to observe a period of recovery. As of 15:12 UTC we are considering this resolved.
There was a full degradation of write service between 9:13 UTC - 9:35 UTC. The cell is operational but there is still degradation in the write path. Our Engineering team is actively working on this.
We are continuing to investigate this issue.
We have been alerted to an issue with Tempo write degradation in prod-eu-west-3 - tempo-prod-08. The cell is operational but there is degradation in the write path. Write requests are taking longer than normal. This started at started 7:00 UTC. Our Engineering team is actively investigating this.
Write Degradation in Grafana Cloud Logs (prod-us-east-3)
1 update
Between 20:23 UTC and 20:53 UTC, Grafana Cloud Logs in prod-us-east-3 experienced a write degradation, which may have resulted in delayed or failed log ingestion for some customers. The issue has been fully resolved, and the cell is currently operating normally. We are continuing to investigate the root cause and will provide additional details if relevant.
Partial Write Outage in prod-us-central-0
1 update
There was a ~15 minute partial write outage for some customers in prod-us-central-0. The time frame for this outage was 15:43-15:57 UTC.
High Latency and Errors in Prod-Us-Central-7
3 updates
This incident has been resolved.
We are seeing some recovery in affected products. We are continuing to monitor the progress.
We are currently investigating an issue causing degraded Mimir and Tempo read performance in the prod-us-central-7 region.
Cloudflare Error 1016
1 update
From 20:32 to 20:37 UTC, a DNS record misconfiguration resulted in temporary Cloudflare 1016 DNS errors on many Grafana Cloud stacks. The misconfiguration was mitigated within 5 minutes, and we are working with Cloudflare to better understand why the particular misconfiguration resulted in this outage.
K6 test-runs cannot be started and the overall navigation experience is degraded
4 updates
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
December 2025(13 incidents)
PDC Queries in Prod-Us-East-3 Are Failing
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified the issue, and are working on a fix now.
We are currently investigating an issue impacting PDC Queries in the Prod-Us-East-3 region. This issue is causing queries to fail, and affected customers are experiencing "no such host" errors when attempting to connect a data source via PDC.
New K6 Tests Intermittently Failing
3 updates
This incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating an issue that is causing intermittent failure when trying to start new k6 tests on the k6-app in the cloud. The tests are starting normally from CLI.
Logs write path degradation on GCP Belgium - prod-eu-west-0
1 update
Today 17th from 10:00 UTC to 10:38 UTC we experienced logs write path degradation. Customers may have experienced 5xx errors while ingestion on that cluster. Service is fully restored.
IRM access issues on instances in GCP Belgium (prod-eu-west-0)
1 update
Due to an incident today in IRM (OnCall) product, access to this application was degraded from 10:39 UTC until 11:16. Customers may have experienced the app not operable or not accessible. The service is again fully restored.
k6 Cloud service is down
2 updates
This incident has been resolved.
The main k6 Cloud service is down due to database issues, and it is not possible to access test runs or start new ones.
Some Trace QL Queries Failing
2 updates
This incident has been resolved.
TraceQL queries with "= nil" in Explore Traces and part of Drilldown Traces are failing with 400 Bad Request errors. The issue has been identified, and a fix is currently being rolled out.
Mimir Partial Write Outage
5 updates
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
Synthetic Monitoring has now also recovered. Customers should no longer experience alert rules failing to evaluate. We continue to monitor for recurrence and will provide updates accordingly.
Engineering has released a fix and as of 22:25 UTC, customers should no longer experience ingestion issues. We will continue to monitor for recurrence and provide updates accordingly.
While investigating this issue, we also became aware that Synthetic Monitoring is affected. Some customers may have alert rules failing to evaluate.
As of 21:30 UTC, we are experiencing a partial ingestion outage in Grafana Mimir. This is affecting the write path, where some ingestion requests are failing or timing out. Our engineering team is actively investigating and working to identify the root cause.
Elevated Metric Push Failures and Latency
7 updates
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
We are observing a trend in improvement after implementing a fix. We will continue to monitor and update accordingly. During our investigation, we also became aware that some alerts associated with Synthetic Monitoring checks have been failing to evaluate correctly.
Our engineering team has identified a potential root cause, and a fix is being implemented.
Our engineering team has engaged with our Cloud Service Provider and are working together to continue to investigate this issue.
We are continuing to investigate this issue.
We have also identified that trace ingestion may also be affected. Some customers may experience elevated latency and intermittent errors when sending traces. Investigation is ongoing.
We have detected an issue causing some customers to experience failed metric pushes as well as increased latency when sending metrics. The issue was first observed at 18:30 UTC. Our engineering team is actively investigating the root cause and working to restore normal operation as quickly as possible. We will provide further updates as more information becomes available. Thank you for your patience while we work to resolve this issue.
Elevated Log Push Failures and Latency on prod-eu-west-0 cluster
1 update
Users experienced failed log pushes as well as increased latency when sending logs to Loki service hosted on prod-eu-west-0 cluster between 18:30 UTC to ~23:00 UTC. Our engineering team has engaged our Cloud Service Provider and the fix was implemented that mitigated the issue.
Metrics read issue affecting cortex-prod-13 on prod-us-east-0
3 updates
The incident has been resolved.
Read path has been restored at 08:23 UTC and queries are fully functioning again. The read path outage lasted from 08:04 to 08:23 UTC.
At 08:04 UTC we detected read path outage (queries) on cortex-prod-13. We are currently investigating this issue. The ingestion path (writes) is not affected.
Logs query degradation on AWS Germany (prod-eu-west-2)
3 updates
The issue has been resolved.
The query service is operational again, logs reads should be available on the cluster. Our engineers are monitoring the health status of the service to ensure full recovery.
Since today 9th, around 12:30 UTC time, we are experiencing problems on the Loki read path of cluster eu-west-2. This translates in difficulties to query logs for customers on this cluster, and can also impact alerts and other services based on these logs. Our engineers are actively working in restoring the service.
Hosted Grafana is currently being impacted as a result of the Cloudflare outage
2 updates
This incident has been resolved.
We are currently experiencing disruptions to Hosted Grafana services due to a widespread Cloudflare outage impacting connectivity across multiple regions. Our team is actively monitoring the situation and will provide updates as Cloudflare works to restore normal operation.
Loki prod-ap-northeast-0-loki-prod-030 writes degradation
1 update
Loki prod-ap-northeast-0-loki-prod-030 cell had writes degradation between 8:11 - 8:58 AM UTC. The engineering team mitigated the situation and the cell is stable now.
November 2025(16 incidents)
Alerts failing with prometheus
3 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue around degraded services in prod-us-central-0. The expected behavioral impact is that the queries may take longer than usual to respond.
Synthetic Monitoring is down in prod-us-central-7
2 updates
This incident has been resolved.
Users can not interact with the SM API for any DB-related action, such as: CRUD checks CRUD probes
Longer Than Expected Load Times on Grafana Cloud
3 updates
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
A fix has been implemented, and we are seeing latency down across clusters. We are continuing to monitor progress.
We are currently investigating reports of long load times on Grafana Cloud. We will update as more information becomes available.
Some Loki Writes in Prod-Gb-South-0 Failed
2 updates
This incident has been resolved.
From approximately 14:10-14:25 UTC, writes to Loki failed for a subset of customers in the gb-south-0 region. Most of these errors have already recovered, and our team continues to monitor the recovery.
Slow user queries exceeds threshold
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are having some intermittent query failures
We are currently investigating this issue.
Elevated Read & Write Latency for Some Cells in Prod-Us-East-0
4 updates
This incident has been resolved.
Things have recovered, and we are monitoring to ensure stability.
The previous post mentioned that this was occurring in some cells in prod-us-central-0. This is incorrect and is occurring in some cells in prod-us-east-0.
At approximately 16:40 UTC we noticed an issue causing increased read & write latency in some prod-us-central-0 cells. At this stage we are noticing some recovery and will continue to monitor.
Intermittent issues when starting k6 cloud test runs
1 update
We are experiencing issues starting cloud test runs. This is primarily affecting browser test runs and test runs using static IPs.
Missing Billing Metrics for Loki
1 update
An incident has impacted Loki billing metrics from 05:30 to 06:10 UTC across all clusters. This is currently resolved, however, users may notice some billing metrics missing from the billing dashboard during this time period. There is no impact of logs querying or ingestion.
Hosted grafana is currently being impacted as a result of the Cloudflare outage
2 updates
This incident has been resolved.
We are currently experiencing disruption to Hosted Grafana services due to a widespread Cloudflare outage impacting connectivity across multiple regions. Our team is actively monitoring the situation and will provide updates as Cloudflare works to restore normal operation.
Cortex - read/write path disruption
4 updates
This incident has been resolved.
We’re continuing to work on this issue, and are actively investigating the remaining details. Our team is making progress, and we’ll share another update as soon as we have more information to provide.
We are continuing to work on a fix for this issue.
We are currently observing a read/write path disruption in prod-us-central.cortex-prod-04.
Elevated Mimir Read/Write Errors
7 updates
We continue to observe a continued period of recovery. As of 00:08 UTC, we are considering this issue resolved. No further updates.
A fix has been implemented and we are monitoring the results.
Our teams have been alerted that Synthetic Monitoring will also be affected by this outage. Users may see gaps in their Synthetic Monitoring metrics as well as missed alerts as a result of this. We continue to investigate and will provide further updates as they become available.
The investigation has revealed that metric ingestion is also affected, including log-generated recording rules. We are continuing to investigate the root cause and will provide further updates as more information becomes available.
The investigation has revealed that metric ingestion is also affected, including log-generated recording rules. We are continuing to investigate the root cause and will provide further updates as more information becomes available.
Span metrics have also been identified as affected. We are continuing to investigate the root cause and will share further updates as more information becomes available.
We are investigating elevated Mimir read errors beginning at approximately 21:57 UTC. The errors are technically retriable, but most retries are unlikely to succeed at this time. This may result in failed or delayed query responses for some users. Engineering is actively investigating the root cause and working to restore normal read performance. We will provide further updates as we learn more.
Metrics Write Outage in Multiple Cells
6 updates
As of 23:07 UTC we are considering this incident as resolved. Mitigation efforts have restored normal write performance, and error rates have returned to expected levels. We have confirmed stability across the affected areas and continue to monitor, but no further impact is expected. If you continue to experience any issues, please reach out to support.
We are seeing improvement on the metrics side, with write performance recovering. We continue to investigate the remaining impact to Synthetic Monitoring and are working to determine the underlying cause. Monitoring will continue as recovery progresses, and we’ll provide further updates as we learn more.
Our teams have been alerted that Synthetic Monitoring will also be affected by this outage. Users may see gaps in their Synthetic Monitoring metrics as well as missed alerts as a result of this. We continue to investigate and will provide further updates as they become available.
We’ve re-evaluated the situation and this issue is still ongoing. Although we initially observed signs of recovery, write errors continue to occur in the affected cells. Mitigation work is still in progress, and we’re treating the incident as identified again while we work toward a sustained resolution. We’ll provide further updates as we confirm stabilization.
Mitigation has been applied and Mimir write performance is beginning to recover in the affected cells. prod-us-central-0.cortex-prod-10 appears to have recovered as of 19:52 UTC, and prod-us-central-5.cortex-dedicated-06 is showing signs of recovery as of 20:00 UTC. We are continuing to monitor both cells closely to ensure the mitigation is effective and that the systems remain stable.
We are investigating a partial write outage affecting multiple metrics cells, beginning around 19:30 UTC. Some customers may see intermittent write failures or delays, but most requests should succeed after retries and recent metrics may appear late as a result. Querying previously ingested data remains unaffected. Engineering is continuing to investigate and will provide further updates as more information becomes available.
Hyderabad Probe Issues
1 update
We experienced degraded service with the Hyderabad probe today around 13:20 UTC which was resolved as of 17:30 UTC.
PDC-Prod-eu-west-2 cluster degraded performance
3 updates
This incident has been resolved.
Engineering has released a fix and as of 12:15 UTC, customers should no longer experience performance degradation on the PDC service. We will continue to monitor for recurrence and provide updates accordingly.
We are currently facing performance degradation on PDC service hosted on prod-eu-west-2 cluster. Our Engineering Team is currently working on fixing the issue, we do apologize for any inconvenience.
Loki Prod 012 read-path-unstable
3 updates
resolved since 03:02UTC
We started having instabilities for Alerting and Recording rules of this cell starting at 2:30 AM UTC. They started recovering at around 3 AM UTC, but we're still watching.
A fix has been implemented and we are monitoring the results.
Degraded Brower Check Performance
1 update
Spanning from November 10th, 18:00 UTC to November 11th, 22:00 UTC, Synthetic Monitoring experienced degraded browser check performance due to a faulty release that has been rolled back. This impacted all regions, specifically the probes. The API itself experienced no issues.