G

Grafana Cloud Outage History

Past incidents and downtime events

Complete history of Grafana Cloud outages, incidents, and service disruptions. Showing 50 most recent incidents.

February 2026(1 incident)

criticalresolvedFeb 3, 04:52 PM — Resolved Feb 3, 05:56 PM

IRM Pages Not Accessible

4 updates
resolvedFeb 3, 05:56 PM

This incident has been resolved.

monitoringFeb 3, 05:38 PM

A fix as implemented, and we are seeing recovery throughout the rollout. We will continue to monitor results.

identifiedFeb 3, 05:29 PM

The issue has been identified and we are implementing a fix.

investigatingFeb 3, 04:52 PM

As of 16:40 UTC, we are currently investigating an issue where IRM pages are not accessible. Users may experience errors or be unable to load IRM-related pages during this time. Our team is actively working to identify the root cause and restore full functionality as quickly as possible. We will provide updates as more information becomes available.

January 2026(20 incidents)

majorresolvedJan 28, 05:27 PM — Resolved Jan 28, 08:25 PM

Some Dashboards in Prod-Us-Central-3 unable to load

3 updates
resolvedJan 28, 08:25 PM

This incident has been resolved.

monitoringJan 28, 06:24 PM

A fix has been implemented, and we are monitoring the results.

investigatingJan 28, 05:27 PM

We are currently investigating an issue impacting dashboards for users in the prod-us-central-3 region. This is preventing impacted dashboards from loading as expected. This is also impacting a very small subset of users in the prod-us-central-0 region as well. We will provide more details regarding the scope as they become available.

minorresolvedJan 27, 08:37 PM — Resolved Jan 28, 12:22 AM

Grafana OnCall and IRM Loading Issues

3 updates
resolvedJan 28, 12:22 AM

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

monitoringJan 27, 10:56 PM

As of 22:55 UTC, we have observed marked improvement with the incident impacting IRM and OnCall. We are still investigating and will continue to monitor and provide updates.

investigatingJan 27, 08:37 PM

We are currently investigating an issue impacting some customers when accessing Grafana Oncall and IRM. Impacted customers may experience long load times, or even time-outs when attempting to access these components. We'll provide more information as it becomes available.

majorresolvedJan 27, 10:17 AM — Resolved Jan 27, 11:14 AM

Grafana Cloud instances unavailable

3 updates
resolvedJan 27, 11:14 AM

This incident has been resolved.

monitoringJan 27, 10:33 AM

A fix has been implemented and we are monitoring the results.

investigatingJan 27, 10:17 AM

Some users experience their Grafana Cloud instances as unavailable.

noneresolvedJan 27, 07:49 AM — Resolved Jan 27, 07:49 AM

Increased write error rate for logs in prod-us-west-0

1 update
resolvedJan 27, 07:49 AM

We were experiencing increased write error rate for logs in prod-us-west-0 from 6:55 to 7:15 UTC. We have since observed continued stability and are marking this as resolved.

majorresolvedJan 26, 08:53 PM — Resolved Jan 27, 12:13 AM

Upgrade from Free → Pro failing for users

3 updates
resolvedJan 27, 12:13 AM

Engineering has released a fix and as of 00:13 UTC, customers should no longer experience issues upgrading from Free to Pro subscriptions. At this time, we are considering this issue resolved. No further updates.

identifiedJan 26, 09:52 PM

Engineering has identified the issue and is currently exploring remediation options. At this time, users will continue to experience the inability to upgrade from Free to Pro subscriptions. We will continue to provide updates as more information is shared.

investigatingJan 26, 08:53 PM

As of 20:05 UTC, our engineering team became aware of an issue related to subscription plan upgrades. Users experiencing this issue will not be able to upgrade from a Free plan to a Pro subscription. Engineering is actively engaged and assessing the issue. We will provide updates accordingly.

noneresolvedJan 23, 03:37 PM — Resolved Jan 23, 06:44 PM

Investigating Issues with Email Delivery

3 updates
resolvedJan 23, 06:44 PM

This incident has been resolved.

monitoringJan 23, 04:55 PM

We are noticing significant improvement, and things are stabilizing as expected. Our engineering teams will continue to monitor progress.

investigatingJan 23, 03:37 PM

We are currently investigating an issues impacting Email delivery for some Services, including Alert Notifications.

noneresolvedJan 21, 09:16 PM — Resolved Jan 22, 10:29 PM

Synthetic monitoring secrets - proxy URL changes

2 updates
resolvedJan 22, 10:29 PM

The incident is resolved. We are in contact with customers affected by this change.

identifiedJan 21, 09:16 PM

During the secrets migration in https://status.grafana.com/incidents/47d1q4sphrmj, secrets proxy URLs for some customers updated in the following regions: prod-us-central-0, prod-us-east-0, and prod-eu-west-2. This was an unexpected breaking change affecting a subset of customers. This will specifically affect customers who are using secrets on private probes behind a firewall. We are investigating. If your private probes are impacted, we ask you to update firewall rules for the secrets proxy to allow outbound connections to the updated hosts: gsm-proxy-prod-eu-west-2.grafana.net -> gsm-proxy-prod-eu-west-4.grafana.net gsm-proxy-prod-us-central-0.grafana.net -> gsm-proxy-prod-us-central-4.grafana.net gsm-proxy-prod-us-east-0.grafana.net -> gsm-proxy-prod-us-east-2.grafana.net Note that this URL change affects only a small subset of customers, the majority of customers will not need to update firewall rules. For affected customers, private probes will show the following error in probe logs, for example: Error during test execution: failed to get secret: Get "https://gsm-proxy-prod-us-east-2.grafana.net/api/v1/secrets/.../decrypt": Forbidden undefined

minorresolvedJan 21, 01:24 PM — Resolved Jan 21, 03:21 PM

Hosted Traces elevated write latency in prod-us-central-0 region.

3 updates
resolvedJan 21, 03:21 PM

We consider this incident as resolved since the latency hasn't been elevated since the fix was applied. The issue was caused by a latency spike in a downstream dependency, causing an increased backpressure on the Hosted Traces ingestion path, which degraded gateway performance and resulted in an elevated write latency. After clearing the affected gateway services the degraded state went away and normal operation was restored.

monitoringJan 21, 01:35 PM

The issue was identified and a fix was applied. After applying the fix, latency went down to a regular and expected value. We're currently monitoring the component's health before resolving the incident.

investigatingJan 21, 01:24 PM

We're currently investigating an issue with elevated write latency in Hosted Traces prod-us-central-0 region. It's experiencing sustained high write latency since 7:20 AM UTC. Only a small subset of the requests are impacted.

noneresolvedJan 19, 02:30 PM — Resolved Jan 19, 02:30 PM

Incident: Metrics Querying Unavailable in EU (Resolved)

1 update
resolvedJan 19, 03:30 PM

Impact: Between 14:30 and 14:38 UTC, some customers in prod-eu-west-2 may have experienced issues querying metrics. During this time, read requests to the metrics backend were unavailable, resulting in failed or incomplete query responses. The root cause of the issue was identified and addressed. Resolution: The affected components were restored, and service was fully available by 14:38 UTC. We have taken additional steps to prevent this type of disruption from occurring in the future. Next Steps: We are reviewing monitoring and safeguards around this failure mode to further improve reliability.

minorresolvedJan 17, 11:28 AM — Resolved Jan 19, 01:21 AM

Degraded Writes in AWS us-east-2

9 updates
resolvedJan 19, 01:21 AM

This incident has been resolved.

monitoringJan 18, 09:24 AM

The issue hasn't been seen for a reasonable amount of time and hasn't occurred when it was expected to occur. We're still closely monitoring systems behaviour and will update this incident accordingly.

investigatingJan 18, 02:20 AM

We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.

identifiedJan 18, 02:14 AM

We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.

monitoringJan 17, 08:16 PM

We are continuing to monitor for any further issues.

monitoringJan 17, 08:00 PM

The impact on this has been mitigated at this time and we are currently monitoring.

investigatingJan 17, 06:08 PM

We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.

investigatingJan 17, 04:47 PM

We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.

investigatingJan 17, 11:28 AM

We are currently investigating an issue causing degraded write performance across multiple products in the AWS us-east-2 region. Our engineering team is actively working to determine the full scope and impact of the issue and restore normal service levels.

noneresolvedJan 17, 07:51 AM — Resolved Jan 17, 09:04 AM

Degraded Writes in AWS us-east-2

2 updates
resolvedJan 17, 09:04 AM

This incident has been resolved.

investigatingJan 17, 07:51 AM

We are currently investigating an issue causing degraded write performance across multiple products in the AWS us-east-2 region. Our engineering team is actively working to determine the full scope and impact of the issue and restore normal service levels.

majorresolvedJan 16, 12:28 AM — Resolved Jan 16, 04:00 AM

Partial Mimir Write Outage

6 updates
resolvedJan 16, 04:00 AM

This incident has been resolved. Both read and write 5xx's and increased latency were experienced in the two periods: 23:56:15 to 00:32:45 UTC 00:55:30 to 01:36:15 UTC

monitoringJan 16, 02:27 AM

Customers should no longer experience issues. We will continue to monitor and provide updates.

investigatingJan 16, 01:35 AM

We are continuing to investigate this issue.

investigatingJan 16, 01:33 AM

Users may experience intermittent 5xx errors when writing metrics, though retries may eventually succeed, which can lead to delayed or missing data. We continue to investigate and will update when we have more to share.

monitoringJan 16, 12:59 AM

As of 00:28 UTC, we have observed improvement with the partial write outage. Customers should no longer experience issues with metrics ingestion. We will continue to monitor and provide updates.

investigatingJan 16, 12:28 AM

As of 23:57 UTC, our engineers became aware of an issue with prod-us-west-0 resulting in a partial write outage. Users may experience intermittent 5xx errors when writing metrics, though retries may eventually succeed, which can lead to delayed or missing data. We continue to investigate and will update when we have more to share.

majorresolvedJan 14, 02:30 PM — Resolved Jan 14, 08:17 PM

Connectivity issues for Azure PrivateLink endpoints.

2 updates
resolvedJan 14, 08:17 PM

The scope of this incident was smaller than originally anticipated. As of 16:27 UTC our engineering team merged a fix for those affected and we are considering this as resolved.

investigatingJan 14, 02:30 PM

We're experiencing an issue with connectivity loss for Azure PrivateLink endpoints in all available Azure regions. The issue affects users trying to ingest Alloy data or use PDC over Azure PrivateLink. Our team is actively investigating the issue for the root cause.

majorresolvedJan 12, 03:44 PM — Resolved Jan 12, 06:21 PM

PDC Agent Connectivity Issues in prod-eu-west-3

4 updates
resolvedJan 12, 06:21 PM

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

monitoringJan 12, 05:01 PM

Engineering has released a fix and as of 17:01 UTC, customers should no longer experience connectivity issues. We will continue to monitor for recurrence and provide updates accordingly.

identifiedJan 12, 04:50 PM

Engineering has identified the issue and will be deploying a fix shortly. At this time, users will continue to experience disruptions for queries routed via PDC. We will continue to provide updates as more information is shared.

investigatingJan 12, 03:44 PM

We are investigating an issue in prod-eu-west-3 where PDC agents are failing to maintain/re-establish connectivity. Some agents are struggling to reconnect, which may cause disruptions or degraded performance for customer queries routed over PDC. We’ll share updates as we learn more.

minorresolvedJan 12, 09:03 AM — Resolved Jan 12, 03:26 PM

Tempo write degradation in prod-eu-west-3 - tempo-prod-08

4 updates
resolvedJan 12, 03:26 PM

Engineering has released a fix and we continue to observe a period of recovery. As of 15:12 UTC we are considering this resolved.

investigatingJan 12, 11:41 AM

There was a full degradation of write service between 9:13 UTC - 9:35 UTC. The cell is operational but there is still degradation in the write path. Our Engineering team is actively working on this.

investigatingJan 12, 09:09 AM

We are continuing to investigate this issue.

investigatingJan 12, 09:03 AM

We have been alerted to an issue with Tempo write degradation in prod-eu-west-3 - tempo-prod-08. The cell is operational but there is degradation in the write path. Write requests are taking longer than normal. This started at started 7:00 UTC. Our Engineering team is actively investigating this.

noneresolvedJan 9, 08:30 PM — Resolved Jan 9, 08:30 PM

Write Degradation in Grafana Cloud Logs (prod-us-east-3)

1 update
resolvedJan 9, 11:08 PM

Between 20:23 UTC and 20:53 UTC, Grafana Cloud Logs in prod-us-east-3 experienced a write degradation, which may have resulted in delayed or failed log ingestion for some customers. The issue has been fully resolved, and the cell is currently operating normally. We are continuing to investigate the root cause and will provide additional details if relevant.

noneresolvedJan 7, 05:41 PM — Resolved Jan 7, 05:41 PM

Partial Write Outage in prod-us-central-0

1 update
resolvedJan 7, 05:41 PM

There was a ~15 minute partial write outage for some customers in prod-us-central-0. The time frame for this outage was 15:43-15:57 UTC.

majorresolvedJan 6, 05:41 PM — Resolved Jan 6, 08:26 PM

High Latency and Errors in Prod-Us-Central-7

3 updates
resolvedJan 6, 08:26 PM

This incident has been resolved.

monitoringJan 6, 05:50 PM

We are seeing some recovery in affected products. We are continuing to monitor the progress.

investigatingJan 6, 05:41 PM

We are currently investigating an issue causing degraded Mimir and Tempo read performance in the prod-us-central-7 region.

noneresolvedJan 6, 03:09 PM — Resolved Jan 6, 03:09 PM

Cloudflare Error 1016

1 update
resolvedJan 6, 03:09 PM

From 20:32 to 20:37 UTC, a DNS record misconfiguration resulted in temporary Cloudflare 1016 DNS errors on many Grafana Cloud stacks. The misconfiguration was mitigated within 5 minutes, and we are working with Cloudflare to better understand why the particular misconfiguration resulted in this outage.

minorresolvedJan 2, 10:44 AM — Resolved Jan 2, 01:38 PM

K6 test-runs cannot be started and the overall navigation experience is degraded

4 updates
resolvedJan 2, 01:38 PM

This incident has been resolved.

monitoringJan 2, 11:53 AM

We are continuing to monitor for any further issues.

monitoringJan 2, 11:51 AM

A fix has been implemented and we are monitoring the results.

investigatingJan 2, 10:44 AM

We are currently investigating this issue.

December 2025(13 incidents)

minorresolvedDec 23, 08:31 PM — Resolved Dec 23, 09:59 PM

PDC Queries in Prod-Us-East-3 Are Failing

4 updates
resolvedDec 23, 09:59 PM

This incident has been resolved.

monitoringDec 23, 09:09 PM

A fix has been implemented and we are monitoring the results.

identifiedDec 23, 08:34 PM

We have identified the issue, and are working on a fix now.

investigatingDec 23, 08:31 PM

We are currently investigating an issue impacting PDC Queries in the Prod-Us-East-3 region. This issue is causing queries to fail, and affected customers are experiencing "no such host" errors when attempting to connect a data source via PDC.

majorresolvedDec 19, 04:10 PM — Resolved Dec 19, 04:50 PM

New K6 Tests Intermittently Failing

3 updates
resolvedDec 19, 04:50 PM

This incident has been resolved.

investigatingDec 19, 04:50 PM

We are continuing to investigate this issue.

investigatingDec 19, 04:10 PM

We are currently investigating an issue that is causing intermittent failure when trying to start new k6 tests on the k6-app in the cloud. The tests are starting normally from CLI.

noneresolvedDec 17, 01:02 PM — Resolved Dec 17, 01:02 PM

Logs write path degradation on GCP Belgium - prod-eu-west-0

1 update
resolvedDec 17, 01:02 PM

Today 17th from 10:00 UTC to 10:38 UTC we experienced logs write path degradation. Customers may have experienced 5xx errors while ingestion on that cluster. Service is fully restored.

noneresolvedDec 16, 12:51 PM — Resolved Dec 16, 12:51 PM

IRM access issues on instances in GCP Belgium (prod-eu-west-0)

1 update
resolvedDec 16, 12:51 PM

Due to an incident today in IRM (OnCall) product, access to this application was degraded from 10:39 UTC until 11:16. Customers may have experienced the app not operable or not accessible. The service is again fully restored.

noneresolvedDec 16, 10:37 AM — Resolved Dec 16, 10:42 AM

k6 Cloud service is down

2 updates
resolvedDec 16, 10:42 AM

This incident has been resolved.

investigatingDec 16, 10:37 AM

The main k6 Cloud service is down due to database issues, and it is not possible to access test runs or start new ones.

majorresolvedDec 12, 02:17 PM — Resolved Dec 12, 02:38 PM

Some Trace QL Queries Failing

2 updates
resolvedDec 12, 02:38 PM

This incident has been resolved.

identifiedDec 12, 02:17 PM

TraceQL queries with "= nil" in Explore Traces and part of Drilldown Traces are failing with 400 Bad Request errors. The issue has been identified, and a fix is currently being rolled out.

majorresolvedDec 11, 09:43 PM — Resolved Dec 11, 11:17 PM

Mimir Partial Write Outage

5 updates
resolvedDec 11, 11:17 PM

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

monitoringDec 11, 11:02 PM

Synthetic Monitoring has now also recovered. Customers should no longer experience alert rules failing to evaluate. We continue to monitor for recurrence and will provide updates accordingly.

monitoringDec 11, 10:33 PM

Engineering has released a fix and as of 22:25 UTC, customers should no longer experience ingestion issues. We will continue to monitor for recurrence and provide updates accordingly.

investigatingDec 11, 10:23 PM

While investigating this issue, we also became aware that Synthetic Monitoring is affected. Some customers may have alert rules failing to evaluate.

investigatingDec 11, 09:43 PM

As of 21:30 UTC, we are experiencing a partial ingestion outage in Grafana Mimir. This is affecting the write path, where some ingestion requests are failing or timing out. Our engineering team is actively investigating and working to identify the root cause.

majorresolvedDec 10, 07:29 PM — Resolved Dec 10, 11:05 PM

Elevated Metric Push Failures and Latency

7 updates
resolvedDec 10, 11:05 PM

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

monitoringDec 10, 10:22 PM

We are observing a trend in improvement after implementing a fix. We will continue to monitor and update accordingly. During our investigation, we also became aware that some alerts associated with Synthetic Monitoring checks have been failing to evaluate correctly.

identifiedDec 10, 09:38 PM

Our engineering team has identified a potential root cause, and a fix is being implemented.

investigatingDec 10, 08:48 PM

Our engineering team has engaged with our Cloud Service Provider and are working together to continue to investigate this issue.

investigatingDec 10, 07:49 PM

We are continuing to investigate this issue.

investigatingDec 10, 07:49 PM

We have also identified that trace ingestion may also be affected. Some customers may experience elevated latency and intermittent errors when sending traces. Investigation is ongoing.

investigatingDec 10, 07:29 PM

We have detected an issue causing some customers to experience failed metric pushes as well as increased latency when sending metrics. The issue was first observed at 18:30 UTC. Our engineering team is actively investigating the root cause and working to restore normal operation as quickly as possible. We will provide further updates as more information becomes available. Thank you for your patience while we work to resolve this issue.

majorresolvedDec 10, 07:30 PM — Resolved Dec 10, 07:30 PM

Elevated Log Push Failures and Latency on prod-eu-west-0 cluster

1 update
resolvedDec 11, 02:35 PM

Users experienced failed log pushes as well as increased latency when sending logs to Loki service hosted on prod-eu-west-0 cluster between 18:30 UTC to ~23:00 UTC. Our engineering team has engaged our Cloud Service Provider and the fix was implemented that mitigated the issue.

criticalresolvedDec 10, 08:23 AM — Resolved Dec 10, 09:06 AM

Metrics read issue affecting cortex-prod-13 on prod-us-east-0

3 updates
resolvedDec 10, 09:06 AM

The incident has been resolved.

monitoringDec 10, 08:30 AM

Read path has been restored at 08:23 UTC and queries are fully functioning again. The read path outage lasted from 08:04 to 08:23 UTC.

investigatingDec 10, 08:23 AM

At 08:04 UTC we detected read path outage (queries) on cortex-prod-13. We are currently investigating this issue. The ingestion path (writes) is not affected.

majorresolvedDec 9, 12:59 PM — Resolved Dec 9, 01:10 PM

Logs query degradation on AWS Germany (prod-eu-west-2)

3 updates
resolvedDec 9, 01:10 PM

The issue has been resolved.

monitoringDec 9, 01:06 PM

The query service is operational again, logs reads should be available on the cluster. Our engineers are monitoring the health status of the service to ensure full recovery.

identifiedDec 9, 12:59 PM

Since today 9th, around 12:30 UTC time, we are experiencing problems on the Loki read path of cluster eu-west-2. This translates in difficulties to query logs for customers on this cluster, and can also impact alerts and other services based on these logs. Our engineers are actively working in restoring the service.

noneresolvedDec 5, 09:25 AM — Resolved Dec 5, 09:44 AM

Hosted Grafana is currently being impacted as a result of the Cloudflare outage

2 updates
resolvedDec 5, 09:44 AM

This incident has been resolved.

investigatingDec 5, 09:25 AM

We are currently experiencing disruptions to Hosted Grafana services due to a widespread Cloudflare outage impacting connectivity across multiple regions. Our team is actively monitoring the situation and will provide updates as Cloudflare works to restore normal operation.

minorresolvedDec 1, 08:00 AM — Resolved Dec 1, 08:00 AM

Loki prod-ap-northeast-0-loki-prod-030 writes degradation

1 update
resolvedDec 1, 09:30 AM

Loki prod-ap-northeast-0-loki-prod-030 cell had writes degradation between 8:11 - 8:58 AM UTC. The engineering team mitigated the situation and the cell is stable now.

November 2025(16 incidents)

minorresolvedNov 27, 04:49 PM — Resolved Nov 27, 06:27 PM

Alerts failing with prometheus

3 updates
resolvedNov 27, 06:27 PM

This incident has been resolved.

monitoringNov 27, 06:02 PM

A fix has been implemented and we are monitoring the results.

investigatingNov 27, 04:49 PM

We are currently investigating an issue around degraded services in prod-us-central-0. The expected behavioral impact is that the queries may take longer than usual to respond.

criticalresolvedNov 24, 11:19 AM — Resolved Nov 24, 11:49 AM

Synthetic Monitoring is down in prod-us-central-7

2 updates
resolvedNov 24, 11:49 AM

This incident has been resolved.

investigatingNov 24, 11:19 AM

Users can not interact with the SM API for any DB-related action, such as: CRUD checks CRUD probes

majorresolvedNov 21, 03:47 PM — Resolved Nov 22, 05:06 PM

Longer Than Expected Load Times on Grafana Cloud

3 updates
resolvedNov 22, 05:06 PM

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

monitoringNov 21, 05:04 PM

A fix has been implemented, and we are seeing latency down across clusters. We are continuing to monitor progress.

investigatingNov 21, 03:47 PM

We are currently investigating reports of long load times on Grafana Cloud. We will update as more information becomes available.

majorresolvedNov 21, 02:32 PM — Resolved Nov 21, 03:38 PM

Some Loki Writes in Prod-Gb-South-0 Failed

2 updates
resolvedNov 21, 03:38 PM

This incident has been resolved.

monitoringNov 21, 02:32 PM

From approximately 14:10-14:25 UTC, writes to Loki failed for a subset of customers in the gb-south-0 region. Most of these errors have already recovered, and our team continues to monitor the recovery.

minorresolvedNov 21, 12:09 PM — Resolved Nov 21, 02:52 PM

Slow user queries exceeds threshold

4 updates
resolvedNov 21, 02:52 PM

This incident has been resolved.

monitoringNov 21, 01:08 PM

A fix has been implemented and we are monitoring the results.

investigatingNov 21, 12:10 PM

We are having some intermittent query failures

investigatingNov 21, 12:09 PM

We are currently investigating this issue.

majorresolvedNov 20, 05:12 PM — Resolved Nov 20, 06:47 PM

Elevated Read & Write Latency for Some Cells in Prod-Us-East-0

4 updates
resolvedNov 20, 06:47 PM

This incident has been resolved.

monitoringNov 20, 05:47 PM

Things have recovered, and we are monitoring to ensure stability.

identifiedNov 20, 05:23 PM

The previous post mentioned that this was occurring in some cells in prod-us-central-0. This is incorrect and is occurring in some cells in prod-us-east-0.

identifiedNov 20, 05:12 PM

At approximately 16:40 UTC we noticed an issue causing increased read & write latency in some prod-us-central-0 cells. At this stage we are noticing some recovery and will continue to monitor.

noneresolvedNov 19, 09:00 PM — Resolved Nov 19, 09:00 PM

Intermittent issues when starting k6 cloud test runs

1 update
resolvedNov 20, 04:35 PM

We are experiencing issues starting cloud test runs. This is primarily affecting browser test runs and test runs using static IPs.

noneresolvedNov 19, 06:20 AM — Resolved Nov 19, 06:20 AM

Missing Billing Metrics for Loki

1 update
resolvedNov 19, 06:20 AM

An incident has impacted Loki billing metrics from 05:30 to 06:10 UTC across all clusters. This is currently resolved, however, users may notice some billing metrics missing from the billing dashboard during this time period. There is no impact of logs querying or ingestion.

noneresolvedNov 18, 11:59 AM — Resolved Nov 18, 05:34 PM

Hosted grafana is currently being impacted as a result of the Cloudflare outage

2 updates
resolvedNov 18, 05:34 PM

This incident has been resolved.

investigatingNov 18, 11:59 AM

We are currently experiencing disruption to Hosted Grafana services due to a widespread Cloudflare outage impacting connectivity across multiple regions. Our team is actively monitoring the situation and will provide updates as Cloudflare works to restore normal operation.

majorresolvedNov 17, 02:06 PM — Resolved Nov 18, 07:26 AM

Cortex - read/write path disruption

4 updates
resolvedNov 18, 07:26 AM

This incident has been resolved.

identifiedNov 17, 10:49 PM

We’re continuing to work on this issue, and are actively investigating the remaining details. Our team is making progress, and we’ll share another update as soon as we have more information to provide.

identifiedNov 17, 02:49 PM

We are continuing to work on a fix for this issue.

identifiedNov 17, 02:06 PM

We are currently observing a read/write path disruption in prod-us-central.cortex-prod-04.

majorresolvedNov 17, 10:22 PM — Resolved Nov 18, 12:10 AM

Elevated Mimir Read/Write Errors

7 updates
resolvedNov 18, 12:10 AM

We continue to observe a continued period of recovery. As of 00:08 UTC, we are considering this issue resolved. No further updates.

monitoringNov 17, 11:27 PM

A fix has been implemented and we are monitoring the results.

investigatingNov 17, 10:55 PM

Our teams have been alerted that Synthetic Monitoring will also be affected by this outage. Users may see gaps in their Synthetic Monitoring metrics as well as missed alerts as a result of this. We continue to investigate and will provide further updates as they become available.

investigatingNov 17, 10:47 PM

The investigation has revealed that metric ingestion is also affected, including log-generated recording rules. We are continuing to investigate the root cause and will provide further updates as more information becomes available.

investigatingNov 17, 10:47 PM

The investigation has revealed that metric ingestion is also affected, including log-generated recording rules. We are continuing to investigate the root cause and will provide further updates as more information becomes available.

investigatingNov 17, 10:33 PM

Span metrics have also been identified as affected. We are continuing to investigate the root cause and will share further updates as more information becomes available.

investigatingNov 17, 10:22 PM

We are investigating elevated Mimir read errors beginning at approximately 21:57 UTC. The errors are technically retriable, but most retries are unlikely to succeed at this time. This may result in failed or delayed query responses for some users. Engineering is actively investigating the root cause and working to restore normal read performance. We will provide further updates as we learn more.

majorresolvedNov 17, 08:02 PM — Resolved Nov 17, 11:08 PM

Metrics Write Outage in Multiple Cells

6 updates
resolvedNov 17, 11:08 PM

As of 23:07 UTC we are considering this incident as resolved. Mitigation efforts have restored normal write performance, and error rates have returned to expected levels. We have confirmed stability across the affected areas and continue to monitor, but no further impact is expected. If you continue to experience any issues, please reach out to support.

monitoringNov 17, 09:22 PM

We are seeing improvement on the metrics side, with write performance recovering. We continue to investigate the remaining impact to Synthetic Monitoring and are working to determine the underlying cause. Monitoring will continue as recovery progresses, and we’ll provide further updates as we learn more.

identifiedNov 17, 08:50 PM

Our teams have been alerted that Synthetic Monitoring will also be affected by this outage. Users may see gaps in their Synthetic Monitoring metrics as well as missed alerts as a result of this. We continue to investigate and will provide further updates as they become available.

identifiedNov 17, 08:34 PM

We’ve re-evaluated the situation and this issue is still ongoing. Although we initially observed signs of recovery, write errors continue to occur in the affected cells. Mitigation work is still in progress, and we’re treating the incident as identified again while we work toward a sustained resolution. We’ll provide further updates as we confirm stabilization.

monitoringNov 17, 08:11 PM

Mitigation has been applied and Mimir write performance is beginning to recover in the affected cells. prod-us-central-0.cortex-prod-10 appears to have recovered as of 19:52 UTC, and prod-us-central-5.cortex-dedicated-06 is showing signs of recovery as of 20:00 UTC. We are continuing to monitor both cells closely to ensure the mitigation is effective and that the systems remain stable.

investigatingNov 17, 08:02 PM

We are investigating a partial write outage affecting multiple metrics cells, beginning around 19:30 UTC. Some customers may see intermittent write failures or delays, but most requests should succeed after retries and recent metrics may appear late as a result. Querying previously ingested data remains unaffected. Engineering is continuing to investigate and will provide further updates as more information becomes available.

minorresolvedNov 17, 07:01 PM — Resolved Nov 17, 07:01 PM

Hyderabad Probe Issues

1 update
resolvedNov 17, 07:01 PM

We experienced degraded service with the Hyderabad probe today around 13:20 UTC which was resolved as of 17:30 UTC.

noneresolvedNov 17, 11:56 AM — Resolved Nov 17, 12:23 PM

PDC-Prod-eu-west-2 cluster degraded performance

3 updates
resolvedNov 17, 12:23 PM

This incident has been resolved.

monitoringNov 17, 12:22 PM

Engineering has released a fix and as of 12:15 UTC, customers should no longer experience performance degradation on the PDC service. We will continue to monitor for recurrence and provide updates accordingly.

investigatingNov 17, 11:56 AM

We are currently facing performance degradation on PDC service hosted on prod-eu-west-2 cluster. Our Engineering Team is currently working on fixing the issue, we do apologize for any inconvenience.

minorresolvedNov 17, 04:12 AM — Resolved Nov 17, 04:55 AM

Loki Prod 012 read-path-unstable

3 updates
resolvedNov 17, 04:55 AM

resolved since 03:02UTC

monitoringNov 17, 04:13 AM

We started having instabilities for Alerting and Recording rules of this cell starting at 2:30 AM UTC. They started recovering at around 3 AM UTC, but we're still watching.

monitoringNov 17, 04:12 AM

A fix has been implemented and we are monitoring the results.

minorresolvedNov 12, 04:50 PM — Resolved Nov 12, 04:50 PM

Degraded Brower Check Performance

1 update
resolvedNov 12, 04:50 PM

Spanning from November 10th, 18:00 UTC to November 11th, 22:00 UTC, Synthetic Monitoring experienced degraded browser check performance due to a faulty release that has been rolled back. This impacted all regions, specifically the probes. The API itself experienced no issues.