
An improper query configuration in a system designed to manage bot traffic caused significant sections of the internet to go down for several hours.
An improper query configuration in a system designed to manage bot traffic caused significant sections of the internet to go down for several hours.


A blog entry published Tuesday night by Cloudflare co-founder and CEO Matthew Prince discusses the factors behind its “worst outage since 2019,” attributing the problem to an issue within the Bot Management system that is meant to regulate which automated crawlers can access specific websites utilizing its CDN.
Cloudflare reported last year that around 20% of the internet travels through its infrastructure, which is intended to distribute the load and keep sites operational during traffic surges and DDoS assaults. However, the outage today impacted numerous services, disrupting everything from X to ChatGPT to the popular outage monitoring service Downdetector for several hours, mirroring recent incidents linked to issues with Microsoft Azure and Amazon Web Services.
Cloudflare’s bot management is designed to tackle issues like scrapers extracting data for training generative AI. Recently, it unveiled a system employing Generative AI to create the “AI Labyrinth, a new approach that leverages AI-generated content to slow down, confuse, and exhaust the resources of AI Crawlers and other bots that ignore ‘no crawl’ requests.
However, the company noted that today’s issues stemmed from modifications to the permissions system of a database, not from generative AI technology, DNS issues, or what Cloudflare initially thought might be a cyber attack or harmful activity such as a “hyper-scale DDoS attack.”
Prince mentioned that the machine learning model behind Bot Management, which calculates bot scores for requests traversing its network, utilizes a frequently updated configuration file that assists in identifying automated requests; however, “A change in our core ClickHouse query behavior that produces this file led to a substantial number of duplicate ‘feature’ rows.”
The blog post provides additional insight into what followed, but the query modification caused its ClickHouse database to produce duplicate entries. As the configuration file quickly surpassed preset memory limits, it brought down “the core proxy system responsible for traffic processing for our clients, for any traffic reliant on the bots module.”
Consequently, clients that utilized Cloudflare’s policies to block specific bots experienced false positives and severed genuine traffic, while Cloudflare users who did not rely on the generated bot score in their policies remained operational.
At present, it has outlined four explicit strategies to prevent similar incidents from occurring in the future, even though the increasing centralization of web services may render these outages unavoidable:
- Strengthening the ingestion of Cloudflare-generated configuration files similar to how we would for user-generated inputs
- Allowing more global kill switches for functions
- Preventing core dumps or other error reports from overwhelming system resources
- Assessing failure modes for error conditions throughout all core proxy modules