Cloudflare Apologises For 'the Pain We Caused The Internet' And...

Cloudflare Apologises For 'the Pain We Caused The Internet' And...

A teeny, tiny bug turned into a major internet outage.

Just about everybody who was online yesterday will have noticed something was wrong with ye olde internet. A major issue at network service provider Cloudflare brought down everything from and X and ChatGPT to order screens at McDonald's restaurants. Now Cloudflare has posted a full explanation of the outage and it turns out the problem was entirely internal and self inflicted.

Cloudflare CEO Matthew Prince's mea culpa was remarkably unambiguous. Up front and literally in bold, he kicks off with the following statement:

"The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind."

He follows that up with an entirely caveat-free apology. "We are sorry for the impact to our customers and to the Internet in general. Given Cloudflare's importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today."

So, what, exactly, happened? Cloudflare did initially suspect foul play, most likely a massive DDoS or distributed denial of service attack. However, further investigation revealed that, "it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a 'feature file' used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network."

Unfortunately, Cloudflare's bot management software had a hard coded file size limit, which that newly doubled feature file exceed. Boom, that software failed.

The problem began at 11:20 UTC and Cloudflare says it had correctly identified the issue, stopped the propagation of the larger-than-expected feature file and got core traffic "largely flowing as normal" by 14:30. By 17:06, "all systems at Cloudflare were functioning as normal."

Keep up to date with the most important stories and the best deals, as picked by the PC Gamer team.

Cloudflare says the event was the company's worse outage since 2019. Several mitigations are being put in place to prevent a repeat, including more global kill switches for features and eliminating the ability for core dumps or other error reports to overwhelm system resources.

Source: PC Gamer