—
Mistah Kurtz-he dead
A penny for the Old Guy
On July 19th, 2024, Crowdstrike, a well-known cybersecurity provider within private and public IT realms, managed to send industries from air travel to healthcare into various states of meltdown thanks to an unvalidated agent update.
This update threw millions of Windows devices into bootloops, sending companies large and small into utter technical chaos for days. Delta Airlines CEO Ed Bastian alleges Crowdstrike’s botched update cost Delta $500 million, a claim which Crowdstrike CEO George Kurtz has dismissed, pinning the blame on Delta for its refusal to accept technical assistance from Crowdstrike.
Delta’s legal action against Crowdstrike is one in a sea of lawsuits materializing against the company for the massive losses the defective update caused. Overall monetary losses to Crowdstrike’s Fortune 500 clients are estimated to be around $5 billion.
To understand why this buggy update was so devastating, we must understand the way Crowdstrike’s Falcon EDR (endpoint detection and response) platform works. The Falcon EDR agent requires a kernel-mode driver which grants it low-level access to the Windows operating system.
This kernel driver allows the Falcon agent to continuously monitor Windows user space and kernel space for malicious executables, attachments, and other potential cybersecurity threats. This model provides a substantial degree of cybersecurity protection to the device, but the use of a kernel-mode driver presents a double-edged sword: if the driver fails to initiate the Crowdstrike agent correctly, the operating system can crash and subsequently fail to boot.
Microsoft’s response to this was to issue a technical incident response memo about the Crowdstrike failure, discouraging security vendors from using kernel-mode drivers. Crowdstrike’s kernel-mode driver, while signed and blessed by Microsoft, relies on frequent updates from Crowdstrike which are not individually signed or audited by Microsoft, or any other third-party.
Thus, while the driver itself was deemed safe, the Falcon agent failed to parse a bad configuration file from Crowdstrike, causing the program to access memory it shouldn’t have accessed, bringing down the kernel and the operating system along with it.
Kernel-space code runs close to the hardware (or ‘near the metal’), which is advantageous for cybersecurity applications that need low-level operating system access.
This low-level access, however, has to be weighed against the potentially devastating outcomes of a bad update or untested configuration change. David Weston, Vice President of Microsoft’s Enterprise and Operating System Security, outlined a process for granting a security application’s kernel-space access while reducing the risk to the kernel in the event of a botched update:
"For example, security vendors can use minimal sensors that run in kernel mode for data collection and enforcement, limiting exposure to availability issues," he explained. "The remainder of the key product functionality includes managing updates, parsing content, and other operations that can occur isolated within user mode where recoverability is possible."
Crowdstrike places the blame for the failed update on its content validation pipeline. What remains unclear is how many standard industry practices Crowdstrike has actually adopted, such as sandboxing for update and change testing.
Worryingly, this kind of failure suggests a lack of industry-standard CI/CD (continuous integration/continuous deployment) practices which most likely could have prevented the global outage caused by the bungled update.
In the wake of Crowdstrike’s global IT wreckage, Microsoft announced its intention to "to work with the anti-malware ecosystem to take advantage of these integrated features to modernize their approach, helping to support and even increase security along with reliability." Microsoft’s guidance involves:
Providing guidance and best practices for updating and rolling out of cybersecurity product patches
Reducing need for Windows kernel space access in order to obtain critical security information
Implementing enhanced isolation and anti-tampering capabilities in Windows, utilizing tech like VBS Enclaves
Implementing Zero Trust approaches such as High Integrity Attestation, providing a method for determining the security status of a computer by monitoring its native security features (eg: Windows Defender for Endpoint)
In spite of the initial backlash against Crowdstrike for its sloppy patching pipeline, most Crowdstrike customers report they plan to remain customers rather than migrate to competing cybersecurity platforms. Whether this speaks to the inherent complexity of switching cybersecurity providers is an open question; in the tech space, vendor lock-in is an ongoing problem that hasn’t eased much in recent decades.
Some tech analysts have even suggested that now is the time to buy into Crowdstrike’s platform, as Crowdstrike is incentivized to beef up its CI/CD practices and focus on delivering a stable product, in light of its very public failure to follow best development and test practices.
No matter where Crowdstrike or its users land, the incident points out very real flaws in modern endpoint detection and response platforms. Crowdstrike isn’t the only company whose application relies on kernel-level access to the host operating system. This practice is widespread throughout the industry, with Microsoft bearing its own share of culpability in all but forcing security vendors to play in the dangerous kernel sandbox in order to develop security apps that do what they claim on the tin.
Microsoft’s response to the Crowdstrike fracas is a naked attempt to make Microsoft look good while not-so-subtly throwing security vendors under the bus for doing what they had to do to make their products function as intended. In Microsoft Land, throwing corporate partners, lucrative resellers, and end-users under the bus when push comes to shove is nothing new.
In a world where technology is becoming increasingly enshittified, prices increasingly stratospheric, and technical sanity increasingly hard to find, the Crowdstrike fiasco underscores a more central need: to end over-reliance on single vendors, and on single points of failure in general. A company like Crowdstrike or Microsoft will promise you the world (and sell it to you, at prices to match), but when things fall apart they won’t be there to save your business.
This is the importance of local IT: your business is only as resilient as the IT processes you have in place. Microsoft will sell you Exchange Online, but they won’t test your backups for you. Crowdstrike will sell you Falcon, but they aren’t going to perform phishing simulations or work with your employees to understand modern security threats. These companies are selling a solution, and it’s up to you to implement that solution in a way that makes sense for your use case while following best industry practices.
Get in touch with Geeks for Business today to learn more about how managed IT can keep your business running, even when the Crowdstrikes of the world drop the ball.
Comments