Breaking Down the CrowdStrike BSOD Global Incident

Breaking Down the CrowdStrike BSOD Global Incident

Original Post (Monday, July 22, 2024):

On Friday, July 19, 2024, Microsoft Windows devices worldwide began crashing, quickly becoming unusable, as they experienced the infamous Blue Screen of Death (BSOD).  Early news reports all mentioned Microsoft but often left the culprit, CrowdStrike, a leading security software, out of many headlines. This incident impacted thousands of businesses across various industries, causing operational disruptions and highlighting vulnerabilities in incident response plans. Here’s a detailed look at what happened, the timeline of events, and the lessons learned.

What Happened?

CrowdStrike released a flawed and untested update to its Falcon endpoint protection software. Almost immediately, reports of Windows devices crashing and displaying the BSOD started to be reported. The issue was traced back to a conflict between the CrowdStrike update and specific Windows system files, causing critical system errors and rendering devices inoperable.

Timeline of Events

  • July 19, 2024, 4:09 AM UTC: CrowdStrike releases a faulty configuration update for its Falcon sensor software.
  • July 19, 2024, 5:27 AM UTC: CrowdStrike reverts the faulty update.
  • July 19, 2024, 7:15 AM UTC: Google identifies the CrowdStrike update as the source of the problem.
  • July 19, 2024, 9:45 AM UTC: CrowdStrike CEO George Kurtz confirms the issue and the deployment of a fix and a “workaround” for those impacted. Unfortunately, the workaround requires deleting the impacted file, something that would require IT professionals to take manual action on thousands of devices.
  • July 20, 2024: Microsoft acknowledges the issue and starts working on a recovery tool to address the BSOD problem.
  • July 22, 2024, 10:00 AM ET: Microsoft releases a recovery tool aimed at fixing the affected devices.

Impact

The CrowdStrike BSOD issue had a far-reaching impact, affecting various sectors, including transportation, healthcare, finance, retail, and manufacturing. The number of impacted devices is estimated to be at least 8.5 million according to estimates from Microsoft, disrupting business operations and causing significant financial losses.

Industries Affected

  • Airlines: On the first day of the issue, over 5,000 flights were canceled, with thousands more delayed. As the issue continued to plague airlines throughout the busy summer weekend, more flights would be canceled and/or delayed.
  • Healthcare: Patient care systems and medical devices were disrupted, leading to delays and rescheduling of appointments.
  • Finance: Trading platforms and financial systems experienced outages, affecting transactions and customer service.
  • Retail: Point-of-sale systems went offline, causing a loss in sales and customer dissatisfaction.
  • Manufacturing: Production lines were halted, affecting supply chains and delivery schedules.

Lessons Learned

The CrowdStrike BSOD incident is a clear reminder that even trusted security solutions can cause significant disruptions. Here are some key takeaways:

  1. Thorough Testing: Security updates should undergo rigorous testing across different system configurations to identify potential conflicts. Ideally, updates are rolled out in a phased approach to ensure any potential issues only impact a small subset of systems.
  2. Incident Response Plans: Organizations need robust incident response plans that can quickly address and mitigate the impact of such issues. Not only do organizations need to have Incident Response Plans, they need to review and test the plans at least annually through tabletop exercises.
  3. Communication: During a crisis, transparent and timely communication from vendors is crucial to manage customer expectations and provide solutions. The majority of the IT industry did really well when it came to communication. Most organizations chose not to point fingers at Crowdstrike and instead strike a tone of acknowledging this type of event. Unfortunately, more can be done to prevent it, but for now, we need to all work together to resolve the current issue.

What CrowdStrike Should Have Done Differently

While CrowdStrike acted quickly to address the issue, there are several areas where improvements could have been made:

  1. Pre-Deployment Testing: More extensive pre-deployment testing, especially on critical system files, could have identified the conflict before the update was released.
  2. Faster Rollback: Implementing a faster rollback mechanism to revert to the previous stable version would have minimized downtime for affected users.
  3. Enhanced Communication: Providing real-time updates and clearer communication channels could have helped affected organizations respond more effectively.

This incident isn’t just about CrowdStrike; it underscores a broader issue that many companies could face. Ensuring robust incident response plans and maintaining open communication channels are essential practices for all organizations.

Fixes and Recovery

In response to the CrowdStrike BSOD issue, several fixes were provided by Microsoft and other experts in the community. Here are the key solutions:

  1. Microsoft Recovery Tool: Microsoft released a dedicated recovery tool to address the BSOD issue. This tool can be downloaded from the Microsoft Tech Community here.
  2. CrowdStrike Workaround: CrowdStrike provided a temporary workaround that involved booting into safe mode and uninstalling the problematic update.
  3. Community Solutions: Various IT professionals shared their own fixes on platforms like Reddit, including the use of USB recovery tools and system restore points.

For a more detailed breakdown of the technical fixes, visit the following sources:

Conclusion

The CrowdStrike BSOD disaster serves as a critical learning moment for both vendors and organizations. Ensuring comprehensive testing, robust incident response plans and clear communication can help mitigate the impact of such incidents in the future. As updates continue to emerge, we will keep you informed with the latest information and best practices.

Stay tuned for more updates as we continue to monitor the situation and gather insights from industry experts.

The Business Owner's Guide to Cybersecurity

Download the

Business Owner’s Guide to Cybersecurity