Falcon Content Update - Preliminary Report After Incident

Updated on 07/25/2024 at 7:00 p.m. UTC

PDF Summary

This is CrowdStrike’s Preliminary Post-Incident Analysis (PIR). We will detail our full investigation in the upcoming publicly available Root Cause Analysis. Throughout this PIR, we have used generalized terminology to describe the Falcon platform to improve readability. Terminology used in other documents may be more specific and technical.

What happened?

On Friday, July 19, 2024 at 04:09 UTC, as part of regular operations, CrowdStrike released a content configuration update for the Windows sensor to collect telemetry data on possible new threat techniques.

These updates are an integral part of the Falcon platform’s dynamic protection mechanisms. The problematic update of the Rapid Response Content configuration caused the Windows system to crash.

Affected systems include Windows hosts running sensor version 7.11 and later that were online between Friday, July 19, 2024 04:09 UTC and Friday, July 19, 2024 05:27 UTC and received the update. Mac and Linux hosts were not affected.

The content update issue was fixed on Friday, July 19, 2024 at 05:27 UTC. Systems that went live after this time or did not connect during this window were not affected.

What went wrong and why?

CrowdStrike delivers security content configuration updates to our sensors in two ways: Sensor Content which is shipped directly with our sensor, and Quick Response Content which is designed to respond to the evolving threat landscape at operational speed.

Friday’s issue involved a quick reply content update with an undetected error.

Sensor Content

Sensor Content provides a broad range of capabilities to aid adversary response. It is always part of a sensor release and is not dynamically updated from the cloud. Sensor Content includes on-sensor AI and machine learning models, as well as code written specifically to provide long-term reusable functionality to CrowdStrike threat detection engineers.

These features include template types, which have predefined fields that threat detection engineers can leverage in rapid response content. Template types are expressed in code. All sensor content, including template types, goes through a comprehensive quality assurance process, which includes automated testing, manual testing, validation, and deployment steps.

The sensor release process begins with automated testing, both before and after merging into our codebase. This includes unit testing, integration testing, performance testing, and stress testing. This culminates in a staged sensor deployment process that begins with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option to select which parts of their fleet should install the latest version of the sensor (“N”), or an older version (“N-1”) or two older versions (“N-2”) via sensor update policies.

The Friday, July 19, 2024 event was not triggered by sensor content, which is only provided with the release of an updated Falcon sensor. Customers have full control over sensor deployment, which includes sensor content and model types.

Quick Response Content

Quick Response Content is used to perform various behavioral pattern matching operations on the sensor using a highly optimized engine. Quick Response Content is a representation of fields and values, with associated filtering. This Quick Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver.

Quick Response content is provided in the form of “model instances,” which are instantiations of a given model type. Each model instance corresponds to specific behaviors that the sensor should observe, detect, or prevent. Model instances have a set of fields that can be configured to match the desired behavior.

In other words, model types represent a sensor capability that enables new telemetry and sensing, and their runtime behavior is dynamically configured by the model instance (i.e., the quick response content).

Rapid Response Content provides on-sensor visibility and detections without requiring any changes to sensor code. This capability is used by threat detection engineers to collect telemetry data, identify adversary behavior indicators, and perform detection and prevention. Rapid Response Content is a behavioral heuristic, separate from CrowdStrike’s on-sensor AI detection and prevention capabilities.

Rapid Response Content Testing and Deployment

Quick Response content is delivered as content configuration updates to the Falcon sensor. There are three main systems: the Content Configuration System, the Content Interpreter, and the Sensor Detection Engine.

The content configuration system is part of the Falcon cloud platform, while the content interpreter and sensor detection engine are components of the Falcon sensor. The content configuration system is used to create model instances, which are validated and deployed to the sensor via a mechanism called channel files. The sensor stores and updates its content configuration data via channel files, which are written to the host’s disk.

The sensor content interpreter reads the channel file and interprets the quick response content, allowing the sensor detection engine to observe, detect, or prevent malicious activity, based on the client policy configuration. The content interpreter is designed to gracefully handle exceptions from potentially problematic content.

Newly released model types are stress tested on many aspects, such as resource usage, impact on system performance, and event volume. For each model type, a specific model instance is used to stress test the model type against any possible value of the associated data fields to identify unwanted system interactions.

Template instances are created and configured through the use of the Content Configuration system, which includes the Content Validator that performs validation checks on content before it is published.

Timeline of Events: Testing and Deployment of the InterProcessCommunication (IPC) Model Type

Sensor Content Release: On February 28, 2024, Sensor 7.11 was released to customers, introducing a new type of IPC model to detect new attack techniques that abuse named pipes. This release followed all of the Sensor Content testing procedures described above in the Sensor Content section.

Model Type Stress Test: On March 5, 2024, an IPC model type stress test was run in our test environment, which consists of various operating systems and workloads. The IPC model type passed the stress test and was validated for use.

Released a model instance via channel file 291: On March 5, 2024, following the successful stress test, an IPC model instance was released to production as part of a content configuration update. Subsequently, three additional IPC model instances were deployed between April 8, 2024 and April 24, 2024. These model instances performed as expected in production.

What happened on July 19, 2024?

On July 19, 2024, two additional IPC model instances were deployed. Due to a bug in the content validator, one of the two model instances passed validation despite containing problematic content data.

Based on testing performed prior to the initial deployment of the model type (March 5, 2024), confidence in the checks performed in the content validator, and previous successful deployments of IPC model instances, these instances have been deployed to production.

Once received by the sensor and loaded into the content interpreter, the problematic contents of channel file 291 caused an out-of-bounds memory read that threw an exception. This unexpected exception could not be handled properly, resulting in a Windows operating system crash (BSOD).

How can we prevent this from happening again?

Software Resilience and Testing

Improve quick response content testing by using test types such as:

Testing for local developers
Content update and restoration testing
Stress testing, fuzzing and fault injection
Stability test
Content Interface Testing

Add additional validation checks to the content validator for quick-response content. A new check is in progress to prevent this type of problematic content from being deployed in the future.

Improve existing error handling in the content interpreter.

Rapid deployment of response content

Implement a staggered deployment strategy for rapid response content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
Improve monitoring of sensor and system performance, collecting feedback during rapid response content deployment to guide a phased deployment.
Give customers greater control over the delivery of rapid response content updates by enabling granular selection of when and where these updates are deployed.
Provide content update details via release notes, which customers can subscribe to.

Third-party validation

Conduct multiple independent third-party security code reviews.
Perform independent assessments of end-to-end quality processes from development to deployment.

In addition to this preliminary post-incident review, CrowdStrike is committed to publicly releasing the full root cause analysis once the investigation is complete.