April 29, 2025
April 29, 2025
Pressing the emergency shutdown button when an incident gets out of hand. We should do all we can to make sure that last option is not needed.
Since leaving full time employment in 2022 I am aware that I am fading from the ICS security scene. As a way to remain useful for a while longer I have tried to fill the gap by collaborating with the ISA 99 Committee. Most recently I co-chaired one of the newly created workgroups dedicated to incident management. We began by drafting a White Paper on Incident Management which was approved in February 2025 for internal use as an “ISA 99 Committee Work Product”. What follows is something similar to what could be a review by one of this documents authors. In writing this I am also aware of the challenges of being objective and avoiding bias about a work I am associated with.
The main purpose behind drafting this White Paper was to point a new way forward in establishing a capability for managing incidents in industrial control or industrial automation and control system environments. This came from a recognition by the group that there was something missing in current approaches to incident management. We noted that despite the many solutions that have been proposed and implemented for managing incidents, we continue to see a failure in incident management procedures to detect, report, respond and recover before significant damage and/or disruption to operations, lives, property and environment occurs. This is further complicated by the difficulties in protecting increasingly vulnerable and dynamic attack surfaces consisting of multiple systems and components.
The focus of our attention was on the protection of process control environments as it was felt that too much attention was being devoted to protection of data/information centric IT assets. The WP points out several examples of policy documents from organisations and governments found on both sides of the Atlantic that while attempting to address the cybersecurity concerns of process control environments, still fail to address the technologies used to monitor and control processes governed by the laws of physics and chemistry. In addition, it was also recognised that our scope would cover incidents generally. While cyber incidents get a lot of attention it must be remembered that incidents have other non-malicious causes such as human error, not following procedure or equipment failure. In other words, if an incident happens it still must be managed whatever the cause for the consequences are the same.
The group identified 5 frameworks of incident management which are believed to be the key ingredients for establishing an incident management capability: Detection, Reporting, Response, Recovery and After-action lessons learned. A brief discussion of each framework follows.
Detection:
One of the most difficult tasks in incident management is the prompt detection of an incident, especially one of malicious intent that strives to avoid detection. In the reported IACS/OT incidents that we know about, a common feature is the length of time elapsed between the time of initial compromise until the actual incident occurs. As cyber incidents in IACS/OT tend to manifest themselves as physical problems, the operators and maintenance may attempt to address them many times before considering something more malicious. A good example is the Triton/Trisis attack on SIS which caused two unplanned shutdowns of a large petrochemical facility in the Middle East. Asset owners should determine what an indicator of a compromise (IoC) would look like according to their risk situation;
Reporting:
Once an anomalous event in the system under consideration has been detected, reporting becomes possible. Asset owners/operators should have policies and procedures related to reporting IACS security-related events in a timely manner. What exactly to report and to whom is a problem identified in the WP. It was also noted that a poor understanding of what is an incident will make risk analysis and management difficult. For a comprehensive and timely response to take place both the IT and IACS/OT sides of the house need to have established procedures for information sharing and reporting of incidents. Three roles or functions, at the minimum, needed in incident response were identified as key to addressing the issue of reporting: watch officer that detects the anomaly, analyst who determines the character and extent of the compromise and decision maker who takes the next steps that will initiate response;
Response:
Responding to an identified incident can take many forms, ranging from an automated response of a safety system that was programmed in advance to address a monitored process condition that has exceeded an established set point to the implementation of pre-arranged and tested incident response plan designed to deal with a known incident scenario such as an explosion at a fuel depot connected to a pipeline.
Identification of the kind of threat is important in determining the response. Today there still seems to be a tendency to think of cybersecurity threats in the context of office IT environments which miss the special concerns of IACS/OT environments. NIST for example has issued a draft guidance for comment on Incident Response Recommendations and Considerations for Cybersecurity Risk Management NIST SP 800-61r3 ipd where in the examples of cyber incidents provided there is no inclusion of an incident related to IACS/OT.[1]
It was recognised that it is important to identify what is the true target of the attack. The response will likely be more successful if it is known whether the attack is limited to the IT side of the operation or if there is an attempt being made to target the actual operation itself – the OT/IACS side.
One other important consideration in incident response planning is the pressure to continue running operations before the actions of a threat actor have been contained. Asset owners need to be part of the decision-making process in deciding whether and under what conditions should an operation be suddenly shutdown due to an incident that has not immediately resulted in a degraded operation. Operational staff together with asset owners need to establish the criteria and the means for safely operating under the conditions of an on-going incident;
Recovery
Recovery efforts have the goals, as soon as possible, of bringing affected assets back to a ready state, returning to normal operations/service and returning to safety.
An identified issue for the asset owner in the heat of a live incident is whether to follow the advice of a vendor who may no longer be fully knowledgeable about the asset owners’ operations when the system was installed. A call for help to the system integrator or vendor may not result in helpful advice since the system may have added devices and functions that were not present when the initial factory acceptance test (FAT) and site acceptance test (SAT) were performed. This was noted in the recovery efforts taking place after the Triton/Trisis cyber attack that caused plant shutdowns in 2017 where the vendor advised the owner to return the compromised plant systems back to a secure reference architecture state which by the time of the incident may not fully reflect the actual state of the system;[2]
After action review
When incident recovery actions have been completed and normal operations have been restored, it is time to calmly review what happened and develop lessons learned that will inform the work in developing a corrective actions implementation plan. Reviewing what happened and why an incident occurred despite the protective measures already present can be tremendously valuable for exposing the flaws in operating assumptions, evaluating the effectiveness of security measures, safety systems, procedures and level of system resilience to disruption of the process. Weaknesses or missing parts can be identified, and corrective actions taken that would reduce the likelihood of the incident occurring again, or if it does, the consequences and duration of the event would be reduced. The after-action review, which also includes any lessons learned will inform the work on a corrective action implementation plan and be used to update any existing applicable incident response plan, testing program, future exercise scenarios or other documentation.
It is the group’s recommendation that the next work after publishing this White Paper will be the preparation of an ISA Technical Report (TR) which will present an in-depth discussion of each of the 5 phases of Incident Management (IM): Detection, Reporting, Response, Recovery and After-action review. The TR will issue actionable material for the owner/operator of critical infrastructure that will provide guidance on the establishment and operation of an improved and effective incident management capability.
One caveat that will be kept in mind if the work proceeds: the successful drafting of a TR on Incident Management will largely depend upon the active input of IACS/OT professionals with firsthand operational experience. This is key for coming up with an evidence-based document that is of practical to the asset owner. Otherwise, it will be just another set of recommendations placed alongside many other similar IT biased guides which fail to adequately address what is unique to IACS/OT. If anyone has these qualifications and is interested in contributing, please let me know (vyto2b@gmail.com) and I will put you in touch with the appropriate ISA 99 Incident Management Workgroup contacts.
[1] Alex Nelson, et.al., NIST, April 2024, p. 2. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r3.ipd.pdf
[2] Julius Gutmanis, Triton: A report from the trenches, S4 Events video, 18:52, 24:37,https://www.youtube.com/watch?v=XwSJ8hloGvY (Accessed 9 December 2024)