Risk!

Risk (https://en.wikipedia.org/wiki/Risk_%28game%29) is one of those games that seems to endure. It’s a simple game, which is probably part of its charm, and has changed and adapted over the years. There are even online versions – I checked.

The “classic” game was released in 1959 and different versions have been made, including the 1975 edition I played, but there are endless variants on that game, often (but not always) based on popular films. These include several Star Wars variants, a Doctor Who version, Plants vs Zombies, and Rick and Morty, but they are all the same at their core.

Game play is very simple, and based on simple dice (up to three 6-sided dice / side) and “armies” which are all exactly equivalent, and can lead to very interesting situations where a single defender can stand against near-unimaginable odds – so long as the defender can keep rolling higher values on one die than the attacker can on multiple dice...

Not really a war game, Risk is described as a “strategy board game of diplomacy, conflict and conquest”. From the name, it’s also easy to think of it as an introduction to certain elements of risk management.

Unfortunately, “risk” (https://en.wikipedia.org/wiki/Risk) is yet another one of those words that “everyone knows”, but whose meaning can be very hard to nail down.

Colloquially, risk is generally associated with the chance of something pad happening, but that’s really too vague to be useful. When you reach out for more, you immediately enter a maze of formal definitions, covering everything from business and financial risk, to environmental and health risk.

I found it interesting to look at the US NIST (National Institute of Standards and Technology) glossary page for “risk” (https://csrc.nist.gov/glossary/term/risk), which includes over two dozen references and definitions of the term. To be fair, however, many of the definitions are similar to the first one listed:

“The level of impact on organizational operations (including mission, functions, image, or reputation), organizational assets, or individuals resulting from the operation of an information system given the potential impact of a threat and the likelihood of that threat occurring.”

Risk is an integral part of almost everything we do, and managing risk is at the core of many disciplines, yet many people take on risks without even thinking about it. Partly, this is because many people don’t understand the distinction between the colloquial and formal definitions of risk, and many technology people simply don’t think of things in terms of risk, particularly when that risk is associated with NOT doing something.

Let’s invent an example, using our our friend Alice’s bookstore. Now, everyone “knows” that you should keep all software patched, but Alice is trying to run a business, so it’s necessary for Alice’s IT team to handle patching, but also support and maintenance for the website, the office, the stores, the point-of-sale machines, and so on.

Many organizations, including Alice’s, define patching schedules for software maintenance. In this case, the database software might have a monthly patch schedule, to balance the time/effort/cost of the work required to update (along with the risk of something going wrong, and the need to test), against the risk of having out-of-date software.

Bob is Alice’s IT director, and is responsible for (among other things) the patching schedule and the resources who do the actual work. Bob is the one who explained the cost associated with patching and the risk of not being up-to-date, and was the one who proposed the monthly patch schedule and got Alice’s approval to implement it. In terms of risk management, Alice and Bob evaluated the cost and risk of patching and the cost and risk of NOT patching, and determined the best way to proceed. In semi-technical terms, Alice, as the responsible authority, evaluated the risks, reviewed and approved the risk management plan, and accepted the residual risk.

Everything has been working fine for quite a while, and the team has been selling a LOT of books about creating cat videos, but then Bob heard a story on the Cyberwire (https://www.til-technology.com/my-playlist) talking about a critical vulnerability in the software that they use to support the website.

Now, Bob’s team had just finished patching, and the next patch was scheduled for next month, so Bob just double-checked that the patch was available and ensured that it was included in next month’s patch list.

Problem solved, right?

<<record scratch>> Wrong!!

Wait, what?

This new vulnerability was a bad one, comparable to the “Drupalgeddon” vulnerabilities I have commented on previously (https://www.til-technology.com/post/infosec-bullshido-zero-day), and failure to patch promptly resulted in, well, bad things happening.

But what did Bob do “wrong”? He had an approved patching plan, and they did all the risk management, right?

Well, no. One of the key elements to a risk plan is that it is fluid, and changes as circumstances change. In this case, the risk plan was based on the estimated risks of patching vs not patching, but this new vulnerability was a very different situation that completely changed the risk calculation. In the new case, the cost and risk associated with patching had to be balanced against the KNOWN URGENT ISSUE, and the updated risk management plan should have been reviewed and approved by Alice. Some discussion might have been needed, to determine how to apply an emergency patch or other mitigation, or to accept the additional risk (with the understanding that bad things could happen).

I have seen several variations on this theme, usually in the context of prioritizing “business needs” against maintenance and such. The issue is not (except in rare cases) due to negligence or incompetence, or anything like that, but rather due to the fact that most technology people don’t view everything in the context of risk management.

Technology people generally do the best they can to prioritize work, and this is inevitably influenced by the priorities of “the client” – whether that is an internal business contact, a project/product manager, or an actual customer. It is necessary, however, to ensure that risk management is embedded in management processes, and that people don’t view vulnerability remediation as a “maintenance” or “scheduling” issue, but rather as a risk management issue.

In our example, Bob was apparently thinking of this as a scheduling issue and tried to manage everything as efficiently as possible. Instead, Bob should have been evaluating the degree to which the new vulnerabilities affected the existing risk profile and whether there was a material change in the level of risk.

But how do we do this? There are a LOT of patches out there, a LOT of known security vulnerabilities, and an uncountable number of unknown or not-yet-known vulnerabilities.

Every organization struggles with asset management, software inventory, patching, and the various other components of this question, so any answer will depend on the organization. However, there are some relatively simple processes which can help manage this situation, and it should be noted that most organizations already have much of this in place, to varying degrees of maturity.

First, define how to tie vulnerabilities to systems and confirm who is responsible for the system in question. Depending on the organization and vulnerability, this could be a vendor, a support team, or a product owner. Whoever it is, this person or group must be held accountable for remediation of vulnerabilities.

Second, define a process for evaluation of identified potential vulnerabilities, and standard Service Level Agreements for each type. This will help to ensure that issues can be triaged and prioritized appropriately.

Finally, define processes for unusual cases. How do you manage false-positive results? What about cases where a vulnerability exists, but mitigations are in place to render it difficult or impossible to exploit? What if a vendor patch is not yet available? How about cases where the responsible team is working on an even higher-priority issue than the vulnerability in question?

This last process can (and should) include an approval process. Technology people need to be educated and understand both that not remediating an issue within the standard SLA involves accepting risk on behalf of the organization and that (in most cases) they are NOT authorized to accept risk of that type, and require approval for any other plan.

Getting back to Bob, let’s look at three scenarios through the lens of the process just defined. (All vulnerability levels / SLA invented):

Critical Vulnerability – SLA = 72 hours

In this case, Bob would either have to schedule an emergency patch, or put together a plan for remediation and ask Alice to approve it, with a clear understanding of the risks involved

High Vulnerability – SLA = 14 days

Depending on the organization, Bob may be authorized to defer the patch until the next scheduled release, so long as the risk and approval is documented. The important difference is that it is clear that this decision involves accepting risk (and accountability) for the organization, rather than “just” a support or scheduling issue.

Low Vulnerability – SLA = 45 days

Adding the patch to the next scheduled release falls within the standard SLA, and would not trigger any additional process.

There’s still work involved, of course, but there is hope. As noted, most technology people do not think in terms of risk, but senior leadership in most organizations DO look at many things in terms of risk (or are at least familiar with the idea). By describing things in terms of risk, you will generally have a much easier time explaining, understanding, and supporting organizational priorities.

Cheers!

“In science, 'fact' can only mean 'confirmed to such a degree that it would be perverse to withhold provisional assent.' I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms."

Stephen Jay Gould

Today I Learned

Risk!

Recent Posts

Comments