Prevention, response, and investigation are three key approaches that, when used together, can assist IT in deciding whether to accept or mitigate the level of risk.
This same risk management principle relates to engineering economics, in assessing whether the cost of a preventative solution outweighs its benefits. Risk management seems black and white, but in practice cost becomes a factor.
Oftentimes, senior management does not fully share the cost of the risk -- as related to disruption or downtime -- to IT. This opens the potential for IT and other operational areas to over-engineer solutions, which in turn leads either to a high total cost of ownership that outweighs the benefit or to a solution that fails to work as anticipated due to complexities or change.
Another key metric lies within carrier and provider SLAs. Do these levels of commitment complement or oppose each other? By that I mean, if the cloud provider states availability at 0.99% and the carrier is at 0.995%, then you have an imbalance -- and the potential to over- or under-engineer your solution. The point is, you need to know your exact SLAs with both carrier and cloud providers, at least when it comes to access/availability. The cloud provider with 0.99% availability can be down for 7.2 hours while your carrier committing to a 0.995% availability is only down for 3.6 hours in a 30-day month, for example.
In practice, we see over- and under-engineering daily, across disciplines and in all industries. Workstations have more than enough hard-drive capacity but lack in RAM, or firewalls don't account for simultaneous connections, BYOD, or the Internet of Things. These types of risks are preventable and, when overlooked, can be effectively remediated when discovered during investigation into root cause of failure or disruption.
Case in Point
The response to an IT failure is accepting the risk or mitigating it to eliminate or minimize impact. I'll use power disruption as an example, since IT needs continuous and undisturbed power to provide availability to services.
While investigating why IP phones were failing intermittently at a county agency, I found that all IDF and MDF LAN switches and routers had redundant power supplies. Power supply #1 connected to the UPS source, and the UPS connected to an orange receptacle served by a standby generator. Power supply #2 connected directly to the house receptacle, bypassing the UPS.
The thinking was that if the UPS failed, then the secondary power supply would already be active and so there'd be no downtime during the repair or while awaiting replacement. However, power supply #2 had no surge protection -- which I uncovered. Power disruptions and disturbances to the unprotected house power impacted the LAN switch. The agency implemented its power supply configuration for the right reason, but it did not account for the power disturbances and transients that occur daily in every installation on unprotected power.
Hopefully the IT response to this situation isn't to add a secondary UPS to every IDF and MDF. Instead adding whole panel surge protection devices (SPD) is a more cost-effective solution with a lower TCO.
While IT staff often takes for granted doing the right things to prevent downtime, not every situation is the same and thinking through the solution is important. In the example I used, the solution added another layer of risk -- a risk that came to fruition, causing disruption to the LAN switches. Instead of providing less risk, it increased vulnerability.
Thinking Big Picture
As a sidebar, consider the potential impacts -- good and bad -- of using a carrier data centers for both transport and cloud services. What does the SLA look like for enterprises using one source, what are the risks, and what are the planned responses during an outage? What kind of preventive measures have you employed and do they work? How do you know?
Follow Matt Brunk on Twitter!
@telecomworx