We've all heard about the mythical five nines of availability, the Holy Grail set decades ago for residential telephony.
Five nines availability (99.999%) translates to an average of five minutes of unavailability (00.001%) per year. In other words, while some users may experience longer outages and others may experience no outages at all, average service outages/unavailability doesn't exceed five minutes per user in a year.
While premises-based TDM PBX solutions didn't always achieve five nines, they generally came very close -- commonly achieving at least four nines through redundancy and reliability-focused manufacturing of line cards and phones for high mean time between failures (MTBF). Achieving that level of availability has proven a challenge with VoIP, given the underlying IP data network's complexity and lack of edge redundancy. Core routers, the open Internet, power, SIP trunking, and other factors can reduce VoIP availability to between three and four nines. For example, a non-redundant edge/desktop Ethernet switch with a MTBF of 10 years and a mean time to repair (MTTR) of four hours is unavailable 24 minutes per year (240 minutes in 10 years). This means that one device in the entire network has about five times more unavailability alone than the target needed to achieve that mythical five nines of availability.
Redundant servers in distributed data centers can increase availability in the core (even to six nines or beyond). One of the key benefits of VoIP and centralized architectures is the ability to have multiple devices, with the use of redundant devices increasing overall availability. If one device fails and the core is operating, the user can use an alternative device (phone, PC, mobile, etc.). However, should the core fail, so too would all devices down the chain.
For cloud-delivered communications services, a range of events can affect service availability. This can happen in the data center or network; on servers; with virtualization or UCaaS software; and due to administrative issues. What happens when the core cloud service itself goes off-line? A core cloud service outage can impact many end users simultaneously. Without an operating core platform, alternatives to the IP endpoint like mobile integration also fail. With most UCaaS and CCaaS solutions, using the open Internet as over-the-top (OTT) deployments, Internet issues can also impact availability from the end-user perspective.
If you're using or thinking about using cloud communications services, you need to know how often providers and end users experience outages and whether those events are limited or universal. While enterprises weigh factors such as cost, flexibility, continual upgrade, and self-service capabilities in considering cloud migrations, they tend to overlook availability. In the end, you need a clear understanding of cloud availability and the impact on your organization before deciding to make the move. It turns out there's an app for that.
Downdetector.com, a crowd-sourced site, tracks outages/availability of a wide range of Internet/cloud services. While Downdetector covers most of the usual cloud suspects, it also tracks UCaaS providers such as 8x8, Microsoft, Mitel, RingCentral, and others.
For example, I received an alert regarding a RingCentral outage on April 3. As can be seen on the below screen capture (ads redacted), users began reporting problems with the RingCentral service at 10:14 a.m. ET and continued reporting issues for a couple of hours. Based on comments on Downdetector, the outage appears to have impacted some users for up to five hours. However, based on the length of consistent problem reports, the outage appears to have a had a major impact at just about one hour.
The online Downdetector graph shows the number of problem/outage reports received per 15-minute period. In looking at the chart and estimating the counts by 15-minute period, it seems that Downdetector received almost 1,500 separate reports of this outage.
I decided to dig a little deeper, so looked in the Downdetector archive for additional RingCentral outages. I found a total of four outage events documented in 2018 -- on Jan. 22, March 2, March 15, and, as already mentioned, April 3. The graphic below shows the timing and duration of the reporting for each of those outage events (note the vertical and total count of problem reports is different for each event). While Downdetector has no record of how many users/seats these outage events impacted, the number of reports and the duration indicates RingCentral customers would have found these to be significant interruptions.
After gathering the data, I talked with Curtis Peterson, RingCentral's SVP of Cloud Operations, for a detailed explanation of the four 2018 outage events reported on Downdetector. Here's a summary of his account of those outages:
- Jan. 22 -- Outage caused by a peering issue outside of RingCentral control. This issue led to a relatively low number of reports, and those may have been tied to a specific Internet service provider. Clearly the choice of access provider can be critical to cloud availability.
- March 2 -- Result of an East Coast storm that impacted several data centers in the Washington D.C. area, affecting a variety of Internet applications and sites. RingCentral appears to have been affected by Amazon/Equinix issues on that day. The choice of data center vendor can have a clear impact on services.
- March 15 -- Outage related to Internet peering issues that impacted not only RingCentral but many other cloud providers and users. In checking, I didn't find any other major cloud provider outages at the time, so this event may have been primarily with the RingCentral IP peers versus the open Internet. As in choice of data center locations and providers, peering is another critical issue for cloud providers. Even though this event may not be attributable to RingCentral directly, it still impacted some RingCentral users over an eight-hour period. In total, there were fewer than 200 reports of that outage -- and, as Curtis pointed out, new technologies like software-defined WAN have the potential to mitigate this type of outage
- April 3 -- Event due to a data center issue with West Coast servers as morning load ramped up; users moved to East Coast servers
Most of the outage events didn't impact the entire population, but rather a percentage of 30% or less of the total, Curtis told me. While there is no specific data showing such, his conclusion seems logical considering the March 2 storm and the difference in the event reporting visualization of the March 15 issue and the limited scope of CCaaS (with the number of cloud contact center seats typically 5% of overall business communications seats). The April 3 event, resulting from a core UCaaS issue and affecting a large number of West Coast users, had the largest impact.
Based on the charts, we can attribute the maximum impact of the outages as: one hour of average customer unavailability for Jan. 22, two hours each for March 2 and April 3, and three hours for March 15. Added up, the total is eight hours of average reported outage periods for RingCentral customers in the first three months of 2018. This equates to 32 hours for the year, or, in our count of nines, between two and three nines (99.63% availability), if all of the outages impacted a specific customer. However, if we assume that the events on average impacted 30% of the user base, then the actual impact is 99.89% availability, or about three nines. If the outages on average only impacted 10% of the RingCentral user base, the availability increases to 99.69%, or higher than three nines.
While these are extrapolated estimates, if a premises PBX vendor touted availability of three to four nines (only one to nine hours of downtime per year), that vendor would have been challenged to sell many systems against competitors claiming five nines. On the other hand, as Curtis noted, on legacy systems we can't track issues as transparently as we can with cloud-delivered services. Regardless, customers considering a cloud migration need to understand availability of their chosen provider and the impact on their organization.
While the April 3 RingCentral event started this investigation, RingCentral isn't the only UCaaS provider facing availability challenges. Other than UCaaS software, the operational team, and some tools, all of the UCaaS vendors have the same options from which to choose in delivering their solutions (data centers, servers, networking, Internet peering, etc.). The result is that outages are actually fairly common across the real-time cloud solutions providers. Over the last 16 months, the number of reported outages has been fairly high across the xCaaS community. For example:
- 8x8 had one two-hour outage on March 2, and three outages in 2017
- Cisco WebEx has had two outages this year, and 20 outages in 2017
- Mitel had an outage on Jan. 24, but none reported in 2017
- Vonage has had five outages in 2018, and 15 in 2017 (however, to be fair, these include its consumer service, which operates separate from the business services)
- Microsoft Skype for Business had one outage in 2018, and one in late 2017, while Office 365 overall has had 11 reported outages in 2018 and 49 in 2017
Because Downdetector outage event reporting aggregates input from users across geographies and access carriers and the number of reporters of incidents is relatively large, it generally should eliminate local issues and show core service problems when a major outage occurs. The comments section also provides insight into the scope of the outages: "I have been down for 5 hours now." "Down in Colorado Springs, CO. And when I logged into their server status page, that's down, too." Clearly, administrators get frustrated when cloud services are offline without explanation, recourse, or notification. While most cloud vendors advertise that they'll inform customers of core outages, that doesn't seem to be all that common, based on users reporting outages to Downdetector across vendors (This is a surprise? To whom? Fox in the hen house?).
Reliability or availability, or the lack thereof, should become a major factor in cloud communications purchasing decisions. If a perception emerges that cloud providers can't deliver reasonable UCaaS availability levels, it'll impact enterprise willingness to move to the cloud for UC services.
The Verizon "Can you hear me now?" campaign was a clear attempt to mitigate early availability and quality issues on the cellular network. While many users have become accustomed to the volatility of cellular telephony, the majority of enterprise endpoints aren't configured for knowledge workers or paired with mobile devices. Loss of service totally negates the value of these devices, be they desktop, conference room, or general office phones. A cloud-connected phone without a cloud service is a fantastic paperweight.
With UCaaS, CCaaS, and other cloud-based communications services, the impact of downtime and outages is both significant and immediate. It may be that the UCaaS space should be more availability-focused than other cloud service markets due to the immediacy of the service and the expectation of users based on 100 years of five-nines PSTN. Having access to tools like Downdetector enables consideration of past performance in contract awards. In many ways this is no different than evaluating mutual funds based on their past performance via Morningstar ratings. For organizations looking at cloud vendors or consultants analyzing proposals, including an analysis of availability should become a major part of future evaluations.
Cloud availability and outages could become a major issue in 2018, both generally for UCaaS and for specific vendors. With some providers experiencing an outage lasting an hour or two every month or even more frequently, how much time elapses before customers perceive it as an issue? We need to hear how providers are addressing this and why it should be a consideration for a decision process.
Editor's note: This article has been updated since original publication.