If I point to a car and say “boat,” does the vehicle change magically to conform to the name? We all know how silly that sounds, but many sources seem to think that saying “cloud-native” is all that’s necessary to make something cloud-native. The cloud-native name isn’t going to redefine reality, and the Fastly outage demonstrates why we should significantly care about that.
Fastly is a content delivery network (CDN) that provides content caching for websites and videos to improve user quality of experience. It had an outage triggered by a software bug and a perfectly valid user configuration change, which took down a lot of websites because sometimes-tiny (dare we say, micro!) elements of those sites’ pages were cached on Fastly. The issue got fixed pretty quickly, but many wondered how some relatively little company’s problem could have such widespread consequences, particularly since CDNs like Fastly are really a form of a cloud provider. Aren’t clouds supposed to be scalable and reliable?
The short answer is—it depends. If we take an application that’s designed to run on a server, and simply move it to an infrastructure as a service cloud (IaaS) virtual machine, the failure of the cloud server would take the application down, just as the failure of its own server would. The cloud provider would then, depending on the contract terms, offer recovery options, but whether they’d be effective or not would depend on what state the users of the application and its databases were in when the failure occurred. A seamless failover seems unlikely.
Suppose, though, that the application had been designed to take advantage of the cloud’s scalability and resilience. Suppose that each component of the application was designed to be scaled or replaced as needed and that all database activities and user connections were designed for failure. This is what “cloud-native” design should be—designed not for the old data center world but the new world of cloud.
If a little CDN company’s outage could cause a disruption that took news and government sites offline, what would happen to the Internet and our communications overall if a public cloud implementation of 5G’s radio access network logic (RAN) were to fail? Are all the implementations available today (or on their way) truly cloud-native? If not, are we unrealistically thinking that the resilience of cloud computing is protecting these critical assets?
We will never create reliable cloud services without making the “cloud-native” concept meaningful, and without actually building applications that exploit the cloud to its full potential. The fact that we’re washing virtually everything with the cloud-native term doesn’t make that easy; nobody holds vendors or service providers accountable, and few seem to care if the claims are true. That’s not the only problem though, because cloud-native technology and techniques aren’t universally understood or accepted, even by developers.
We hear all the time that a cloud-native application is broken into microservices, coupled with network connections. While this framework would be scalable and resilient, it could also generate significant latency when compared to an application whose features were collected in a single software component. You can’t just turn monoliths into microservices, host them in the cloud, then claim cloud-native operation—any more than the monoliths intact could support that claim. You have to design an application so that the benefits of microservice segmentation don’t compromise the quality of experience.
Half of all the “cloud-native” applications developed by enterprises who shared their experiences with me failed to meet their business case, either because the application was more costly than the justifications predicted, or because the performance was unsatisfactory. We could be justified in thinking that the cloud-native development skill levels of those writing CDN software or software to host 5G RAN could be higher, but we can’t be sure. Is it possible that any given cloud-native application has a 50/50 chance of delivering on its promise?
Another critical lesson we must learn is that we are increasingly dependent on advanced technology concepts built by multiple organizations, and integrated so that their co-dependencies are far from obvious even to many of the organizations who build our network and cloud experiences. Add in development techniques that don’t fully exploit the cloud, and why are we surprised when a seemingly small problem turns into an (apparently) major outage?
I’m not saying that everything on the Internet and the cloud is going to fall to pieces because even among enterprises, a quarter of the failures associated with cloud-native development are already being rewritten. We are learning the lessons of the cloud, but at a greater price than necessary. The cost and performance characteristics of a cloud-native implementation are predictable, and so are the impacts of failing to optimize applications for the cloud when the decision is made to host them there. It would be easier to figure out the right answer upfront than to try wrong answers when those increase costs and create embarrassing failures.
We all have a role to play in making this right. As consumers of technology material, we should demand specifics to backup claims of “cloud-native” technology. As editors, authors, analysts, and reporters, we should demand the same backup when the claims are presented to us for consideration. As suppliers of technology, we should accept that no amount of marketing benefit justifies hiding the truth, particularly when doing so could contribute to a long-term loss of credibility for the very things we’re trying to promote.
Make sure to check out the growing exhibitor list for Enterprise Connect 2021 here and start planning your trip today. Registration is now open; use the code NJAL200 to save $200 off the current rate!