UC&C Reliability
For a UC&C system to be useful, it must be reliable. Why must this point be made? Isn't it obvious? Of course, the UC&C system hardware itself must be properly configured and functioning. However, the complexity of the infrastructure that is used by the UC&C system often makes it difficult to determine the cause of a problem or to make sure that there aren't problems lurking within the infrastructure. Let's look more closely at the network infrastructure and some problems that we've seen in real networks.
Network Infrastructure
A group other than the UC&C team often runs the network infrastructure. The domain knowledge is different, the terminology is different, and the set of configuration, operation, and troubleshooting tasks are different. Even if Infrastructure and UC&C are part of the same department, there is enough difference in the work performed that the two parts of the organization may not communicate well with one another.
In addition, the network infrastructure team often is so busy that they sometimes don't handle low-level problems that can negatively affect UC&C. Problems include packet loss (duplex mismatch, congestion due to data bursts, defective cabling, sources of RFI/EMI), lack of QoS, and untuned QoS. All these require monitoring of the network infrastructure.
Infrastructure Redundancy
This involves identifying redundant interfaces and devices that are down, compromising the resilience of the network infrastructure. We have frequently encountered cases where a redundant part of a network is down, but because it is backed up, network connectivity is not affected. But when the backup component fails, then an outage occurs. Post mortem analysis often finds that the first failure occurred several months prior to the second failure but was not noticed.
The network management system should allow the network and UC&C team to easily identify failures in a redundant infrastructure. Using device and interface tags allows easy grouping of network elements to identify critical parts of the infrastructure and generate alerts when a critical component is down. (See http://netcraftsmen.net/blogs/entry/device-and-interface-tagging.html.)
War story: Network Congestion Between Sites
A customer's CIO brought us in to look at reports of network performance problems. The network staff said that everything was working as designed. The network management system wasn't monitoring key interface error counters, including the 1-Gbps links between two local sites, each of which included a large data center.
Our analysis found significant network congestion on the links between the sites. The interfaces were dropping packets due to the congestion. The network staff had increased buffering to try to reduce the drops, but they increased it so much that it created a problem for TCP's retransmit algorithm. The configurations had been built by CCIEs, who should have known better. It is much better to drop excess traffic and have the network protocols recover than to buffer it in the network.
War Story: Monitor All the Interfaces
In another consulting engagement, we found key server interfaces that were dropping many packets, due to congestion. Our analysis determined that several high-volume servers were overrunning the switch interfaces during traffic bursts, negatively affecting the applications.
The customer was not aware of the problem. We asked the network management team about monitoring all data center interfaces so that they could report similar problems in the future. Unfortunately, they said that the network management system was too expensive to monitor all data center interfaces. What hadn't been considered was the expense of poor productivity when the network started dropping packets. To get an idea of the impact, take a look at some of the blogs that we've done at NetCraftsmen on the subject. (See http://netcraftsmen.net/blogs/entry/understanding-interface-errors-and-tcp-performance.html, http://netcraftsmen.net/blogs/entry/tcp-performance-and-the-mathis-equation.html, and http://netcraftsmen.net/blogs/entry/why-is-the-application-slow.html.)
The applications in this case were moving document images, which is not much different than many of the current video and graphic image transfers that happen with today's UC&C applications. The point is that the network monitoring system needs to collect data from all interfaces in the enterprise.
In particular, monitor all active interfaces, especially those that connect to UC&C servers and MCUs, where traffic volumes can be high. Monitor data paths between endpoints and these critical servers to find network problems that affect call setup and can cause call failures. Monitor the data paths between endpoints in order to identify problems that affect the quality of calls once they are established. Occasionally, a single problem will affect both call setup and call quality. This case is clear because the common component is the endpoint and its connectivity into the network.
Is your network monitoring system too expensive? Take a look at Statseeker, a scalable and affordable network performance monitoring system.
War Story: Remote Site Entertainment Traffic
Another customer asked us to take a look at poor voice performance to a remote site. The site was connected via T3 link (45 Mbps). There were reports of dropped calls and voice quality problems.
We used the Riverbed Application Performance Monitoring tool called Application Response eXpert (ARX) to monitor the traffic to/from the site. Its analysis determined that half of the traffic to the remote site was from three Internet sources: Pandora, Akamai, Limelight. We determined that these sources were entertainment traffic, but the organization's Internet access policy prevented us from applying a firewall rule to prevent the traffic.
We applied QoS, but continued to see high numbers of drops in the high priority queue. Both voice and the business application in use at the remote site used small packet sizes, and we determined that the 64-buffer queue was easily overrun. Increasing the number of buffers in the high priority queue from 64 to 256 provided enough buffering to handle the voice and application traffic. QoS was configured to classify the entertainment traffic into the low priority queue, where the configuration forced most of the congestion drops. The voice and business application performance returned to the desired level.
Summary
Without good visibility into the network's operation, it is impossible to accurately and quickly diagnose UC&C problems or to identify when the infrastructure redundancy has been compromised. Network monitoring tools provide this visibility only when they are consistently applied across the entire infrastructure. If total cost of ownership is a problem with your existing tools, then perhaps it is time to look for more scalable tools.