Building predictive service assurance based on machine learning
Connectivity issues in the backhaul network are the cause of many service problems and can result in a drop in customer satisfaction and loss of revenue. Discovering the source of those problems is a complex process. Once discovered, fixing the root cause of problems is essential to meeting service level objectives.
With backhaul performance issues expected to be a major obstacle to the successful rollout of 5G architecture and IoT, network operators need to put automated solutions into place now. New processes and tools are required to assure QoE in this new dynamic environment. Operators need solutions that enable them to detect and isolate connectivity-related issues efficiently within a hybrid environment—where a 1:1 relationship between service topology and network topology no longer exists. “Always on” broadband connectivity means traditional ways of troubleshooting and service assurance will not work in these new environments.
Detect connectivity issues via active testing
Active testing is the ideal solution for detecting issues within the backhaul network environment for several reasons. First, since synthetic test traffic and service traffic follow the same path, tests accurately reflect real network performance. Second, since test and real traffic remain paired, there is no need to reconfigure test configurations when testing a virtual or hybrid network. Also, active testing’s scalability means that when deployed as an active test VNF in virtual or hybrid networks, operators can achieve 100% visibility of the services in their network, even as more endpoints are added. This last benefit demonstrates why active testing will become essential to ensuring the proper functionality of newly created virtual services and functions of SDN/NFV networks.
Machine learning takes service assurance from reactive to proactive
Service assurance must move from a reactive approach, triggering action only after issues have occurred, to one that uses machine learning to automatically learn the network’s behavior in order to proactively discover issues that could lead to future problems before they occur. Within this context, two main root cause scenarios exist, each with different outcomes in terms of severity: sudden short-term outages and longer outages.
With short-term outages, temporary service interruptions result without any warning. Networks react automatically, service recovery is fast, and no interventions are required. In contrast, long-term outages require human intervention for recovery. More severe than short term outages, longer term outages are caused by non-recoverable failure in the network and early warning signs, such as degradation in the performance of the mobile backhaul network, would be present.
For machine-learning systems to be successful, they must be able to do two things: predict an outage and determine the specific type (short or long term) in order to alert the appropriate personnel. Specifically, they must be able to perform the following tasks: identify an outage; find causal factors for degradation; classify the outage; and execute the right corrective actions.
How to handle all that data—two approaches
Many backhaul assurance and monitoring systems use RFC 5357, Two Way Active Test Protocol (TWAMP) to test network connectivity and assess performance via key performance indicators like frame loss and frame delay variation (jitter). With TWAMP active testing, test packets are sent between two end points producing tens of KPIs per test point. Machine learning algorithms require large data sets in order to produce accurate results, so TWAMP’s large data sets are an idea fit. But that also means infrastructures must be capable of handling terabytes of data in real time.
So how do machine learning systems handle and utilize all that data? There are two commonly used approaches.
- Unsupervised anomaly detection: feeding the machine learning system unlabeled data and the algorithm learns the anomalies within the data mass in order to discover relationships within the data which otherwise would be undetectable.
Benefits:
- Since no data labeling needed, set up is easier and less data preparation is required
- No separate data set needed for training
- Especially useful when there are several types of source data available
- Supervised failure prediction: labeling data as “normal” or “abnormal” so the algorithm can automatically distinguish between the two types and flag issues contributing to “abnormal” behavior of the network.
Benefits:
- Learns to predict outages in large data sets with good accuracy and high success rates
- Analysis of causal factors when investigating predicted incidents can provide an additional means for the automatic classification of incidents.
Supervised learning algorithms with proprietary labelling are more suitable for assurance and forecasting purposes than unsupervised learning algorithms. However, this requires extra effort on labelling the data and additional data sources may be required for identifying abnormal network behavior.
Conclusion
In the evolution toward 5G and C-RAN infrastructure, automated service assurance solutions aided by machine learning systems are more than helpful to prevent wireless backhaul network issues—they’re essential. For more details and specific examples of machine learning and active testing read our white paper: Leveraging machine learning to eliminate backhaul bottlenecks in 5G networks.