Automatic Network Anomaly Quantification
SLAC contact: Dr. Les, Connie, Adnan Iqbal
High Energy Physics data analysis is increasingly dependant on high speed networks to share data across world-wide collaborations. To get data and results when needed, it is necessary to be aware of network conditions. Network performance monitoring is an area of utmost interest and importance for all networks in general and for high speed networks supporting High Energy Physics projects in particular.
There are many directions to move further with respect to Network Performance Monitoring domain but we have limited ourselves to problems specifically helpful to High Energy Physics community in particular.
1- How to know that we are not getting the performance which is required?
2- Suppose we know that we are not getting the needed performance then what is the reason for this drop in performance?
One way of doing it is to monitor performance of each application individually. Unfortunately users are not computer analysts and they don’t have time to do this. Rather purpose is to do it automatically and without involving any user (physics scientist).
To achieve this goal, first requirement is
a) To have automated tools to provide network performance related parameters
b) Installation/Configuration/Result gathering at required sites
Fortunately, These two steps have already been done. We have tools like Pathchirp, Pinger and Thrulay which give us different network parameters by using different techniques. Current we have number of sites where these tools are installed and configured and working properly.
These tools give us data and then we have to analyze this data to come up with baseline expectations in performance and points in time where there is a significant drop in a network performance event. After finding a possible event we have to find out the most probable reason for that event. Currently we are using an automated system to detect a possible event and then we manually analyze data associated with that event to find its root cause. Causes include route changes, network or end host congestion.
1- We get a possible alert and analyze it manually to get its cause? Need to be automated. (Final goal).
2- To achieve final goal we need a canonical data set on which we can test our automated system
3- We also need to identify possible anomalies, their causes and methods of quantification
4- Analysis of results and deployment