Business service health monitoring and event correlation to minimize MTTR
The client is a diversified global insurer with operations around the world and provides insurance products and services to meet the needs of individuals, families, and businesses. The client was looking for an ITOps platform with intelligent algorithms and also wanted to reduce its incident levels through data-driven analytics.
Business Challenge
The client was facing increased mean time to repair (MTTR). A lack of visibility into business services health and interdependent services and an inability to reduce event noise due to manual thresholds and rule-based alerts were causing a lot of issues. Additionally, there were problems with inefficient correlation leading to an increase in MTTR/MTTD and a lack of situational awareness across business services.
Approach
Conducted workshops with business and IT stakeholders to understand the challenges
Prototyped a solution and presented the approach to the client
Collected sample operations data from various monitoring solutions and developed correlation polices to deduplication of events
Integrated Splunk ITSI with BMC to collect and analyse event and incident data
Utilized out-of-the-box algorithms in Splunk ITSI for incident prediction
Developed custom algorithms for incident and event correlation
Created dashboards in Splunk to monitor business services health and performance
Demonstrated incident prediction model based on historical events and trends
Transformational Effects
Effective RCA (Root Cause Analysis) using consolidated ITOM data in one place
Anomaly detection and proactive incident remediation
Consolidated view of critical KPI’s, metrics and business services health
Insights into historical trend, patterns and problem areas