TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di, Rui Wang

2026-04-24

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Summary

This paper introduces TingIS, a system designed to quickly find and fix problems in large, modern cloud services by analyzing reports from users experiencing issues.

What's the problem?

When big online services have problems, they get a lot of reports from users, but these reports are often messy, come in quickly, and describe issues in different ways depending on what part of the service is affected. It's hard to sort through all this 'noise' to figure out what's *actually* going wrong and quickly address it, and even a few minutes of downtime can cost a lot of money and damage trust with users.

What's the solution?

TingIS uses a combination of clever data organization and powerful AI language models to make sense of these user reports. It first efficiently organizes the reports, then uses the AI to group similar reports together, even if the wording is different. It also filters out irrelevant information and figures out which part of the business is experiencing the problem. This system was tested in a real-world environment handling thousands of messages per minute.

Why it matters?

TingIS is important because it significantly improves how quickly companies can detect and respond to problems in their online services. By automatically finding and grouping issues, it reduces the time it takes to fix things, leading to less downtime, happier users, and ultimately, more money saved.

Abstract

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

View Paper