arXiv:2605.15581v1 Announce Type: new
Abstract: LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present textbfSTAR, a emphStage-attributed Triage and Repair framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely emphEvidence Package (EP), emphHypothesis Set (HS), emphAnalysis Structure (AS), and emphDecision Report (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware emphFast/Slow Routing, emphdecisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair.
We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling emphwhere an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844