This commit is contained in:
Space-Banane
2026-04-02 19:47:53 +02:00
parent 9ed4e240c2
commit bdddf602be
30 changed files with 783 additions and 17 deletions

View File

@@ -0,0 +1,25 @@
# Self-Healing Service Monitor
Detect service degradation early and execute predefined recovery actions before escalating to humans.
## Problem
Teams get paged for recoverable incidents because monitoring is alert-only and lacks trusted automated remediation.
## Core capabilities
- Combine health checks, latency, and error-rate signals into failure states.
- Execute tiered recovery actions: restart, clear cache, failover, scale up.
- Validate recovery with post-action checks before resolving alerts.
- Escalate with rich context only if auto-healing fails.
## MVP scope
- Integrate with Prometheus/Grafana alerts.
- Define safe action catalog per service.
- Maintain cooldown windows to prevent action loops.
## Success criteria
- Lower pager volume for transient failures.
- Faster service recovery for known incident classes.
## Stretch ideas
- Adaptive run selection based on incident fingerprint similarity.
- Automatic rollback if a healing action worsens key metrics.