This commit is contained in:
Space-Banane
2026-04-02 19:47:53 +02:00
parent 9ed4e240c2
commit bdddf602be
30 changed files with 783 additions and 17 deletions

View File

@@ -0,0 +1,25 @@
# Cost Anomaly Detector
Monitor cloud spending in near real time, detect abnormal spikes, and trigger alerts or safe remediation workflows before monthly bills explode.
## Problem
Cloud spend surprises are usually discovered days later in billing dashboards, after the costly behavior has already continued.
## Core capabilities
- Ingest billing and usage metrics from cloud providers.
- Learn baseline cost patterns by service, account, and environment.
- Detect anomalies using thresholds plus statistical deviation.
- Trigger response playbooks: alert only, scale-down, or shutdown non-critical resources.
## MVP scope
- AWS Cost Explorer integration first, then Azure/GCP.
- Daily anomaly scoring with high-priority instant alerts.
- Human approval required for any destructive remediation.
## Success criteria
- Faster detection of abnormal cost events.
- Measurable reduction in unplanned monthly overspend.
## Stretch ideas
- Forecast end-of-month bill scenarios with confidence ranges.
- Recommend reserved instances or savings plans automatically.

View File

@@ -0,0 +1,25 @@
# Runbook Executor
Store operational runbooks as structured YAML or Markdown and execute them step-by-step with human approval gates for risky actions.
## Problem
Incident procedures often live in docs that are hard to follow under pressure, leading to skipped checks and inconsistent responses.
## Core capabilities
- Parse runbooks into explicit steps, prerequisites, and rollback actions.
- Execute commands with dry-run previews and approval prompts.
- Record execution logs, decisions, and timestamps for postmortems.
- Pause and hand off to a human operator at defined control points.
## MVP scope
- YAML schema for step type, command, timeout, and approval level.
- CLI and chat-driven execution interface.
- Immutable audit log output per run.
## Success criteria
- Reduced incident recovery time.
- Higher runbook adherence and better post-incident traceability.
## Stretch ideas
- Simulated practice mode for fire drills.
- Auto-generate runbook health reports showing stale or untested steps.

View File

@@ -0,0 +1,25 @@
# Self-Healing Service Monitor
Detect service degradation early and execute predefined recovery actions before escalating to humans.
## Problem
Teams get paged for recoverable incidents because monitoring is alert-only and lacks trusted automated remediation.
## Core capabilities
- Combine health checks, latency, and error-rate signals into failure states.
- Execute tiered recovery actions: restart, clear cache, failover, scale up.
- Validate recovery with post-action checks before resolving alerts.
- Escalate with rich context only if auto-healing fails.
## MVP scope
- Integrate with Prometheus/Grafana alerts.
- Define safe action catalog per service.
- Maintain cooldown windows to prevent action loops.
## Success criteria
- Lower pager volume for transient failures.
- Faster service recovery for known incident classes.
## Stretch ideas
- Adaptive run selection based on incident fingerprint similarity.
- Automatic rollback if a healing action worsens key metrics.