titan-iac/knowledge/catalog/runbooks.json

[
  {
    "path": "runbooks/ci-gitea-jenkins.md",
    "title": "CI: Gitea \u2192 Jenkins pipeline",
    "tags": [
      "atlas",
      "ci",
      "gitea",
      "jenkins"
    ],
    "entrypoints": [
      "scm.bstein.dev",
      "ci.bstein.dev"
    ],
    "source_paths": [
      "services/gitea",
      "services/jenkins",
      "scripts/jenkins_cred_sync.sh",
      "scripts/gitea_cred_sync.sh"
    ],
    "body": "# CI: Gitea \u2192 Jenkins pipeline\n\n## What this is\nAtlas uses Gitea for source control and Jenkins for CI. Authentication is via Keycloak (SSO).\n\n## Where it is configured\n- Gitea manifests: `services/gitea/`\n- Jenkins manifests: `services/jenkins/`\n- Credential sync helpers: `scripts/gitea_cred_sync.sh`, `scripts/jenkins_cred_sync.sh`\n\n## What users do (typical flow)\n- Create a repo in Gitea.\n- Create/update a Jenkins job/pipeline that can fetch the repo.\n- Configure a webhook (or SCM polling) so pushes trigger builds.\n\n## Troubleshooting (common)\n- \u201cWebhook not firing\u201d: confirm ingress host, webhook URL, and Jenkins job is reachable.\n- \u201cAuth denied cloning\u201d: confirm Keycloak group membership and that Jenkins has a valid token/credential configured."
  },
  {
    "path": "runbooks/comms-verify.md",
    "title": "Othrys verification checklist",
    "tags": [
      "comms",
      "matrix",
      "element",
      "livekit"
    ],
    "entrypoints": [
      "https://live.bstein.dev",
      "https://matrix.live.bstein.dev"
    ],
    "source_paths": [],
    "body": "1) Guest join:\n- Open a private window and visit:\n  `https://live.bstein.dev/#/room/#othrys:live.bstein.dev?action=join`\n- Confirm the guest join flow works and the displayname becomes `<word>-<word>`.\n\n2) Keycloak login:\n- Log in from `https://live.bstein.dev` and confirm MAS -> Keycloak -> Element redirect.\n\n3) Video rooms:\n- Start an Element Call room and confirm audio/video with a second account.\n- Check that guests can read public rooms but cannot start calls.\n\n4) Well-known:\n- `https://live.bstein.dev/.well-known/matrix/client` returns JSON.\n- `https://matrix.live.bstein.dev/.well-known/matrix/client` returns JSON.\n\n5) TURN reachability:\n- Confirm `turn.live.bstein.dev:3478` and `turns:5349` are reachable from WAN."
  },
  {
    "path": "runbooks/kb-authoring.md",
    "title": "KB authoring: what to write (and what not to)",
    "tags": [
      "atlas",
      "kb",
      "runbooks"
    ],
    "entrypoints": [],
    "source_paths": [
      "knowledge/runbooks",
      "scripts/knowledge_render_atlas.py"
    ],
    "body": "# KB authoring: what to write (and what not to)\n\n## The goal\nGive Atlas assistants enough grounded, Atlas-specific context to answer \u201chow do I\u2026?\u201d questions without guessing.\n\n## What to capture (high value)\n- User workflows: \u201cclick here, set X, expected result\u201d\n- Operator workflows: \u201cedit these files, reconcile this kustomization, verify with these commands\u201d\n- Wiring: \u201cthis host routes to this service; this service depends on Postgres/Vault/etc\u201d\n- Failure modes: exact error messages + the 2\u20135 checks that usually resolve them\n- Permissions: Keycloak groups/roles and what they unlock\n\n## What to avoid (low value / fluff)\n- Generic Kubernetes explanations (link to upstream docs instead)\n- Copy-pasting large manifests (prefer file paths + small snippets)\n- Anything that will drift quickly (render it from GitOps instead)\n- Any secret values (reference Secret/Vault locations by name only)\n\n## Document pattern (recommended)\nEach runbook should answer:\n- \u201cWhat is this?\u201d\n- \u201cWhat do users do?\u201d\n- \u201cWhat do operators change (where in Git)?\u201d\n- \u201cHow do we verify it works?\u201d\n- \u201cWhat breaks and how to debug it?\u201d"
  },
  {
    "path": "runbooks/observability.md",
    "title": "Observability: Grafana + VictoriaMetrics (how to query safely)",
    "tags": [
      "atlas",
      "monitoring",
      "grafana",
      "victoriametrics"
    ],
    "entrypoints": [
      "metrics.bstein.dev",
      "alerts.bstein.dev"
    ],
    "source_paths": [
      "services/monitoring"
    ],
    "body": "# Observability: Grafana + VictoriaMetrics (how to query safely)\n\n## Where it is configured\n- `services/monitoring/helmrelease.yaml` (Grafana + Alertmanager + VM values)\n- `services/monitoring/grafana-dashboard-*.yaml` (dashboards and their PromQL)\n\n## Using metrics as a \u201ctool\u201d for Atlas assistants\nThe safest pattern is: map a small set of intents \u2192 fixed PromQL queries, then summarize results.\n\nExamples (intents)\n- \u201cIs the cluster healthy?\u201d \u2192 node readiness + pod restart rate\n- \u201cWhy is Element Call failing?\u201d \u2192 LiveKit/coturn pod restarts + synapse errors + ingress 5xx\n- \u201cIs Jenkins slow?\u201d \u2192 pod CPU/memory + HTTP latency metrics (if exported)\n\n## Why dashboards are not the KB\nDashboards are great references, but the assistant should query VictoriaMetrics directly for live answers and keep the\nKB focused on wiring, runbooks, and stable conventions."
  },
  {
    "path": "runbooks/template.md",
    "title": "<short title>",
    "tags": [
      "atlas",
      "<service>",
      "<topic>"
    ],
    "entrypoints": [
      "<hostnames if relevant>"
    ],
    "source_paths": [
      "services/<svc>",
      "clusters/atlas/<...>"
    ],
    "body": "# <Short title>\n\n## What this is\n\n## For users (how to)\n\n## For operators (where configured)\n\n## Troubleshooting (symptoms \u2192 checks)"
  }
]