Monitoring & Alerting

This section covers the monitoring infrastructure for the Hello World DAO LLC platform: canister cycle monitoring + auto top-up via GitHub Actions, and application error tracking via GlitchTip.

Overview

The monitoring stack is intentionally lightweight — there is no Prometheus, Alertmanager, Grafana, or PagerDuty. Two systems carry the load:

Canister metrics + cycle top-ups: A GitHub Actions cron (monitor-metrics.yml in ops-infra) runs every 6 hours, calls check-cycles.sh against the 12 backend canisters + 6 frontend asset canisters, and auto-tops-up any canister below threshold.
Application error tracking: GlitchTip (self-hosted Sentry-compatible service at glitchtip.founderyos.dev) captures runtime errors from every suite. Source maps are uploaded by CI on every release.

Earlier drafts of this page mentioned Prometheus rules, Alertmanager, Grafana dashboards, Slack/PagerDuty routing — none of those are deployed. If you see references to them in older docs, they are aspirational, not current.

Quick Links

Resource	Description
GlitchTip	Application errors (single project, per-suite tags)
IC Dashboard	Internet Computer canister + subnet status
GitHub Actions — ops-infra	`monitor-metrics.yml` cron + manual triggers
Canister Cycle Monitoring	Canonical canister inventory + cycle budgets

Monitoring Architecture

┌──────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│  Canisters   │───▶│  GitHub Actions cron │───▶│  Workflow summary    │
│  (IC mainnet)│    │  monitor-metrics.yml │    │  + auto top-up       │
└──────────────┘    │  (every 6h)          │    └─────────────────────┘
                    └─────────────────────┘             │
                                                        ▼
                                                 Email on failure
                                                 (devops@helloworlddao.com)

┌──────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│  6 Suites    │───▶│  Sentry SDK          │───▶│  GlitchTip           │
│  (browser)   │    │  (per-suite DSN tag) │    │  glitchtip.          │
└──────────────┘    └─────────────────────┘    │  founderyos.dev/4    │
                                               └─────────────────────┘
                                                        │
                                                        ▼
                                                 Email on new issue

Canister Metrics — GitHub Actions Cron

Workflow

ops-infra/.github/workflows/monitor-metrics.yml

Schedule: every 6 hours (0 */6 * * *)
Trigger (manual): gh workflow run monitor-metrics.yml --repo Hello-World-Co-Op/ops-infra
Identity: github-ci dfx identity (PEM in DFX_IDENTITY_PEM secret)
Action: runs check-cycles.sh against the canister fleet, auto-tops-up canisters below threshold (default: 100B cycles, 0.05 ICP top-up amount)

Thresholds

Threshold	Cycles	Action
Critical	< 100B (0.1 TC)	Auto top-up + workflow exits non-zero
Warning	< 500B (0.5 TC)	Logged in workflow summary; manual review
Healthy	> 1 TC	No action

Reading workflow output

Each run produces a markdown summary (visible on the workflow run page) listing per-canister balances and any top-ups performed. If the workflow fails, an email goes to devops@helloworlddao.com.

For deep-dive procedures see:

Canister Cycle Monitoring — canonical inventory + standalone check script
Automated Cycles Top-Up System — full GHA cron walkthrough at operations/automated-cycles-topup.md in the repo (excluded from rendered site)
Cycles Top-Up runbook — what to do when an alert fires

Application Errors — GlitchTip

Project

Endpoint: https://glitchtip.founderyos.dev
Project ID: 4 (single project, all suites tagged)
DSN format: https://<key>@glitchtip.founderyos.dev/4 (UUID dashes stripped — Sentry SDK rejects dashes)
Auth token (CI source-map upload): SENTRY_AUTH_TOKEN GH secret in each suite repo

What gets sent

Suite	Tag	Source-map upload
`dao-suite`	`suite=dao`	On release in CI
`dao-admin-suite`	`suite=dao-admin`	On release in CI
`governance-suite`	`suite=governance`	On release in CI
`marketing-suite`	`suite=marketing`	On release in CI
`otter-camp-suite`	`suite=otter-camp`	On release in CI
`think-tank-suite`	`suite=think-tank`	On release in CI

Notification

GlitchTip emails devops@helloworlddao.com on first occurrence of a new issue + on regression. There is no Slack/PagerDuty hook.

Common pitfalls

DSN with dashes — normalizeDsn() MUST strip dashes from the UUID key. The Sentry SDK silently rejects dashed UUIDs.
Celery worker outages — GlitchTip uses Celery; if events stop arriving, check the worker on the FOS cluster (Graydon owns this, but symptoms surface here).
Source maps missing — verify the suite's release CI uploaded them; check release-please PR was actually released.

Alert Thresholds (canister + UX)

Alert	Source	Threshold	Severity	Response Time
Low cycles	`monitor-metrics.yml`	< 1T cycles	Warning	< 1 hour
Critical cycles	`monitor-metrics.yml`	< 500B cycles	Critical	< 15 minutes
New JS error	GlitchTip	first occurrence	Info	Triage same day
Error spike	GlitchTip	> 10x baseline	Warning	Triage same day

Runbooks

For operational procedures, see:

Topic	Runbook
Low cycles	Cycles Top-Up Procedure
High errors	High Error Rate Triage
Canister down	Canister Unresponsive Recovery
Failed deploy	Deployment Failure Recovery
Database issues	Database Connectivity

Setup Guide

1. GitHub Secrets

Add these to the ops-infra repository (Settings → Secrets and variables → Actions):

Secret	Purpose
`DFX_IDENTITY_PEM`	dfx identity for canister status checks + top-ups
`IC_PRINCIPAL`	Identity principal (info only)

Each suite repo also needs:

Secret	Purpose
`SENTRY_AUTH_TOKEN`	GlitchTip auth token (scopes: `project:releases`, `org:read`) for CI source-map upload
`SENTRY_DSN` (env var, not secret)	Per-suite DSN — public, exposed in `.env.staging` / `.env.production`

2. Wire suite Sentry SDK

// In each suite's main.ts
import * as Sentry from '@sentry/react';

Sentry.init({
  dsn: normalizeDsn(import.meta.env.VITE_SENTRY_DSN),  // strips dashes from UUID
  environment: import.meta.env.MODE,
  release: import.meta.env.VITE_RELEASE_VERSION,
  initialScope: { tags: { suite: 'dao-admin' } },     // per-suite tag
});

3. Confirm cron + alerts

After enabling monitor-metrics.yml, manually trigger one run and confirm summary lists all canisters.
Trigger a synthetic Sentry error from a deployed suite and confirm it appears in GlitchTip within ~30 seconds.

Troubleshooting

Cycle workflow fails

Check the workflow run page for the dfx command that failed.
Common causes: DFX_IDENTITY_PEM rotated, ICP wallet empty, IC subnet outage (check IC dashboard).
Re-run from the Actions tab once the underlying issue is fixed.

GlitchTip stops receiving events

Open https://glitchtip.founderyos.dev — verify the UI loads.
Check the FOS k8s cluster's GlitchTip Celery worker (kubectl -n hello-world get pods -l app=glitchtip-worker).
Verify the suite is sending — open browser devtools, trigger an error, check Network tab for a /api/<id>/store/ POST.
If POSTs are rejected with 400, the DSN UUID likely has dashes — verify normalizeDsn() is applied.

"Canister not in monitoring"

Open ops-infra/scripts/check-cycles.sh.
Confirm the canister ID is in the CANISTERS=() array.
If missing, add it (one entry per active canister) and PR the change.

Incident Response Runbook
Canister Cycle Monitoring
Automated Cycles Top-Up System (operations/automated-cycles-topup.md — excluded from rendered site, view in repo)
CI/CD Pipeline
ops-infra repository

Monitoring & Alerting ​

Overview ​

Quick Links ​

Monitoring Architecture ​

Canister Metrics — GitHub Actions Cron ​

Workflow ​

Thresholds ​

Reading workflow output ​

Application Errors — GlitchTip ​

Project ​

What gets sent ​

Notification ​

Common pitfalls ​

Alert Thresholds (canister + UX) ​

Runbooks ​

Setup Guide ​

1. GitHub Secrets ​

2. Wire suite Sentry SDK ​

3. Confirm cron + alerts ​

Troubleshooting ​

Cycle workflow fails ​

GlitchTip stops receiving events ​

"Canister not in monitoring" ​

Related Documentation ​