Monitoring & Alerting
This section covers the monitoring infrastructure for the Hello World DAO LLC platform: canister cycle monitoring + auto top-up via GitHub Actions, and application error tracking via GlitchTip.
Overview
The monitoring stack is intentionally lightweight — there is no Prometheus, Alertmanager, Grafana, or PagerDuty. Two systems carry the load:
- Canister metrics + cycle top-ups: A GitHub Actions cron (
monitor-metrics.ymlinops-infra) runs every 6 hours, callscheck-cycles.shagainst the 12 backend canisters + 6 frontend asset canisters, and auto-tops-up any canister below threshold. - Application error tracking: GlitchTip (self-hosted Sentry-compatible service at
glitchtip.founderyos.dev) captures runtime errors from every suite. Source maps are uploaded by CI on every release.
Earlier drafts of this page mentioned Prometheus rules, Alertmanager, Grafana dashboards, Slack/PagerDuty routing — none of those are deployed. If you see references to them in older docs, they are aspirational, not current.
Quick Links
| Resource | Description |
|---|---|
| GlitchTip | Application errors (single project, per-suite tags) |
| IC Dashboard | Internet Computer canister + subnet status |
| GitHub Actions — ops-infra | monitor-metrics.yml cron + manual triggers |
| Canister Cycle Monitoring | Canonical canister inventory + cycle budgets |
Monitoring Architecture
┌──────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Canisters │───▶│ GitHub Actions cron │───▶│ Workflow summary │
│ (IC mainnet)│ │ monitor-metrics.yml │ │ + auto top-up │
└──────────────┘ │ (every 6h) │ └─────────────────────┘
└─────────────────────┘ │
▼
Email on failure
(devops@helloworlddao.com)
┌──────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ 6 Suites │───▶│ Sentry SDK │───▶│ GlitchTip │
│ (browser) │ │ (per-suite DSN tag) │ │ glitchtip. │
└──────────────┘ └─────────────────────┘ │ founderyos.dev/4 │
└─────────────────────┘
│
▼
Email on new issueCanister Metrics — GitHub Actions Cron
Workflow
ops-infra/.github/workflows/monitor-metrics.yml
- Schedule: every 6 hours (
0 */6 * * *) - Trigger (manual):
gh workflow run monitor-metrics.yml --repo Hello-World-Co-Op/ops-infra - Identity:
github-cidfx identity (PEM inDFX_IDENTITY_PEMsecret) - Action: runs
check-cycles.shagainst the canister fleet, auto-tops-up canisters below threshold (default: 100B cycles, 0.05 ICP top-up amount)
Thresholds
| Threshold | Cycles | Action |
|---|---|---|
| Critical | < 100B (0.1 TC) | Auto top-up + workflow exits non-zero |
| Warning | < 500B (0.5 TC) | Logged in workflow summary; manual review |
| Healthy | > 1 TC | No action |
Reading workflow output
Each run produces a markdown summary (visible on the workflow run page) listing per-canister balances and any top-ups performed. If the workflow fails, an email goes to devops@helloworlddao.com.
For deep-dive procedures see:
- Canister Cycle Monitoring — canonical inventory + standalone check script
- Automated Cycles Top-Up System — full GHA cron walkthrough at
operations/automated-cycles-topup.mdin the repo (excluded from rendered site) - Cycles Top-Up runbook — what to do when an alert fires
Application Errors — GlitchTip
Project
- Endpoint:
https://glitchtip.founderyos.dev - Project ID:
4(single project, all suites tagged) - DSN format:
https://<key>@glitchtip.founderyos.dev/4(UUID dashes stripped — Sentry SDK rejects dashes) - Auth token (CI source-map upload):
SENTRY_AUTH_TOKENGH secret in each suite repo
What gets sent
| Suite | Tag | Source-map upload |
|---|---|---|
dao-suite | suite=dao | On release in CI |
dao-admin-suite | suite=dao-admin | On release in CI |
governance-suite | suite=governance | On release in CI |
marketing-suite | suite=marketing | On release in CI |
otter-camp-suite | suite=otter-camp | On release in CI |
think-tank-suite | suite=think-tank | On release in CI |
Notification
GlitchTip emails devops@helloworlddao.com on first occurrence of a new issue + on regression. There is no Slack/PagerDuty hook.
Common pitfalls
- DSN with dashes —
normalizeDsn()MUST strip dashes from the UUID key. The Sentry SDK silently rejects dashed UUIDs. - Celery worker outages — GlitchTip uses Celery; if events stop arriving, check the worker on the FOS cluster (Graydon owns this, but symptoms surface here).
- Source maps missing — verify the suite's release CI uploaded them; check
release-pleasePR was actually released.
Alert Thresholds (canister + UX)
| Alert | Source | Threshold | Severity | Response Time |
|---|---|---|---|---|
| Low cycles | monitor-metrics.yml | < 1T cycles | Warning | < 1 hour |
| Critical cycles | monitor-metrics.yml | < 500B cycles | Critical | < 15 minutes |
| New JS error | GlitchTip | first occurrence | Info | Triage same day |
| Error spike | GlitchTip | > 10x baseline | Warning | Triage same day |
Runbooks
For operational procedures, see:
| Topic | Runbook |
|---|---|
| Low cycles | Cycles Top-Up Procedure |
| High errors | High Error Rate Triage |
| Canister down | Canister Unresponsive Recovery |
| Failed deploy | Deployment Failure Recovery |
| Database issues | Database Connectivity |
Setup Guide
1. GitHub Secrets
Add these to the ops-infra repository (Settings → Secrets and variables → Actions):
| Secret | Purpose |
|---|---|
DFX_IDENTITY_PEM | dfx identity for canister status checks + top-ups |
IC_PRINCIPAL | Identity principal (info only) |
Each suite repo also needs:
| Secret | Purpose |
|---|---|
SENTRY_AUTH_TOKEN | GlitchTip auth token (scopes: project:releases, org:read) for CI source-map upload |
SENTRY_DSN (env var, not secret) | Per-suite DSN — public, exposed in .env.staging / .env.production |
2. Wire suite Sentry SDK
// In each suite's main.ts
import * as Sentry from '@sentry/react';
Sentry.init({
dsn: normalizeDsn(import.meta.env.VITE_SENTRY_DSN), // strips dashes from UUID
environment: import.meta.env.MODE,
release: import.meta.env.VITE_RELEASE_VERSION,
initialScope: { tags: { suite: 'dao-admin' } }, // per-suite tag
});3. Confirm cron + alerts
- After enabling
monitor-metrics.yml, manually trigger one run and confirm summary lists all canisters. - Trigger a synthetic Sentry error from a deployed suite and confirm it appears in GlitchTip within ~30 seconds.
Troubleshooting
Cycle workflow fails
- Check the workflow run page for the dfx command that failed.
- Common causes:
DFX_IDENTITY_PEMrotated, ICP wallet empty, IC subnet outage (check IC dashboard). - Re-run from the Actions tab once the underlying issue is fixed.
GlitchTip stops receiving events
- Open https://glitchtip.founderyos.dev — verify the UI loads.
- Check the FOS k8s cluster's GlitchTip Celery worker (
kubectl -n hello-world get pods -l app=glitchtip-worker). - Verify the suite is sending — open browser devtools, trigger an error, check Network tab for a
/api/<id>/store/POST. - If POSTs are rejected with 400, the DSN UUID likely has dashes — verify
normalizeDsn()is applied.
"Canister not in monitoring"
- Open
ops-infra/scripts/check-cycles.sh. - Confirm the canister ID is in the
CANISTERS=()array. - If missing, add it (one entry per active canister) and PR the change.
Related Documentation
- Incident Response Runbook
- Canister Cycle Monitoring
- Automated Cycles Top-Up System (
operations/automated-cycles-topup.md— excluded from rendered site, view in repo) - CI/CD Pipeline
- ops-infra repository