Operations (Runbook)
How to configure, run, verify, and operate the bot. For why the operational constraints exist, follow the ADR links.
Configuration (environment)
Section titled “Configuration (environment)”All config is loaded by pydantic-settings from environment / .env
(src/abitly_bot/config.py). Template: .env.example.
| Variable | Default | Purpose |
|---|---|---|
BOT_TOKEN | — (required) | Telegram bot token (SecretStr). |
ADMIN_IDS | "" | Comma-separated Telegram user ids allowed to run admin commands. |
ACCOUNT_LINK_URL | "" | Base URL for the website→bot account-link button. |
DB_HOST / DB_PORT | — / 5432 | PostgreSQL host/port. |
DB_USERNAME / DB_PASSWORD | — (required) | DB credentials (SecretStr). |
DB_NAME | — (required) | Database name. |
DB_SCHEMA | abitly | Shared schema; set as search_path per connection. |
DB_SSL | true | Keep TLS verification ON (ADR 0007). |
DB_CA_FILE | null | Path to the managed-DB CA cert, if the provider uses its own CA. |
REDIS_HOST / REDIS_PORT | localhost / 6379 | Redis (shared with backend). |
REDIS_PASSWORD / REDIS_DB | null / 0 | Redis auth / logical DB. |
DEFAULT_TELEGRAM_MESSAGE_MAX_RETRY | 3 | 429 retry attempts before dropping. |
DEFAULT_TELEGRAM_MESSAGE_RETRY_DELAY_MS | 1000 | Base retry delay. |
SEND_MAX_CONCURRENCY | 5 | Max concurrent outbound sends. |
SEND_RATE_PER_SECOND | 30 | Global outbound rate cap (aiolimiter). |
PORT | 3000 | aiohttp /healthcheck port. |
TZ | Europe/Kyiv | APScheduler timezone for the 07:00 job. |
LOG_LEVEL | INFO | Log level. |
Secrets: BOT_TOKEN, DB_PASSWORD, REDIS_PASSWORD are SecretStr and must come
from the platform’s secret store, never committed. .env and .secrets/ are gitignored.
Run (dev)
Section titled “Run (dev)”uv venv && source .venv/bin/activate # or python -m venv .venvuv pip install -e ".[dev]" # or pip install -e ".[dev]"cp .env.example .env # fill BOT_TOKEN (enough for a basic /start)python -m abitly_bot # starts polling + healthcheck on $PORTSmoke check: the bot answers /start; GET http://localhost:$PORT/healthcheck → OK.
Quality gate (offline — no DB/Telegram needed)
Section titled “Quality gate (offline — no DB/Telegram needed)”ruff check . # lint (E,F,I,UP,B,ASYNC)mypy src # strictpytest # unit tests; integration tests auto-skip without DB_* envThis is the gate that must stay green for every change.
Boot & health behaviour
Section titled “Boot & health behaviour”- The healthcheck server starts first so the platform’s probe passes during boot.
- The bot then does a fail-fast DB check (
SELECT 1); if the DB is unreachable it disposes resources and exits non-zero (SystemExit(1)), so the platform restarts it rather than serving a broken bot. See Runtime Flows. - Liveness endpoint:
GET /healthcheck → 200 OKon$PORT. Note this proves the process is up, not that the DB is reachable (that is the fail-fast’s job at boot).
Database TLS & CA
Section titled “Database TLS & CA”TLS verification is always on (ADR 0007). For a managed DB that presents its own CA:
- Obtain the provider’s CA certificate (download the official cert; do not disable verification).
- Point
DB_CA_FILEat it. - Ensure the host’s egress IP is on the provider’s trusted-sources allowlist, or TCP to the DB port will silently time out (packets dropped).
See docs/MIGRATION_STATUS.md (Blocker 1) for the concrete
provider steps used during bring-up.
Running integration tests
Section titled “Running integration tests”These need live (read-only) DB access and run under the integration marker:
export DB_CA_FILE=/abs/path/to/ca.pem # if the DB uses a private CA# plus DB_HOST/DB_PORT/DB_USERNAME/DB_PASSWORD/DB_NAME in the envpytest tests/integration -v -m integrationtest_schema_reflection.py is the schema-drift safety net
(ADR 0006); it tolerates the two
pending filter tables but surfaces any other mismatch.
Scheduled job
Section titled “Scheduled job”- What: daily open-day reminders for events 1 or 3 days away.
- When: 07:00 in
TZ(defaultEurope/Kyiv). - Semantics:
coalesce=True(a restart-missed run fires once on resume),misfire_grace_time=3600. Jobstore is in-memory — job state is not shared with the backend or across instances. Code:infra/scheduler.py.
Flood / 429 handling
Section titled “Flood / 429 handling”Outbound fan-outs go through MessageSender: bounded concurrency
(SEND_MAX_CONCURRENCY), a global rate cap (SEND_RATE_PER_SECOND), and 429 retry via
TelegramRetryAfter (sleep retry_after, re-queue up to …_MAX_RETRY). Blocked users
(TelegramForbiddenError) and unexpected errors are logged and dropped per message
without aborting the batch. Tune via the SEND_* / …_RETRY* env vars. Details:
ADR 0005.
Scaling constraints
Section titled “Scaling constraints”Run exactly one instance. Two independent constraints enforce this:
- Long-polling — multiple pollers double-deliver updates (ADR 0001).
- In-process state — the
MessageSenderqueue/limiter and APScheduler jobstore are per-process; a second instance would not share the rate budget and would double-fire the daily job (ADR 0005).
FSM state is shared (Redis), but that alone is not enough to make the bot multi-instance-safe. Going multi-instance would require: webhook ingress (or a single leader poller), a shared/distributed rate limiter, and a single scheduler owner.
Graceful shutdown
Section titled “Graceful shutdown”On KeyboardInterrupt / SystemExit, main()’s finally stops the scheduler
(wait=False) and closes the bot session, Redis, DB engine, and healthcheck runner.
Code: src/abitly_bot/__main__.py:83-89.
Cutover checklist
Section titled “Cutover checklist”- Allowlist the egress IP + provision
DB_CA_FILE; run the integration tests green. - Land the backend prerequisites (mint endpoint + the two filter tables; verify M2M names) — Data Model.
- Live smoke against a staging token:
/start, paste an offer URL,/myoffers,/statistics. - Point the production token at this app (Ф7 wiring is ready).
Living status: docs/MIGRATION_STATUS.md.