Przeglądaj źródła

[Docs] Document Realtime scaling limits

Append known capacity limits and failure modes for the single-container
Supabase Realtime deployment, with concrete triggers for when to cluster
or add pgbouncer. Out of MVP scope — reference for future capacity work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User 1 miesiąc temu
rodzic
commit
8444d3d
1 zmienionych plików z 31 dodań i 0 usunięć
  1. 31 0
      research/TECHFILE.md

+ 31 - 0
research/TECHFILE.md

@@ -772,3 +772,34 @@ The updated scope correctly addresses all three Critical findings and all seven
 | Critical      | 2     | GoTrue anonymous auth env var; Edge Functions not in default self-hosted stack                                                                                                                                                                   |
 | Recommended   | 8     | argon2 Docker native build; iron-session v8 API; persistQueryClient adapter packages; serwist sw.ts authoring; t3-env over raw zod + SUPABASE_URL naming fix; tini confirmation; Caddy/Docker internal URL networking; Vitest/Playwright scoping |
 | No New Action | —     | All first-review findings confirmed adopted in updated scope                                                                                                                                                                                     |
+
+---
+
+## Realtime Scaling Limits (added 2026-05-08)
+
+Self-hosted Supabase Realtime is fine for MVP and the low thousands of concurrent users with the current single-container config. Document of known limits so future capacity work has a baseline.
+
+**Architecture today:** one `supabase-realtime` container (BEAM/Elixir), one logical replication slot from Postgres, postgres_cdc_rls extension evaluating RLS per subscriber per change, single shared Postgres for Realtime + PostgREST + Auth.
+
+**Comfortable limits (single-node):**
+
+- ~10–30k concurrent WebSocket connections per BEAM node (RAM-bound).
+- Hundreds of writes/sec on watched `public.movies` rows.
+- `REPLICA IDENTITY FULL` on `movies` is cheap because rows are ~1KB; would be expensive on wide/large tables.
+
+**Failure modes at scale:**
+
+1. **Single realtime container = single fan-out CPU.** Hot groups (e.g., 100+ users in one list, all subscribed) cause linear policy evaluation on every UPDATE. CPU saturation, not crash. Mitigation: cluster Realtime via libcluster (BEAM distributed) — needs DNS-based discovery and `DNS_CLUSTER_QUERY` env wired into compose.
+2. **Single logical replication slot.** Stuck or slow subscriber bloats WAL on Postgres, can fill disk. Mitigation: monitor `pg_replication_slots.confirmed_flush_lsn` lag; alert before WAL fills volume.
+3. **Shared Postgres connection pool.** Realtime + PostgREST + Auth + cron all hit the same DB. At ~1000+ concurrent users, add **pgbouncer** in transaction-pooling mode in front of Postgres; raise `max_connections` only as a stopgap.
+4. **postgres_cdc_rls per-subscriber RLS evaluation.** Current `movies` SELECT policy is cheap (one membership check). If policies grow more complex (joins, multi-table subqueries), evaluation cost compounds with subscriber count.
+5. **Tenant table is a single point of config.** `_realtime.tenants` holds encrypted DB credentials with `DB_ENC_KEY=supabaserealtime`. Rotating that key requires re-encrypting the tenant row.
+
+**Capacity triggers — when to act:**
+
+- Realtime container CPU sustained >70% → cluster.
+- WS connect failures or `phx_close` storms → check tenant config + connection pool.
+- WAL volume growth >10%/day with no corresponding DB write growth → check replication slot lag.
+- p95 update-broadcast latency >500ms → fan-out bottleneck.
+
+**Out of scope for MVP.** Flag in PROJECT_SCOPE.md Phase 9 (capacity) or Phase 10 (launch-readiness) when traffic projections justify the work.