Docs 500 Error Fix

Problems

1. OIDC authorization endpoint uses unresolvable hostname

The OIDC front-channel endpoints (authorization_endpoint, end_session_endpoint, issuer) used id.kubernetes.local which doesn’t resolve in the browser. The browser was redirected to https://id.kubernetes.local/... and couldn’t reach Keycloak.

2. Network policy blocks Keycloak backchannel on port 80

After fixing #1, the callback returned a 500 error. The docs backend’s Kubernetes NetworkPolicy blocked egress on port 80. The OIDC backchannel endpoints (token_endpoint, userinfo_endpoint, jwks_uri) were configured as HTTP (http://keycloak-keycloak/...) which uses port 80. When the mozilla-django-oidc callback handler tried to exchange the auth code for tokens via POST to the token endpoint, the connection was silently dropped by the network policy, causing a ConnectionError → 500.

The network policy only allowed egress on:

Port 80 was missing.

Fixes

Fix 1: Use publicly resolvable hostname for OIDC front-channel endpoints

File: helmfile/environments/demo/mijnbureau.yaml.gotmpl

Changed id.kubernetes.localid.127.0.0.1.sslip.io (resolves to 127.0.0.1 via sslip.io) for browser-facing endpoints:

issuer: "https://id.127.0.0.1.sslip.io/realms/mijnbureau"
authorization_endpoint: "https://id.127.0.0.1.sslip.io/realms/mijnbureau/protocol/openid-connect/auth"
end_session_endpoint: "https://id.127.0.0.1.sslip.io/realms/mijnbureau/protocol/openid-connect/logout"

The backchannel endpoints (token_endpoint, userinfo_endpoint, jwks_uri, introspection_endpoint) remain as internal http://keycloak-keycloak/... since those are called from within the cluster.

Fix 2: Allow HTTP egress to Keycloak in network policy

File: helmfile/apps/docs/values.yaml.gotmpl

Added a network policy egress rule allowing HTTP (port 80) specifically to Keycloak pods:

# Allow http egress for keycloak backchannel (token, userinfo, jwks)
- ports:
    - port: 80
  to:
    - podSelector:
        matchLabels:
          app.kubernetes.io/name: keycloak

Additional Issues Found & Fixed

3. Database credential mismatch (docs backend CrashLoopBackOff)

The docs PostgreSQL pod’s docs user password in the database didn’t match the Kubernetes secret after helm upgrades. The helmfile sync triggered a PostgreSQL chart upgrade that rotated the secret values, but the actual database retained the old password. This caused authentication failures (password authentication failed for user "docs").

Fix: Reset database user passwords to match the current secret values:

ALTER USER postgres WITH PASSWORD '<current-postgres-password>';
ALTER USER docs WITH PASSWORD '<current-docs-password>';

4. Liveness probe kills backend before workers initialize

4 uvicorn workers (WEB_CONCURRENCY=4) initialize simultaneously under 500m CPU limit, taking ~20-60s to start. The liveness probe (initialDelaySeconds: 10, failureThreshold: 3) killed the pod after 30s — before workers could finish initialization. This caused infinite CrashLoopBackOff.

Fix: Enabled a startupProbe in helmfile/apps/docs/values.yaml.gotmpl with 60s initial delay and 6 failure threshold (total ~120s startup window):

startupProbe:
  enabled: true
  initialDelaySeconds: 60
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 6
  successThreshold: 1

5. Celery worker OOMKilled (pre-existing)

The docs celery worker was OOMKilled (exit 137) with 384Mi memory limit due to Celery prefork autoscaling (min=3, max=9).

Fix: Increased celery worker memory limit to 720Mi in helmfile/environments/demo/mijnbureau.yaml.gotmpl.

6. Database password rotated by helmfile sync

Each helmfile sync upgrades the PostgreSQL chart, which can regenerate the docs-cluster-rw secret with a new password. The database retains the previous password, causing authentication failures.

Workaround: After each helmfile sync, run ALTER USER to sync the database with the current secret. A proper fix would pin the PostgreSQL password in the chart values.