Golden Rules (TL;DR)

  1. Pin everything:
    • SQLAlchemy==1.4.54 (don’t float to 2.x).
    • numpy<2 (Chroma transitive deps can break on 2.0).
    • ghcr.io/chroma-core/chroma:<fixed-tag> (e.g., 0.4.24).
  2. No boolean evaluation of SQLAlchemy clauses:
    • if stmt: / A and B / A or B
    • if stmt is not None: and use sa.and_(...), sa.or_(...).
  3. Make /chroma/upsert dry-run part of every deployment smoke test.
  4. Preflight on startup: log SA/NumPy versions, check Chroma heartbeat, and (optionally) verify DB schema columns exist.

Baseline Configuration

portal-api dependencies

requirements.txt

fastapi==0.110.0
uvicorn==0.29.0
SQLAlchemy==1.4.54
psycopg2-binary==2.9.9
# ...other deps...

constraints.txt

SQLAlchemy==1.4.54
numpy<2

Install with:

pip install -r requirements.txt -c constraints.txt

Chroma (Kubernetes)

  • StatefulSet with PVC (PERSIST_DIRECTORY=/chroma)
  • Image tag pinned, e.g. ghcr.io/chroma-core/chroma:0.4.24
  • Probes against /api/v1/heartbeat

Example k8s/chroma.yaml (Service + StatefulSet) is included in the CI/CD section below.


Code-level Guidance

1) Safe query composition

Replace fragile ORM / boolean-evaluated expressions with SQLAlchemy Core:

import sqlalchemy as sa

def list_queued(self, *, collections=None, limit=1000):
    t = sa.table(
        "portal_chroma_doc",
        sa.column("id"), sa.column("doc_id"),
        sa.column("entity"), sa.column("natural_key"), sa.column("lang"),
        sa.column("collection"), sa.column("doc_text"),
        sa.column("meta"), sa.column("state"),
    )
    stmt = (
        sa.select(
            t.c.id, t.c.doc_id, t.c.entity, t.c.natural_key, t.c.lang,
            t.c.collection, t.c.doc_text, t.c.meta.label("metadata")
        )
        .where(t.c.state == sa.literal("queued"))
        .order_by(t.c.id.asc())
        .limit(int(limit))
    )
    if collections:
        stmt = stmt.where(t.c.collection.in_(list(collections)))

    rows = self.s.execute(stmt).mappings().all()
    # map rows → DTOs (same as now)

If you must support multiple DB schemas (e.g., some envs lack a model column), either add a migration to unify the schema or detect columns via information_schema and build a dynamic SELECT (we used this to stabilize during recovery).

2) Startup preflight (recommended)

At app start, log:

  • SQLAlchemy.__version__ and import path,
  • numpy.__version__,
  • a quick GET to CHROMA_URL/api/v1/heartbeat.

Fail /startupz if critical preflights fail so Kubernetes blocks rollout.


K8s Manifests (minimal, production-safe defaults)

Chroma

k8s/chroma.yaml

apiVersion: v1
kind: Service
metadata: { name: chroma, namespace: portal-dev, labels: { app: chroma } }
spec:
  type: ClusterIP
  selector: { app: chroma }
  ports: [{ name: http, port: 8000, targetPort: 8000 }]
---
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: chroma, namespace: portal-dev }
spec:
  serviceName: chroma
  replicas: 1
  selector: { matchLabels: { app: chroma } }
  template:
    metadata: { labels: { app: chroma } }
    spec:
      containers:
        - name: chroma
          image: ghcr.io/chroma-core/chroma:0.4.24
          imagePullPolicy: IfNotPresent
          env:
            - { name: PERSIST_DIRECTORY, value: /chroma }
            - { name: CHROMA_SERVER_HOST, value: "0.0.0.0" }
            - { name: CHROMA_SERVER_HTTP_PORT, value: "8000" }
          ports: [{ name: http, containerPort: 8000 }]
          readinessProbe:
            httpGet: { path: /api/v1/heartbeat, port: http }
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 12
          livenessProbe:
            httpGet: { path: /api/v1/heartbeat, port: http }
            initialDelaySeconds: 30
            periodSeconds: 20
            timeoutSeconds: 5
            failureThreshold: 6
          volumeMounts: [{ name: data, mountPath: /chroma }]
  volumeClaimTemplates:
    - metadata: { name: data }
      spec:
        storageClassName: standard
        accessModes: [ "ReadWriteOnce" ]
        resources: { requests: { storage: 20Gi } }

portal-api

k8s/portal-api.yaml

apiVersion: v1
kind: Service
metadata: { name: portal-api, namespace: portal-dev, labels: { app: portal-api } }
spec:
  type: ClusterIP
  selector: { app: portal-api }
  ports: [{ name: http, port: 80, targetPort: 8000 }]
---
apiVersion: apps/v1
kind: Deployment
metadata: { name: portal-api, namespace: portal-dev }
spec:
  replicas: 2
  selector: { matchLabels: { app: portal-api } }
  template:
    metadata: { labels: { app: portal-api } }
    spec:
      serviceAccountName: portal-api
      initContainers:
        - name: wait-chroma
          image: curlimages/curl:8.10.1
          command: ["/bin/sh","-lc"]
          args:
            - |
              for i in $(seq 1 60); do
                if curl -fsS http://chroma:8000/api/v1/heartbeat >/dev/null; then
                  echo "chroma ok"; exit 0
                fi
                echo "waiting chroma.."; sleep 2
              done
              echo "chroma timeout"; exit 1
      containers:
        - name: api
          image: portal-api:dev   # <- pin your registry tag
          imagePullPolicy: IfNotPresent
          env:
            - { name: CHROMA_URL, value: http://chroma:8000 }
          ports: [{ name: http, containerPort: 8000 }]
          readinessProbe:
            httpGet: { path: /healthz, port: http }
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 3
          livenessProbe:
            httpGet: { path: /healthz, port: http }
            initialDelaySeconds: 15
            periodSeconds: 10
            timeoutSeconds: 2
            failureThreshold: 3

CI/CD: Minikube Smoke on GitHub Actions

.github/workflows/smoke.yml

name: smoke-on-minikube
on:
  push: { branches: [ main ] }
  pull_request: {}

jobs:
  smoke:
    runs-on: ubuntu-22.04
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4

      - uses: azure/setup-kubectl@v4
        with: { version: "v1.30.5" }
      - uses: azure/setup-helm@v4
        with: { version: "v3.14.4" }

      - name: Start minikube
        uses: medyagh/setup-minikube@v0.0.18
        with:
          minikube-version: "v1.33.1"
          kubernetes-version: "v1.30.0"
          driver: docker

      - name: Create namespace
        run: kubectl create ns portal-dev || true

      - name: Deploy Chroma
        run: |
          kubectl -n portal-dev apply -f k8s/chroma.yaml
          kubectl -n portal-dev rollout status sts/chroma --timeout=180s
          kubectl -n portal-dev run curltest --rm --restart=Never \
            --image=curlimages/curl:8.10.1 -- \
            curl -fsS http://chroma:8000/api/v1/heartbeat

      - name: Deploy portal-api
        run: |
          kubectl -n portal-dev apply -f k8s/portal-api.yaml
          kubectl -n portal-dev rollout status deploy/portal-api --timeout=240s

      - name: Port-forward & smoke
        run: |
          set -euo pipefail
          kubectl -n portal-dev port-forward svc/portal-api 18080:80 >/tmp/pf.log 2>&1 &
          PF_PID=$!
          sleep 5
          trap "kill $PF_PID || true" EXIT
          curl -fsS http://127.0.0.1:18080/healthz
          kubectl -n portal-dev run c2 --rm --restart=Never \
            --image=curlimages/curl:8.10.1 -- \
            curl -fsS http://chroma:8000/api/v1/heartbeat
          curl -fsS -X POST http://127.0.0.1:18080/chroma/upsert \
            -H 'Content-Type: application/json' \
            -d '{"limit":10,"dry_run":true}'

      - name: Dump diagnostics on failure
        if: failure()
        run: |
          kubectl -n portal-dev get all -o wide
          kubectl -n portal-dev describe sts/chroma || true
          kubectl -n portal-dev describe deploy/portal-api || true
          kubectl -n portal-dev logs -l app=portal-api --tail=200 || true

Optional (for local or CI):
hack/smoke.sh

#!/usr/bin/env bash
set -euo pipefail
NS=${NS:-portal-dev}
kubectl -n "$NS" run curltest --rm --restart=Never \
  --image=curlimages/curl:8.10.1 -- \
  curl -fsS http://chroma:8000/api/v1/heartbeat
kubectl -n "$NS" wait --for=condition=Available deploy/portal-api --timeout=180s
kubectl -n "$NS" port-forward svc/portal-api 18080:80 >/tmp/pf.log 2>&1 &
PF_PID=$!; sleep 5; trap "kill $PF_PID || true" EXIT
curl -fsS http://127.0.0.1:18080/healthz
curl -fsS -X POST http://127.0.0.1:18080/chroma/upsert \
  -H 'Content-Type: application/json' -d '{"limit":10,"dry_run":true}'
echo "OK"

Rollback & Cleanup

  • If you ever need to revert the hot-fix path (not required if code is fixed):
    • remove extra volumes/envs you added for runtime patching,
    • keep the dependency pins and probes.

Final note to the team

ChromaDB is powerful but unforgiving with drifting dependencies and ambiguous SQLAlchemy idioms. Please stick to the pinned versions, favor SQLAlchemy Core for data-access utilities, and let CI block rollouts unless heartbeat and /chroma/upsert (dry-run) both pass. This keeps future image changes from turning into production firefights.


Comments

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です