- Pin everything:
SQLAlchemy==1.4.54(don’t float to 2.x).numpy<2(Chroma transitive deps can break on 2.0).ghcr.io/chroma-core/chroma:<fixed-tag>(e.g.,0.4.24).
- No boolean evaluation of SQLAlchemy clauses:
- ❌
if stmt:/A and B/A or B - ✅
if stmt is not None:and usesa.and_(...),sa.or_(...).
- ❌
- Make
/chroma/upsertdry-run part of every deployment smoke test. - Preflight on startup: log SA/NumPy versions, check Chroma heartbeat, and (optionally) verify DB schema columns exist.
Baseline Configuration
portal-api dependencies
requirements.txt
fastapi==0.110.0
uvicorn==0.29.0
SQLAlchemy==1.4.54
psycopg2-binary==2.9.9
# ...other deps...
constraints.txt
SQLAlchemy==1.4.54
numpy<2
Install with:
pip install -r requirements.txt -c constraints.txt
Chroma (Kubernetes)
- StatefulSet with PVC (
PERSIST_DIRECTORY=/chroma) - Image tag pinned, e.g.
ghcr.io/chroma-core/chroma:0.4.24 - Probes against
/api/v1/heartbeat
Example k8s/chroma.yaml (Service + StatefulSet) is included in the CI/CD section below.
Code-level Guidance
1) Safe query composition
Replace fragile ORM / boolean-evaluated expressions with SQLAlchemy Core:
import sqlalchemy as sa
def list_queued(self, *, collections=None, limit=1000):
t = sa.table(
"portal_chroma_doc",
sa.column("id"), sa.column("doc_id"),
sa.column("entity"), sa.column("natural_key"), sa.column("lang"),
sa.column("collection"), sa.column("doc_text"),
sa.column("meta"), sa.column("state"),
)
stmt = (
sa.select(
t.c.id, t.c.doc_id, t.c.entity, t.c.natural_key, t.c.lang,
t.c.collection, t.c.doc_text, t.c.meta.label("metadata")
)
.where(t.c.state == sa.literal("queued"))
.order_by(t.c.id.asc())
.limit(int(limit))
)
if collections:
stmt = stmt.where(t.c.collection.in_(list(collections)))
rows = self.s.execute(stmt).mappings().all()
# map rows → DTOs (same as now)
If you must support multiple DB schemas (e.g., some envs lack a
modelcolumn), either add a migration to unify the schema or detect columns viainformation_schemaand build a dynamic SELECT (we used this to stabilize during recovery).
2) Startup preflight (recommended)
At app start, log:
SQLAlchemy.__version__and import path,numpy.__version__,- a quick GET to
CHROMA_URL/api/v1/heartbeat.
Fail /startupz if critical preflights fail so Kubernetes blocks rollout.
K8s Manifests (minimal, production-safe defaults)
Chroma
k8s/chroma.yaml
apiVersion: v1
kind: Service
metadata: { name: chroma, namespace: portal-dev, labels: { app: chroma } }
spec:
type: ClusterIP
selector: { app: chroma }
ports: [{ name: http, port: 8000, targetPort: 8000 }]
---
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: chroma, namespace: portal-dev }
spec:
serviceName: chroma
replicas: 1
selector: { matchLabels: { app: chroma } }
template:
metadata: { labels: { app: chroma } }
spec:
containers:
- name: chroma
image: ghcr.io/chroma-core/chroma:0.4.24
imagePullPolicy: IfNotPresent
env:
- { name: PERSIST_DIRECTORY, value: /chroma }
- { name: CHROMA_SERVER_HOST, value: "0.0.0.0" }
- { name: CHROMA_SERVER_HTTP_PORT, value: "8000" }
ports: [{ name: http, containerPort: 8000 }]
readinessProbe:
httpGet: { path: /api/v1/heartbeat, port: http }
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 12
livenessProbe:
httpGet: { path: /api/v1/heartbeat, port: http }
initialDelaySeconds: 30
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 6
volumeMounts: [{ name: data, mountPath: /chroma }]
volumeClaimTemplates:
- metadata: { name: data }
spec:
storageClassName: standard
accessModes: [ "ReadWriteOnce" ]
resources: { requests: { storage: 20Gi } }
portal-api
k8s/portal-api.yaml
apiVersion: v1
kind: Service
metadata: { name: portal-api, namespace: portal-dev, labels: { app: portal-api } }
spec:
type: ClusterIP
selector: { app: portal-api }
ports: [{ name: http, port: 80, targetPort: 8000 }]
---
apiVersion: apps/v1
kind: Deployment
metadata: { name: portal-api, namespace: portal-dev }
spec:
replicas: 2
selector: { matchLabels: { app: portal-api } }
template:
metadata: { labels: { app: portal-api } }
spec:
serviceAccountName: portal-api
initContainers:
- name: wait-chroma
image: curlimages/curl:8.10.1
command: ["/bin/sh","-lc"]
args:
- |
for i in $(seq 1 60); do
if curl -fsS http://chroma:8000/api/v1/heartbeat >/dev/null; then
echo "chroma ok"; exit 0
fi
echo "waiting chroma.."; sleep 2
done
echo "chroma timeout"; exit 1
containers:
- name: api
image: portal-api:dev # <- pin your registry tag
imagePullPolicy: IfNotPresent
env:
- { name: CHROMA_URL, value: http://chroma:8000 }
ports: [{ name: http, containerPort: 8000 }]
readinessProbe:
httpGet: { path: /healthz, port: http }
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
livenessProbe:
httpGet: { path: /healthz, port: http }
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
CI/CD: Minikube Smoke on GitHub Actions
.github/workflows/smoke.yml
name: smoke-on-minikube
on:
push: { branches: [ main ] }
pull_request: {}
jobs:
smoke:
runs-on: ubuntu-22.04
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- uses: azure/setup-kubectl@v4
with: { version: "v1.30.5" }
- uses: azure/setup-helm@v4
with: { version: "v3.14.4" }
- name: Start minikube
uses: medyagh/setup-minikube@v0.0.18
with:
minikube-version: "v1.33.1"
kubernetes-version: "v1.30.0"
driver: docker
- name: Create namespace
run: kubectl create ns portal-dev || true
- name: Deploy Chroma
run: |
kubectl -n portal-dev apply -f k8s/chroma.yaml
kubectl -n portal-dev rollout status sts/chroma --timeout=180s
kubectl -n portal-dev run curltest --rm --restart=Never \
--image=curlimages/curl:8.10.1 -- \
curl -fsS http://chroma:8000/api/v1/heartbeat
- name: Deploy portal-api
run: |
kubectl -n portal-dev apply -f k8s/portal-api.yaml
kubectl -n portal-dev rollout status deploy/portal-api --timeout=240s
- name: Port-forward & smoke
run: |
set -euo pipefail
kubectl -n portal-dev port-forward svc/portal-api 18080:80 >/tmp/pf.log 2>&1 &
PF_PID=$!
sleep 5
trap "kill $PF_PID || true" EXIT
curl -fsS http://127.0.0.1:18080/healthz
kubectl -n portal-dev run c2 --rm --restart=Never \
--image=curlimages/curl:8.10.1 -- \
curl -fsS http://chroma:8000/api/v1/heartbeat
curl -fsS -X POST http://127.0.0.1:18080/chroma/upsert \
-H 'Content-Type: application/json' \
-d '{"limit":10,"dry_run":true}'
- name: Dump diagnostics on failure
if: failure()
run: |
kubectl -n portal-dev get all -o wide
kubectl -n portal-dev describe sts/chroma || true
kubectl -n portal-dev describe deploy/portal-api || true
kubectl -n portal-dev logs -l app=portal-api --tail=200 || true
Optional (for local or CI):hack/smoke.sh
#!/usr/bin/env bash
set -euo pipefail
NS=${NS:-portal-dev}
kubectl -n "$NS" run curltest --rm --restart=Never \
--image=curlimages/curl:8.10.1 -- \
curl -fsS http://chroma:8000/api/v1/heartbeat
kubectl -n "$NS" wait --for=condition=Available deploy/portal-api --timeout=180s
kubectl -n "$NS" port-forward svc/portal-api 18080:80 >/tmp/pf.log 2>&1 &
PF_PID=$!; sleep 5; trap "kill $PF_PID || true" EXIT
curl -fsS http://127.0.0.1:18080/healthz
curl -fsS -X POST http://127.0.0.1:18080/chroma/upsert \
-H 'Content-Type: application/json' -d '{"limit":10,"dry_run":true}'
echo "OK"
Rollback & Cleanup
- If you ever need to revert the hot-fix path (not required if code is fixed):
- remove extra volumes/envs you added for runtime patching,
- keep the dependency pins and probes.
Final note to the team
ChromaDB is powerful but unforgiving with drifting dependencies and ambiguous SQLAlchemy idioms. Please stick to the pinned versions, favor SQLAlchemy Core for data-access utilities, and let CI block rollouts unless heartbeat and /chroma/upsert (dry-run) both pass. This keeps future image changes from turning into production firefights.
コメントを残す