The jump from demo to production is mostly operations: observability, fallbacks, and clear user expectations.
Observability
Log structured fields (without secrets):
- Request ID, model name, latency, token counts.
- Retrieval hits — top chunk IDs and scores.
- Whether the user saw a fallback (cached answer, "could not find," human handoff).
Trace end-to-end to debug "bad answer" reports quickly.
Fallbacks
When retrieval confidence is low or generation times out:
- Return a safe message and suggest reformulating the query.
- Offer human support or alternative navigation.
- Never loop silently on errors.
Rate limits and abuse
Bots hammer public endpoints. Use per-user and global limits, CAPTCHA or auth where appropriate, and anomaly detection on token burn.
Versioning prompts and indexes
Tag prompts with versions in your config; store which version answered each query. Re-ingest documents on a schedule and track index build IDs so you can roll back.
Key takeaways
- Treat LLM features like any critical path: metrics, tracing, and SLOs.
- Plan explicit degraded modes — partial success beats confident nonsense.
- Version everything you can: prompts, models, and retrieval snapshots.