Skip to main content
AI for Developers·Lesson 4 of 5

Calling LLM APIs from Applications

Putting an LLM behind your own API turns reliability into an engineering problem: timeouts, idempotency, and abuse prevention matter as much as the model name.

The request shape (conceptually)

Most chat-style APIs accept an array of messages with roles:

  • system — stable instructions (tone, format, policies).
  • user — the human or upstream service input.
  • assistant — prior model turns in multi-step chat.
  • tool — results from function/tool calls (when supported).

Keep system prompts small and version them like code. Put volatile detail in user messages or retrieved context blocks.

Streaming

Streaming sends tokens as they are generated. Better perceived latency in UIs; trickier to handle on the server (you may need to aggregate for logging or retries).

For HTTP APIs, prefer server-sent events or chunked responses your client library already supports — avoid inventing ad-hoc protocols.

Errors, retries, and backoff

Networks and providers fail. Standard patterns:

  • Retry idempotent reads on 429/5xx with exponential backoff and jitter.
  • Do not blindly retry requests that might have charged you or committed side effects unless you have deduplication keys.
  • Set hard timeouts so a hung connection cannot hold workers forever.

Cost and limits

Pricing usually tracks input + output tokens. Mitigations:

  • Trim unnecessary history and documents before sending.
  • Cache stable system instructions where the platform allows.
  • Cap max_tokens for unattended jobs.

Security basics

Never expose raw API keys in browsers. Call providers from your server or a trusted edge function. Rate-limit per user and scrub secrets from prompts before logging.

Key takeaways

  • Structure messages so policies are stable and context is explicit.
  • Stream for UX, but plan for partial failures and retries.
  • Treat the integration like any paid external API: timeouts, quotas, and key hygiene.