The Real Context Workflow
Think in terms of turn reduction, not knowledge modeling. Your input is a big array of turns. Your output is a smaller context window for the next model call.
What The Problem Actually Looks Like
In production, a thread is rarely just user and assistant messages. It also includes tool results, CRM data, system events, previous summaries, internal guidance, and repeated restatements of the same issue.
{
"turns": [
"User: We still cannot log in after yesterday's Okta cutover.",
"Agent: Pulling account metadata and auth logs.",
"Tool crm_lookup: account=acme_corp tier=enterprise billing=current renewal=2026-07-01",
"Tool auth_audit: 14 failed SAML assertions since 09:12 UTC; issuer mismatch detected.",
"Internal note: Customer is not delinquent. Keep ticket in support queue.",
"Previous ticket: promised service credit if outage exceeds 4 hours.",
"Slack escalation: INC-4821 open; workaround is manual issuer override.",
"User: Three teams lost admin access after yesterday's metadata change.",
"Tool statuspage: degraded identity service in us-east-1.",
"Agent: Need a concise handoff context before the next model call."
]
} That array is still far smaller than what many teams send in reality. The important point is that it already contains overlap, stale details, and the same issue stated in different ways.
Step 1: First Reduction With POST /context
Use POST /context when you want the first compact pass. Send raw turns. Get back the normalized state that seems most important.
curl -X POST http://localhost:9300/context \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"turns": [
"User: We still cannot log in after yesterday's Okta cutover.",
"Tool crm_lookup: account=acme_corp tier=enterprise billing=current",
"Tool auth_audit: issuer mismatch detected after IdP migration.",
"Slack escalation: workaround is manual issuer override."
],
"maxFacts": 12
}' {
"facts": [
{ "predicate": "customer_tier", "args": ["acme_corp", "enterprise"], "salience": 0.98 },
{ "predicate": "billing_status", "args": ["acme_corp", "current"], "salience": 0.80 },
{ "predicate": "current_issue", "args": ["acme_corp", "saml_issuer_mismatch"], "salience": 0.97 },
{ "predicate": "workaround", "args": ["acme_corp", "manual_issuer_override"], "salience": 0.88 }
],
"factsReturned": 4,
"contradictions": 0,
"newFactsExtracted": 4
} Step 2: Ask The Next Question With POST /context/optimize
The first pass is broad. The next pass should be question-specific. Once you know what the next model call is trying to do, use /context/optimize to narrow the window further.
curl -X POST http://localhost:9300/context/optimize \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"sessionId": "support-thread-4821",
"maxFacts": 10,
"goals": [
{"predicate":"next_best_action","args":["acme_corp"]},
{"predicate":"service_credit_applicable","args":["acme_corp"]}
]
}' This is where the context window becomes operational instead of merely descriptive. The model stops seeing the whole incident and starts seeing the subset needed to answer the next action question.
Step 3: Stop Re-Sending The Same Context With POST /context/diff
This is the step most teams miss. If the conversation continues, do not send the whole optimized window again. Keep the same sessionId and ask for the delta.
curl -X POST http://localhost:9300/context/diff \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"sessionId": "support-thread-4821",
"maxFacts": 10
}' {
"previousWindowId": "ctx-01",
"currentWindowId": "ctx-02",
"added": [
{ "predicate": "temporary_access_restored", "args": ["acme_corp", "true"], "salience": 0.92 }
],
"removed": [],
"unchanged": 9,
"fullRefreshRecommended": false
} That is the production benefit: later model calls pay for the change, not for the entire thread history again.
Step 4: End The Thread Cleanly
When the thread is finished, clear the diff snapshot:
curl -X POST http://localhost:9300/context/session/clear \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{"sessionId":"support-thread-4821"}' Formatting For The Model
Most teams take the returned entries and flatten them into a short system or tool message. The model does not need the original transcript if the compact context already captures the operational state.
import requests
from openai import OpenAI
client = OpenAI()
ctx = requests.post(
"http://localhost:9300/context/optimize",
headers={"X-Tenant-ID": "default"},
json={
"sessionId": "support-thread-4821",
"maxFacts": 10,
"goals": [{"predicate": "next_best_action", "args": ["acme_corp"]}],
},
).json()
context_lines = [
f"- {entry['predicate']}({', '.join(entry['args'])})"
for entry in ctx["entries"]
]
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use only this reduced context:\n" + "\n".join(context_lines)},
{"role": "user", "content": "Write the next support reply."},
],
) Where The Other Surfaces Fit
- API: best when your app already owns the turn array and prompt assembly.
- Python SDK: best when you want
optimize_context(),diff_context(), and session cleanup in app code. - TypeScript SDK: best when you want
contextWindow(),optimizeContext(),diffContext(), andclearContextSession()directly in app code. - MCP: best when your agent runtime already uses tool calling. Use MCP
contextfor salience retrieval and pair it with the HTTP context endpoints when you need goal-specific windows.
If You Want The Backend Details
Predicates, rules, scopes, salience scoring, truth maintenance, and memory lifecycle still exist. They just belong in the backend explanation, not at the front of the product story.
If that is what you need next, go to How It Works on the Backend.