The four layers
| Layer | Purpose | Backend rule |
|---|---|---|
L1 | Per-user rate limiting | Blocks when requests exceed the tenant’s per-user per-minute limit |
L2 | Low-quality input filter | Blocks when the quality score is below 0.35 |
L3 | Semantic duplicate detection | Blocks when semantic similarity is above 0.92 |
L4 | Budget governance | Blocks when quota is exhausted and overage policy is block |
What gets blocked
| Layer | Example input | Typical reason |
|---|---|---|
L1 | Same user spams many writes in one minute | rate_limit_exceeded |
L2 | "ok", "hi", "??" | low_quality |
L3 | Re-sending the same preference statement repeatedly | duplicate_query |
L4 | Tenant is out of monthly calls or tokens on block mode | budget_exhausted |
How quality is scored
The current quality score is based on:- number of messages
- average message length
- lexical diversity
- whether the conversation contains a question signal
How to improve your block rate
- Send meaningful user facts, preferences, goals, or procedures instead of filler text.
- Avoid writing the same memory-worthy statement repeatedly.
- Batch coherent conversational turns together instead of sending single-word fragments.
- Respect the per-user write rate limit.
- Monitor
blocked_reasonandbudget_remaining_pctonadd()responses.
Operational note
A blocked write returns HTTP200, but the response status indicates the blocking layer, such as L2 or L4. Inspect the response body, not just the status code.