AI Agent Rate Limiting: Implementing Token Budgets and Usage Quotas
Build rate limiting systems that prevent runaway costs, ensure fair usage, and handle provider limits gracefully - covering token budgets, user quotas, and adaptive throttling.

TL;DR
- Rate limiting for AI is primarily about cost control. A single runaway agent can burn through your monthly budget in hours.
- Implement limits at multiple levels: per-request, per-user, per-org, and global.
- Use token sliding windows, not just request counts. 10 small requests ≠ 1 massive request.
- Pre-flight estimation catches expensive requests before they execute.
Jump to Why rate limiting matters · Jump to Multi-level limiting · Jump to Implementation · Jump to Graceful degradation
# AI Agent Rate Limiting: Implementing Token Budgets and Usage Quotas
It's 3am. An agent enters an infinite loop, calling GPT-4 with maximum context every 2 seconds. By morning, you've burned £8,000 in API costs - your entire monthly budget gone in 6 hours. No alerts fired because you only monitored request counts, not token usage.
Rate limiting for AI agents isn't just about fairness or DDoS protection - it's existential cost control. LLM API costs scale with usage, and usage can explode without warning when agents behave unexpectedly.
This guide covers building multi-layer rate limiting that prevents cost disasters while maintaining service quality for legitimate use.
Key takeaways - Token-based limits matter more than request-based limits. One 100K token request costs 100x more than one 1K token request. - Implement limits at request, user, org, and global levels. Each catches different failure modes. - Pre-flight estimation lets you reject expensive requests before incurring costs. - Graceful degradation (smaller models, shorter outputs) is better than hard rejections.
Why rate limiting matters
AI cost structure differs fundamentally from traditional SaaS. A database query costs fractions of a penny. An LLM call can cost pounds.
The cost explosion problem
| Scenario | Traditional API | LLM API |
|---|---|---|
| Single request | £0.0001 | £0.01-£0.50 |
| Runaway loop (1000 req/min) | £6/hour | £600-£30,000/hour |
| Single bad actor | Annoying | Bankrupting |
Real incident patterns
Pattern 1: Infinite retry loops
Agent hits an error, retries with the same massive context, fails again. Each retry costs £0.15. Loop runs for 4 hours before detection = £3,600 burned.
Pattern 2: Context accumulation
Conversation grows without trimming. By turn 50, each message includes 80K tokens of history. User asks 20 questions that afternoon = £60 for one session.
Pattern 3: Abuse by bad actors
Free tier user discovers they can trigger expensive operations. Scripts 1000 requests with complex prompts = £500 in usage you can't recover.
"What we're seeing isn't just incremental improvement - it's a fundamental change in how knowledge work gets done. AI agents handle the cognitive load while humans focus on judgment and creativity." - Marcus Chen, Chief AI Officer at McKinsey Digital
Multi-level rate limiting
Effective rate limiting operates at multiple levels, each catching different failure modes.
Level 1: Per-request limits
Prevent single requests from being unreasonably expensive.
interface RequestLimits {
maxInputTokens: number;
maxOutputTokens: number;
maxToolCalls: number;
maxExecutionTime: number;
}
const defaultRequestLimits: RequestLimits = {
maxInputTokens: 32000,
maxOutputTokens: 4000,
maxToolCalls: 10,
maxExecutionTime: 60000 // 60 seconds
};Level 2: Per-user limits
Prevent individual users from excessive usage.
interface UserLimits {
tokensPerMinute: number;
tokensPerHour: number;
tokensPerDay: number;
requestsPerMinute: number;
costPerDay: number;
}
const userLimitsByTier = {
free: {
tokensPerMinute: 10000,
tokensPerHour: 100000,
tokensPerDay: 500000,
requestsPerMinute: 10,
costPerDay: 0.50
},
pro: {
tokensPerMinute: 50000,
tokensPerHour: 500000,
tokensPerDay: 5000000,
requestsPerMinute: 60,
costPerDay: 10
},
enterprise: {
tokensPerMinute: 200000,
tokensPerHour: 2000000,
tokensPerDay: 20000000,
requestsPerMinute: 200,
costPerDay: 100
}
};Level 3: Per-organisation limits
Prevent one org from consuming disproportionate resources.
interface OrgLimits {
tokensPerDay: number;
costPerMonth: number;
concurrentRequests: number;
}
// Based on subscription tier
const orgLimits = {
starter: {
tokensPerDay: 1000000,
costPerMonth: 50,
concurrentRequests: 5
},
growth: {
tokensPerDay: 10000000,
costPerMonth: 500,
concurrentRequests: 20
},
enterprise: {
tokensPerDay: 100000000,
costPerMonth: 5000,
concurrentRequests: 100
}
};Level 4: Global limits
Protect your overall budget regardless of individual limits.
interface GlobalLimits {
totalCostPerHour: number;
totalCostPerDay: number;
totalTokensPerMinute: number;
emergencyShutoffCost: number;
}
const globalLimits: GlobalLimits = {
totalCostPerHour: 500,
totalCostPerDay: 3000,
totalTokensPerMinute: 5000000,
emergencyShutoffCost: 10000 // Auto-pause all if exceeded
};Implementation guide
Token budget manager
interface UsageRecord {
userId: string;
orgId: string;
tokens: number;
cost: number;
timestamp: Date;
}
class TokenBudgetManager {
private redis: Redis;
// Sliding window token counting
async checkUserLimit(
userId: string,
estimatedTokens: number,
tier: string
): Promise<{ allowed: boolean; remaining: number; resetIn: number }> {
const limits = userLimitsByTier[tier];
const now = Date.now();
// Check multiple windows
const [perMinute, perHour, perDay] = await Promise.all([
this.getWindowUsage(userId, 60000),
this.getWindowUsage(userId, 3600000),
this.getWindowUsage(userId, 86400000)
]);
// Check against limits
if (perMinute + estimatedTokens > limits.tokensPerMinute) {
return {
allowed: false,
remaining: limits.tokensPerMinute - perMinute,
resetIn: this.getResetTime(userId, 60000)
};
}
if (perHour + estimatedTokens > limits.tokensPerHour) {
return {
allowed: false,
remaining: limits.tokensPerHour - perHour,
resetIn: this.getResetTime(userId, 3600000)
};
}
if (perDay + estimatedTokens > limits.tokensPerDay) {
return {
allowed: false,
remaining: limits.tokensPerDay - perDay,
resetIn: this.getResetTime(userId, 86400000)
};
}
return {
allowed: true,
remaining: Math.min(
limits.tokensPerMinute - perMinute,
limits.tokensPerHour - perHour,
limits.tokensPerDay - perDay
),
resetIn: 0
};
}
async recordUsage(record: UsageRecord): Promise<void> {
const key = `usage:${record.userId}`;
const timestamp = record.timestamp.getTime();
// Add to sorted set with timestamp as score
await this.redis.zadd(key, timestamp, JSON.stringify({
tokens: record.tokens,
cost: record.cost,
timestamp
}));
// Trim old entries (keep last 24 hours)
const cutoff = Date.now() - 86400000;
await this.redis.zremrangebyscore(key, 0, cutoff);
}
private async getWindowUsage(userId: string, windowMs: number): Promise<number> {
const key = `usage:${userId}`;
const cutoff = Date.now() - windowMs;
const entries = await this.redis.zrangebyscore(key, cutoff, '+inf');
return entries.reduce((sum, entry) => {
const data = JSON.parse(entry);
return sum + data.tokens;
}, 0);
}
private getResetTime(userId: string, windowMs: number): number {
// Returns seconds until oldest entry in window expires
// Implementation depends on your windowing strategy
return Math.ceil(windowMs / 1000);
}
}Pre-flight cost estimation
Estimate costs before making expensive calls:
interface CostEstimate {
inputTokens: number;
estimatedOutputTokens: number;
totalTokens: number;
estimatedCost: number;
wouldExceedLimit: boolean;
limitType?: string;
}
class CostEstimator {
private tokenizer: Tokenizer;
async estimate(
messages: Message[],
model: string,
context: { userId: string; orgId: string; tier: string }
): Promise<CostEstimate> {
// Count input tokens
const inputTokens = this.countTokens(messages);
// Estimate output (use historical average or model-specific estimate)
const avgOutputRatio = 0.5; // Output typically ~50% of input
const estimatedOutputTokens = Math.min(
Math.ceil(inputTokens * avgOutputRatio),
4000 // Cap at max output
);
const totalTokens = inputTokens + estimatedOutputTokens;
// Calculate cost
const pricing = MODEL_PRICING[model];
const estimatedCost =
(inputTokens / 1000) * pricing.input +
(estimatedOutputTokens / 1000) * pricing.output;
// Check against limits
const budgetManager = new TokenBudgetManager();
const userCheck = await budgetManager.checkUserLimit(
context.userId,
totalTokens,
context.tier
);
const orgCheck = await this.checkOrgLimit(context.orgId, estimatedCost);
const wouldExceedLimit = !userCheck.allowed || !orgCheck.allowed;
const limitType = !userCheck.allowed ? 'user' : !orgCheck.allowed ? 'org' : undefined;
return {
inputTokens,
estimatedOutputTokens,
totalTokens,
estimatedCost,
wouldExceedLimit,
limitType
};
}
private countTokens(messages: Message[]): number {
return messages.reduce((sum, msg) => {
return sum + this.tokenizer.encode(msg.content).length;
}, 0);
}
}Rate limiter middleware
class RateLimitMiddleware {
private budgetManager: TokenBudgetManager;
private estimator: CostEstimator;
async process(
request: AgentRequest,
context: ExecutionContext
): Promise<void> {
// Step 1: Estimate cost
const estimate = await this.estimator.estimate(
request.messages,
request.model,
context
);
// Step 2: Check if would exceed limits
if (estimate.wouldExceedLimit) {
throw new RateLimitError({
type: estimate.limitType,
estimatedTokens: estimate.totalTokens,
estimatedCost: estimate.estimatedCost,
message: this.getLimitMessage(estimate.limitType, context.tier)
});
}
// Step 3: Check request-level limits
if (estimate.inputTokens > defaultRequestLimits.maxInputTokens) {
throw new RequestTooLargeError({
inputTokens: estimate.inputTokens,
maxAllowed: defaultRequestLimits.maxInputTokens
});
}
// Step 4: Reserve budget (to prevent race conditions)
await this.budgetManager.reserve(context.userId, estimate.totalTokens);
// Step 5: Execute (in calling code)
// ...
// Step 6: Record actual usage (after execution)
// await this.recordUsage(actualTokens, actualCost);
}
private getLimitMessage(limitType: string, tier: string): string {
const messages = {
user: `You've reached your ${tier} tier usage limit. Please wait or upgrade your plan.`,
org: `Your organisation has reached its usage limit. Please contact your admin.`,
global: `Our service is experiencing high demand. Please try again shortly.`
};
return messages[limitType] || 'Usage limit reached.';
}
}Adaptive throttling
When approaching provider rate limits, slow down proactively:
class AdaptiveThrottler {
private currentDelay = 0;
private recentErrors: number[] = [];
private maxDelay = 30000;
async throttle(): Promise<void> {
if (this.currentDelay > 0) {
await sleep(this.currentDelay);
}
}
recordSuccess(): void {
// Decrease delay on success
this.currentDelay = Math.max(0, this.currentDelay - 100);
// Clean old errors
const cutoff = Date.now() - 60000;
this.recentErrors = this.recentErrors.filter(t => t > cutoff);
}
recordRateLimit(retryAfter?: number): void {
this.recentErrors.push(Date.now());
if (retryAfter) {
this.currentDelay = retryAfter * 1000;
} else {
// Exponential backoff based on error frequency
const errorCount = this.recentErrors.length;
this.currentDelay = Math.min(
this.maxDelay,
Math.pow(2, errorCount) * 100
);
}
}
getStatus(): { delay: number; recentErrors: number } {
return {
delay: this.currentDelay,
recentErrors: this.recentErrors.length
};
}
}Graceful degradation
When limits are approached, degrade gracefully instead of hard failing.
Strategy 1: Model downgrade
async function executeWithDegradation(
request: AgentRequest,
context: ExecutionContext
): Promise<AgentResponse> {
const modelChain = ['gpt-4o', 'gpt-4o-mini', 'gpt-3.5-turbo'];
for (const model of modelChain) {
const estimate = await estimator.estimate(request.messages, model, context);
if (!estimate.wouldExceedLimit) {
// Use this model
return execute({ ...request, model });
}
}
// All models would exceed - show degraded response
return {
content: "I'm currently limited in how I can help. Please try a simpler question or wait a few minutes.",
degraded: true
};
}Strategy 2: Output limiting
async function executeWithOutputLimit(
request: AgentRequest,
remainingBudget: number
): Promise<AgentResponse> {
// Calculate safe output tokens
const inputTokens = countTokens(request.messages);
const safeOutputTokens = Math.max(
100, // Minimum useful response
remainingBudget - inputTokens
);
return execute({
...request,
maxTokens: safeOutputTokens
});
}Strategy 3: Queue and batch
class RequestQueue {
private queue: QueuedRequest[] = [];
private processing = false;
async enqueue(request: AgentRequest): Promise<AgentResponse> {
return new Promise((resolve, reject) => {
this.queue.push({ request, resolve, reject });
this.processQueue();
});
}
private async processQueue(): Promise<void> {
if (this.processing) return;
this.processing = true;
while (this.queue.length > 0) {
// Check if we have budget
const budgetAvailable = await this.checkBudget();
if (!budgetAvailable) {
// Wait before trying again
await sleep(5000);
continue;
}
const item = this.queue.shift();
try {
const result = await execute(item.request);
item.resolve(result);
} catch (error) {
item.reject(error);
}
}
this.processing = false;
}
}Cost alerts and monitoring
Alert configuration
interface AlertConfig {
userId?: string;
orgId?: string;
thresholdPercent: number; // Alert at X% of limit
channel: 'email' | 'slack' | 'webhook';
destination: string;
}
const defaultAlerts: AlertConfig[] = [
// User approaching limit
{ thresholdPercent: 80, channel: 'email', destination: '{{user.email}}' },
// Org approaching limit
{ orgId: '*', thresholdPercent: 90, channel: 'slack', destination: '#billing-alerts' },
// Global emergency
{ thresholdPercent: 95, channel: 'webhook', destination: 'https://api.internal/emergency' }
];
async function checkAndAlert(usage: UsageStats): Promise<void> {
for (const alert of defaultAlerts) {
const limit = getLimit(alert);
const percent = (usage.current / limit) * 100;
if (percent >= alert.thresholdPercent) {
await sendAlert(alert, {
currentUsage: usage.current,
limit,
percent: percent.toFixed(1)
});
}
}
}Emergency shutoff
class EmergencyShutoff {
private active = false;
async check(globalUsage: number): Promise<void> {
if (globalUsage >= globalLimits.emergencyShutoffCost) {
this.activate();
}
}
activate(): void {
this.active = true;
// Notify all channels
sendAlert('emergency', {
message: 'Emergency shutoff activated - all AI requests paused',
timestamp: new Date()
});
// Log for investigation
console.error('EMERGENCY SHUTOFF ACTIVATED');
}
deactivate(): void {
this.active = false;
console.log('Emergency shutoff deactivated');
}
isActive(): boolean {
return this.active;
}
}FAQs
Should I limit by tokens or by cost?
Both. Tokens for immediate throttling, cost for billing alignment. A token limit prevents large requests; a cost limit accounts for model pricing differences.
How do I handle legitimate high-usage users?
Offer higher tier plans with increased limits. Monitor usage patterns to identify power users for outreach. Consider custom enterprise plans.
What about provider rate limits?
Track your OpenAI/Anthropic rate limit headers and adjust accordingly. Implement adaptive throttling that backs off when approaching provider limits.
How granular should limits be?
Start with user and org level. Add per-agent or per-feature limits if you notice specific areas driving costs. Too granular becomes hard to manage.
What's a fair free tier limit?
Enough for meaningful trial use without enabling abuse. We use £0.50/day (roughly 50 GPT-4o requests or 500 GPT-4o-mini). Adjust based on your cost tolerance.
Summary and next steps
Rate limiting for AI agents is fundamentally about cost control. Multi-level limits (request, user, org, global) catch different failure modes. Pre-flight estimation prevents expensive mistakes. Graceful degradation maintains service quality.
Implementation checklist:
- Implement token counting and cost estimation
- Build sliding window usage tracking
- Add per-user and per-org limits based on tiers
- Create pre-flight checks before expensive calls
- Implement graceful degradation strategies
- Set up cost alerts and emergency shutoff
Quick wins:
- Add basic request-level limits (max tokens, max tools)
- Track usage per user even before enforcing limits
- Set up daily cost alerts
Internal links:
- /blog/prompt-caching-cost-optimisation-ai-agents
- /blog/llm-cost-optimization-ai-agents
- /blog/ai-agent-retry-strategies-exponential-backoff
External references:
More from the blog
OpenHelm vs runCLAUDErun: Which Claude Code Scheduler Is Right for You?
A direct comparison of the two most popular Claude Code schedulers, how each works, what each costs, and which fits your workflow.
Claude Code vs Cursor Pro: Real Developer Cost Comparison
An honest look at what developers actually spend on Claude Code, Cursor Pro, and GitHub Copilot, and how to get the most from each.
Stop doing the work around the work
OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.