Skip to main content

Performance and sizing

This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.

Resource requirements

Baseline resources

Minimal deployment (development/testing):

  • CPU: 100m (0.1 cores)
  • Memory: 128Mi

Production deployment (recommended):

  • CPU: 500m (0.5 cores)
  • Memory: 512Mi

Scaling factors

Resource needs increase based on:

  • Number of backends: Each backend adds minimal overhead (~10-20MB memory)
  • Request volume: Higher traffic requires more CPU for request processing
  • Composite tool complexity: Workflows with many parallel steps consume more memory
  • Token caching: Authentication token cache grows with unique client count

Backend scale recommendations

vMCP performs well across different scales:

Backend CountUse CaseNotes
1-5Small teams, focused toolsetsMinimal resource overhead
5-15Medium teams, diverse toolsRecommended range for most use cases
15-30Large teams, comprehensiveIncrease health check interval
30+Enterprise-scale deploymentsConsider multiple vMCP instances

Performance characteristics

Backend discovery
  • Timing: Happens once per client session
  • Duration: Typically completes in 1-3 seconds for 10 backends
  • Timeout: 15 seconds (returns HTTP 504 on timeout)
  • Parallelism: Backends queried concurrently for capabilities
Health checks
  • Interval: Every 30 seconds by default (configurable)
  • Impact: Minimal overhead on backend servers
  • Timeout: 10 seconds by default (configurable via healthCheckTimeout)
  • Configuration: See Configure health checks
Tool routing
  • Overhead: Single-digit millisecond latency for routing and conflict resolution
  • Caching: Routing table cached per session for consistent behavior
  • Lookup: O(1) hash table lookup for tool/resource/prompt routing
Composite workflows
  • Parallelism: Up to 10 parallel step executions (hard-coded)
  • Execution model: DAG-based with dependency resolution
  • Bottleneck: Limited by slowest backend response time in each level
  • Memory: Step results cached in memory during workflow execution
Token caching
  • Reduction: 90%+ reduction in authentication overhead for repeated requests
  • Duration: Tokens cached until expiration
  • Scope: Per-client, per-backend token cache
  • Impact: Significantly improves response times for authenticated backends

Horizontal scaling

vMCP is stateless and supports horizontal scaling:

Scaling characteristics

  • Independence: Each vMCP instance operates independently
  • Session state: vMCP uses session IDs internally to cache routing tables and maintain consistency within a session
  • State: No shared state between instances
  • Method: Scale by increasing replicas in the Deployment

Example scaling configuration

apiVersion: apps/v1
kind: Deployment
metadata:
name: vmcp-my-vmcp
spec:
replicas: 3 # Scale to 3 instances
# ... rest of deployment spec

Load balancing

When using multiple replicas, clients must be routed to the same vMCP instance for the duration of their session (to maintain routing table consistency). Configure session affinity based on your deployment:

  • Kubernetes Service: Use sessionAffinity: ClientIP for basic client-to-pod stickiness
    • Note: This is IP-based and may not work well behind proxies or with changing client IPs
  • Ingress Controller: Configure cookie-based sticky sessions (recommended)
    • nginx: Use nginx.ingress.kubernetes.io/affinity: cookie
    • Other controllers: Consult your Ingress controller documentation
  • Gateway API: Use appropriate session affinity configuration based on your Gateway implementation
Recommended approach

For production deployments behind an Ingress, use cookie-based sticky sessions rather than ClientIP affinity. This works reliably even when traffic comes through proxies or load balancers.

When to scale

Scale up (increase resources)

Increase CPU and memory when you observe:

  • High CPU usage (>70% sustained) during normal operations
  • Memory pressure or OOM (out-of-memory) kills
  • Slow response times (>1 second) for simple tool calls
  • Health check timeouts or frequent backend unavailability

Scale out (increase replicas)

Add more vMCP instances when:

  • CPU usage remains high despite increasing resources
  • You need higher availability and fault tolerance
  • Request volume exceeds capacity of a single instance
  • You want to distribute load across multiple availability zones

Scale configuration

Adjust operational settings when scaling:

Configuration for large backend counts (15+)
spec:
config:
operational:
failureHandling:
# Reduce health check frequency to minimize overhead
healthCheckInterval: 60s

# Increase thresholds for better stability
unhealthyThreshold: 5
Configuration for high request volumes
spec:
podTemplateSpec:
spec:
containers:
- name: vmcp
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '2'
memory: 2Gi

Performance optimization

Reduce backend discovery time
  1. Use inline mode for static backend configurations (eliminates Kubernetes API queries)
  2. Minimize backend count by grouping related tools in fewer servers
  3. Ensure fast backend responses to initialize requests
Reduce authentication overhead
  1. Enable token caching (enabled by default)
  2. Use unauthenticated mode for internal/trusted backends
  3. Configure appropriate token expiration in your OIDC provider
Optimize composite workflows
  1. Minimize dependencies between steps to maximize parallelism
  2. Use step-level error handling with onError.action: continue to prevent single step failures from blocking entire workflows
  3. Set appropriate timeouts for slow backends using timeout field on individual steps
Monitor performance

Use the vMCP telemetry integration to monitor:

  • Backend request latency and error rates
  • Workflow execution times and failure patterns
  • Health check success/failure rates

See Telemetry and metrics for configuration details.