Performance and sizing
This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.
Resource requirements
Baseline resources
Minimal deployment (development/testing):
- CPU: 100m (0.1 cores)
- Memory: 128Mi
Production deployment (recommended):
- CPU: 500m (0.5 cores)
- Memory: 512Mi
Scaling factors
Resource needs increase based on:
- Number of backends: Each backend adds minimal overhead (~10-20MB memory)
- Request volume: Higher traffic requires more CPU for request processing
- Composite tool complexity: Workflows with many parallel steps consume more memory
- Token caching: Authentication token cache grows with unique client count
Backend scale recommendations
vMCP performs well across different scales:
| Backend Count | Use Case | Notes |
|---|---|---|
| 1-5 | Small teams, focused toolsets | Minimal resource overhead |
| 5-15 | Medium teams, diverse tools | Recommended range for most use cases |
| 15-30 | Large teams, comprehensive | Increase health check interval |
| 30+ | Enterprise-scale deployments | Consider multiple vMCP instances |
Performance characteristics
Backend discovery
- Timing: Happens once per client session
- Duration: Typically completes in 1-3 seconds for 10 backends
- Timeout: 15 seconds (returns HTTP 504 on timeout)
- Parallelism: Backends queried concurrently for capabilities
Health checks
- Interval: Every 30 seconds by default (configurable)
- Impact: Minimal overhead on backend servers
- Timeout: 10 seconds by default (configurable via
healthCheckTimeout) - Configuration: See Configure health checks
Tool routing
- Overhead: Single-digit millisecond latency for routing and conflict resolution
- Caching: Routing table cached per session for consistent behavior
- Lookup: O(1) hash table lookup for tool/resource/prompt routing
Composite workflows
- Parallelism: Up to 10 parallel step executions (hard-coded)
- Execution model: DAG-based with dependency resolution
- Bottleneck: Limited by slowest backend response time in each level
- Memory: Step results cached in memory during workflow execution
Token caching
- Reduction: 90%+ reduction in authentication overhead for repeated requests
- Duration: Tokens cached until expiration
- Scope: Per-client, per-backend token cache
- Impact: Significantly improves response times for authenticated backends
Horizontal scaling
vMCP is stateless and supports horizontal scaling:
Scaling characteristics
- Independence: Each vMCP instance operates independently
- Session state: vMCP uses session IDs internally to cache routing tables and maintain consistency within a session
- State: No shared state between instances
- Method: Scale by increasing replicas in the Deployment
Example scaling configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: vmcp-my-vmcp
spec:
replicas: 3 # Scale to 3 instances
# ... rest of deployment spec
Load balancing
When using multiple replicas, clients must be routed to the same vMCP instance for the duration of their session (to maintain routing table consistency). Configure session affinity based on your deployment:
- Kubernetes Service: Use
sessionAffinity: ClientIPfor basic client-to-pod stickiness- Note: This is IP-based and may not work well behind proxies or with changing client IPs
- Ingress Controller: Configure cookie-based sticky sessions (recommended)
- nginx: Use
nginx.ingress.kubernetes.io/affinity: cookie - Other controllers: Consult your Ingress controller documentation
- nginx: Use
- Gateway API: Use appropriate session affinity configuration based on your Gateway implementation
For production deployments behind an Ingress, use cookie-based sticky
sessions rather than ClientIP affinity. This works reliably even when
traffic comes through proxies or load balancers.
When to scale
Scale up (increase resources)
Increase CPU and memory when you observe:
- High CPU usage (>70% sustained) during normal operations
- Memory pressure or OOM (out-of-memory) kills
- Slow response times (>1 second) for simple tool calls
- Health check timeouts or frequent backend unavailability
Scale out (increase replicas)
Add more vMCP instances when:
- CPU usage remains high despite increasing resources
- You need higher availability and fault tolerance
- Request volume exceeds capacity of a single instance
- You want to distribute load across multiple availability zones
Scale configuration
Adjust operational settings when scaling:
Configuration for large backend counts (15+)
spec:
config:
operational:
failureHandling:
# Reduce health check frequency to minimize overhead
healthCheckInterval: 60s
# Increase thresholds for better stability
unhealthyThreshold: 5
Configuration for high request volumes
spec:
podTemplateSpec:
spec:
containers:
- name: vmcp
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '2'
memory: 2Gi
Performance optimization
Reduce backend discovery time
- Use inline mode for static backend configurations (eliminates Kubernetes API queries)
- Minimize backend count by grouping related tools in fewer servers
- Ensure fast backend responses to initialize requests
Reduce authentication overhead
- Enable token caching (enabled by default)
- Use unauthenticated mode for internal/trusted backends
- Configure appropriate token expiration in your OIDC provider
Optimize composite workflows
- Minimize dependencies between steps to maximize parallelism
- Use step-level error handling with
onError.action: continueto prevent single step failures from blocking entire workflows - Set appropriate timeouts for slow backends using
timeoutfield on individual steps
Monitor performance
Use the vMCP telemetry integration to monitor:
- Backend request latency and error rates
- Workflow execution times and failure patterns
- Health check success/failure rates
See Telemetry and metrics for configuration details.