Performance and sizing

This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.

Resource requirements

Baseline resources

Minimal deployment (development/testing):

CPU: 100m (0.1 cores)
Memory: 128Mi

Production deployment (recommended):

CPU: 500m (0.5 cores)
Memory: 512Mi

Scaling factors

Resource needs increase based on:

Number of backends: Each backend adds minimal overhead (~10-20MB memory)
Request volume: Higher traffic requires more CPU for request processing
Composite tool complexity: Workflows with many parallel steps consume more memory
Token caching: Authentication token cache grows with unique client count

Backend scale recommendations

vMCP performs well across different scales:

Backend Count	Use Case	Notes
1-5	Small teams, focused toolsets	Minimal resource overhead
5-15	Medium teams, diverse tools	Recommended range for most use cases
15-30	Large teams, comprehensive	Increase health check interval
30+	Enterprise-scale deployments	Consider multiple vMCP instances

Performance characteristics

Backend discovery

Timing: Happens once per client session
Duration: Typically completes in 1-3 seconds for 10 backends
Timeout: 15 seconds (returns HTTP 504 on timeout)
Parallelism: Backends queried concurrently for capabilities

Health checks

Interval: Every 30 seconds by default (configurable)
Impact: Minimal overhead on backend servers
Timeout: 10 seconds by default (configurable via healthCheckTimeout)
Configuration: See Configure health checks

Tool routing

Overhead: Single-digit millisecond latency for routing and conflict resolution
Caching: Routing table cached per session for consistent behavior
Lookup: O(1) hash table lookup for tool/resource/prompt routing

Composite workflows

Parallelism: Up to 10 parallel step executions (hard-coded)
Execution model: DAG-based with dependency resolution
Bottleneck: Limited by slowest backend response time in each level
Memory: Step results cached in memory during workflow execution

Token caching

Reduction: 90%+ reduction in authentication overhead for repeated requests
Duration: Tokens cached until expiration
Scope: Per-client, per-backend token cache
Impact: Significantly improves response times for authenticated backends

Horizontal scaling

vMCP is stateless and supports horizontal scaling:

Scaling characteristics

Independence: Each vMCP instance operates independently
Session state: vMCP uses session IDs internally to cache routing tables and maintain consistency within a session
State: No shared state between instances
Method: Scale by increasing replicas in the Deployment

Example scaling configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vmcp-my-vmcp
spec:
  replicas: 3 # Scale to 3 instances
  # ... rest of deployment spec

Load balancing

When using multiple replicas, clients must be routed to the same vMCP instance for the duration of their session (to maintain routing table consistency). Configure session affinity based on your deployment:

Kubernetes Service: Use sessionAffinity: ClientIP for basic client-to-pod stickiness
- Note: This is IP-based and may not work well behind proxies or with changing client IPs
Ingress Controller: Configure cookie-based sticky sessions (recommended)
- nginx: Use nginx.ingress.kubernetes.io/affinity: cookie
- Other controllers: Consult your Ingress controller documentation
Gateway API: Use appropriate session affinity configuration based on your Gateway implementation

Recommended approach

For production deployments behind an Ingress, use cookie-based sticky sessions rather than ClientIP affinity. This works reliably even when traffic comes through proxies or load balancers.

When to scale

Scale up (increase resources)

Increase CPU and memory when you observe:

High CPU usage (>70% sustained) during normal operations
Memory pressure or OOM (out-of-memory) kills
Slow response times (>1 second) for simple tool calls
Health check timeouts or frequent backend unavailability

Scale out (increase replicas)

Add more vMCP instances when:

CPU usage remains high despite increasing resources
You need higher availability and fault tolerance
Request volume exceeds capacity of a single instance
You want to distribute load across multiple availability zones

Scale configuration

Adjust operational settings when scaling:

Configuration for large backend counts (15+)

spec:
  config:
    operational:
      failureHandling:
        # Reduce health check frequency to minimize overhead
        healthCheckInterval: 60s

        # Increase thresholds for better stability
        unhealthyThreshold: 5

Configuration for high request volumes

spec:
  podTemplateSpec:
    spec:
      containers:
        - name: vmcp
          resources:
            requests:
              cpu: '1'
              memory: 1Gi
            limits:
              cpu: '2'
              memory: 2Gi

Performance optimization

Reduce backend discovery time

Use inline mode for static backend configurations (eliminates Kubernetes API queries)
Minimize backend count by grouping related tools in fewer servers
Ensure fast backend responses to initialize requests

Reduce authentication overhead

Enable token caching (enabled by default)
Use unauthenticated mode for internal/trusted backends
Configure appropriate token expiration in your OIDC provider

Optimize composite workflows

Minimize dependencies between steps to maximize parallelism
Use step-level error handling with onError.action: continue to prevent single step failures from blocking entire workflows
Set appropriate timeouts for slow backends using timeout field on individual steps

Monitor performance

Use the vMCP telemetry integration to monitor:

Backend request latency and error rates
Workflow execution times and failure patterns
Health check success/failure rates

See Telemetry and metrics for configuration details.

Resource requirements​

Baseline resources​

Scaling factors​

Backend scale recommendations​

Performance characteristics​

Horizontal scaling​

Scaling characteristics​

Example scaling configuration​

Load balancing​

When to scale​

Scale up (increase resources)​

Scale out (increase replicas)​

Scale configuration​

Performance optimization​

Related information​