Chaos Engineering

Inject failures, test resilience, and get a resilience score for your infrastructure.

Overview

Test how your system handles failure before your users find out. RaidFrame's chaos engineering tools let you inject controlled failures into your services and measure the impact.

Inject Latency

Add artificial latency to a service:

rf chaos latency api --ms 500 --duration 5m
⚡ Chaos experiment started
  Service: api
  Type: latency injection (+500ms)
  Duration: 5 minutes
  Expires: 2026-03-16T14:28:00Z

  Stop early: rf chaos stop api

During the experiment, monitor your other services to see how they handle the degraded dependency.

Kill Instances

Terminate a random instance to test auto-recovery:

rf chaos kill api --instance random
⚡ Killed instance i-abc123 (api)
  Remaining: 3/4 instances healthy
  Auto-scaling: new instance launching...

  Recovery time: 8 seconds (new instance healthy)

Periodic Instance Kills

rf chaos kill api --instance random --interval 10m --duration 1h

Kills a random instance every 10 minutes for 1 hour. Tests your auto-scaling, health checks, and load balancing under sustained failure.

Network Failures

Packet Loss

rf chaos network api --packet-loss 10% --duration 5m

Connection Timeout

Simulate a database going unresponsive:

rf chaos network pg-main --blackhole --duration 2m

All connections to the database will hang (no response, no timeout) for 2 minutes. Tests your app's timeout handling and circuit breakers.

DNS Failure

rf chaos dns api --fail-rate 20% --duration 5m

20% of DNS lookups from the api service will fail. Tests retry logic and fallback behavior.

Resource Exhaustion

CPU Stress

rf chaos cpu api --percent 90 --duration 5m

Memory Pressure

rf chaos memory api --fill 80% --duration 5m

Disk Full

rf chaos disk api --fill 95% --duration 5m

Chaos Experiments

Define repeatable experiments:

chaos:
  experiments:
    database-failure:
      description: "Simulate database outage"
      steps:
        - type: network_blackhole
          target: pg-main
          duration: 2m
        - type: wait
          duration: 30s
        - type: assert
          condition: "api.error_rate < 100%"
          message: "API should degrade gracefully, not crash"

    instance-resilience:
      description: "Kill instances under load"
      steps:
        - type: kill_instance
          target: api
          count: 1
          interval: 5m
          duration: 30m
        - type: assert
          condition: "api.p99 < 500ms"
          message: "P99 should remain under 500ms during recovery"

Run an experiment:

rf chaos run database-failure
⚡ Running experiment: database-failure
  Step 1: Network blackhole on pg-main (2m)
    api error_rate: 0% → 15% → 42%
  Step 2: Wait 30s (recovery period)
    api error_rate: 42% → 8% → 0.5%
  Step 3: Assert api.error_rate < 100%
    ✓ PASSED (peak: 42%, circuit breaker triggered correctly)

Experiment complete: PASSED

Resilience Score

Get an automated assessment of your infrastructure's resilience:

rf chaos resilience-report
RESILIENCE REPORT — my-saas
════════════════════════════

Score: 7/10

✓ Health checks configured on all web services
✓ Auto-scaling enabled (min > 1 on production)
✓ Database has automated backups
✓ Read replicas configured
✓ Multiple instances running (no single points of failure)
✓ Circuit breakers detected in application code
✓ Retry logic detected for database connections

⚠ No connection timeout configured for Redis (could hang on failure)
⚠ Worker service has min=1 (single point of failure)
✗ No multi-region deployment (single region failure = full outage)

Recommendations:
  1. Set Redis connection timeout: rf db config cache timeout 5000
  2. Scale worker to min=2: rf services scale worker --min 2
  3. Add a second region: rf regions add eu-west-1

Safety

All chaos experiments:

  • Require explicit confirmation before running on production
  • Have automatic duration limits (max 1 hour)
  • Can be stopped instantly with rf chaos stop
  • Are logged in the audit trail
  • Trigger alerts so your team knows an experiment is active
# Stop all active experiments
rf chaos stop --all

# View experiment history
rf chaos history