Inject failures, test resilience, and get a resilience score for your infrastructure.
Test how your system handles failure before your users find out. RaidFrame's chaos engineering tools let you inject controlled failures into your services and measure the impact.
Add artificial latency to a service:
rf chaos latency api --ms 500 --duration 5m
⚡ Chaos experiment started
Service: api
Type: latency injection (+500ms)
Duration: 5 minutes
Expires: 2026-03-16T14:28:00Z
Stop early: rf chaos stop api
During the experiment, monitor your other services to see how they handle the degraded dependency.
Terminate a random instance to test auto-recovery:
rf chaos kill api --instance random
⚡ Killed instance i-abc123 (api)
Remaining: 3/4 instances healthy
Auto-scaling: new instance launching...
Recovery time: 8 seconds (new instance healthy)
rf chaos kill api --instance random --interval 10m --duration 1h
Kills a random instance every 10 minutes for 1 hour. Tests your auto-scaling, health checks, and load balancing under sustained failure.
rf chaos network api --packet-loss 10% --duration 5m
Simulate a database going unresponsive:
rf chaos network pg-main --blackhole --duration 2m
All connections to the database will hang (no response, no timeout) for 2 minutes. Tests your app's timeout handling and circuit breakers.
rf chaos dns api --fail-rate 20% --duration 5m
20% of DNS lookups from the api service will fail. Tests retry logic and fallback behavior.
rf chaos cpu api --percent 90 --duration 5m
rf chaos memory api --fill 80% --duration 5m
rf chaos disk api --fill 95% --duration 5m
Define repeatable experiments:
chaos:
experiments:
database-failure:
description: "Simulate database outage"
steps:
- type: network_blackhole
target: pg-main
duration: 2m
- type: wait
duration: 30s
- type: assert
condition: "api.error_rate < 100%"
message: "API should degrade gracefully, not crash"
instance-resilience:
description: "Kill instances under load"
steps:
- type: kill_instance
target: api
count: 1
interval: 5m
duration: 30m
- type: assert
condition: "api.p99 < 500ms"
message: "P99 should remain under 500ms during recovery"
Run an experiment:
rf chaos run database-failure
⚡ Running experiment: database-failure
Step 1: Network blackhole on pg-main (2m)
api error_rate: 0% → 15% → 42%
Step 2: Wait 30s (recovery period)
api error_rate: 42% → 8% → 0.5%
Step 3: Assert api.error_rate < 100%
✓ PASSED (peak: 42%, circuit breaker triggered correctly)
Experiment complete: PASSED
Get an automated assessment of your infrastructure's resilience:
rf chaos resilience-report
RESILIENCE REPORT — my-saas
════════════════════════════
Score: 7/10
✓ Health checks configured on all web services
✓ Auto-scaling enabled (min > 1 on production)
✓ Database has automated backups
✓ Read replicas configured
✓ Multiple instances running (no single points of failure)
✓ Circuit breakers detected in application code
✓ Retry logic detected for database connections
⚠ No connection timeout configured for Redis (could hang on failure)
⚠ Worker service has min=1 (single point of failure)
✗ No multi-region deployment (single region failure = full outage)
Recommendations:
1. Set Redis connection timeout: rf db config cache timeout 5000
2. Scale worker to min=2: rf services scale worker --min 2
3. Add a second region: rf regions add eu-west-1
All chaos experiments:
rf chaos stop# Stop all active experiments
rf chaos stop --all
# View experiment history
rf chaos history