Engineeringmonitoringobservabilitylogging

Cloud Application Monitoring: The Complete Observability Guide

Set up logs, metrics, traces, and alerts for your cloud application without Datadog or New Relic. A practical guide to production observability.

RaidFrame Team

February 28, 2026 · 5 min read

TL;DR — You need four things to monitor a production app: logs (what happened), metrics (how much), traces (where it's slow), and alerts (tell me when it breaks). Most teams pay $23/host/month for Datadog to get this. On RaidFrame, it's built in at no extra cost.

The four pillars of observability

1. Logs — what happened

Every console.log, print(), and logger.info() in your app is captured, timestamped, and searchable.

rf logs --service api --since 1h --level error

Good logging means structured output:

// Bad
console.log("User signed up");
 
// Good
console.log(JSON.stringify({
  level: "info",
  event: "user.signup",
  user_id: "u_123",
  plan: "pro",
  source: "google_ads",
}));

Structured logs become searchable and filterable. You can query rf logs --search "event=user.signup" instead of grep-ing through text.

2. Metrics — how much

Numbers over time. CPU usage, memory, request rate, error rate, latency percentiles.

rf metrics --service api --period 24h

CPU:     34% avg, 89% peak
Memory:  1.2 GB avg, 1.8 GB peak
Requests: 142/s avg
P50:     12ms
P95:     89ms
P99:     230ms
Errors:  0.3%

The metrics that matter:

Metric	Why	Alert When
Error rate	Are users seeing failures?	> 1% for 5 min
P99 latency	Are some users having a bad time?	> 1s for 5 min
CPU usage	Are you running out of compute?	> 85% for 3 min
Memory usage	Memory leak detection	> 90% for 5 min
Queue depth	Is work piling up?	> 1000 for 10 min

3. Traces — where it's slow

A request hits your load balancer, goes to your API, queries the database, calls Redis, and returns. Where did the 800ms go?

Distributed tracing answers that:

GET /api/orders (820ms)
├── [api] parseRequest (2ms)
├── [api] authenticate (15ms)
├── [db] SELECT * FROM orders WHERE user_id = $1 (680ms) ⚠ SLOW
├── [cache] Redis GET order_count (3ms)
└── [api] serializeResponse (8ms)

The database query took 680ms of an 820ms request. That's where to optimize.

4. Alerts — tell me when it breaks

Don't watch dashboards. Set alerts and go build.

rf alerts create --name "API Errors" --metric error_rate --service api --threshold "> 2%" --window 5m --notify slack
rf alerts create --name "Slow Responses" --metric response_time_p99 --service api --threshold "> 500ms" --window 5m --notify slack
rf alerts create --name "CPU High" --metric cpu_percent --service api --threshold "> 85%" --window 3m --notify email

Try RaidFrame free

Deploy your first app in 60 seconds. No credit card required.

Start free

What to monitor at each stage

Pre-launch (0 users)

Just logs and uptime:

rf alerts create --name "Site Down" --type uptime --url https://myapp.com --interval 30s --notify email

First users (1-1000)

Add error rate and latency alerts:

rf alerts create --name "Errors" --metric error_rate --service api --threshold "> 5%" --window 5m --notify slack

Growth (1K-10K users)

Add database monitoring, queue depth, and custom metrics:

rf metrics --service pg-main
rf alerts create --name "DB Slow" --metric db_query_time_ms --threshold "> 500" --window 5m --notify slack
rf alerts create --name "Queue Backlog" --metric queue_depth --service tasks --threshold "> 500" --window 10m --notify slack

Scale (10K+ users)

Add distributed tracing, anomaly detection, and cost alerts:

rf alerts create --name "Traffic Anomaly" --metric request_count --service api --type anomaly --notify pagerduty
rf budget set 500 --notify email --at 80%,100%

Custom metrics

Push your own business metrics:

import { Metrics } from "@raidframe/sdk";
 
const metrics = new Metrics();
 
// Track signups
metrics.increment("business.signups");
 
// Track revenue
metrics.gauge("business.mrr", 4500);
 
// Track checkout time
metrics.histogram("business.checkout_duration_ms", 340);

Custom metrics appear alongside system metrics and can trigger alerts and auto-scaling.

The Datadog question

Datadog costs $23/host/month for infrastructure monitoring, $36/host for APM, plus $0.10 per GB of logs. For a small team running 4 hosts with APM:

Datadog: ~$236/month just for monitoring.

RaidFrame: $0 — monitoring is built into the platform.

If you need to export to Grafana for custom dashboards, RaidFrame exposes a Prometheus endpoint:

rf metrics prometheus-endpoint

Connect Grafana and build whatever dashboards you want.

FAQ

Do I need to install an agent?

No. Metrics, logs, and traces are collected automatically. No APM agent, no sidecar container, no instrumentation library required for basic monitoring.

How long are logs retained?

3 days on Starter, 30 days on Pro, 90 days on Enterprise. Export to S3-compatible storage for longer retention.

Can I use OpenTelemetry?

Yes. RaidFrame accepts OpenTelemetry traces via the OTLP endpoint. Use the OpenTelemetry SDK for richer trace data.

Does monitoring affect performance?

Negligible. Log collection adds < 1ms per request. Metrics are sampled. Tracing overhead is under 2%.

Can I send alerts to PagerDuty?

Yes. Slack, email, PagerDuty, OpsGenie, and custom webhooks are all supported.

monitoringobservabilityloggingmetricsalerts

Ship faster with RaidFrame

Auto-scaling compute, managed databases, global CDN, and zero-config CI/CD. Free tier included.

Start for free View pricing

Keep reading

Engineering

Cloud Application Monitoring: The Complete Observability Guide

The four pillars of observability

1. Logs — what happened

2. Metrics — how much

3. Traces — where it's slow

4. Alerts — tell me when it breaks

What to monitor at each stage

Pre-launch (0 users)

First users (1-1000)

Growth (1K-10K users)

Scale (10K+ users)

Custom metrics

The Datadog question

FAQ

Do I need to install an agent?

How long are logs retained?

Can I use OpenTelemetry?

Does monitoring affect performance?

Can I send alerts to PagerDuty?

Ship faster with RaidFrame

Keep reading

How to Build and Scale a SaaS on Cloud Infrastructure

Best Tech Stack for Building a SaaS in 2026

GPU Cloud Hosting for AI/ML: How to Choose and Deploy

The four pillars of observability

1. Logs — what happened

2. Metrics — how much

3. Traces — where it's slow

4. Alerts — tell me when it breaks

What to monitor at each stage

Pre-launch (0 users)

First users (1-1000)

Growth (1K-10K users)

Scale (10K+ users)

Custom metrics

The Datadog question

FAQ

Do I need to install an agent?

How long are logs retained?

Can I use OpenTelemetry?

Does monitoring affect performance?

Can I send alerts to PagerDuty?

Related reading

Ship faster with RaidFrame

Keep reading

How to Build and Scale a SaaS on Cloud Infrastructure

Best Tech Stack for Building a SaaS in 2026

GPU Cloud Hosting for AI/ML: How to Choose and Deploy