Profiling & Monitoring
A fast NestJS service in development can still crumble under production load. Profiling tells you where the time goes — a blocked event loop, a leaking heap, a slow downstream call — while monitoring tells you whether the system is healthy right now. This page covers measuring event-loop lag and memory, exposing liveness and readiness probes with @nestjs/terminus, and emitting Prometheus metrics for latency, throughput, and errors.
Profiling the event loop and memory
Node.js runs your handlers on a single event loop. If a synchronous operation (JSON parsing, crypto, a tight loop) hogs that thread, every concurrent request stalls. The first signal of trouble is event-loop lag: the delay between when a timer should fire and when it actually does.
The built-in perf_hooks.monitorEventLoopDelay samples this with high precision and almost no overhead.
import { Injectable, OnModuleInit, Logger } from '@nestjs/common';
import { monitorEventLoopDelay } from 'node:perf_hooks';
@Injectable()
export class LoopProfiler implements OnModuleInit {
private readonly logger = new Logger(LoopProfiler.name);
private readonly histogram = monitorEventLoopDelay({ resolution: 20 });
onModuleInit(): void {
this.histogram.enable();
setInterval(() => {
const p99Ms = this.histogram.percentile(99) / 1e6;
const meanMs = this.histogram.mean / 1e6;
const rss = process.memoryUsage().rss / 1024 / 1024;
this.logger.log(
`loop mean=${meanMs.toFixed(1)}ms p99=${p99Ms.toFixed(1)}ms rss=${rss.toFixed(0)}MB`,
);
this.histogram.reset();
}, 10_000).unref();
}
}
Output:
[Nest] 4821 - LoopProfiler loop mean=0.4ms p99=1.2ms rss=128MB
[Nest] 4821 - LoopProfiler loop mean=18.7ms p99=210.5ms rss=141MB
A p99 of 210ms means roughly one request in a hundred waited a fifth of a second just to be picked up. To find the offending code, capture a CPU profile with the V8 inspector and open the .cpuprofile in Chrome DevTools or VS Code:
node --prof dist/main.js # writes isolate-*.log, then:
node --prof-process isolate-*.log > processed.txt
# Or attach the inspector live and take a flamegraph:
node --inspect dist/main.js
Tip: Profile against production-like data and concurrency. A 10-row dev table will never reveal the N+1 query that melts the loop at 10,000 rows.
Health checks with @nestjs/terminus
Orchestrators like Kubernetes need an HTTP endpoint to decide if a pod is alive and ready for traffic. @nestjs/terminus provides composable health indicators that aggregate into a single status.
npm install @nestjs/terminus
import { Module } from '@nestjs/common';
import { TerminusModule } from '@nestjs/terminus';
import { HttpModule } from '@nestjs/axios';
import { HealthController } from './health.controller';
@Module({
imports: [TerminusModule, HttpModule],
controllers: [HealthController],
})
export class HealthModule {}
import { Controller, Get } from '@nestjs/common';
import {
HealthCheck,
HealthCheckService,
HttpHealthIndicator,
MemoryHealthIndicator,
TypeOrmHealthIndicator,
} from '@nestjs/terminus';
@Controller('health')
export class HealthController {
constructor(
private readonly health: HealthCheckService,
private readonly http: HttpHealthIndicator,
private readonly db: TypeOrmHealthIndicator,
private readonly memory: MemoryHealthIndicator,
) {}
@Get('live')
@HealthCheck()
liveness() {
return this.health.check([
() => this.memory.checkHeap('heap', 300 * 1024 * 1024),
]);
}
@Get('ready')
@HealthCheck()
readiness() {
return this.health.check([
() => this.db.pingCheck('database', { timeout: 1500 }),
() => this.http.pingCheck('payments', 'https://api.stripe.com/healthcheck'),
]);
}
}
Output:
GET /health/ready → 200
{
"status": "ok",
"info": { "database": { "status": "up" }, "payments": { "status": "up" } },
"error": {},
"details": { "database": { "status": "up" }, "payments": { "status": "up" } }
}
Liveness should test only the process itself (use it for restart decisions); readiness checks dependencies (use it to gate traffic). If a downstream is down, Terminus returns 503 so the orchestrator stops routing requests to the pod.
| Probe | Endpoint | Tests | Failure action |
|---|---|---|---|
| Liveness | /health/live | heap, deadlock | Restart pod |
| Readiness | /health/ready | DB, cache, APIs | Remove from load balancer |
| Startup | /health/startup | slow boot tasks | Delay other probes |
Prometheus metrics
Health checks are binary; metrics are continuous. Exposing latency, throughput, and error counts lets you build dashboards and alerts. The prom-client library plus a small interceptor covers the RED method (Rate, Errors, Duration).
npm install prom-client
import { Injectable } from '@nestjs/common';
import { Counter, Histogram, Registry, collectDefaultMetrics } from 'prom-client';
@Injectable()
export class MetricsService {
readonly registry = new Registry();
readonly httpDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Request latency in seconds',
labelNames: ['method', 'route', 'status'] as const,
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2.5, 5],
registers: [this.registry],
});
readonly httpErrors = new Counter({
name: 'http_requests_errors_total',
help: 'Total failed requests',
labelNames: ['method', 'route', 'status'] as const,
registers: [this.registry],
});
constructor() {
collectDefaultMetrics({ register: this.registry });
}
}
import {
CallHandler,
ExecutionContext,
Injectable,
NestInterceptor,
} from '@nestjs/common';
import { Observable, tap } from 'rxjs';
import { Request, Response } from 'express';
import { MetricsService } from './metrics.service';
@Injectable()
export class MetricsInterceptor implements NestInterceptor {
constructor(private readonly metrics: MetricsService) {}
intercept(ctx: ExecutionContext, next: CallHandler): Observable<unknown> {
const req = ctx.switchToHttp().getRequest<Request>();
const res = ctx.switchToHttp().getResponse<Response>();
const route = req.route?.path ?? req.path;
const stop = this.metrics.httpDuration.startTimer({ method: req.method, route });
return next.handle().pipe(
tap({
next: () => {
stop({ status: String(res.statusCode) });
},
error: () => {
const status = String(res.statusCode || 500);
stop({ status });
this.metrics.httpErrors.inc({ method: req.method, route, status });
},
}),
);
}
}
Expose the scrape endpoint and register the interceptor globally:
import { Controller, Get, Header } from '@nestjs/common';
import { MetricsService } from './metrics.service';
@Controller('metrics')
export class MetricsController {
constructor(private readonly metrics: MetricsService) {}
@Get()
@Header('Content-Type', 'text/plain')
scrape(): Promise<string> {
return this.metrics.registry.metrics();
}
}
Output:
# HELP http_request_duration_seconds Request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",route="/users",status="200",le="0.05"} 482
http_request_duration_seconds_bucket{method="GET",route="/users",status="200",le="0.1"} 511
http_request_duration_seconds_count{method="GET",route="/users",status="200"} 512
http_requests_errors_total{method="POST",route="/orders",status="500"} 3
Warning: Never put unbounded values (user IDs, raw URLs with params) in label values. Each unique combination creates a new time series and can blow up Prometheus memory — this is called a cardinality explosion. Always use the matched route pattern, not
req.url.
Best Practices
- Keep liveness probes dependency-free so a flaky database never triggers a restart loop; gate traffic with readiness instead.
- Sample event-loop lag continuously in production — it is the earliest warning of a synchronous bottleneck.
- Use histogram buckets that match your latency SLOs so quantile alerts are meaningful.
- Bound metric label cardinality to route patterns and fixed status codes; never log raw URLs or IDs as labels.
- Protect the
/metricsendpoint at the network layer or with a guard so it is not publicly scrapeable. - Capture CPU and heap profiles under realistic load and data volume, not against trivial dev datasets.
- Tie alerts to the RED signals (rate, errors, duration) rather than raw CPU, so you page on user-visible symptoms.