Profiling & Monitoring

A fast NestJS service in development can still crumble under production load. Profiling tells you where the time goes — a blocked event loop, a leaking heap, a slow downstream call — while monitoring tells you whether the system is healthy right now. This page covers measuring event-loop lag and memory, exposing liveness and readiness probes with @nestjs/terminus, and emitting Prometheus metrics for latency, throughput, and errors.

Profiling the event loop and memory

Node.js runs your handlers on a single event loop. If a synchronous operation (JSON parsing, crypto, a tight loop) hogs that thread, every concurrent request stalls. The first signal of trouble is event-loop lag: the delay between when a timer should fire and when it actually does.

The built-in perf_hooks.monitorEventLoopDelay samples this with high precision and almost no overhead.

import { Injectable, OnModuleInit, Logger } from '@nestjs/common';
import { monitorEventLoopDelay } from 'node:perf_hooks';

@Injectable()
export class LoopProfiler implements OnModuleInit {
  private readonly logger = new Logger(LoopProfiler.name);
  private readonly histogram = monitorEventLoopDelay({ resolution: 20 });

  onModuleInit(): void {
    this.histogram.enable();
    setInterval(() => {
      const p99Ms = this.histogram.percentile(99) / 1e6;
      const meanMs = this.histogram.mean / 1e6;
      const rss = process.memoryUsage().rss / 1024 / 1024;
      this.logger.log(
        `loop mean=${meanMs.toFixed(1)}ms p99=${p99Ms.toFixed(1)}ms rss=${rss.toFixed(0)}MB`,
      );
      this.histogram.reset();
    }, 10_000).unref();
  }
}

Output:

[Nest] 4821  - LoopProfiler   loop mean=0.4ms p99=1.2ms rss=128MB
[Nest] 4821  - LoopProfiler   loop mean=18.7ms p99=210.5ms rss=141MB

A p99 of 210ms means roughly one request in a hundred waited a fifth of a second just to be picked up. To find the offending code, capture a CPU profile with the V8 inspector and open the .cpuprofile in Chrome DevTools or VS Code:

node --prof dist/main.js          # writes isolate-*.log, then:
node --prof-process isolate-*.log > processed.txt

# Or attach the inspector live and take a flamegraph:
node --inspect dist/main.js

Tip: Profile against production-like data and concurrency. A 10-row dev table will never reveal the N+1 query that melts the loop at 10,000 rows.

Health checks with @nestjs/terminus

Orchestrators like Kubernetes need an HTTP endpoint to decide if a pod is alive and ready for traffic. @nestjs/terminus provides composable health indicators that aggregate into a single status.

npm install @nestjs/terminus

import { Module } from '@nestjs/common';
import { TerminusModule } from '@nestjs/terminus';
import { HttpModule } from '@nestjs/axios';
import { HealthController } from './health.controller';

@Module({
  imports: [TerminusModule, HttpModule],
  controllers: [HealthController],
})
export class HealthModule {}

import { Controller, Get } from '@nestjs/common';
import {
  HealthCheck,
  HealthCheckService,
  HttpHealthIndicator,
  MemoryHealthIndicator,
  TypeOrmHealthIndicator,
} from '@nestjs/terminus';

@Controller('health')
export class HealthController {
  constructor(
    private readonly health: HealthCheckService,
    private readonly http: HttpHealthIndicator,
    private readonly db: TypeOrmHealthIndicator,
    private readonly memory: MemoryHealthIndicator,
  ) {}

  @Get('live')
  @HealthCheck()
  liveness() {
    return this.health.check([
      () => this.memory.checkHeap('heap', 300 * 1024 * 1024),
    ]);
  }

  @Get('ready')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('database', { timeout: 1500 }),
      () => this.http.pingCheck('payments', 'https://api.stripe.com/healthcheck'),
    ]);
  }
}

Output:

GET /health/ready → 200
{
  "status": "ok",
  "info": { "database": { "status": "up" }, "payments": { "status": "up" } },
  "error": {},
  "details": { "database": { "status": "up" }, "payments": { "status": "up" } }
}

Liveness should test only the process itself (use it for restart decisions); readiness checks dependencies (use it to gate traffic). If a downstream is down, Terminus returns 503 so the orchestrator stops routing requests to the pod.

Probe	Endpoint	Tests	Failure action
Liveness	`/health/live`	heap, deadlock	Restart pod
Readiness	`/health/ready`	DB, cache, APIs	Remove from load balancer
Startup	`/health/startup`	slow boot tasks	Delay other probes

Prometheus metrics

Health checks are binary; metrics are continuous. Exposing latency, throughput, and error counts lets you build dashboards and alerts. The prom-client library plus a small interceptor covers the RED method (Rate, Errors, Duration).

npm install prom-client

import { Injectable } from '@nestjs/common';
import { Counter, Histogram, Registry, collectDefaultMetrics } from 'prom-client';

@Injectable()
export class MetricsService {
  readonly registry = new Registry();

  readonly httpDuration = new Histogram({
    name: 'http_request_duration_seconds',
    help: 'Request latency in seconds',
    labelNames: ['method', 'route', 'status'] as const,
    buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2.5, 5],
    registers: [this.registry],
  });

  readonly httpErrors = new Counter({
    name: 'http_requests_errors_total',
    help: 'Total failed requests',
    labelNames: ['method', 'route', 'status'] as const,
    registers: [this.registry],
  });

  constructor() {
    collectDefaultMetrics({ register: this.registry });
  }
}

import {
  CallHandler,
  ExecutionContext,
  Injectable,
  NestInterceptor,
} from '@nestjs/common';
import { Observable, tap } from 'rxjs';
import { Request, Response } from 'express';
import { MetricsService } from './metrics.service';

@Injectable()
export class MetricsInterceptor implements NestInterceptor {
  constructor(private readonly metrics: MetricsService) {}

  intercept(ctx: ExecutionContext, next: CallHandler): Observable<unknown> {
    const req = ctx.switchToHttp().getRequest<Request>();
    const res = ctx.switchToHttp().getResponse<Response>();
    const route = req.route?.path ?? req.path;
    const stop = this.metrics.httpDuration.startTimer({ method: req.method, route });

    return next.handle().pipe(
      tap({
        next: () => {
          stop({ status: String(res.statusCode) });
        },
        error: () => {
          const status = String(res.statusCode || 500);
          stop({ status });
          this.metrics.httpErrors.inc({ method: req.method, route, status });
        },
      }),
    );
  }
}

Expose the scrape endpoint and register the interceptor globally:

import { Controller, Get, Header } from '@nestjs/common';
import { MetricsService } from './metrics.service';

@Controller('metrics')
export class MetricsController {
  constructor(private readonly metrics: MetricsService) {}

  @Get()
  @Header('Content-Type', 'text/plain')
  scrape(): Promise<string> {
    return this.metrics.registry.metrics();
  }
}

Output:

# HELP http_request_duration_seconds Request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",route="/users",status="200",le="0.05"} 482
http_request_duration_seconds_bucket{method="GET",route="/users",status="200",le="0.1"} 511
http_request_duration_seconds_count{method="GET",route="/users",status="200"} 512
http_requests_errors_total{method="POST",route="/orders",status="500"} 3

Warning: Never put unbounded values (user IDs, raw URLs with params) in label values. Each unique combination creates a new time series and can blow up Prometheus memory — this is called a cardinality explosion. Always use the matched route pattern, not req.url.

Best Practices

Keep liveness probes dependency-free so a flaky database never triggers a restart loop; gate traffic with readiness instead.
Sample event-loop lag continuously in production — it is the earliest warning of a synchronous bottleneck.
Use histogram buckets that match your latency SLOs so quantile alerts are meaningful.
Bound metric label cardinality to route patterns and fixed status codes; never log raw URLs or IDs as labels.
Protect the /metrics endpoint at the network layer or with a guard so it is not publicly scrapeable.
Capture CPU and heap profiles under realistic load and data volume, not against trivial dev datasets.
Tie alerts to the RED signals (rate, errors, duration) rather than raw CPU, so you page on user-visible symptoms.

Profiling & Monitoring

Profiling the event loop and memory

Health checks with @nestjs/terminus

Prometheus metrics

Best Practices

Related Topics