Debug a memory leak on a Node.js server running in ECS

Written by Lenvin Gonsalves.
on March 29, 2026

TLDR: One of our Node.js-based microservices was running out of memory. This post covers identification of the issue using monitoring tools, setting up a memory profiler against a live ECS container, and applying the fix.

Introduction

At Makro Pro, we use NestJS microservices for the marketplace backend. On one of the services, we noticed a rapid increase in memory consumption. The issue was so bad that the container was frequently crashing, and other services could not communicate with it.

We threw money at the problem by increasing the container memory. After some time, we hit a wall — so we had to address the root cause.

Importance of instrumentation

Thanks to our monitoring setup, we identified the problem early. We use NewRelic, which surfaces errors across services. We found errors showing that other services were getting ECONNRESET when trying to reach the affected service:

NewRelic Error group summary showing ECONNRESET errors from other services unable to connect

We investigated the container logs and found the culprit — a hard crash with a JavaScript heap out of memory error:

Server logs showing FATAL ERROR: Ineffective mark-compacts near heap limit — JavaScript heap out of memory

On NewRelic, we could visualise the heap size over time. The chart told the whole story — memory was climbing continuously from a few GB all the way to 15–20 GB before the container crashed:

NewRelic memory usage chart for prod-marketplace-api showing heap growing from 0 to 15GB over 3 days before crashing

Now that we had confirmed the problem, it was time to find the root cause.

Connecting the debugger to ECS

To run a heap profiler against a live ECS task, we needed to:

  1. Run the NestJS server in debug mode inside the container
  2. Expose port 9229 on the ECS task definition
  3. Enable enable_execute_command on the container
  4. Port-forward 9229 from the container to localhost

1. Run the NestJS server in debug mode

Ensure the node process starts with the --inspect flag in the ECS bootstrap script. Set the address to 0.0.0.0 only if the service is not exposed to the internet — in our case, the service was only discoverable within the VPC:

package.json diff showing start:prod changed from "node dist/main" to "node --inspect=0.0.0.0:9229 dist/main"

2. Expose port 9229 on the ECS task

Under the ECS task definition, scroll to Port Mappings and add port 9229. Configuration varies based on your networking setup (bridge vs. awsvpc mode).

3. Enable enable_execute_command

Add the required IAM permissions to the ECS task role so AWS Session Manager can exec into the container:

IAM policy JSON granting ssmmessages:CreateControlChannel, CreateDataChannel, OpenControlChannel, OpenDataChannel

4. Port-forward from the container

It's not possible to connect a debugger directly to the container's port. Instead, create a local tunnel using ecs-exec-pf:

ecs-exec-pf -c ${cluster_name} -t ${task_id} -p 9229 -l 9229

Taking a heap snapshot

With the tunnel running, open Chrome and navigate to chrome://inspect. Click "Open dedicated DevTools for Node":

Chrome DevTools for Node.js showing the connection screen with "Add connection" button

If localhost:9229 doesn't appear automatically, add it manually:

Add connection dialog with localhost:9229 entered

Once connected, navigate to the Memory tab, select Heap Snapshot, and click Take Snapshot:

Chrome DevTools Memory tab showing Heap Snapshot profiling type selected with Take Snapshot button

Finding the leak

In the snapshot, sorted by retained size, ThrottlerStorageService immediately stood out — it was consuming 34% of the entire heap in QA (almost certainly higher in production):

Heap snapshot showing ThrottlerStorageService with 999,776 bytes retained — 34% of total heap

Digging into it, the ThrottlerStorageService (from the NestJS throttler used on our GraphQL APIs) maintains a timeoutIds array. For every incoming request, a new Timeout object gets added to the array:

Diagram: 2 phones sending Request 1 and Request 2 to a server, resulting in [Timeout_1, Timeout_2] in memory

The timeouts were never being cleaned up. With every request the array grew, and with thousands of requests per hour, the memory ballooned:

Diagram: N phones sending N requests, resulting in [Timeout_1, Timeout_2, ..., Timeout_N] filling memory

Applying the fix

The package maintainers had already identified and fixed this in v2.0.1. The release notes confirmed it exactly:

throttler v2.0.1 changelog: "fix memory leak for timeoutIds array — the timeoutIds array would not be trimmed and would grow until out of memory. Now ids are properly removed on timeout."

The fix was a one-line change in package.json:

package.json diff upgrading @nestjs/throttler from ^1.1.4 to ^2.0.1

After the fix

We monitored memory consumption over the following week. The heap size stabilised completely — no more unbounded growth, and the service no longer needed to be over-provisioned with gigabytes of RAM:

NewRelic memory usage chart after the fix showing stable memory between 200–300 MB over 7 days


Written during my time at OOZOU, where I was embedded with the Makro Pro engineering team building Thailand's #1 B2B wholesale e-commerce platform.