Your Go Service Is Leaking Memory and You Don’t Know It Yet — A Production pprof Survival Guide

How I Stopped Getting Paged at 3 AM by Learning to Profile Heap and Goroutine Leaks Before They Escalate


Introduction: The OOM Kill Nightmare at 3 AM

It was a Tuesday night. I was deep in sleep when my phone blew up. PagerDuty. Then Slack. Then a second PagerDuty alert.

Our Go API — handling roughly 8,000 requests per minute — had just been OOM-killed by the Linux kernel. The pod had been restarted automatically by Kubernetes, which masked the problem just long enough for it to happen again four hours later. And again. And again.

The application wasn’t crashing. It wasn’t throwing errors. It was just slowly, silently consuming more and more memory until the kernel had no choice but to end it. RSS was climbing ~50 MB per hour with no correlation to traffic spikes. Classic memory leak behavior.

This was the incident that forced me to truly learn Go’s pprof tooling — and I’ve never looked at memory management the same way since. In this post, I’ll show you exactly how I set up pprof for safe production use, how to read heap profiles, identify goroutine leaks, and understand the critical difference between allocated and in-use memory that most guides completely skip over.


How I Set Up the pprof Endpoint Securely

The first mistake I see developers make is either not exposing pprof at all, or worse — exposing it on the same port as their public API. Both extremes cause problems.

The right approach is to expose pprof on a separate internal port, bound only to localhost or an internal interface, and protect it at the network level.

Basic Setup (Internal-Only Port)

package main

import (
    "log"
    "net/http"
    _ "net/http/pprof" // Side-effect import registers /debug/pprof handlers
)

func main() {
    // Start pprof on a separate goroutine, internal-only
    go func() {
        log.Println("pprof listening on :6060")
        if err := http.ListenAndServe("127.0.0.1:6060", nil); err != nil {
            log.Fatalf("pprof server failed: %v", err)
        }
    }()

    // Your actual application server
    startApplicationServer()
}

Never bind pprof to 0.0.0.0 in production. I’ve seen this in wild production codebases — it exposes heap dumps, goroutine stacks, and CPU profiles to anyone who can reach the pod. That’s an information disclosure vulnerability at minimum.

Kubernetes Setup — Accessing pprof Safely via Port-Forward

In Kubernetes, you never open a NodePort or LoadBalancer for pprof. Instead, use kubectl port-forward from your local machine:

# Forward pprof port to your localhost
kubectl port-forward pod/my-go-service-7d9f4b8c6-xk2pz 6060:6060 -n production

# In another terminal, capture a 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Capture a heap snapshot immediately
go tool pprof http://localhost:6060/debug/pprof/heap

Environment-Gated Enablement

For services where you want pprof available but not always running, I use an environment variable gate:

import (
    "os"
    "net/http"
    _ "net/http/pprof"
)

func initPprof() {
    if os.Getenv("ENABLE_PPROF") == "true" {
        go func() {
            log.Println("pprof enabled on :6060")
            http.ListenAndServe("127.0.0.1:6060", nil)
        }()
    }
}

Set ENABLE_PPROF=true in your Kubernetes deployment manifest only when you need to diagnose, and remove it afterward. Simple and auditable.


Step-by-Step: Analyzing Heap Profiles and Identifying Orphan Goroutines

Step 1 — Capture a Baseline Heap Profile

The first rule of memory leak investigation: never look at just one snapshot. You need two profiles separated by time to see the delta.

# Snapshot 1 — capture baseline
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

Wait 15–30 minutes (or reproduce the suspected leak scenario), then:

# Snapshot 2 — capture after potential leak activity
go tool pprof -http=:8081 http://localhost:6060/debug/pprof/heap

Or better yet, use the base comparison flag:

# Download both profiles to disk first
curl -s http://localhost:6060/debug/pprof/heap -o heap1.prof
# ... wait for leak to grow ...
curl -s http://localhost:6060/debug/pprof/heap -o heap2.prof

# Diff them
go tool pprof -base heap1.prof heap2.prof

Inside the pprof interactive shell or web UI, run:

(pprof) top20
(pprof) list YourSuspectedFunction
(pprof) web   # Opens a flamegraph in your browser

Step 2 — Reading the Flamegraph

The flamegraph is your best friend. Wide, tall bars mean a function is allocating a lot and holding onto it. Here’s what to look for:

  • HTTP handler functions that never return allocations — could indicate request context leaks
  • Unexpected growth in encoding/json or database/sql — often caused by unbounded caches or connection pools
  • runtime.malg growing steadily — this is goroutine stack allocation, a classic goroutine leak signal

Step 3 — Checking for Goroutine Leaks

Goroutine leaks are the most common silent memory leak in Go services I’ve encountered. They don’t cause large individual allocations, but their cumulative stack memory and the resources they hold (channels, connections, timers) will steadily grow until something breaks.

# Get a goroutine dump
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txt

# Count current goroutines (quick health check)
curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | head -5

A healthy service should have a roughly stable goroutine count under steady traffic. If you’re seeing it climb linearly over time, you have a leak.

Here’s a real example of a goroutine leak I found in a webhook processing service:

// LEAKY — goroutine blocks forever if channel is never drained
func processWebhook(event Event) {
    ch := make(chan Result) // unbuffered
    go func() {
        ch <- process(event) // blocks if nobody reads
    }()
    // If this function returns early (timeout, error), the goroutine above leaks
}

// FIXED — use context cancellation and a buffered channel
func processWebhook(ctx context.Context, event Event) error {
    ch := make(chan Result, 1) // buffered — sender won't block
    go func() {
        select {
        case ch <- process(event):
        case <-ctx.Done():
        }
    }()

    select {
    case result := <-ch:
        return handleResult(result)
    case <-ctx.Done():
        return ctx.Err()
    }
}

The Expert Tip: Difference Between “Allocated” and “In-Use” Memory

This is the distinction that trips up even experienced Go developers, and it’s the one that makes reading pprof output genuinely confusing at first.

pprof heap profiles expose four key metrics:

MetricWhat It Means
alloc_objectsTotal number of objects allocated since program start
alloc_spaceTotal bytes allocated since program start
inuse_objectsObjects currently allocated and not yet GC’d
inuse_spaceBytes currently allocated and not yet GC’d

By default, go tool pprof shows inuse_space — what’s live in memory right now. This is what you care about for diagnosing actual memory growth.

But alloc_space is what you care about for finding where allocation pressure is coming from — even if that memory gets GC’d quickly, high allocation rates cause GC pressure and CPU overhead.

# Focus on live (in-use) memory — for diagnosing actual leaks
go tool pprof -inuse_space http://localhost:6060/debug/pprof/heap

# Focus on total allocations — for finding GC pressure hotspots
go tool pprof -alloc_space http://localhost:6060/debug/pprof/heap

The rookie mistake I made early in my career: looking at alloc_space and panicking about a function that showed huge allocation totals, only to realize the GC was collecting it perfectly fine. The in-use space was stable. The real leak was elsewhere.

The rule of thumb I use: if inuse_space grows consistently over time under steady load, you have a leak. If alloc_space is high but inuse_space is stable, you have a GC pressure problem — different issue, different fix.


Conclusion: Best Practices to Prevent Leaks at the Design Level

After that 3 AM incident, I introduced a set of structural practices into every Go service I architect. These have dramatically reduced the frequency of memory leak issues:

1. Always propagate context.Context and respect cancellation

Every goroutine you spawn must have an exit condition tied to a context. No naked goroutines without cancel channels.

2. Use goleak in your test suite

go.uber.org/goleak detects goroutine leaks in unit tests. It’s one line:

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

This catches goroutine leaks at test time, not 3 AM.

3. Bound every cache, pool, and buffer

Unbounded caches are memory leaks waiting to happen. Use sync.Pool for short-lived objects, and always set a maximum size on any in-memory data structure.

4. Close http.Response.Body — always

resp, err := http.Get(url)
if err != nil {
    return err
}
defer resp.Body.Close() // Never skip this, even on error paths

A missing Body.Close() leaks a TCP connection and goroutines from the HTTP client’s internal transport.

5. Instrument with metrics from day one

Track runtime.MemStats.HeapInuse and goroutine count in your Prometheus metrics. A slowly rising graph over days is a leak. A stable graph is confidence.

import "runtime"

func recordMemStats() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    heapInuseGauge.Set(float64(m.HeapInuse))
    goroutineGauge.Set(float64(runtime.NumGoroutine()))
}

Memory leaks in Go are sneaky precisely because the language makes memory management feel easy. The GC handles most things — until it doesn’t. Building pprof profiling into your debugging workflow isn’t optional at scale; it’s table stakes.

The good news: once you’ve been through the process a few times, reading heap profiles becomes second nature. You’ll start recognizing patterns, and more importantly, you’ll start writing code that avoids them in the first place.

Ship carefully. Measure everything. And for the love of your on-call rotation — set up goroutine leak detection in your tests.