Architecture

Cutting API Latency 50% Without Rewriting the System

KPKarey Powell·April 28, 2026·2 min read

The brief was deceptively simple: the platform was slow at peak, and a rewrite was off the table. We had a live credit bureau serving real lenders, millions of consumer records, and a hard "do not break anything" constraint. Six weeks later, p95 latency was down 50% and the system absorbed three times the peak query volume without breaking a sweat. No rewrite. Here is how.

Measure before you touch anything

The first week, I wrote no code. I instrumented. You cannot optimize what you cannot see, and the team's intuitions about where the time went turned out to be wrong in two of the three hottest paths.

Profiling is humbling. The bottleneck is almost never where the senior engineer swears it is.

Once we had real percentile latency per endpoint and per downstream call, three culprits stood out:

A chatty service boundary that turned one logical lookup into seven sequential network hops.
An uncached reference dataset that changed maybe once a day but was fetched on every request.
A database query doing a sort the index couldn't satisfy.

Reshape the boundaries, don't rewrite the services

The seven-hop path existed because the services were drawn around database tables instead of business capabilities. We didn't rewrite them — we collapsed the read path into a single composed query behind one service, leaving the write paths untouched.

// before: N sequential calls, each a network round-trip
for _, id := range ids {
    record, _ := client.FetchRecord(ctx, id) // ouch
    results = append(results, record)
}

// after: one batched call, one round-trip
results, _ := client.FetchRecordsBatch(ctx, ids)

That single change took the endpoint from seven round-trips to one and accounted for most of the win.

Cache the boring, slow-changing things

The reference dataset was perfect cache material: large read volume, tiny write volume. We put it behind Redis with a short TTL and an explicit invalidation hook on the (rare) write. The rule I follow:

Cache data that is read far more than it is written.
Always have an invalidation story — a cache you can't invalidate is a bug with a countdown timer.
Set a TTL even when you invalidate, so a missed invalidation self-heals.

The index that wasn't

The last fix was a one-line migration: a composite index matching the query's filter-and-sort shape. The query planner stopped sorting in memory and the endpoint's p95 dropped by two-thirds on its own.

What made it durable

The patterns we established — composed read services, an explicit caching policy, and a "profile first" culture — outlived the engagement and became team standards. Speed isn't a one-time fix; it's a set of defaults that keep the next feature from quietly regressing what you just earned.

PerformanceMicroservicesCachingPostgreSQL

Karey Powell

Staff Engineer & AI Systems Architect. 14+ years building production fintech and AI systems across the Caribbean. Currently Lead Solutions Architect at MZ Holdings.