DISTRIBUTED SYSTEMS · LLMOps · AI INFRASTRUCTURE

Harshit Joshi

I architect systems that hold up under real load. Today I build next-generation conversational AI and LLM features at Amazon Alexa, serving millions of users. Before that, seven years deep in high-throughput pub/sub, recommendation, and discovery pipelines.

15k+: concurrent connections sustained
7+: years on Alexa-scale systems
M+: users served in production
3: featured deep-dive projects

// EXPERIENCEWhere the load was real.

2023 - PresentAmazon Alexa AI

Software Engineer, Conversational AI

Architecting next-generation conversational AI and LLM features for an assistant used by millions every day.

Design and ship LLM-backed experiences end to end: prompt orchestration, retrieval, and guardrails for production conversational agents.
Build the LLMOps backbone - evaluation harnesses, online/offline experiment loops, and observability for model behavior at scale.
Tune latency and cost on the inference path so generative features stay responsive under millions of requests.

2018 - 2023Amazon Alexa

Software Engineer, Notifications & Search Catalog

Built high-throughput pub/sub architectures, personalized recommendations, and discovery pipelines for the Alexa catalog.

Designed pub/sub notification systems that fan out reliably across millions of devices with strict delivery guarantees.
Built personalized recommendation and ranking pipelines that surfaced relevant capabilities to the right users.
Owned discovery and search-catalog indexing pipelines, keeping ingestion fresh and queries fast across a massive content corpus.

// FEATURED PROJECTSThree I keep coming back to.

Dynamic Scaling for Multi-Node Hadoop

An auto-scaling VM performance optimizer for intensive distributed workloads.

Profiles job behavior across a Hadoop cluster and right-sizes the node pool on the fly - spinning capacity up under heavy MapReduce phases and reclaiming it as soon as the work drains, so intensive batch jobs finish faster without idle spend.

Multi-nodeelastic cluster

HadoopAuto-scalingDistributedVM Optimization

E-Commerce High-Concurrency Engine

A custom Node.js + Redis server tuned to hold the line under brutal traffic.

A from-scratch server engine optimized with vertical partitioning and SQL query caching, with Redis fronting the hot path. Load-tested to comfortably sustain 15k+ concurrent connections while keeping tail latency in check.

15k+concurrent connections

Node.jsRedisSQL CachingVertical Partitioning

GenAI Worksheets Research

Research into task-oriented conversational agents.

Explored structured, worksheet-style representations for task-oriented dialogue - giving conversational agents a scaffold to track goals and slots across a multi-turn task, so they complete real workflows instead of drifting.

Conversational AITask-orientedResearchAgents

// SYSTEM DESIGN · INTERACTIVEPush the controls, watch it react.

High-Throughput LLM Orchestration & Pub/Sub Pipeline

A live model of the kind of pipeline I build: requests fan out through a pub/sub bus to a pool of orchestrator workers, which route to a Redis cache, llama.cpp inference, or the database. Scale the workers, flip caching, or fire a load spike and watch latency, queue depth, and throughput respond.

Latency133ms

Queue depth4msgs

Throughput1.1ktok/s

Cache hit0%

IngressAPI gateway

EventBridge / SQSpub/sub fan-out

Orchestratorworker pool

workers

nominal

Redis Cachebypassed

llama.cpp

Database

Steady state. Each orchestrator worker pulls from SQS independently, so adding consumers scales throughput without lock contention or deadlocks.

// SKILLS RADARThe stack I reach for.

LANGUAGES

JavaC++

AWS

BedrockSageMakerDynamoDBSQS

AI / LLM

LLMOpsLangChainModel Context Protocol (MCP)

DATA / INFRA

RedisGraphQL

// TECHNICAL BLOGNotes from the trenches.

DEEP DIVELocal inference · ~8 min read

Model Optimization Parameters using llama.cpp

Running large language models on your own hardware comes down to a handful of dials. Get them right and a 7B model is snappy on a laptop; get them wrong and you are swapping to disk or watching tokens trickle out. Here is how I reason about the four that matter most.

Quantization: Q4_K_M vs Q8_0

Quantization trades numerical precision for memory and speed. The weights get packed into fewer bits, so the model shrinks and the arithmetic gets cheaper. The question is always how much quality you are willing to give up for that.

	Q4_K_M	Q8_0
Bits / weight	~4.5	~8.5
7B model size	~4.4 GB	~7.2 GB
Quality loss	small, rarely noticed	near-lossless
Speed	fastest	slower, heavier
Best for	laptops, edge, RAM-bound	quality-critical, ample VRAM

Q4_K_M is the default I reach for: the K-quant mixed scheme keeps the sensitive layers at higher precision while squeezing the rest, so you get most of the size win without the blunt quality drop of a flat 4-bit. Q8_0 is for when output fidelity is non-negotiable and you have the memory to spare - evaluation runs, structured extraction, anything where a subtle drift in logits actually matters.

Rule of thumb: start at Q4_K_M. Only move up to Q5/Q8 if you can measure a quality regression on your own eval set, not because the higher number feels safer.

Thread allocation

The -t flag sets how many CPU threads handle token generation. More threads is not always faster: past the number of physical cores you hit memory-bandwidth limits and hyperthreading contention, and throughput plateaus or drops.

# match physical cores, not logical threads
./llama-cli -m model.Q4_K_M.gguf \
  -t 8        # generation threads (≈ physical cores)
  -tb 16      # batch threads for prompt ingest

I set -t to the physical core count and let -tb (batch threads, used during prompt processing) run a little higher, since the prefill stage parallelizes better than autoregressive decode. Leave a core free for the OS so the UI does not stutter under load.

Temperature & top-p

These two shape the sampling distribution. Temperature flattens or sharpens the probabilities; top-p (nucleus sampling) clips the long tail by keeping only the smallest set of tokens whose cumulative probability reaches p.

Goal	Temperature	Top-p
Deterministic / code	0.0 - 0.2	0.9
Balanced answers	0.6 - 0.7	0.9
Creative writing	0.9 - 1.1	0.95

Tune one at a time. For anything that has to be correct - code, extraction, tool calls - I pull temperature toward zero and let top-p do the trimming. For ideation I raise temperature first and only widen top-p if the output feels too safe.

GPU offloading: Metal vs CUDA

The -ngl flag offloads N transformer layers to the GPU. Each layer you move off the CPU is a chunk of matrix math running on far more parallel hardware - the single biggest speedup available, as long as the layers fit in VRAM.

# offload all layers if they fit in VRAM
./llama-cli -m model.Q4_K_M.gguf -ngl 99

# Metal (Apple Silicon) builds: unified memory, no copy
# CUDA (NVIDIA) builds: watch VRAM headroom for the KV cache

Metal (Apple Silicon): unified memory means the GPU and CPU share one pool, so there is no PCIe copy tax. Offload aggressively; the ceiling is total system RAM.
CUDA (NVIDIA): dedicated VRAM is faster per layer but finite. Size -ngl so weights and the KV cache fit - overshoot and you spill back to host memory over PCIe, which is slower than just keeping those layers on the CPU.

The tradeoff: offloading is pure win until you run out of VRAM, then it is a cliff. Profile actual memory use at your target context length before maxing -ngl.

Got a hard systems problem? Let's dig in.

Back to homehpj1992.poke.site