Skip to content

LLM Production Operations: The Three Skills Nobody Teaches You

#llm-operations #debugging #pytorch #profiling #agent-systems #streaming

Nobody talks about this part.

You can fine tune a 70B model. You can write perfect ReAct prompts. You can wire up 12 tools into an agent that passes every demo.

None of that will save you when you deploy. The production failures will not be clever. They will be silent truncations in streams. They will be retry loops burning $200 overnight. They will be inference running at 12% of the GPU's theoretical throughput and nobody can explain why.

These are not research problems. These are operational problems. This article covers the three core skills every senior ML engineer running LLM systems needs to master.

Streaming is not a UX feature. It is a protocol.

Everyone implements streaming because it feels faster. Almost no one implements it correctly.

The most important fact about streaming is this: total end to end latency does not change. Only perceived latency changes. For an identical 4 second generation, a non streamed response will show nothing for 4000ms. A correctly implemented streamed response will show the first token at 300ms. Users will report the streamed version is 3x faster. They are not wrong. Perception is reality for interface performance.

Almost every implementation breaks in one of three predictable ways.

The three silent streaming failures

First: you bail the loop when text stops arriving. You will never get the stop reason.

In every major LLM API, stop_reason is not sent with the last text chunk. It arrives in a separate final event after all text deltas have completed. If you close the HTTP connection the moment you stop receiving text, you will never know if the generation completed normally, hit max tokens, stopped on a sequence, or intended to call a tool.

A max tokens cutoff looks identical to a normal completed response to the end user. You will ship this bug. You will not notice it for three months.

Second: you do not buffer partial SSE messages.

SSE messages are separated by \n\n. TCP does not care about your message boundaries. A single complete SSE message can arrive split across two separate network packets. 90% of the streaming code you find on GitHub will call JSON.parse on every chunk received. This code will work 99% of the time. It will crash randomly once every few hundred requests, usually under load.

The fix is one line. Always keep the last fragment after splitting. Pop it back into the buffer for the next chunk. This is not clever. This is just how TCP works.

Third: you forget the AbortController.

When a user navigates away mid stream, the browser stops reading bytes. The HTTP connection stays open. The LLM provider will happily continue generating tokens and charging you for every one. Nobody reads this part of the fetch documentation. For a high traffic service this leak will account for 15-30% of your total LLM bill.

Pass an AbortSignal to every fetch call. Attach it to page unload. This is not optional.

Agents do not crash. They misbehave.

Normal software fails fast. A normal Python script will throw an exception and stop running the moment something goes wrong.

Agents do not work this way. An agent will happily continue running. It will retry bad inputs. It will summarize stale results. It will produce a perfectly plausible looking wrong answer. It will burn through your budget and you will have no evidence that anything went wrong at all.

Print statements will not help you here. Logging to stdout will not help you here. Dashboards will not help you here. You need a black box.

The 71 line agent flight recorder

You do not need a $50/month observability SaaS to debug agents. You need an append only JSONL file.

Every event that happens during an agent run gets written as one line. No structure. No schema. Just append. This format survives process crashes. It survives OOM kills. It survives the container being terminated mid run. If the process wrote the line, it will be there when you come back.

The flight recorder does four things: assigns a unique run id, measures duration for every tool call, redacts secrets before writing to disk, and never modifies existing lines.

Wrap every tool call with the context manager. It will record start, end, error, traceback and duration automatically. You will not remember to add this after the failure. Add it before you deploy.

The sanitization routine is not perfect. It is a seatbelt, not a vault. It will prevent the single most embarrassing operational failure: building a helpful debug trace that quietly logs every API key your system uses.

Add the budget guard before you need it

Every runaway agent disaster story starts the same way. Someone wrote a retry loop. It looked harmless.

The guard is not accurate billing. It is a tripwire. It is the last line of defence that will stop the run and write an explicit event explaining exactly why it stopped.

Set conservative defaults. 3 turns maximum. $0.01 maximum budget per run. You can increase these later. You will never regret setting them too low. You will very much regret setting them too high.

When this guard triggers it will not be an annoyance. It will be the thing that saved you from a $700 overnight batch run.

Query failures like a database

This is the part that changes everything. You do not grep JSONL files. You query them with DuckDB.

DuckDB can read raw JSONL directly, no import, no server, no schema. You can run SQL over every event from every agent run that ever happened.

You can answer questions in 10 seconds that would take an hour with normal logs:

  • Which tool fails most often?
  • What is the 95th percentile latency for document search?
  • How many runs were stopped by the budget guard last week?
  • Did the same input get retried 12 times?

You are no longer guessing at failures. You are interrogating them.

You cannot optimize what you do not profile

Every LLM inference deployment leaves 50-80% of GPU performance on the table. Everyone knows this. Almost no one measures it.

torch.profiler is not a tool for performance experts. It is the only way to know what is actually happening on your GPU.

Most engineers never open the trace. They run the benchmark, see tokens per second, and stop. That is equivalent to judging a car only by its top speed and never opening the hood.

Overhead bound vs compute bound

There are only two states any LLM workload can be in.

When you are overhead bound: the GPU finishes the work faster than the CPU can feed it new work. The GPU is idle most of the time. For small batch sizes this is the default state. You will see CPU time measured in milliseconds, GPU time measured in microseconds.

When you are compute bound: the GPU is saturated. Work is always waiting. This is the state you want to be in for batch processing.

Almost everyone running inference is operating overhead bound and does not know it. You will not find this out from GPU utilization metrics. Those metrics lie. Only the profiler will show you the idle gaps.

The dispatch chain

Every PyTorch operation travels down a long chain.

  1. You call torch.matmul in Python.
  2. It lands on aten::matmul dispatch.
  3. It runs cuBLAS heuristic planning.
  4. It submits the kernel to the CUDA queue.
  5. The kernel eventually runs on the GPU.

There is a 2-10 microsecond delay between every step. For small operations these delays add up to 99% of the total runtime.

This is why torch.compile gives such large speedups for inference. It does not make the kernels run faster. It removes almost all of the dispatch overhead between them.

Warmup is not magic

The first run of any operation will always be 10-100x slower than subsequent runs. This is not a bug.

On the first run PyTorch loads modules, allocates workspaces, runs cuBLAS kernel heuristics and caches the result. All of this work only happens once.

Never profile the first 3 runs of anything. Never benchmark cold runs. This is the single most common mistake people make when running performance tests. Everyone makes this mistake at least once.

Closing

None of these skills are glamorous. You will not get on the front page of Hacker News for correctly buffering SSE chunks. You will not give a conference talk for adding a budget guard to your agent loop. You will not get twitter likes for reading a profiler trace.

These are the skills that keep systems running. These are the skills that stop you getting paged at 3am. These are the skills that separate demo code from production systems.

You can spend all your time chasing the newest model architecture. Or you can spend one afternoon mastering these three skills. One of these will make you someone that people actually call when something breaks.