Inference

Using a trained model to produce outputs on new inputs: after training is finished (also called deployment or forward pass in many setups).

Inference is run-time compute: given an input, run the forward pass of a trained model to get predictions, text, embeddings, or actions. Cost drivers are latency, throughput, memory, and sometimes per-token pricing for hosted LLMs.

It is conceptually separate from training, though some systems interleave online learning rarely in production safety stacks.