What is model serving?
Model serving is exposing a trained model behind an API so applications can send inputs and get predictions. A model sitting in a notebook helps nobody; serving turns it into a usable service with an endpoint, request handling, and a response format.
Why it matters
The whole point of a model is to make predictions in the real world, and serving is the step that delivers that value. It is also where ML meets backend engineering — latency, scaling, and reliability suddenly matter. Being able to ship a model as a service is what makes you useful beyond research.
What to learn
- Wrapping a model in a web framework like FastAPI
- Loading the model once at startup, not per request
- Input validation and output formatting
- Batch versus real-time inference
- Latency and throughput basics
- Health checks and readiness
- Versioning the served model
Common pitfall
Loading the model from disk inside the request handler, so every prediction pays the slow load cost. The model should be loaded once when the service starts and reused across requests. Loading per request can turn a millisecond prediction into a multi-second one and crush throughput.
Resources
Primary (free):
- FastAPI — Documentation · docs
- Hugging Face — Inference · docs
- Made With ML — Serving · course
Practice
Wrap a trained model in a FastAPI service: load it once at startup, expose a predict endpoint that validates input and returns formatted output, and add a health check. Containerize it with the Docker skills from the previous node. Done when the model loads once and serves repeated requests fast.
Outcomes
- Serve a model behind a validated API endpoint.
- Load the model once at startup, not per request.
- Choose batch or real-time inference for the use case.
- Add health checks and version the served model.