What is scikit-learn?
scikit-learn is the standard Python library for classical machine learning. It
gives you a consistent interface — fit, predict, transform — across
dozens of algorithms, plus tools for preprocessing, evaluation, and pipelines.
It is where most non-deep-learning ML is actually built.
Why it matters
scikit-learn turns the concepts from earlier nodes into working code with very little boilerplate. Its uniform API means switching from one model to another is a one-line change, so you can experiment fast. It is a daily tool and a common expectation for data and ML roles.
What to learn
- The estimator API: fit, predict, transform
- Pipelines that chain preprocessing and a model
- Train/test split and cross-validation helpers
- Hyperparameter search with grid and random search
- The built-in metrics
- Saving and loading fitted models
- Reading the documentation to find the right tool
Common pitfall
Tuning hyperparameters against the test set, trying settings until the test score goes up. That quietly overfits to the test set, and the reported score no longer predicts real performance. Tune on a validation set or with cross-validation, and touch the test set only once, at the very end.
Resources
Primary (free):
- scikit-learn — User guide · docs
- scikit-learn — Getting started · docs
- scikit-learn — Tutorials · docs
Practice
Build an end-to-end scikit-learn pipeline: preprocessing plus a classifier, evaluated with cross-validation, then tuned with a grid search on a validation split. Save the final fitted pipeline to disk and reload it to predict. Done when the test set was used only once, at the end.
Outcomes
- Use the consistent fit/predict/transform API.
- Chain steps into a pipeline.
- Tune hyperparameters with cross-validation, not the test set.
- Save and reload a fitted model.