Intro to Machine Learning with Scikit-Learn

Machine learning can feel overwhelming at first, but Scikit-Learn makes it surprisingly approachable. In this post, I'll walk you through the end-to-end workflow I use for most classification and regression tasks.

The Core ML Workflow

Every ML project follows the same high-level steps: collect data, preprocess it, choose a model, train it, and evaluate it. Scikit-Learn has tools for every stage.

1. Load and Explore Your Data

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('dataset.csv')
print(df.head())
print(df.describe())
print(df.isnull().sum())

2. Preprocess

Real-world data is messy. Pipelines keep your preprocessing clean and prevent data leakage.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
])

3. Train and Evaluate

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print(classification_report(y_test, model.predict(X_test)))

Tips I Learned the Hard Way

Always fit your scaler on training data only — never on the full dataset.
Cross-validation (cv=5) is more reliable than a single train/test split.
Feature importance plots are your best friend for debugging.
Start simple (Logistic Regression, Decision Tree) before reaching for complex models.

Scikit-Learn's consistent API — .fit(), .predict(), .score() — means the knowledge you build with one model transfers directly to every other. That's a superpower.