Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

End-to-End Regression with scikit-learn (Numeric Features Only)

This chapter demonstrates a complete regression workflow using scikit-learn and the California Housing dataset. All features in this dataset are numeric, which allows us to focus on the core machine learning pipeline without introducing categorical preprocessing.

The goal is not to optimize performance aggressively, but to build a correct, leak-free, and reproducible workflow.


Load the Dataset

We load the California Housing dataset directly from scikit-learn. It contains 8 numeric features and one numeric target variable, MedHouseVal.

data = fetch_california_housing(as_frame=True)
df = data.frame.copy()

target_col = "MedHouseVal"
df.head()

Train / Test Split

We split the dataset into training and test sets before any preprocessing. The test set is treated as hands-off until the final evaluation to prevent data leakage.

train_df, test_df = train_test_split(
    df,
    test_size=0.2,
    random_state=42
)

train_df.shape, test_df.shape

Separate Features and Target

We separate input features (X) from the target variable (y) for both training and test data.

X_train = train_df.drop(columns=[target_col])
y_train = train_df[target_col].copy()

X_test = test_df.drop(columns=[target_col])
y_test = test_df[target_col].copy()

Numeric Feature Preprocessing

All features in this dataset are numeric. We explicitly list the numeric columns and define a preprocessing pipeline that:

  • imputes missing values using the median
  • standardizes features using StandardScaler
num_cols = list(X_train.columns)
num_cols
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

Build the Full Pipeline

We choose a RandomForestRegressor as the model and combine it with the preprocessing pipeline. Keeping preprocessing and modeling inside a single pipeline ensures consistency across training and evaluation.

model = RandomForestRegressor(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

pipe = Pipeline(steps=[
    ("prep", numeric_pipeline),
    ("model", model),
])

print(pipe)

Cross-Validation on the Training Set

Before touching the test set, we estimate performance using 5-fold cross-validation on the training data. We use RMSE (Root Mean Squared Error) as the evaluation metric.

scores = cross_val_score(
    pipe,
    X_train,
    y_train,
    scoring="neg_root_mean_squared_error",
    cv=5
)

rmse = -scores
rmse.mean(), rmse.std()

Final Evaluation on the Test Set

After cross-validation, we fit the pipeline on the full training set and evaluate once on the test set. This provides an unbiased estimate of generalization performance.

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
test_rmse = root_mean_squared_error(y_test, y_pred)

test_rmse

Interpretation

The final test RMSE is approximately 0.50.

Since the target variable is measured in hundreds of thousands of dollars, this corresponds to an average prediction error of roughly $50,000.

This result is realistic for this dataset and confirms that the workflow is functioning correctly.


Summary

In this chapter we:

  • split the data before preprocessing to avoid leakage
  • built a numeric-only preprocessing pipeline
  • combined preprocessing and modeling using Pipeline
  • evaluated using cross-validation and a final test set

This structure serves as a clean foundation for future extensions, such as handling categorical features or experimenting with other models.


Reproducibility

  • Python ≥ 3.10
  • scikit-learn ≥ 1.6.1
  • Fixed random seed (random_state=42)

🔗 Full runnable notebook:

▶ Run this notebook on Google Colab