Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Inspecting the California Housing Dataset (Before Preprocessing)

Before any preprocessing or modeling, we should first understand the raw dataset: its structure, data types, missing values, and basic statistical properties. This step prevents silent bugs and informs correct preprocessing decisions later.


Load the Dataset

We download and load the California Housing dataset from Géron’s public repository. The dataset is cached locally to avoid repeated downloads.

from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets", filter="data")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing_full = load_housing_data()
housing_full.shape, housing_full.head()

This gives us a first look at:

  • dataset size (rows × columns)
  • feature names
  • rough value ranges

Dataset Structure and Data Types

We inspect column types and missing-value counts.

housing_full.info()

This step answers:

  • which features are numeric vs categorical
  • whether any columns have missing values
  • whether dtypes are suitable for downstream pipelines

Descriptive Statistics (Numeric Features)

We compute summary statistics for numeric columns.

housing_full.describe()

From this, we can quickly spot:

  • skewed distributions (mean vs median)
  • extreme min/max values (potential outliers)
  • scale differences across features

Categorical Feature Distribution

The dataset contains one categorical feature: ocean_proximity.

housing_full["ocean_proximity"].value_counts()

This reveals:

  • category balance
  • whether rare categories exist
  • whether one-hot encoding is appropriate

Missing Value Analysis (Counts)

We check how many values are missing in each column.

housing_full.isna().sum().sort_values(ascending=False).head(20)

This helps decide:

  • which features require imputation
  • whether any columns might be dropped entirely

Missing Value Analysis (Ratios)

Absolute counts can be misleading, so we also inspect missing-value ratios.

housing_full.isna().mean().sort_values(ascending=False).head(20)

As a rule of thumb:

  • features with very high missing ratios deserve scrutiny
  • low ratios are usually safe for median or mode imputation

Correlation with Target Variable

We compute correlations for numeric features and focus on the target variable.

num_df = housing_full.select_dtypes(include="number")
corr = num_df.corr(numeric_only=True)
corr["median_house_value"].sort_values(ascending=False)

This provides a quick signal check:

  • which features are strongly correlated with the target
  • which features may be redundant or weak predictors

Correlation is not causation, but it is a useful early filter.


Summary

In this section we:

  • inspected dataset shape and schema
  • identified numeric and categorical features
  • analyzed missing values
  • examined basic statistical properties
  • checked correlations with the target variable

With this understanding, we are now ready to design correct and informed preprocessing pipelines without guessing.

🔗 Full runnable notebook:

▶ Run this notebook on Google Colab