Inspecting the California Housing Dataset (Before Preprocessing)

Before any preprocessing or modeling, we should first understand the raw dataset: its structure, data types, missing values, and basic statistical properties. This step prevents silent bugs and informs correct preprocessing decisions later.

Load the Dataset

We download and load the California Housing dataset from Géron’s public repository. The dataset is cached locally to avoid repeated downloads.

from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets", filter="data")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing_full = load_housing_data()
housing_full.shape, housing_full.head()

This gives us a first look at:

dataset size (rows × columns)
feature names
rough value ranges

Dataset Structure and Data Types

We inspect column types and missing-value counts.

housing_full.info()

This step answers:

which features are numeric vs categorical
whether any columns have missing values
whether dtypes are suitable for downstream pipelines

Descriptive Statistics (Numeric Features)

We compute summary statistics for numeric columns.

housing_full.describe()

From this, we can quickly spot:

skewed distributions (mean vs median)
extreme min/max values (potential outliers)
scale differences across features

Categorical Feature Distribution

The dataset contains one categorical feature: ocean_proximity.

housing_full["ocean_proximity"].value_counts()

This reveals:

category balance
whether rare categories exist
whether one-hot encoding is appropriate

Missing Value Analysis (Counts)

We check how many values are missing in each column.

housing_full.isna().sum().sort_values(ascending=False).head(20)

This helps decide:

which features require imputation
whether any columns might be dropped entirely

Missing Value Analysis (Ratios)

Absolute counts can be misleading, so we also inspect missing-value ratios.

housing_full.isna().mean().sort_values(ascending=False).head(20)

As a rule of thumb:

features with very high missing ratios deserve scrutiny
low ratios are usually safe for median or mode imputation

Correlation with Target Variable

We compute correlations for numeric features and focus on the target variable.

num_df = housing_full.select_dtypes(include="number")
corr = num_df.corr(numeric_only=True)
corr["median_house_value"].sort_values(ascending=False)

This provides a quick signal check:

which features are strongly correlated with the target
which features may be redundant or weak predictors

Correlation is not causation, but it is a useful early filter.

Summary

In this section we:

inspected dataset shape and schema
identified numeric and categorical features
analyzed missing values
examined basic statistical properties
checked correlations with the target variable

With this understanding, we are now ready to design correct and informed preprocessing pipelines without guessing.

🔗 Full runnable notebook:

▶ Run this notebook on Google Colab

Keyboard shortcuts

Personal blog - Xi Xiao