Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- bump: patch
changes:
fixed:
- Fixed QRF to encode categorical imputed variables that become predictors correctly.
- Added `not_numeric_categorical` parameter to control whether discrete numeric variables are treated as categorical.
- Replaced Total Variation Distance with KL-divergence.
19 changes: 17 additions & 2 deletions docs/imputation-benchmarking/benchmarking-methods.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3301,16 +3301,31 @@
"\n",
"### Distribution similarity metrics\n",
"\n",
"The `compare_distributions()` function evaluates how well the imputed values preserve the distributional characteristics of the original data. It uses the Wasserstein distance (also known as Earth Mover's Distance) to quantify the difference between the distribution of imputed values and the true distribution.\n",
"The `compare_distributions()` function evaluates how well the imputed values preserve the distributional characteristics of the original data. It automatically selects the appropriate metric based on the variable type: Wasserstein distance for continuous numerical variables and Kullback-Leibler (KL) divergence for discrete categorical and boolean variables.\n",
"\n",
"The Wasserstein distance between two probability distributions $P$ and $Q$ is defined as:\n",
"#### Wasserstein distance for numerical variables\n",
"\n",
"For continuous numerical variables, the framework uses the Wasserstein distance (also known as Earth Mover's Distance) to quantify the difference between distributions. The Wasserstein distance between two probability distributions $P$ and $Q$ is defined as:\n",
"\n",
"$$W_p(P, Q) = \\left(\\inf_{\\gamma \\in \\Pi(P, Q)} \\int_{X \\times Y} d(x, y)^p d\\gamma(x, y)\\right)^{1/p}$$\n",
"\n",
"where $\\Pi(P, Q)$ denotes the set of all joint distributions whose marginals are $P$ and $Q$ respectively.\n",
"\n",
"The Wasserstein distance measures the minimum \"work\" required to transform one distribution into another, where work is defined as the amount of distribution mass moved times the distance it's moved. Lower values indicate better preservation of the original distribution's shape. In the SCF example, QRF shows the lowest Wasserstein distance (1.2e7), indicating it best preserves the distribution of net worth values, while QuantReg shows the highest distance (2.8e7), suggesting greater distributional distortion.\n",
"\n",
"#### Kullback-Leibler divergence for categorical and boolean variables\n",
"\n",
"For discrete distributions (categorical and boolean variables), the framework employs KL divergence, an information-theoretic measure that quantifies how one probability distribution diverges from a reference distribution. The KL divergence from distribution $Q$ to distribution $P$ is defined as:\n",
"\n",
"$$D_{KL}(P||Q) = \\sum_{x \\in \\mathcal{X}} P(x) \\log\\left(\\frac{P(x)}{Q(x)}\\right)$$\n",
"\n",
"where:\n",
"- $P$ is the reference distribution (original data)\n",
"- $Q$ is the approximation (imputed data)\n",
"- $\\mathcal{X}$ is the set of all possible categorical values\n",
"\n",
"In the context of imputation evaluation, KL divergence measures how much information is lost when using the imputed distribution $Q$ to approximate the true distribution $P$. Lower KL divergence values indicate better preservation of the original categorical distribution.\n",
"\n",
"## Predictor analysis and sensitivity evaluation\n",
"\n",
"Beyond comparing imputation methods, understanding the relationship between predictors and target variables, as well as the sensitivity of imputation quality to predictor selection, provides crucial insights for model optimization and feature engineering.\n",
Expand Down
3 changes: 3 additions & 0 deletions microimpute/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,14 @@

# Import comparison and metric utilities
from microimpute.comparisons.metrics import (
compare_distributions,
compare_metrics,
compute_loss,
get_metric_for_variable_type,
kl_divergence,
log_loss,
quantile_loss,
wasserstein_distance,
)

# Import validation utilities
Expand Down
2 changes: 2 additions & 0 deletions microimpute/comparisons/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,10 @@
compare_metrics,
compute_loss,
get_metric_for_variable_type,
kl_divergence,
log_loss,
quantile_loss,
wasserstein_distance,
)

# Import validation utilities
Expand Down
68 changes: 42 additions & 26 deletions microimpute/comparisons/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This module contains utilities for evaluating imputation quality using various metrics:
- Quantile loss for numerical variables
- Log loss for categorical variables
- Distributional similarity metrics (Wasserstein distance, Total Variation Distance)
- Distributional similarity metrics (Wasserstein distance, KL Divergence)
The module automatically detects which metric to use based on variable type.
"""

Expand All @@ -13,6 +13,7 @@
import numpy as np
import pandas as pd
from pydantic import validate_call
from scipy.special import rel_entr
from scipy.stats import wasserstein_distance
from sklearn.metrics import log_loss as sklearn_log_loss

Expand Down Expand Up @@ -495,25 +496,35 @@ def compare_metrics(
raise RuntimeError(f"Failed to compare metrics: {str(e)}") from e


def total_variation_distance(
def kl_divergence(
donor_values: np.ndarray, receiver_values: np.ndarray
) -> float:
"""Calculate Total Variation Distance between two categorical distributions.
"""Calculate Kullback-Leibler (KL) Divergence between two categorical distributions.

Total Variation Distance (TVD) measures the maximum difference between
two probability distributions. For categorical variables, it is calculated as:
TVD = 0.5 * sum(|P(x) - Q(x)|) for all categories x
KL divergence measures the difference between two probability distributions.
For categorical variables, it is calculated as:
KL(P||Q) = sum(P(x) * log(P(x) / Q(x))) for all categories x

This implementation uses the donor distribution as P (reference) and
receiver distribution as Q (approximation), measuring how well the
receiver distribution approximates the donor distribution.

Args:
donor_values: Array of categorical values from donor data.
receiver_values: Array of categorical values from receiver data.
donor_values: Array of categorical values from donor data (reference distribution P).
receiver_values: Array of categorical values from receiver data (approximation Q).

Returns:
Total variation distance value between 0 and 1, where 0 indicates
identical distributions and 1 indicates completely disjoint distributions.
KL divergence value >= 0, where 0 indicates identical distributions
and larger values indicate greater divergence. Note: KL divergence is
unbounded and can be infinite if Q(x) = 0 for some x where P(x) > 0.

Raises:
ValueError: If inputs are empty or invalid.

Note:
- KL divergence is not symmetric: KL(P||Q) != KL(Q||P)
- To handle zero probabilities, a small epsilon is added to avoid log(0)
- Uses scipy.special.rel_entr for numerical stability
"""
if len(donor_values) == 0 or len(receiver_values) == 0:
raise ValueError(
Expand All @@ -529,15 +540,22 @@ def total_variation_distance(
donor_counts = pd.Series(donor_values).value_counts(normalize=True)
receiver_counts = pd.Series(receiver_values).value_counts(normalize=True)

# Calculate TVD
tvd = 0.0
for category in all_categories:
p_donor = donor_counts.get(category, 0.0)
p_receiver = receiver_counts.get(category, 0.0)
tvd += abs(p_donor - p_receiver)
# Create probability arrays for all categories
p_donor = np.array([donor_counts.get(cat, 0.0) for cat in all_categories])
q_receiver = np.array(
[receiver_counts.get(cat, 0.0) for cat in all_categories]
)

# TVD is half the sum of absolute differences
return tvd / 2.0
# Add small epsilon to avoid log(0) and division by zero
epsilon = 1e-10
q_receiver = np.maximum(q_receiver, epsilon)

# Calculate KL divergence using scipy.special.kl_div
# kl_div(p, q) computes p * log(p/q) element-wise
kl_values = rel_entr(p_donor, q_receiver)

# Sum over all categories to get total KL divergence
return np.sum(kl_values)


@validate_call(config=VALIDATE_CONFIG)
Expand All @@ -550,7 +568,7 @@ def compare_distributions(

Evaluates distributional similarity using appropriate metrics:
- Wasserstein Distance for numerical variables
- Total Variation Distance for categorical variables
- KL Divergence for categorical variables

Args:
donor_data: DataFrame containing original donor data.
Expand All @@ -575,7 +593,7 @@ def compare_distributions(
>>> print(result)
Variable Metric Distance
0 income wasserstein_distance 66.666667
1 region total_variation_distance 0.166667
1 region kl_divergence 0.166667
"""
try:
log.info(
Expand Down Expand Up @@ -613,13 +631,11 @@ def compare_distributions(

# Choose appropriate metric
if var_type in ["bool", "categorical", "numeric_categorical"]:
# Use Total Variation Distance for categorical
metric_name = "total_variation_distance"
distance = total_variation_distance(
donor_values, receiver_values
)
# Use KL Divergence for categorical
metric_name = "kl_divergence"
distance = kl_divergence(donor_values, receiver_values)
log.debug(
f"TVD for categorical variable '{var}': {distance:.6f}"
f"KL divergence for categorical variable '{var}': {distance:.6f}"
)
else:
# Use Wasserstein Distance for numerical
Expand Down
35 changes: 30 additions & 5 deletions microimpute/models/imputer.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,15 +104,22 @@ def _validate_data(self, data: pd.DataFrame, columns: List[str]) -> None:
)

def identify_target_types(
self, data: pd.DataFrame, imputed_variables: List[str]
self,
data: pd.DataFrame,
imputed_variables: List[str],
not_numeric_categorical: Optional[List[str]] = None,
) -> None:
"""Identify and track variable types for imputation targets.

Args:
data: DataFrame containing the data.
imputed_variables: List of variables to be imputed.
not_numeric_categorical: Optional list of variable names that should
be treated as numeric even if they would normally be detected as
numeric_categorical.
"""
detector = VariableTypeDetector()
not_numeric_categorical = not_numeric_categorical or []

for var in imputed_variables:
if var not in data.columns:
Expand All @@ -133,7 +140,10 @@ def identify_target_types(
continue

var_type, categories = detector.categorize_variable(
data[var], var, self.logger
data[var],
var,
self.logger,
force_numeric=(var in not_numeric_categorical),
)

if var_type == "bool":
Expand Down Expand Up @@ -163,6 +173,7 @@ def preprocess_data_types(
data: pd.DataFrame,
predictors: List[str],
imputed_variables: List[str],
not_numeric_categorical: Optional[List[str]] = None,
) -> Tuple[pd.DataFrame, List[str], List[str], Dict[str, Any]]:
"""Preprocess predictors only - convert categorical predictors to dummies.
Imputation targets remain in original form for classification.
Expand All @@ -171,6 +182,9 @@ def preprocess_data_types(
data: DataFrame containing the data.
predictors: List of predictor column names.
imputed_variables: List of variables to impute (kept in original form).
not_numeric_categorical: Optional list of variable names that should
be treated as numeric even if they would normally be detected as
numeric_categorical.

Returns:
Tuple of (processed_data, updated_predictors, imputed_variables, empty_dict)
Expand All @@ -182,7 +196,10 @@ def preprocess_data_types(
processor = DummyVariableProcessor(self.logger)
processed_data, updated_predictors = (
processor.preprocess_predictors(
data, predictors, imputed_variables
data,
predictors,
imputed_variables,
not_numeric_categorical,
)
)

Expand All @@ -204,6 +221,7 @@ def fit(
imputed_variables: List[str],
weight_col: Optional[Union[str, np.ndarray, pd.Series]] = None,
skip_missing: bool = False,
not_numeric_categorical: Optional[List[str]] = None,
**kwargs: Any,
) -> Any: # Returns ImputerResults
"""Fit the model to the training data.
Expand All @@ -214,6 +232,9 @@ def fit(
imputed_variables: List of column names to impute.
weight_col: Optional name of the column or column array/series containing sampling weights. When provided, `X_train` will be sampled with replacement using this column as selection probabilities before fitting the model.
skip_missing: If True, skip variables missing from training data with warning. If False, raise error for missing variables.
not_numeric_categorical: Optional list of variable names that should
be treated as numeric even if they would normally be detected as
numeric_categorical.
**kwargs: Additional model-specific parameters.

Returns:
Expand Down Expand Up @@ -261,10 +282,14 @@ def fit(
raise ValueError("Weights must be positive")

# Identify target types BEFORE preprocessing
self.identify_target_types(X_train, imputed_variables)
self.identify_target_types(
X_train, imputed_variables, not_numeric_categorical
)

X_train, predictors, imputed_variables, imputed_vars_dummy_info = (
self.preprocess_data_types(X_train, predictors, imputed_variables)
self.preprocess_data_types(
X_train, predictors, imputed_variables, not_numeric_categorical
)
)

if weights is not None:
Expand Down
Loading