PolicyEngine · juaristi22 · Oct 19, 2025 · Oct 19, 2025 · Oct 19, 2025 · Oct 19, 2025
diff --git a/changelog_entry.yaml b/changelog_entry.yaml
@@ -0,0 +1,6 @@
+- bump: patch
+  changes:
+    fixed:
+    - Fixed QRF to encode categorical imputed variables that become predictors correctly.
+    - Added `not_numeric_categorical` parameter to control whether discrete numeric variables are treated as categorical.
+    - Replaced Total Variation Distance with KL-divergence.
diff --git a/docs/imputation-benchmarking/benchmarking-methods.ipynb b/docs/imputation-benchmarking/benchmarking-methods.ipynb
@@ -3301,16 +3301,31 @@
     "\n",
     "### Distribution similarity metrics\n",
     "\n",
-    "The `compare_distributions()` function evaluates how well the imputed values preserve the distributional characteristics of the original data. It uses the Wasserstein distance (also known as Earth Mover's Distance) to quantify the difference between the distribution of imputed values and the true distribution.\n",
+    "The `compare_distributions()` function evaluates how well the imputed values preserve the distributional characteristics of the original data. It automatically selects the appropriate metric based on the variable type: Wasserstein distance for continuous numerical variables and Kullback-Leibler (KL) divergence for discrete categorical and boolean variables.\n",
     "\n",
-    "The Wasserstein distance between two probability distributions $P$ and $Q$ is defined as:\n",
+    "#### Wasserstein distance for numerical variables\n",
+    "\n",
+    "For continuous numerical variables, the framework uses the Wasserstein distance (also known as Earth Mover's Distance) to quantify the difference between distributions. The Wasserstein distance between two probability distributions $P$ and $Q$ is defined as:\n",
     "\n",
     "$$W_p(P, Q) = \\left(\\inf_{\\gamma \\in \\Pi(P, Q)} \\int_{X \\times Y} d(x, y)^p d\\gamma(x, y)\\right)^{1/p}$$\n",
     "\n",
     "where $\\Pi(P, Q)$ denotes the set of all joint distributions whose marginals are $P$ and $Q$ respectively.\n",
     "\n",
     "The Wasserstein distance measures the minimum \"work\" required to transform one distribution into another, where work is defined as the amount of distribution mass moved times the distance it's moved. Lower values indicate better preservation of the original distribution's shape. In the SCF example, QRF shows the lowest Wasserstein distance (1.2e7), indicating it best preserves the distribution of net worth values, while QuantReg shows the highest distance (2.8e7), suggesting greater distributional distortion.\n",
     "\n",
+    "#### Kullback-Leibler divergence for categorical and boolean variables\n",
+    "\n",
+    "For discrete distributions (categorical and boolean variables), the framework employs KL divergence, an information-theoretic measure that quantifies how one probability distribution diverges from a reference distribution. The KL divergence from distribution $Q$ to distribution $P$ is defined as:\n",
+    "\n",
+    "$$D_{KL}(P||Q) = \\sum_{x \\in \\mathcal{X}} P(x) \\log\\left(\\frac{P(x)}{Q(x)}\\right)$$\n",
+    "\n",
+    "where:\n",
+    "- $P$ is the reference distribution (original data)\n",
+    "- $Q$ is the approximation (imputed data)\n",
+    "- $\\mathcal{X}$ is the set of all possible categorical values\n",
+    "\n",
+    "In the context of imputation evaluation, KL divergence measures how much information is lost when using the imputed distribution $Q$ to approximate the true distribution $P$. Lower KL divergence values indicate better preservation of the original categorical distribution.\n",
+    "\n",
     "## Predictor analysis and sensitivity evaluation\n",
     "\n",
     "Beyond comparing imputation methods, understanding the relationship between predictors and target variables, as well as the sensitivity of imputation quality to predictor selection, provides crucial insights for model optimization and feature engineering.\n",

diff --git a/microimpute/__init__.py b/microimpute/__init__.py
@@ -27,11 +27,14 @@
 
 # Import comparison and metric utilities
 from microimpute.comparisons.metrics import (
+    compare_distributions,
     compare_metrics,
     compute_loss,
     get_metric_for_variable_type,
+    kl_divergence,
     log_loss,
     quantile_loss,
+    wasserstein_distance,
 )
 
 # Import validation utilities

diff --git a/microimpute/comparisons/__init__.py b/microimpute/comparisons/__init__.py
@@ -24,8 +24,10 @@
     compare_metrics,
     compute_loss,
     get_metric_for_variable_type,
+    kl_divergence,
     log_loss,
     quantile_loss,
+    wasserstein_distance,
 )
 
 # Import validation utilities

diff --git a/microimpute/comparisons/metrics.py b/microimpute/comparisons/metrics.py
@@ -3,7 +3,7 @@
 This module contains utilities for evaluating imputation quality using various metrics:
 - Quantile loss for numerical variables
 - Log loss for categorical variables
-- Distributional similarity metrics (Wasserstein distance, Total Variation Distance)
+- Distributional similarity metrics (Wasserstein distance, KL Divergence)
 The module automatically detects which metric to use based on variable type.
 """
 
@@ -13,6 +13,7 @@
 import numpy as np
 import pandas as pd
 from pydantic import validate_call
+from scipy.special import rel_entr
 from scipy.stats import wasserstein_distance
 from sklearn.metrics import log_loss as sklearn_log_loss
 
@@ -495,25 +496,35 @@ def compare_metrics(
         raise RuntimeError(f"Failed to compare metrics: {str(e)}") from e
 
 
-def total_variation_distance(
+def kl_divergence(
     donor_values: np.ndarray, receiver_values: np.ndarray
 ) -> float:
-    """Calculate Total Variation Distance between two categorical distributions.
+    """Calculate Kullback-Leibler (KL) Divergence between two categorical distributions.
 
-    Total Variation Distance (TVD) measures the maximum difference between
-    two probability distributions. For categorical variables, it is calculated as:
-    TVD = 0.5 * sum(|P(x) - Q(x)|) for all categories x
+    KL divergence measures the difference between two probability distributions.
+    For categorical variables, it is calculated as:
+    KL(P||Q) = sum(P(x) * log(P(x) / Q(x))) for all categories x
+
+    This implementation uses the donor distribution as P (reference) and
+    receiver distribution as Q (approximation), measuring how well the
+    receiver distribution approximates the donor distribution.
 
     Args:
-        donor_values: Array of categorical values from donor data.
-        receiver_values: Array of categorical values from receiver data.
+        donor_values: Array of categorical values from donor data (reference distribution P).
+        receiver_values: Array of categorical values from receiver data (approximation Q).
 
     Returns:
-        Total variation distance value between 0 and 1, where 0 indicates
-        identical distributions and 1 indicates completely disjoint distributions.
+        KL divergence value >= 0, where 0 indicates identical distributions
+        and larger values indicate greater divergence. Note: KL divergence is
+        unbounded and can be infinite if Q(x) = 0 for some x where P(x) > 0.
 
     Raises:
         ValueError: If inputs are empty or invalid.
+
+    Note:
+        - KL divergence is not symmetric: KL(P||Q) != KL(Q||P)
+        - To handle zero probabilities, a small epsilon is added to avoid log(0)
+        - Uses scipy.special.rel_entr for numerical stability
     """
     if len(donor_values) == 0 or len(receiver_values) == 0:
         raise ValueError(
@@ -529,15 +540,22 @@ def total_variation_distance(
     donor_counts = pd.Series(donor_values).value_counts(normalize=True)
     receiver_counts = pd.Series(receiver_values).value_counts(normalize=True)
 
-    # Calculate TVD
-    tvd = 0.0
-    for category in all_categories:
-        p_donor = donor_counts.get(category, 0.0)
-        p_receiver = receiver_counts.get(category, 0.0)
-        tvd += abs(p_donor - p_receiver)
+    # Create probability arrays for all categories
+    p_donor = np.array([donor_counts.get(cat, 0.0) for cat in all_categories])
+    q_receiver = np.array(
+        [receiver_counts.get(cat, 0.0) for cat in all_categories]
+    )
 
-    # TVD is half the sum of absolute differences
-    return tvd / 2.0
+    # Add small epsilon to avoid log(0) and division by zero
+    epsilon = 1e-10
+    q_receiver = np.maximum(q_receiver, epsilon)
+
+    # Calculate KL divergence using scipy.special.kl_div
+    # kl_div(p, q) computes p * log(p/q) element-wise
+    kl_values = rel_entr(p_donor, q_receiver)
+
+    # Sum over all categories to get total KL divergence
+    return np.sum(kl_values)
 
 
 @validate_call(config=VALIDATE_CONFIG)
@@ -550,7 +568,7 @@ def compare_distributions(
 
     Evaluates distributional similarity using appropriate metrics:
     - Wasserstein Distance for numerical variables
-    - Total Variation Distance for categorical variables
+    - KL Divergence for categorical variables
 
     Args:
         donor_data: DataFrame containing original donor data.
@@ -575,7 +593,7 @@ def compare_distributions(
         >>> print(result)
            Variable                 Metric  Distance
         0    income  wasserstein_distance  66.666667
-        1    region  total_variation_distance    0.166667
+        1    region          kl_divergence    0.166667
     """
     try:
         log.info(
@@ -613,13 +631,11 @@ def compare_distributions(
 
             # Choose appropriate metric
             if var_type in ["bool", "categorical", "numeric_categorical"]:
-                # Use Total Variation Distance for categorical
-                metric_name = "total_variation_distance"
-                distance = total_variation_distance(
-                    donor_values, receiver_values
-                )
+                # Use KL Divergence for categorical
+                metric_name = "kl_divergence"
+                distance = kl_divergence(donor_values, receiver_values)
                 log.debug(
-                    f"TVD for categorical variable '{var}': {distance:.6f}"
+                    f"KL divergence for categorical variable '{var}': {distance:.6f}"
                 )
             else:
                 # Use Wasserstein Distance for numerical

diff --git a/microimpute/models/imputer.py b/microimpute/models/imputer.py
@@ -104,15 +104,22 @@ def _validate_data(self, data: pd.DataFrame, columns: List[str]) -> None:
             )
 
     def identify_target_types(
-        self, data: pd.DataFrame, imputed_variables: List[str]
+        self,
+        data: pd.DataFrame,
+        imputed_variables: List[str],
+        not_numeric_categorical: Optional[List[str]] = None,
     ) -> None:
         """Identify and track variable types for imputation targets.
 
         Args:
             data: DataFrame containing the data.
             imputed_variables: List of variables to be imputed.
+            not_numeric_categorical: Optional list of variable names that should
+                be treated as numeric even if they would normally be detected as
+                numeric_categorical.
         """
         detector = VariableTypeDetector()
+        not_numeric_categorical = not_numeric_categorical or []
 
         for var in imputed_variables:
             if var not in data.columns:
@@ -133,7 +140,10 @@ def identify_target_types(
                 continue
 
             var_type, categories = detector.categorize_variable(
-                data[var], var, self.logger
+                data[var],
+                var,
+                self.logger,
+                force_numeric=(var in not_numeric_categorical),
             )
 
             if var_type == "bool":
@@ -163,6 +173,7 @@ def preprocess_data_types(
         data: pd.DataFrame,
         predictors: List[str],
         imputed_variables: List[str],
+        not_numeric_categorical: Optional[List[str]] = None,
     ) -> Tuple[pd.DataFrame, List[str], List[str], Dict[str, Any]]:
         """Preprocess predictors only - convert categorical predictors to dummies.
         Imputation targets remain in original form for classification.
@@ -171,6 +182,9 @@ def preprocess_data_types(
             data: DataFrame containing the data.
             predictors: List of predictor column names.
             imputed_variables: List of variables to impute (kept in original form).
+            not_numeric_categorical: Optional list of variable names that should
+                be treated as numeric even if they would normally be detected as
+                numeric_categorical.
 
         Returns:
             Tuple of (processed_data, updated_predictors, imputed_variables, empty_dict)
@@ -182,7 +196,10 @@ def preprocess_data_types(
             processor = DummyVariableProcessor(self.logger)
             processed_data, updated_predictors = (
                 processor.preprocess_predictors(
-                    data, predictors, imputed_variables
+                    data,
+                    predictors,
+                    imputed_variables,
+                    not_numeric_categorical,
                 )
             )
 
@@ -204,6 +221,7 @@ def fit(
         imputed_variables: List[str],
         weight_col: Optional[Union[str, np.ndarray, pd.Series]] = None,
         skip_missing: bool = False,
+        not_numeric_categorical: Optional[List[str]] = None,
         **kwargs: Any,
     ) -> Any:  # Returns ImputerResults
         """Fit the model to the training data.
@@ -214,6 +232,9 @@ def fit(
             imputed_variables: List of column names to impute.
             weight_col: Optional name of the column or column array/series containing sampling weights. When provided, `X_train` will be sampled with replacement using this column as selection probabilities before fitting the model.
             skip_missing: If True, skip variables missing from training data with warning. If False, raise error for missing variables.
+            not_numeric_categorical: Optional list of variable names that should
+                be treated as numeric even if they would normally be detected as
+                numeric_categorical.
             **kwargs: Additional model-specific parameters.
 
         Returns:
@@ -261,10 +282,14 @@ def fit(
             raise ValueError("Weights must be positive")
 
         # Identify target types BEFORE preprocessing
-        self.identify_target_types(X_train, imputed_variables)
+        self.identify_target_types(
+            X_train, imputed_variables, not_numeric_categorical
+        )
 
         X_train, predictors, imputed_variables, imputed_vars_dummy_info = (
-            self.preprocess_data_types(X_train, predictors, imputed_variables)
+            self.preprocess_data_types(
+                X_train, predictors, imputed_variables, not_numeric_categorical
+            )
         )
 
         if weights is not None: