|
| 1 | +Title: TST/ENH: Raise TypeError in Series.searchsorted for incomparable object-dtype values |
| 2 | + |
| 3 | +Summary |
| 4 | +------- |
| 5 | +This small change makes Series.searchsorted raise a TypeError when the underlying |
| 6 | +values are a numpy ndarray with dtype=object containing elements that are not |
| 7 | +mutually comparable with the provided search value (for example, mixing int and |
| 8 | +str). This aligns the behavior of `searchsorted` with `sort_values` and |
| 9 | +reduces surprising cases where NumPy's `searchsorted` can return an index even |
| 10 | +though comparisons between the types would fail. |
| 11 | + |
| 12 | +Files changed |
| 13 | +------------ |
| 14 | +- pandas/core/base.py |
| 15 | + - Add a lightweight runtime comparability check for object-dtype ndarrays in |
| 16 | + IndexOpsMixin.searchsorted. If a simple sample comparison between an array |
| 17 | + element and the search value raises TypeError, we propagate that TypeError. |
| 18 | + |
| 19 | +- pandas/tests/series/methods/test_searchsorted.py |
| 20 | + - Add `test_searchsorted_incomparable_object_raises` which asserts that |
| 21 | + `Series([1, 2, "1"]).searchsorted("1")` raises TypeError. |
| 22 | + |
| 23 | +Rationale |
| 24 | +-------- |
| 25 | +Pandas delegates `searchsorted` to NumPy for ndarray-backed data. NumPy's |
| 26 | +behavior on mixed-type object arrays can be surprising: it sometimes finds an |
| 27 | +insertion index even when Python comparisons between element types would raise |
| 28 | +TypeError (e.g. `1 < "1"`). Other pandas operations (like `sort_values`) raise |
| 29 | +in that situation, so this change makes `searchsorted` consistent with the |
| 30 | +rest of pandas. |
| 31 | + |
| 32 | +Behavior and trade-offs |
| 33 | +---------------------- |
| 34 | +- The comparability check is deliberately lightweight: it attempts a single |
| 35 | + comparison between the first non-NA array element and the sample search |
| 36 | + value. If that raises TypeError, we re-raise. |
| 37 | +- This heuristic catches the common case (mixed ints/strings) without scanning |
| 38 | + the whole array (which would be expensive). It may not detect all |
| 39 | + pathological mixed-type arrays (for example, if the first element is |
| 40 | + comparable but later ones are not). If we want a stricter rule we can |
| 41 | + instead sample more elements or check types across the array, at some |
| 42 | + performance cost. |
| 43 | + |
| 44 | +Testing |
| 45 | +------ |
| 46 | +- New test added (see above). To run locally: |
| 47 | + |
| 48 | + # install in editable mode if importing from source |
| 49 | + python -m pip install -ve . |
| 50 | + |
| 51 | + # run the single test |
| 52 | + pytest -q pandas/tests/series/methods/test_searchsorted.py::test_searchsorted_incomparable_object_raises |
| 53 | + |
| 54 | +Compatibility |
| 55 | +------------ |
| 56 | +- Backwards compatible for numeric/datetime/etc. arrays: behavior unchanged. |
| 57 | +- For object-dtype arrays with mixed types there is now a TypeError where |
| 58 | + previously NumPy might have silently returned an index. This is intentional |
| 59 | + to make behavior consistent with sorting. |
| 60 | + |
| 61 | +Follow-ups |
| 62 | +--------- |
| 63 | +- If desired, we can strengthen the comparability check (sample multiple |
| 64 | + elements or inspect the set of Python types) and add tests for those |
| 65 | + conditions. |
| 66 | + |
| 67 | +PR checklist |
| 68 | +----------- |
| 69 | +- [ ] Add release note if desired (small change to searchsorted semantics) |
| 70 | +- [ ] Add/adjust tests for stronger heuristics if implemented |
0 commit comments