-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Over in dask/dask#12178 (comment), we're discussing how dask should adapt to the new datetime / timedelta resolution inference.
The new behavior is value-dependent: you don't know what dtype the result will be until you run it on the values. This is a challenge for dask, which might process subsets of the data in parallel, but would like each partition of a column to have the same data type:
In [9]: values = ["1", "2", "1 day 2 hours"]
In [10]: pd.to_timedelta(s[:2])
Out[10]:
0 0 days 00:00:00.000000001
1 0 days 00:00:00.000000002
dtype: timedelta64[ns]
In [11]: pd.to_timedelta(s[2:])
Out[11]:
2 1 days 02:00:00
dtype: timedelta64[us]The first partition (s[:2]) are inferred to be timedelta64[ns] while the second partition (s[2:]) are inferred to be timedelta64[us].
Feature Description
Add a dtype or resolution parameter to to_datetime and to_timedelta:
pd.to_datetime(
arg: 'DatetimeScalarOrArrayConvertible | DictConvertible',
errors: 'DateTimeErrorChoices' = 'raise',
dayfirst: 'bool' = False,
yearfirst: 'bool' = False,
utc: 'bool' = False,
format: 'str | None' = None,
exact: 'bool | lib.NoDefault' = <no_default>,
unit: 'str | None' = None,
infer_datetime_format: 'lib.NoDefault | bool' = <no_default>,
origin: 'str' = 'unix',
cache: 'bool' = True,
resolution: 'Resolution | None' = None,
) -> 'DatetimeIndex | Series | DatetimeScalar | NaTType | None'
"""
...
dtype: Dtype, optional
Controls the resolution of the result.
"""This would ideally be implemented as pd.to_datetime(...).as_unit(resolution).
Alternative Solutions
Dask could just go on its own here and add that resolution keyword. But I suspect other workloads might benefit from knowing exactly what range they'll get out.
Additional Context
There's some complexity here in how this proposed resolution keywords: in particular unit (how you interpret numeric values) and errors (what happens if you specify a value that is out of bounds for the resolution you provide?). I'd be curious to hear if those downsides outweigh any benefits.