Skip to content

ENH: Add unit argument to to_datetime and to_timedelta to avoid value-dependent parsing #63270

@TomAugspurger

Description

@TomAugspurger

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Over in dask/dask#12178 (comment), we're discussing how dask should adapt to the new datetime / timedelta resolution inference.

The new behavior is value-dependent: you don't know what dtype the result will be until you run it on the values. This is a challenge for dask, which might process subsets of the data in parallel, but would like each partition of a column to have the same data type:

In [9]: values = ["1", "2", "1 day 2 hours"]

In [10]: pd.to_timedelta(s[:2])
Out[10]: 
0   0 days 00:00:00.000000001
1   0 days 00:00:00.000000002
dtype: timedelta64[ns]

In [11]: pd.to_timedelta(s[2:])
Out[11]: 
2   1 days 02:00:00
dtype: timedelta64[us]

The first partition (s[:2]) are inferred to be timedelta64[ns] while the second partition (s[2:]) are inferred to be timedelta64[us].

Feature Description

Add a dtype or resolution parameter to to_datetime and to_timedelta:

pd.to_datetime(
    arg: 'DatetimeScalarOrArrayConvertible | DictConvertible',
    errors: 'DateTimeErrorChoices' = 'raise',
    dayfirst: 'bool' = False,
    yearfirst: 'bool' = False,
    utc: 'bool' = False,
    format: 'str | None' = None,
    exact: 'bool | lib.NoDefault' = <no_default>,
    unit: 'str | None' = None,
    infer_datetime_format: 'lib.NoDefault | bool' = <no_default>,
    origin: 'str' = 'unix',
    cache: 'bool' = True,
    resolution: 'Resolution | None' = None,
) -> 'DatetimeIndex | Series | DatetimeScalar | NaTType | None'
"""
...
dtype: Dtype, optional
    Controls the resolution of the result.
"""

This would ideally be implemented as pd.to_datetime(...).as_unit(resolution).

Alternative Solutions

Dask could just go on its own here and add that resolution keyword. But I suspect other workloads might benefit from knowing exactly what range they'll get out.

Additional Context

There's some complexity here in how this proposed resolution keywords: in particular unit (how you interpret numeric values) and errors (what happens if you specify a value that is out of bounds for the resolution you provide?). I'd be curious to hear if those downsides outweigh any benefits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions