-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
API: to_datetime(ints, unit) give requested unit #63347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
73c582c
0eda6d9
916da40
f9fd868
a5d7e51
a7955da
b62cc4f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -1312,7 +1312,11 @@ def _try_convert_to_date(self, data: Series) -> Series: | |||||||||
| date_units = (self.date_unit,) if self.date_unit else self._STAMP_UNITS | ||||||||||
| for date_unit in date_units: | ||||||||||
| try: | ||||||||||
| return to_datetime(new_data, errors="raise", unit=date_unit) | ||||||||||
| # Without this as_unit cast, we would fail to overflow | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. something like this:
Suggested change
|
||||||||||
| # and get much-too-large dates | ||||||||||
| return to_datetime(new_data, errors="raise", unit=date_unit).dt.as_unit( | ||||||||||
| "ns" | ||||||||||
|
Comment on lines
+1315
to
+1318
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not directly understanding that comment.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is inside a block that tries large units and if they overflow then tries smaller units. This PR makes the large units not-overflow in cases where this piece of code expects them to. Without this edit, e.g.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is my previous comment clear? and if so, suggestions for how to adapt that to a clearer code comment?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
OK, that was the context I was missing. But, then I still don't entirely get how this was currently working. The dates you show above like '2000-01-03' fit in the range of all units. So how would the integer value for that ever overflow? If I put a breakpoint specifically for Overflow on the line below, and run the full test_pandas.py file, I don't get a catch
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fetched the branch to play a bit with the tests: I was misled by the OverflowError, because it is OutOfBoundsDatetime that is being raised when trying to cast to nanoseconds. So essentially this "infer unit" code assumes that the integer value came from a timestamp that originally had a nanosecond resolution (or at least that should fit in a nanosecond resolution)? Which makes sense from the time we only supported ns. We could also check this by doing a manual bounds check instead of the casting? (I don't if we have an existing helper function for that)? So we could keep the logic of the inference of the date_unit, but then keep the actual returned data in that unit, instead of forcing it to nanoseconds
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (also, for the case where the user specifies the unit, we so we don't have to infer, we actually don't need to force cast to nanoseconds / check bounds, because that restriction is then not needed) |
||||||||||
| ) | ||||||||||
| except (ValueError, OverflowError, TypeError): | ||||||||||
| continue | ||||||||||
| return data | ||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -955,7 +955,7 @@ def test_date_format_frame_raises(self, datetime_frame): | |
| ], | ||
| ) | ||
| def test_date_format_series(self, date, date_unit, datetime_series): | ||
| ts = Series(Timestamp(date).as_unit("ns"), index=datetime_series.index) | ||
| ts = Series(Timestamp(date), index=datetime_series.index) | ||
| ts.iloc[1] = pd.NaT | ||
| ts.iloc[5] = pd.NaT | ||
| if date_unit: | ||
|
|
@@ -964,7 +964,7 @@ def test_date_format_series(self, date, date_unit, datetime_series): | |
| json = ts.to_json(date_format="iso") | ||
|
|
||
| result = read_json(StringIO(json), typ="series") | ||
| expected = ts.copy() | ||
| expected = ts.copy().dt.as_unit("ns") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not for this PR, but so this is another case where we currently return ns unit but could change to use us by default?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but im inclined to leave that to will to decide.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general I think we should use the same default of microseconds whenever we infer / parse strings, and so for IO formats that means whenever they don't store any type / unit information (in contrast to eg parquet). We already do that for csv, html, excel, etc, so I don't think there is a reason to not do that for JSON. But opened #63442 to track that. |
||
| tm.assert_series_equal(result, expected) | ||
|
|
||
| def test_date_format_series_raises(self, datetime_series): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -502,7 +502,7 @@ def test_groupby_resample_empty_sum_string( | |
| result = gbrs.sum(min_count=min_count) | ||
|
|
||
| index = pd.MultiIndex( | ||
| levels=[[1, 2, 3], [pd.to_datetime("2000-01-01", unit="ns")]], | ||
| levels=[[1, 2, 3], [pd.to_datetime("2000-01-01", unit="ns").as_unit("ns")]], | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did this PR change that? (that this no longer returns nanoseconds)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but i didn't realize it until I just checked. I thought this PR only affected integer cases. I also didn't think on main that the unit keyword would have any effect in this case. So there's at least two things I need to look into.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK i think ive figured this out. By passing |
||
| codes=[[0, 1, 2], [0, 0, 0]], | ||
| names=["A", None], | ||
| ) | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.