Skip to content

Conversation

@vladborovtsov
Copy link

@vladborovtsov vladborovtsov commented Sep 27, 2025

Rationale for this change

Add an optional default_column_type parameter to the CSV reading API (C++ and Python) to provide a fallback type when per-column types aren’t specified, improving schema consistency and complementing the existing column_types logic.

What changes are included in this PR?

Are these changes tested?

Yes. Existing and new tests are passing.

C++:

> [==========] Running 3 tests from 1 test suite.
> [----------] Global test environment set-up.
> [----------] 3 tests from ReaderTests
> [ RUN      ] ReaderTests.DefaultColumnTypePartialDefault
> [       OK ] ReaderTests.DefaultColumnTypePartialDefault (3 ms)
> [ RUN      ] ReaderTests.DefaultColumnTypeAllStringsWithHeader
> [       OK ] ReaderTests.DefaultColumnTypeAllStringsWithHeader (0 ms)
> [ RUN      ] ReaderTests.DefaultColumnTypeAllStringsNoHeader
> [       OK ] ReaderTests.DefaultColumnTypeAllStringsNoHeader (0 ms)
> [----------] 3 tests from ReaderTests (4 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 3 tests from 1 test suite ran. (4 ms total)
> [  PASSED  ] 3 tests.

All:

> [==========] 264 tests from 46 test suites ran. (452 ms total)
> [  PASSED  ] 264 tests.

pyarrow:
New tests are passing.

Are there any user-facing changes?

I believe this change is backward compatible. Parameter is optional and its default value doesn't change the existing behavior; All the existing rests are passing.

Maybe relevant: #22232

Relates to #47502

@github-actions
Copy link

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

1 similar comment
@github-actions
Copy link

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@vladborovtsov
Copy link
Author

@github-actions crossbow submit preview-docs

@github-actions
Copy link

Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/18062577036

@github-actions
Copy link

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@vladborovtsov vladborovtsov marked this pull request as ready for review September 27, 2025 19:26
@kou kou changed the title GH-47502: Introduce optional default_column_type parameter GH-47502: [C++] Introduce optional default_column_type parameter Oct 13, 2025
@github-actions
Copy link

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@AlenkaF
Copy link
Member

AlenkaF commented Oct 24, 2025

Thank you @vladborovtsov for the contribution.
I will add info about the proposed solution in the original issue (#22232) so I can see opinions from C++ devs on the proposed solution.

@AlenkaF AlenkaF changed the title GH-47502: [C++] Introduce optional default_column_type parameter GH-22232: [C++][Python] Introduce optional default_column_type parameter Oct 24, 2025
@github-actions
Copy link

⚠️ GitHub issue #22232 has been automatically assigned in GitHub to PR creator.

@vladborovtsov
Copy link
Author

Hi @AlenkaF
I'm happy to continue the labour and discussion to get that merged.
As for AI, it wasn't used much, although I tried :) With such huge codebase the generation quality is quite low.

@AlenkaF
Copy link
Member

AlenkaF commented Oct 24, 2025

Happy to see a response!
All good, it is totally OK to use gen AI wisely ;)

I will wait for an opinion from a C++ dev and in the meantime try to look at the Python part.

@vladborovtsov
Copy link
Author

Hi @AlenkaF
Any feedback?

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python part looks good to me. I only added a comment for the docstring examples.

@rok @raulcd do you have any opinions on the C++ approach to handle this use case?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 12, 2025
Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only read the c++ part. I think the way option is defined and checked is good.

A nit: I'd move c++ tests into arrow/cpp/src/arrow/csv/column_builder_test.cc and use available machinery there. See how CheckInferred is used here:

TEST_F(InferringColumnBuilderTest, MultipleChunkTimestamp) {
auto options = ConvertOptions::Defaults();
auto tg = TaskGroup::MakeSerial();
std::shared_ptr<ChunkedArray> expected;
ChunkedArrayFromVector<TimestampType>(timestamp(TimeUnit::SECOND),
{{false}, {true}, {true}},
{{0}, {0}, {1542129070}}, &expected);
CheckInferred(tg, {{""}, {"1970-01-01"}, {"2018-11-13 17:11:10"}}, options, expected);
options.timestamp_parsers.push_back(TimestampParser::MakeStrptime("%Y/%m/%d"));
tg = TaskGroup::MakeSerial();
ChunkedArrayFromVector<TimestampType>(timestamp(TimeUnit::SECOND),
{{false}, {true}, {true}},
{{0}, {0}, {1542067200}}, &expected);
CheckInferred(tg, {{""}, {"1970/01/01"}, {"2018/11/13"}}, options, expected);
}

Edit: column_builder_test.cc is checking behavior only on single columns, perhaps keep a couple multi-column tests in the current location to test for that.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 12, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 18, 2025
@vladborovtsov
Copy link
Author

There is not too much to add into column_builder_test.cc: the introduced default_column_type affects only reader. And reader using typed columns when default specified, for details see cpp/src/arrow/csv/reader.cc.

I’ve added a guard test to ensure that my changes don’t override the type inference logic. Not really sure if it it makes sense to have them there.

As for the reader tests, I’ve tried to keep them representative while minimizing the amount of lines with recent changes.

@vladborovtsov vladborovtsov requested a review from rok December 21, 2025 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants