Skip to content

Conversation

@sondracek
Copy link
Member

@sondracek sondracek commented May 15, 2023

It would be very useful to allow filename filtering in auto column extraction (AutoParser) as well. One of the issues is that there is no simple way how to tell AutoParser we are at the final level (filename itself). As you can see in the second test, when mixing "proper" partitioning (with =) with other structure it will obviously fail.

This PR is meant more for a discussion then something that should be immediatelly merged.

@sondracek sondracek requested a review from vtuma May 16, 2023 07:12
tmi
tmi previously approved these changes May 16, 2023
Copy link
Collaborator

@tmi tmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as written across the comments -- let's just perhaps change the __init__ interface and the broken->mixed, but otherwise I like this, let's get it merged.

nowadays, I mostly use the IdentityReader with FileInPartition for my usecases, and I extract the filename when I need it by a different hack, file_url.rsplit('/', 1)[1]. With this merged, I could replace it by the partition_values['fname'], which does look cleaner :)

df3.to_json(partition2 / "a1.json", orient="records", lines=True)

parser = AutoParser.from_str("col=b/fname", parse_filenames_as="fname")
case2_result = read_partitioned_table(f"file://{tmp_path}/", Q_EQ("fname", "a1.json"), parser)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the last commit. Does it make sense to support this as well? It would allow anyone to combine filtering in the string (using from_str method) with filename queries. The only thing I do not like is the extended is_terminal_level method.

Copy link
Contributor

@vojtechtuma-imi vojtechtuma-imi May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good catch -- consistency across interfaces is important, so we should support the generation/from_str. It does not look well one has to supply the parse_filenames_as argument -- I think we should derive it from that str as well -- it is there, and without the =. But that may be a risky business (I think I may have realized some time ago it is not possible due to some corner case).

I don't like the is_terminal_level either -- as well as the whole parser interface, as evidenced by the number of type-ignores and now also kwargs there. I should be able to refactor it one day -- so all that matters now is that outward interface is ~ok. If you want to give a try to the from_str without the extra parse_filename_as but still supporting it, that would be cool. But I'm fine with the current redundant state just to be safe, and simplifying later (we would then make the param Optional, and if specified, prefer it, to preserve compatibility)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you implement it? When you have sth like parser = AutoParser.from_str("col=b/fname"), does fname stands for filename or does it mean folder with an implicit * value like in this example? I guess that's what you are reffering to as "a risky business". You would basically need to auto-parse something like parser = AutoParser.from_str("col=b/colX/fname"). We could do sth like global filename partition name (some smarter version of FILENAME_PARTITION = "fname") and if present in path or allowed, add it. Otherwise I do not see a simple way how to distinguish it (we would probably need to do it on the fly, but that would probably require some changes in the partition discovery).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thats exactly one of the pitfalls. The global filename does not look good to me, instead, we could eg:

  • add a second param -- AutoParser.from_str("col1=b/col2", "my_fname") (and you could eg "my_fname=[f1.parquet, f2.parquet]"). We would do it also for FixedParser, and yell deprecated when the original behaviour would be relied upon
  • invalidate this example -- if ends with /, then its a directory, otherwise its a filename. Apply in both Auto and Fixed parsers. Backwards compatibility requires we make from_str_v2 instead

in either approach, the next major version would simplify code and remove the deprecated part / extraneous code. I'm leaning towards the first one, it is less hacky, less error-prone in some way (trailing / is a menace), albeit more verbose and more error-prone in another way (you do not realise you should have used it and still try to squeeze fname to the single string).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for a delay 😇

I am still unsure about the right solution. I would like to have a consistent interface for both parsers, but I still think it would need a major refactoring (and it's even more confusing as there is the DateRangeGenerator "parser" which shares the base column parser as well). I definitely do not want to go with a / (as you wrote, trailing / is a menace). It seems good to me to separate path description and filename (which is basically needed for AutoParser to correctly parse and query filenames as well). But the current solution is still not an ideal imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and ofc if you see any simpler way how to do it, let me know. I am always trying to do minimum changes and I am afraid there is no simple solution that would not require updating from_str and __init__ interaface as we need to support both, the classmethod from path and __call__ method that is called on parser instances (and we have to tell somehow that fname parsing is enabled).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept thinking about it, and I believe now that there is exactly one way out -- a genuine grammar. I've recently did a grammar to translate sql queries to pandas sql expressions in https://github.com/erikrose/parsimonious/tree/master and it turned out quite good. It will bring us here a few advantages -- in particular, separating-decoupling between users specifying what they want (from_str, generating partitions, ...) and how we interpret it (auto parses / fixed columns parser). As well as independent testability and better catching of bad inputs and providing expl messages

that will be a rather big refactoring, with incompatible interfaces. We'll essentially have to create ColumnParserV2(ABC), with a different from_str, and accept either union or a protocol inside the impl. That makes me fully ok with your

I am afraid there is no simple solution

comment, which I agree with. Let's merge this, in particular the tests -- those will be the most useful thing during refactoring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, was the grammer you did something proprietary or is it shared somewhere? It really sounds it could help us here a lot, but as you wrote it would be a big refactoring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is proprietary atm, it is full of business-specific tweaks and not general enough. And I may not even attempt to make it publishable -- I'll first need to check whether https://github.com/tobymao/sqlglot doesn't always do the same thing but better... Either way, I'd have to write a wholly new grammar here. Ideally, it would also cover types to some extent (and you would not have to distinguish whether comparisons should be numerical or string), etc etc...

@classmethod
def from_str(cls, path_description: str):
"""Example: col1/col2=v1/col3=[v4,v5,v6]/colFname"""
def from_str(cls, path_description: str, fname: str | None = None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fname - not an ideal parameter name (it can contain fname query as well), any suggestion?

stacklevel=3,
)
path_description, fname = path_description.rsplit("/", 1)
return super().from_str(path_description, fname)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep backward compatibility and have a consitent interface. But I am not sure about the future. Do we want to deprecated the original behaviour (it was required for FixedColumnsParser to contain fname as well)? I can imagine both would stay but it's better to have almost always only a single way how to do it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with single ways being preferred -- per my comment about grammars, I don't think we even need to single out this as deprecated here, since ideally this whole mechanism would get deprecated...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change this from FutureWarning to just log a warning, which is kind of similar but not the same (I think it's good to try to force users to prefer one single way over the other). If we decide to implement the grammer in the future, I would probably prefer to deprecate whole from_str method (with a warning that users should pin their versions to <=1.0), rather then introducing V2 versions, but that something we can discuss later.

Any suggestion for fname? Do you think it's ok to keep it? I would rather use something different, but didn't come up with a good name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future warning is good

deprecating from_str and introducing from_grammar or smth like that is totally fine with me, thats kinda what I meant by v2 (I meant v2 inside some classes and functions, not on the module level... that is a pain)

(for fname I responded in another thread)

"""Creates a parser by processing partitions from the given path.
Example of path description: "col1/col2=v1/col3=[v4,v5,v6]". If filename should be parsed as well,
provide either the name of the partition (like `fname="fname"`) or as query (`fname="myfile=f1.csv"`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually got me confused a bit. In the first example, you'll end up with column fname in the result, holding the fname. In the second case, you'll end up with myfile column in the result, and it will only contain f1.csv values. Maybe expanding this in this sense would help understand it? I know that in the ColumnParser it feels weird to talk about "columns in the result", but its a docstring so we can perhaps afford to be a bit end-to-end

regarding the name of the variable -- maybe fname_col_name, or fname_as? The point is that both of them are confusing enough on a first read that it forces the reader to check the docstring, which is what we want. And on subsequent reads, it is "ah ok that was this confusing thing, I know". We could even name it "tom_bombadil" to reach the same effect, but fname_col_name or fname_as are perhaps more intuitive (and I want to use tom_bombadil in the grammar anyway)

Copy link
Collaborator

@tmi tmi Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm but the filter is not captured in that name, good point... so fname_query? That is legit even if you don't include the filter -- I essentially view this as select fname as fname_query.split(=)[0] where fname = fname_query.split(=)[1].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants