Add AutoParser fname extension example #35

sondracek · 2023-05-15T09:39:42Z

It would be very useful to allow filename filtering in auto column extraction (AutoParser) as well. One of the issues is that there is no simple way how to tell AutoParser we are at the final level (filename itself). As you can see in the second test, when mixing "proper" partitioning (with =) with other structure it will obviously fail.

This PR is meant more for a discussion then something that should be immediatelly merged.

src/fsql/column_parser.py

tests/test_auto_parser.py

tmi

as written across the comments -- let's just perhaps change the __init__ interface and the broken->mixed, but otherwise I like this, let's get it merged.

nowadays, I mostly use the IdentityReader with FileInPartition for my usecases, and I extract the filename when I need it by a different hack, file_url.rsplit('/', 1)[1]. With this merged, I could replace it by the partition_values['fname'], which does look cleaner :)

sondracek · 2023-05-16T15:44:23Z

tests/test_auto_parser.py

+    df3.to_json(partition2 / "a1.json", orient="records", lines=True)
+
+    parser = AutoParser.from_str("col=b/fname", parse_filenames_as="fname")
+    case2_result = read_partitioned_table(f"file://{tmp_path}/", Q_EQ("fname", "a1.json"), parser)


Check the last commit. Does it make sense to support this as well? It would allow anyone to combine filtering in the string (using from_str method) with filename queries. The only thing I do not like is the extended is_terminal_level method.

yeah, good catch -- consistency across interfaces is important, so we should support the generation/from_str. It does not look well one has to supply the parse_filenames_as argument -- I think we should derive it from that str as well -- it is there, and without the =. But that may be a risky business (I think I may have realized some time ago it is not possible due to some corner case).

I don't like the is_terminal_level either -- as well as the whole parser interface, as evidenced by the number of type-ignores and now also kwargs there. I should be able to refactor it one day -- so all that matters now is that outward interface is ~ok. If you want to give a try to the from_str without the extra parse_filename_as but still supporting it, that would be cool. But I'm fine with the current redundant state just to be safe, and simplifying later (we would then make the param Optional, and if specified, prefer it, to preserve compatibility)

How would you implement it? When you have sth like parser = AutoParser.from_str("col=b/fname"), does fname stands for filename or does it mean folder with an implicit * value like in this example? I guess that's what you are reffering to as "a risky business". You would basically need to auto-parse something like parser = AutoParser.from_str("col=b/colX/fname"). We could do sth like global filename partition name (some smarter version of FILENAME_PARTITION = "fname") and if present in path or allowed, add it. Otherwise I do not see a simple way how to distinguish it (we would probably need to do it on the fly, but that would probably require some changes in the partition discovery).

yeah, thats exactly one of the pitfalls. The global filename does not look good to me, instead, we could eg:

add a second param -- AutoParser.from_str("col1=b/col2", "my_fname") (and you could eg "my_fname=[f1.parquet, f2.parquet]"). We would do it also for FixedParser, and yell deprecated when the original behaviour would be relied upon

invalidate this example -- if ends with /, then its a directory, otherwise its a filename. Apply in both Auto and Fixed parsers. Backwards compatibility requires we make from_str_v2 instead

in either approach, the next major version would simplify code and remove the deprecated part / extraneous code. I'm leaning towards the first one, it is less hacky, less error-prone in some way (trailing / is a menace), albeit more verbose and more error-prone in another way (you do not realise you should have used it and still try to squeeze fname to the single string).

sorry for a delay 😇

I am still unsure about the right solution. I would like to have a consistent interface for both parsers, but I still think it would need a major refactoring (and it's even more confusing as there is the DateRangeGenerator "parser" which shares the base column parser as well). I definitely do not want to go with a / (as you wrote, trailing / is a menace). It seems good to me to separate path description and filename (which is basically needed for AutoParser to correctly parse and query filenames as well). But the current solution is still not an ideal imo.

and ofc if you see any simpler way how to do it, let me know. I am always trying to do minimum changes and I am afraid there is no simple solution that would not require updating from_str and __init__ interaface as we need to support both, the classmethod from path and __call__ method that is called on parser instances (and we have to tell somehow that fname parsing is enabled).

I kept thinking about it, and I believe now that there is exactly one way out -- a genuine grammar. I've recently did a grammar to translate sql queries to pandas sql expressions in https://github.com/erikrose/parsimonious/tree/master and it turned out quite good. It will bring us here a few advantages -- in particular, separating-decoupling between users specifying what they want (from_str, generating partitions, ...) and how we interpret it (auto parses / fixed columns parser). As well as independent testability and better catching of bad inputs and providing expl messages

that will be a rather big refactoring, with incompatible interfaces. We'll essentially have to create ColumnParserV2(ABC), with a different from_str, and accept either union or a protocol inside the impl. That makes me fully ok with your

I am afraid there is no simple solution

comment, which I agree with. Let's merge this, in particular the tests -- those will be the most useful thing during refactoring.

Interesting, was the grammer you did something proprietary or is it shared somewhere? It really sounds it could help us here a lot, but as you wrote it would be a big refactoring.

that is proprietary atm, it is full of business-specific tweaks and not general enough. And I may not even attempt to make it publishable -- I'll first need to check whether https://github.com/tobymao/sqlglot doesn't always do the same thing but better... Either way, I'd have to write a wholly new grammar here. Ideally, it would also cover types to some extent (and you would not have to distinguish whether comparisons should be numerical or string), etc etc...

sondracek · 2023-08-02T11:56:06Z

src/fsql/column_parser.py

    @classmethod
-    def from_str(cls, path_description: str):
-        """Example: col1/col2=v1/col3=[v4,v5,v6]/colFname"""
+    def from_str(cls, path_description: str, fname: str | None = None):


fname - not an ideal parameter name (it can contain fname query as well), any suggestion?

sondracek · 2023-08-02T12:01:33Z

src/fsql/column_parser.py

+                stacklevel=3,
+            )
+            path_description, fname = path_description.rsplit("/", 1)
+        return super().from_str(path_description, fname)


I would like to keep backward compatibility and have a consitent interface. But I am not sure about the future. Do we want to deprecated the original behaviour (it was required for FixedColumnsParser to contain fname as well)? I can imagine both would stay but it's better to have almost always only a single way how to do it.

I do agree with single ways being preferred -- per my comment about grammars, I don't think we even need to single out this as deprecated here, since ideally this whole mechanism would get deprecated...

I change this from FutureWarning to just log a warning, which is kind of similar but not the same (I think it's good to try to force users to prefer one single way over the other). If we decide to implement the grammer in the future, I would probably prefer to deprecate whole from_str method (with a warning that users should pin their versions to <=1.0), rather then introducing V2 versions, but that something we can discuss later.

Any suggestion for fname? Do you think it's ok to keep it? I would rather use something different, but didn't come up with a good name.

future warning is good

deprecating from_str and introducing from_grammar or smth like that is totally fine with me, thats kinda what I meant by v2 (I meant v2 inside some classes and functions, not on the module level... that is a pain)

(for fname I responded in another thread)

tmi · 2023-08-08T19:01:16Z

src/fsql/column_parser.py

+        """Creates a parser by processing partitions from the given path.
+
+        Example of path description: "col1/col2=v1/col3=[v4,v5,v6]". If filename should be parsed as well,
+        provide either the name of the partition (like `fname="fname"`) or as query (`fname="myfile=f1.csv"`).


this actually got me confused a bit. In the first example, you'll end up with column fname in the result, holding the fname. In the second case, you'll end up with myfile column in the result, and it will only contain f1.csv values. Maybe expanding this in this sense would help understand it? I know that in the ColumnParser it feels weird to talk about "columns in the result", but its a docstring so we can perhaps afford to be a bit end-to-end

regarding the name of the variable -- maybe fname_col_name, or fname_as? The point is that both of them are confusing enough on a first read that it forces the reader to check the docstring, which is what we want. And on subsequent reads, it is "ah ok that was this confusing thing, I know". We could even name it "tom_bombadil" to reach the same effect, but fname_col_name or fname_as are perhaps more intuitive (and I want to use tom_bombadil in the grammar anyway)

hm but the filter is not captured in that name, good point... so fname_query? That is legit even if you don't include the filter -- I essentially view this as select fname as fname_query.split(=)[0] where fname = fname_query.split(=)[1].

Add AutoParser fname extension example

928f7e1

sondracek requested a review from vtuma May 16, 2023 07:12

tmi reviewed May 16, 2023

View reviewed changes

src/fsql/column_parser.py Outdated Show resolved Hide resolved

tmi reviewed May 16, 2023

View reviewed changes

src/fsql/column_parser.py Show resolved Hide resolved

tmi reviewed May 16, 2023

View reviewed changes

tests/test_auto_parser.py Outdated Show resolved Hide resolved

tmi previously approved these changes May 16, 2023

View reviewed changes

sondracek added 2 commits May 16, 2023 17:02

Address comments

462a7a3

Extend AutoParser to support specific usecases

0ec42fd

sondracek dismissed tmi’s stale review via 0ec42fd May 16, 2023 15:27

sondracek commented May 16, 2023

View reviewed changes

Separate fname from path

6455762

sondracek commented Aug 2, 2023

View reviewed changes

Change deprecation to log warning

c3a21b4

tmi reviewed Aug 8, 2023

View reviewed changes

Add AutoParser fname extension example #35

Are you sure you want to change the base?

Add AutoParser fname extension example #35

Uh oh!

Conversation

sondracek commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tmi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vojtechtuma-imi May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmi Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sondracek commented May 15, 2023 •

edited

Loading

tmi left a comment •

edited

Loading

vojtechtuma-imi May 17, 2023 •

edited

Loading

tmi Aug 8, 2023 •

edited

Loading