-
Notifications
You must be signed in to change notification settings - Fork 9
Dataseries modules
There are many dataseries modules that you can re-use. They include: DSExpr, DStoTextModule, IndexSourceModule, MinMaxIndexModule, PrefetchBufferModule, RowAnalysisModule, SequenceModule, TypeIndexModule.
TODO: re-sort these in some useful order, verify list above is complete, add in the data-series-server modules
A generic base class for modules that read from files and return extents. Using this module will get all extents in all files in file order. This module also tracks statistics (e.g. number of bytes read).
A module that reads from one or more files and returns extents that match a particular type prefix. This module is used to apply an analysis across a number of files.
This module takes a minimum and maximum value and returns extents which overlap the specified range. As a preprocessing step, the dsextentindex utility reads all extents to be evaluated, calculates the minimum and maximum values for the specified columns, and stores them in a file to be read by the MinMaxIndexModule. This module is used to analyze sub-sets of a larger dataset, for example one month out of a multi-year dataset.
Filters for extents matching a particular type, deprecated in preference to using a TypeIndexModule.
A module for generating new extents. This module tracks the size of the current extent and automatically flushes the extent when it exceeds a specified size. It can optionally parallelize compression for improved performance.
A module that converts the input extents to text. It allows the user to specify the output format to allow matching existing output programs. An example use is converting DataSeries files to CSV.
A module that allows the decompression and analysis steps to proceed in parallel, increasing efficiency on multi-core and SMP systems. While source modules already overlap the operations of reading from disk, a PrefetchBufferModule can also be placed between analysis modules so that different analyses can run in parallel. Mostly deprecated except for the use of parallelizing analysis as the input modules now read in parallel.
A module that stores a sequence of modules in a pipeline. This is used by analysis programs to dynamically select the list of modules and on completion easily run all the printResult functions on the selected modules.
A module for building analyses that operate a row at a time. This module handles the issues of iterating over the rows in each extent, and calling preparation and finalization functions. Using it is slightly less efficient than duplicating the iteration code because of the virtual function call for each row.
For each row, calculates an expression over constants and the fields in the row. The expression results are used to calculate statistics such as means, quantiles, or sequences. A separate field is used to group the statistics. A DSStatGroupByModule can be thought of as a very simplified SQL select statement.