-
Notifications
You must be signed in to change notification settings - Fork 9
Defining types
An extent-type in DataSeries is similar to a SQL table create statement in that it defines the field-type of all the fields in a related group of records. Every extent-type has a name that is intended to be a general description of the record. We have found that using a naming convention that encodes type and hierarchical information, such as Trace::BlockIO::HPUX, Trace::NFS::common or Trace::NFS::read-write works well, providing information to the user on what is contained, and about extents that may be related (e.g., the Trace::NFS::* names above).
Extent types are described in XML, which allows us to specify attributes associated both with the entire type, and with each individual field. Some attributes are parsed and understood by the DataSeries library, while others are used by individual programs. Since the type is written in XML, users can define additional conventions for their own use.
The columns in an extent type are referred to as fields. The following types are defined:
| type | description |
|---|---|
| bool | Boolean -- true(1) or false(0) |
| byte | Byte -- unsigned: 0-255 |
| int32 | Signed 32 bit integer |
| int64 | Signed 64 bit integer |
| double | IEEE 64 bit floating point |
| variable32 | Sequence of up to 2^31 bytes |
| fixedwidth | A fixed-width sequence of bytes. |
In addition to the types that are supported, there are a number of options that can be applied to the data types. Core options are either of the form opt_, or pack_; the former extend or change the values available to applications, and require applications to understand the option, whereas the latter form are transparent to applications, but enable higher compression than would otherwise be available. Options are applied to either an entire extent or to individual fields. One possible direction for future work would be automatically inferring the "best" packing options.
Some of the dataseries properties apply to the entire extent type. In this case, they are specified as attributes on the ExtentType element.
| attribute | description |
|---|---|
| namespace | Used to make names in a type unique. Should be a domain name that the creator of the type controls. |
| version | Version number for the type in the form major.minor with the semantics that minor versions are only allowed to add new fields, whereas major versions can remove fields, rename fields or change field semantics. This means that analysis code that can process version 1.x will work on any version 1.y for y >= x, but may or may not work on version 2.0. |
| pack_null_compact | Should we remove all of the nullable fields before running the results through compression. For records with many nullable values this can greatly increase the compression ratio at a cost of additional computation time. The technical report evaluates this option in the section on the Ellard Traces. |
| pack_pad_record | This option controls how the record is padded. Originally all records were padded to 8 bytes. For records with only 4 byte or smaller fields, this wastes some amount of space, and hence the option to pad to the maximum column size was added. The technical report evaluates this option in the section on the 1998 World Cup traces. |
| pack_field_ordering | This option controls how the fields are ordered within a record. It turns out some files compress better with different field orderings. The technical report evaluates this option in the section on the 1998 World Cup traces. |
In addition to the field type, a field may have some attributes. These attributes primarily affect they definition of the field (opt_) or affect the way the field is transformed before compression (pack_). Other attributes may be specified to control sub-properties. Known attributes include:
| attribute | description |
|---|---|
| opt_nullable | Indicates values in this column can be null or a value of the field type. This option is implemented by generating a hidden boolean column that determines if the value is null. Note: fields default to not-null. |
| opt_doublebase=base-value | Specifies a relative base for doubles. This is used to gain additional precision in the double without losing the absolute value. This option was thought to be useful for storing timestamps with nanosecond precision, but it turned out to be difficult to program. We later found that using int64's with a units of 2^-32 seconds worked better. Hence this option is deprecated. |
| pack_relative | Specifies that this field should be packed relative to another field. This delta encoding option is useful for compressing time stamps and other values which may be large but are usually close to the previous value in the same field or the value of a different field in the same row. In particular it means that if field-name is the same as the current field's name, the previous row's value will be subtracted from the current row's value before the data is compressed, and otherwise from the value of the other field in the same row. This feature is only supported for int32, int64, and double fields. For double fields it is required that the fields be packed with pack_scale as well to eliminate precision issues. |
| pack_unique | Specifies that each unique variable32 value should only be packed once within that extent. This option applies across all variable32 fields with pack_unique enabled. For fields with many repeated values this option can significantly increase the effective compression ratio because it entirely removes duplicate data within that extent (compression algorithms only remove it partially). |
| pack_scale=precision | This option specifies the precision to use for double values; in particular it means that the double will be multiplied by 1/precision, and rounded to an integer before being compressed, with the reverse transform applied after the data is uncompressed. This option is useful because values that are stored as doubles sometimes accumulate pseudo-random bits in the low digits. These pseudo-random bits contain no useful information and reduce the achievable compression. This option improves compression by removing the pseudo-random bits in the low bits of doubles by scaling and rounding the double. Note:If your values are not within 10% of an integer multiple of scale-value, then specifying this option will generate warnings. These warnings are intended to help you avoid accidentally losing precision. You can turn the warnings off by specifying pack_scale_warn="no" |
| pack_scale_warn="yes|no" | By default pack_scale will warn you if your doubles are too far off from being integer multiples of the precision. In some cases, programmers don't care, they are willing to lose precision and don't want to be warned about that loss. In that case, you can specify pack_scale_warn="no" to disable warnings. Alternately, you could override the LintelLog appender, and write one that drops those messages. |
| units=... | The units attribute is intended to describe the units for a particular measurement. Right now it is used with the Int64TimeField to handle different units for time. Currently supported units are "2^-32 seconds", "nanoseconds", "microseconds." |
| epoch=... | The epoch attribute is used with the Int64TimeField to specify the epoch for the time. The currently supported value is "unix." |
| comment | Just a text comment for this field (stored only once in the type extent). While this attribute technically fits with all the "arbitrary" ones, it has become somewhat of a convention. |
| print_true="..." | Specify the string to print when a boolean field is true |
| print_false="..." | Specify the string to print when a boolean field is false |
| print_format="..." | Specify a printf or boost::format-style print specifier for byte, int32, int64, and double fields |
| print_divisor="integer" | Specify a divisor for int32 or int64 fields to apply before printing the integer |
| print_offset="{first, relativeto:field, int64}" | Specify that an int64 field should be printed relative to either the first value printed, another field in the same row, or a specific int64. |
| print_multiplier="double" | Specify a multiplier for double fields to be applied prior to printing |
| print_style="style" | Specify a style for printing variable32 fields. Current values include maybehex, csv, text |
| ... | You can put any other XML attributes in provided they don't start with pack or opt, they will just be ignored by the library. |