|
1 | 1 | # java-string-compressor |
2 | | - |
3 | 2 | Ultra-fast, zero-allocation string compression library. Up to 50% memory reduction. |
4 | 3 |
|
5 | | -- 4 bits -> 50% compression rate |
6 | | -- 5 bits -> 38% compression rate |
7 | | -- 6 bits -> 25% compression rate |
8 | | - |
9 | | -Fast! Tiny milliseconds to compress a 10 MB string. Check out the benchmarks. |
10 | | - |
| 4 | +Fast! Tiny milliseconds to compress a 10 MB string. Check out the benchmarks.<br/> |
11 | 5 | Well tested! See the test directory for usage examples and edge cases. |
12 | 6 |
|
13 | | -### 4‑bit compressor (`FourBitAsciiCompressor`) |
14 | | - |
15 | | -Compression rate: 50% |
16 | | -Maximum of 16 different chars. Default charset: `0-9`, `;`, `#`, `-`, `+`, `.`, `,` |
17 | | - |
18 | 7 | ```java |
19 | | -byte[] data = str.getBytes(US_ASCII); // Assume data is a 100 megabytes string. |
| 8 | +String data = "Assume this is a 100 megabytes string..."; |
| 9 | + |
| 10 | +// 4‑bit compressor -> 50% compression rate |
| 11 | +// Max of 16 different chars. Default charset: `0-9`, `;`, `#`, `-`, `+`, `.`, `,` |
20 | 12 | byte[] c = new FourBitAsciiCompressor().compress(data); // c is 50 megabytes. |
| 13 | + |
| 14 | +// 5‑bit compressor -> 38% compression rate |
| 15 | +// Max of 32 different chars. Default charset: `A-Z`, space, `.`, `,`, `\`, `-`, `@` |
| 16 | +byte[] c = new FiveBitAsciiCompressor().compress(data); // c is 62 megabytes. |
| 17 | + |
| 18 | +// 6‑bit compressor -> 25% compression rate |
| 19 | +// Max of 64 different chars. Default charset: `A-Z`, `0-9`, and many punctuation marks defined at SixBitAsciiCompressor.DEFAULT_6BIT_CHARSET. |
| 20 | +byte[] c = new SixBitAsciiCompressor().compress(data); // c is 75 megabytes. |
21 | 21 | ``` |
22 | 22 |
|
23 | | -### 5‑bit compressor (`FiveBitAsciiCompressor`) |
| 23 | +Check our documentation below. |
24 | 24 |
|
25 | | -Compression rate: 38% |
26 | | -Maximum of 32 different chars. Default charset: `A-Z`, space, `.`, `,`, `\`, `-`, `@` |
| 25 | +## Downloads |
| 26 | +```xml |
| 27 | +<dependency> |
| 28 | + <groupId>io.github.dannemann</groupId> |
| 29 | + <artifactId>java-string-compressor</artifactId> |
| 30 | + <version>1.0.0</version> |
| 31 | +</dependency> |
| 32 | +``` |
| 33 | +```java |
| 34 | +implementation("io.github.dannemann:java-string-compressor:1.0.0") |
| 35 | +``` |
| 36 | +Or download the lastest JAR from: https://github.com/Dannemann/java-string-compressor/releases |
| 37 | + |
| 38 | +## Documentation |
| 39 | +This library exits to quickly compress a massive volume of strings. |
| 40 | +Very useful if you need massive data allocated in memory for quick access or compacted for storage. |
| 41 | +We achieve this by removing all unnecessary bits from a character. But how? |
| 42 | + |
| 43 | +An ASCII character is represented by 8 bits: `00000000` to `11111111`. |
| 44 | +This gives us 128 different slots to represent characters. |
| 45 | +But a lot of times we do not need all those characters, only a small sub-set of them. |
| 46 | +For example, if your data only has numbers (0-9) and a few punctuations, 16 different characters can be enough to |
| 47 | +represent them, and we only need 4 bits (`0000` to `1111`) to represent 16 characters. |
| 48 | +But if your data only has letters (A-Z, like customer names), a set of 32 different characters is enough, which can be |
| 49 | +represented by 5 bits. |
| 50 | +And if you need both, 6 bits are enough. |
| 51 | +This way we can remove those unnecessary bits and store only the ones we need. |
| 52 | +And this is exactly was this library do. |
| 53 | + |
| 54 | +Another important feature is searching. This library not only supports compacting, but also binary searching on the |
| 55 | +compacted data itself without deflating it, which will be explained later in this documentation. |
| 56 | + |
| 57 | +To compress a string, you can easily use either `FourBitAsciiCompressor`, `FiveBitAsciiCompressor`, or `SixBitAsciiCompressor`. |
| 58 | + |
| 59 | +### Creating a compressor object |
| 60 | +```java |
| 61 | +AsciiCompressor compressor = new SixBitAsciiCompressor(); |
| 62 | +``` |
27 | 63 |
|
| 64 | +#### Defining your custom character set |
| 65 | +Each compressor have a set of default supported characters which are defined in fields `DEFAULT_4BIT_CHARSET`, `DEFAULT_5BIT_CHARSET`, and `DEFAULT_6BIT_CHARSET`. |
| 66 | +If you need a custom character set, use constructors with parameter `supportedCharset`: |
28 | 67 | ```java |
29 | | -byte[] data = str.getBytes(US_ASCII); // Assume data is a 100 megabytes string. |
30 | | -byte[] c = new FiveBitAsciiCompressor().compress(data); // c is 62 megabytes. |
| 68 | +byte[] myCustom4BitCharset = {'!', '"', '#', '$', '%', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '@'}; // Follows ASCII character ordering. |
| 69 | +AsciiCompressor compressor = new FourBitAsciiCompressor(myCustom4BitCharset); |
31 | 70 | ``` |
32 | 71 |
|
33 | | -### 6‑bit compressor (`SixBitAsciiCompressor`) |
| 72 | +#### Catching invalid characters (useful for testing an debugging) |
| 73 | +It’s useful to validate the input and throw errors when invalid characters are found. |
| 74 | +You can enable character validation by using any constructor with `throwException` parameter. |
| 75 | +Validations aren't recommended for production because you will probably be allocating massive amounts of gigabytes, and |
| 76 | +you don't want a single invalid character to halt the whole processes. |
| 77 | +It’s better to occasionally display an incorrect character than to abort the entire operation. |
| 78 | +```java |
| 79 | +public FiveBitAsciiCompressor(boolean throwException) |
| 80 | +``` |
34 | 81 |
|
35 | | -Compression rate: 25% |
36 | | -Maximum of 64 different chars. Default charset supports `A-Z`, `0-9`, and many punctuation marks which are defined at |
37 | | -`SixBitAsciiCompressor.DEFAULT_6BIT_CHARSET`. |
| 82 | +#### Preserving source byte arrays (useful for testing an debugging) |
| 83 | +Whenever possible, try to read straight bytes from your input source without creating `String` objects from them. |
| 84 | +This will keep your whole compressing process zero-allocation (like this library), which boosts performance and memory saving. |
| 85 | +But, by dealing directly with `byte[]` instead of `Strings`, you will notice that the compressor overwrites the original |
| 86 | +input byte array to minimize memory usage, making it unusable. |
| 87 | +To avoid this behavior and compress a copy of the original, enable input preservation by using any constructor with `preserveOriginal` parameter. |
| 88 | +```java |
| 89 | +public SixBitAsciiCompressor(byte[] supportedCharset, boolean throwException, boolean preserveOriginal) |
| 90 | +``` |
38 | 91 |
|
| 92 | +### Compressing and decompressing |
| 93 | +Once the compressor is instantiated, the compress and decompress process is straightforward: |
39 | 94 | ```java |
40 | | -byte[] data = str.getBytes(US_ASCII); // Assume data is a 100 megabytes string. |
41 | | -byte[] c = new SixBitAsciiCompressor().compress(data); // c is 75 megabytes. |
| 95 | + byte[] compressed = compressor.compress(input); |
| 96 | + byte[] decompressed = compressor.decompress(compressed); |
| 97 | + String string = new String(decompressed, StandardCharsets.ISO_8859_1); |
| 98 | +// String string = AsciiCompressor.getString(decompressed); // Same as above. Recommended. |
42 | 99 | ``` |
| 100 | +We recommend using `AsciiCompressor.getString(byte[])` because the method can be updated whenever a most efficient way to encode a `String` is found. |
43 | 101 |
|
44 | | -### Defining your custom character set |
| 102 | +**In case you can't work directly with byte arrays and need `String` objects for compression:** |
| 103 | +To extract ASCII bytes from a `String` in the most efficient way (for compression), do `AsciiCompressor.getBytes(String)`. |
| 104 | +But the overloaded version `compressor.compress(String)` already calls it automatically, so, just call the overloaded version. |
45 | 105 |
|
46 | | -Compressors have a set of default characters supported for compression. These are defined in constants |
47 | | -```DEFAULT_4BIT_CHARSET```, ```DEFAULT_5BIT_CHARSET```, and ```DEFAULT_6BIT_CHARSET```. You can define your own |
48 | | -character set by using any constructor with the ```supportedCharset``` parameter. |
| 106 | +### Where to store the compressed data |
49 | 107 |
|
50 | | -### Catching invalid characters |
| 108 | +In its purest form, a `String` is just a byte array (`byte[]`), and a compressed `String` couldn't be different. |
| 109 | +You can store it anywhere you would store a `byte[]`. |
| 110 | +The most common approach is to store each compressed string ordered in memory using a `byte[][]` (for binary search) or |
| 111 | +a B+Tree (coming in the next release). |
| 112 | +The frequency of reads and writes + business requirements will tell the best media and data structure to use. |
51 | 113 |
|
52 | | -It’s useful to validate the input and throw errors when invalid characters are found. |
53 | | -You can enable character validation by using any constructor with the ```throwException``` parameter. |
54 | | -Validation isn’t recommended for production because you will probably be adding dozens of gigabytes to the memory, |
55 | | -and you don't want a single invalid character to halt the whole processes. |
56 | | -It’s better to occasionally display an incorrect character than to abort the entire operation. |
| 114 | +If the data is ordered before compression and stored in-memory in a `byte[][]`, you can use the full power of the binary search directly in the compressed data |
| 115 | +through `FourBitBinarySearch`, `FiveBitBinarySearch`, and `SixBitBinarySearch`. |
57 | 116 |
|
58 | | -### Preserving the original input string |
| 117 | +### Binary search |
59 | 118 |
|
60 | | -By default, the compressor overwrites the original input byte array to minimize memory usage. |
61 | | -Very useful when dealing with big strings, avoiding duplicating them. |
62 | | -You can enable input preservation by using any constructor with the ```preserveOriginal``` parameter. |
63 | 119 |
|
64 | | -## Downloads |
65 | 120 |
|
66 | | -Add it to your Maven project: |
67 | | -```xml |
68 | | -<dependency> |
69 | | - <groupId>io.github.dannemann</groupId> |
70 | | - <artifactId>java-string-compressor</artifactId> |
71 | | - <version>1.0.0</version> |
72 | | -</dependency> |
| 121 | +```java |
| 122 | +byte[][] compactedMass = new byte[100000000][]; // Data for 100 million customers. |
| 123 | + |
| 124 | + |
| 125 | +byte[] compressed = compressor.compress(input); |
| 126 | +byte[] decompressed = compressor.decompress(compressed); |
| 127 | +String string = new String(decompressed, StandardCharsets.ISO_8859_1); |
73 | 128 | ``` |
74 | 129 |
|
75 | | -Gradle: |
| 130 | +### Bulk / Batch compression |
| 131 | + |
| 132 | +java-string-compressor provides both, `BulkCompressor` and `ManagedBulkCompressor` specifically for this task. |
| 133 | +They help you automatize the process of adding each batch to the correct position in the destination array where the |
| 134 | +compressed data will be stored. Both currently supports `byte[][]` as destination for the compressed data. |
| 135 | + |
| 136 | +`BulkCompressor` is a "lower-level" utility where you should manage where each compacted string should be added in |
| 137 | +the target `byte[][]`. In the other hand, `ManagedBulkCompressor` encapsulates and automatizes this process, avoiding you |
| 138 | +from handle array positions and bounds. This is why we recommend `ManagedBulkCompressor` (which uses a `BulkCompressor` internally). |
| 139 | + |
| 140 | +Both bulk compressors loop through the data in parallel by calling `IntStream.range().parallel()`. |
| 141 | + |
| 142 | +Let's take `compactedMass` from the previous example and show how we can populate it with data from all customers. |
| 143 | + |
76 | 144 | ```java |
77 | | -implementation("io.github.dannemann:java-string-compressor:1.0.0") |
| 145 | +byte[][] compactedMass = new byte[100000000][]; // Data for 100 million customers. |
| 146 | + |
| 147 | + |
| 148 | + |
| 149 | + |
| 150 | + |
| 151 | + |
| 152 | + |
| 153 | +byte[] compressed = compressor.compress(input); |
| 154 | +byte[] decompressed = compressor.decompress(compressed); |
| 155 | +String string = new String(decompressed, StandardCharsets.ISO_8859_1); |
78 | 156 | ``` |
79 | 157 |
|
80 | | -Or download the lastest release from: https://github.com/Dannemann/java-string-compressor/releases |
| 158 | + |
| 159 | +`BulkCompressor` is a "lower-level" utility where |
| 160 | + |
| 161 | + |
| 162 | + |
| 163 | + |
| 164 | +### Other |
| 165 | +Do not forget to check our JavaDocs with further information about each member. |
| 166 | + |
| 167 | + |
| 168 | + |
| 169 | + |
| 170 | + |
| 171 | + |
| 172 | + |
| 173 | + |
| 174 | + |
| 175 | + |
| 176 | + |
| 177 | + |
| 178 | +<br> |
| 179 | +<br> |
| 180 | +<br> |
| 181 | +<br> |
| 182 | +<br> |
| 183 | +<br> |
| 184 | +<br> |
| 185 | +<br> |
| 186 | +<br> |
| 187 | +<br> |
| 188 | +<br> |
| 189 | +<br> |
| 190 | +<br> |
| 191 | +<br> |
| 192 | +<br> |
| 193 | +<br> |
| 194 | + |
| 195 | + |
| 196 | + |
| 197 | + |
| 198 | + |
| 199 | + |
| 200 | + |
81 | 201 |
|
82 | 202 |
|
83 | 203 |
|
|
0 commit comments