Skip to content

Commit e1d2b76

Browse files
committed
README updates.
1 parent 83b5ab6 commit e1d2b76

File tree

1 file changed

+169
-49
lines changed

1 file changed

+169
-49
lines changed

README.md

Lines changed: 169 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,203 @@
11
# java-string-compressor
2-
32
Ultra-fast, zero-allocation string compression library. Up to 50% memory reduction.
43

5-
- 4 bits -> 50% compression rate
6-
- 5 bits -> 38% compression rate
7-
- 6 bits -> 25% compression rate
8-
9-
Fast! Tiny milliseconds to compress a 10 MB string. Check out the benchmarks.
10-
4+
Fast! Tiny milliseconds to compress a 10 MB string. Check out the benchmarks.<br/>
115
Well tested! See the test directory for usage examples and edge cases.
126

13-
### 4‑bit compressor (`FourBitAsciiCompressor`)
14-
15-
Compression rate: 50%
16-
Maximum of 16 different chars. Default charset: `0-9`, `;`, `#`, `-`, `+`, `.`, `,`
17-
187
```java
19-
byte[] data = str.getBytes(US_ASCII); // Assume data is a 100 megabytes string.
8+
String data = "Assume this is a 100 megabytes string...";
9+
10+
// 4‑bit compressor -> 50% compression rate
11+
// Max of 16 different chars. Default charset: `0-9`, `;`, `#`, `-`, `+`, `.`, `,`
2012
byte[] c = new FourBitAsciiCompressor().compress(data); // c is 50 megabytes.
13+
14+
// 5‑bit compressor -> 38% compression rate
15+
// Max of 32 different chars. Default charset: `A-Z`, space, `.`, `,`, `\`, `-`, `@`
16+
byte[] c = new FiveBitAsciiCompressor().compress(data); // c is 62 megabytes.
17+
18+
// 6‑bit compressor -> 25% compression rate
19+
// Max of 64 different chars. Default charset: `A-Z`, `0-9`, and many punctuation marks defined at SixBitAsciiCompressor.DEFAULT_6BIT_CHARSET.
20+
byte[] c = new SixBitAsciiCompressor().compress(data); // c is 75 megabytes.
2121
```
2222

23-
### 5‑bit compressor (`FiveBitAsciiCompressor`)
23+
Check our documentation below.
2424

25-
Compression rate: 38%
26-
Maximum of 32 different chars. Default charset: `A-Z`, space, `.`, `,`, `\`, `-`, `@`
25+
## Downloads
26+
```xml
27+
<dependency>
28+
<groupId>io.github.dannemann</groupId>
29+
<artifactId>java-string-compressor</artifactId>
30+
<version>1.0.0</version>
31+
</dependency>
32+
```
33+
```java
34+
implementation("io.github.dannemann:java-string-compressor:1.0.0")
35+
```
36+
Or download the lastest JAR from: https://github.com/Dannemann/java-string-compressor/releases
37+
38+
## Documentation
39+
This library exits to quickly compress a massive volume of strings.
40+
Very useful if you need massive data allocated in memory for quick access or compacted for storage.
41+
We achieve this by removing all unnecessary bits from a character. But how?
42+
43+
An ASCII character is represented by 8 bits: `00000000` to `11111111`.
44+
This gives us 128 different slots to represent characters.
45+
But a lot of times we do not need all those characters, only a small sub-set of them.
46+
For example, if your data only has numbers (0-9) and a few punctuations, 16 different characters can be enough to
47+
represent them, and we only need 4 bits (`0000` to `1111`) to represent 16 characters.
48+
But if your data only has letters (A-Z, like customer names), a set of 32 different characters is enough, which can be
49+
represented by 5 bits.
50+
And if you need both, 6 bits are enough.
51+
This way we can remove those unnecessary bits and store only the ones we need.
52+
And this is exactly was this library do.
53+
54+
Another important feature is searching. This library not only supports compacting, but also binary searching on the
55+
compacted data itself without deflating it, which will be explained later in this documentation.
56+
57+
To compress a string, you can easily use either `FourBitAsciiCompressor`, `FiveBitAsciiCompressor`, or `SixBitAsciiCompressor`.
58+
59+
### Creating a compressor object
60+
```java
61+
AsciiCompressor compressor = new SixBitAsciiCompressor();
62+
```
2763

64+
#### Defining your custom character set
65+
Each compressor have a set of default supported characters which are defined in fields `DEFAULT_4BIT_CHARSET`, `DEFAULT_5BIT_CHARSET`, and `DEFAULT_6BIT_CHARSET`.
66+
If you need a custom character set, use constructors with parameter `supportedCharset`:
2867
```java
29-
byte[] data = str.getBytes(US_ASCII); // Assume data is a 100 megabytes string.
30-
byte[] c = new FiveBitAsciiCompressor().compress(data); // c is 62 megabytes.
68+
byte[] myCustom4BitCharset = {'!', '"', '#', '$', '%', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '@'}; // Follows ASCII character ordering.
69+
AsciiCompressor compressor = new FourBitAsciiCompressor(myCustom4BitCharset);
3170
```
3271

33-
### 6‑bit compressor (`SixBitAsciiCompressor`)
72+
#### Catching invalid characters (useful for testing an debugging)
73+
It’s useful to validate the input and throw errors when invalid characters are found.
74+
You can enable character validation by using any constructor with `throwException` parameter.
75+
Validations aren't recommended for production because you will probably be allocating massive amounts of gigabytes, and
76+
you don't want a single invalid character to halt the whole processes.
77+
It’s better to occasionally display an incorrect character than to abort the entire operation.
78+
```java
79+
public FiveBitAsciiCompressor(boolean throwException)
80+
```
3481

35-
Compression rate: 25%
36-
Maximum of 64 different chars. Default charset supports `A-Z`, `0-9`, and many punctuation marks which are defined at
37-
`SixBitAsciiCompressor.DEFAULT_6BIT_CHARSET`.
82+
#### Preserving source byte arrays (useful for testing an debugging)
83+
Whenever possible, try to read straight bytes from your input source without creating `String` objects from them.
84+
This will keep your whole compressing process zero-allocation (like this library), which boosts performance and memory saving.
85+
But, by dealing directly with `byte[]` instead of `Strings`, you will notice that the compressor overwrites the original
86+
input byte array to minimize memory usage, making it unusable.
87+
To avoid this behavior and compress a copy of the original, enable input preservation by using any constructor with `preserveOriginal` parameter.
88+
```java
89+
public SixBitAsciiCompressor(byte[] supportedCharset, boolean throwException, boolean preserveOriginal)
90+
```
3891

92+
### Compressing and decompressing
93+
Once the compressor is instantiated, the compress and decompress process is straightforward:
3994
```java
40-
byte[] data = str.getBytes(US_ASCII); // Assume data is a 100 megabytes string.
41-
byte[] c = new SixBitAsciiCompressor().compress(data); // c is 75 megabytes.
95+
byte[] compressed = compressor.compress(input);
96+
byte[] decompressed = compressor.decompress(compressed);
97+
String string = new String(decompressed, StandardCharsets.ISO_8859_1);
98+
// String string = AsciiCompressor.getString(decompressed); // Same as above. Recommended.
4299
```
100+
We recommend using `AsciiCompressor.getString(byte[])` because the method can be updated whenever a most efficient way to encode a `String` is found.
43101

44-
### Defining your custom character set
102+
**In case you can't work directly with byte arrays and need `String` objects for compression:**
103+
To extract ASCII bytes from a `String` in the most efficient way (for compression), do `AsciiCompressor.getBytes(String)`.
104+
But the overloaded version `compressor.compress(String)` already calls it automatically, so, just call the overloaded version.
45105

46-
Compressors have a set of default characters supported for compression. These are defined in constants
47-
```DEFAULT_4BIT_CHARSET```, ```DEFAULT_5BIT_CHARSET```, and ```DEFAULT_6BIT_CHARSET```. You can define your own
48-
character set by using any constructor with the ```supportedCharset``` parameter.
106+
### Where to store the compressed data
49107

50-
### Catching invalid characters
108+
In its purest form, a `String` is just a byte array (`byte[]`), and a compressed `String` couldn't be different.
109+
You can store it anywhere you would store a `byte[]`.
110+
The most common approach is to store each compressed string ordered in memory using a `byte[][]` (for binary search) or
111+
a B+Tree (coming in the next release).
112+
The frequency of reads and writes + business requirements will tell the best media and data structure to use.
51113

52-
It’s useful to validate the input and throw errors when invalid characters are found.
53-
You can enable character validation by using any constructor with the ```throwException``` parameter.
54-
Validation isn’t recommended for production because you will probably be adding dozens of gigabytes to the memory,
55-
and you don't want a single invalid character to halt the whole processes.
56-
It’s better to occasionally display an incorrect character than to abort the entire operation.
114+
If the data is ordered before compression and stored in-memory in a `byte[][]`, you can use the full power of the binary search directly in the compressed data
115+
through `FourBitBinarySearch`, `FiveBitBinarySearch`, and `SixBitBinarySearch`.
57116

58-
### Preserving the original input string
117+
### Binary search
59118

60-
By default, the compressor overwrites the original input byte array to minimize memory usage.
61-
Very useful when dealing with big strings, avoiding duplicating them.
62-
You can enable input preservation by using any constructor with the ```preserveOriginal``` parameter.
63119

64-
## Downloads
65120

66-
Add it to your Maven project:
67-
```xml
68-
<dependency>
69-
<groupId>io.github.dannemann</groupId>
70-
<artifactId>java-string-compressor</artifactId>
71-
<version>1.0.0</version>
72-
</dependency>
121+
```java
122+
byte[][] compactedMass = new byte[100000000][]; // Data for 100 million customers.
123+
124+
125+
byte[] compressed = compressor.compress(input);
126+
byte[] decompressed = compressor.decompress(compressed);
127+
String string = new String(decompressed, StandardCharsets.ISO_8859_1);
73128
```
74129

75-
Gradle:
130+
### Bulk / Batch compression
131+
132+
java-string-compressor provides both, `BulkCompressor` and `ManagedBulkCompressor` specifically for this task.
133+
They help you automatize the process of adding each batch to the correct position in the destination array where the
134+
compressed data will be stored. Both currently supports `byte[][]` as destination for the compressed data.
135+
136+
`BulkCompressor` is a "lower-level" utility where you should manage where each compacted string should be added in
137+
the target `byte[][]`. In the other hand, `ManagedBulkCompressor` encapsulates and automatizes this process, avoiding you
138+
from handle array positions and bounds. This is why we recommend `ManagedBulkCompressor` (which uses a `BulkCompressor` internally).
139+
140+
Both bulk compressors loop through the data in parallel by calling `IntStream.range().parallel()`.
141+
142+
Let's take `compactedMass` from the previous example and show how we can populate it with data from all customers.
143+
76144
```java
77-
implementation("io.github.dannemann:java-string-compressor:1.0.0")
145+
byte[][] compactedMass = new byte[100000000][]; // Data for 100 million customers.
146+
147+
148+
149+
150+
151+
152+
153+
byte[] compressed = compressor.compress(input);
154+
byte[] decompressed = compressor.decompress(compressed);
155+
String string = new String(decompressed, StandardCharsets.ISO_8859_1);
78156
```
79157

80-
Or download the lastest release from: https://github.com/Dannemann/java-string-compressor/releases
158+
159+
`BulkCompressor` is a "lower-level" utility where
160+
161+
162+
163+
164+
### Other
165+
Do not forget to check our JavaDocs with further information about each member.
166+
167+
168+
169+
170+
171+
172+
173+
174+
175+
176+
177+
178+
<br>
179+
<br>
180+
<br>
181+
<br>
182+
<br>
183+
<br>
184+
<br>
185+
<br>
186+
<br>
187+
<br>
188+
<br>
189+
<br>
190+
<br>
191+
<br>
192+
<br>
193+
<br>
194+
195+
196+
197+
198+
199+
200+
81201

82202

83203

0 commit comments

Comments
 (0)