Skip to content

Commit 7ce5445

Browse files
committed
[Docs] Add architecture diagrams and refine documentation
- Introduced `gpullama3-architecture.svg` and `gpullama3-architecture-light.svg` diagrams. - Improved README with a simplified model collection reference. - Moved detailed GPU requirements and CLI options to `RUN_DEBUG.md` for clarity.
1 parent 5025c97 commit 7ce5445

File tree

5 files changed

+499
-187
lines changed

5 files changed

+499
-187
lines changed

README.md

Lines changed: 4 additions & 187 deletions
Original file line numberDiff line numberDiff line change
@@ -264,82 +264,8 @@ Check models below.
264264

265265
## Download Model Files
266266

267-
Download `FP16` quantized `Llama-3` .gguf files from:
268-
- https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16
269-
- https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16
270-
- https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16
271-
272-
Download `FP16` quantized `Mistral` .gguf files from:
273-
- https://huggingface.co/collections/beehive-lab/mistral-gpullama3java-684afabb206136d2e9cd47e0
274-
275-
Download `FP16` quantized `Qwen3` .gguf files from:
276-
- https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF
277-
- https://huggingface.co/ggml-org/Qwen3-1.7B-GGUF
278-
- https://huggingface.co/ggml-org/Qwen3-4B-GGUF
279-
- https://huggingface.co/ggml-org/Qwen3-8B-GGUF
280-
281-
Download `FP16` quantized `Qwen2.5` .gguf files from:
282-
- https://huggingface.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF
283-
- https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
284-
285-
Download `FP16` quantized `DeepSeek-R1-Distill-Qwen` .gguf files from:
286-
- https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF
287-
288-
Please be gentle with [huggingface.co](https://huggingface.co) servers:
289-
290-
**Note** FP16 models are first-class citizens for the current version.
291-
```
292-
# Llama 3.2 (1B) - FP16
293-
wget https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-1b-instruct-fp16.gguf
294-
295-
# Llama 3.2 (3B) - FP16
296-
wget https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-3b-instruct-fp16.gguf
297-
298-
# Llama 3 (8B) - FP16
299-
wget https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-8b-instruct-fp16.gguf
300-
301-
# Mistral (7B) - FP16
302-
wget https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3.fp16.gguf
303-
304-
# Qwen3 (0.6B) - FP16
305-
wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-f16.gguf
306-
307-
# Qwen3 (1.7B) - FP16
308-
wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-1.7B-f16.gguf
309-
310-
# Qwen3 (4B) - FP16
311-
wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-4B-f16.gguf
312-
313-
# Qwen3 (8B) - FP16
314-
wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-8B-f16.gguf
315-
316-
# Phi-3-mini-4k - FP16
317-
wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf
318-
319-
# Qwen2.5 (0.5B)
320-
wget https://huggingface.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/Qwen2.5-0.5B-Instruct-f16.gguf
321-
322-
# Qwen2.5 (1.5B)
323-
wget https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-fp16.gguf
324-
325-
# DeepSeek-R1-Distill-Qwen (1.5B)
326-
wget https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf
327-
```
328-
329-
**[Experimental]** you can download the Q8 and Q4 used in the original implementation of Llama3.java, but for now are going to be dequanted to FP16 for TornadoVM support:
330-
```
331-
# Llama 3.2 (1B) - Q4_0
332-
curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
333-
# Llama 3.2 (3B) - Q4_0
334-
curl -L -O https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
335-
# Llama 3 (8B) - Q4_0
336-
curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf
337-
# Llama 3.2 (1B) - Q8_0
338-
curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
339-
# Llama 3.1 (8B) - Q8_0
340-
curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf
341-
```
342-
267+
We provided a collection of models tested by us in [Hugging-face](https://huggingface.co/beehive-lab/collections).
268+
However, any Llama3, Mistral, Qwen2, Qwen3, or Phi-3 model in `gguf` format can be used with **GPULlama3.java**.
343269
-----------
344270

345271
## Running `llama-tornado`
@@ -413,120 +339,11 @@ First, check your GPU specifications. If your GPU has high memory capacity, you
413339
./llama-tornado --gpu --model beehive-llama-3.2-8b-instruct-fp16.gguf --prompt "Tell me a joke" --gpu-memory 20GB
414340
```
415341

416-
### GPU Memory Requirements by Model Size
417-
418-
| Model Size | Recommended GPU Memory |
419-
|-------------|------------------------|
420-
| 1B models | 7GB (default) |
421-
| 3-7B models | 15GB |
422-
| 8B models | 20GB+ |
423-
424-
**Note**: If you still encounter memory issues, try:
425-
426-
1. Using Q4_0 instead of Q8_0 quantization (requires less memory).
427-
2. Closing other GPU-intensive applications in your system.
428-
429342
-----------
430343

431-
## Command Line Options
432-
433-
Supported command-line options include:
434-
435-
```bash
436-
cmd ➜ llama-tornado --help
437-
usage: llama-tornado [-h] --model MODEL_PATH [--prompt PROMPT] [-sp SYSTEM_PROMPT] [--temperature TEMPERATURE] [--top-p TOP_P] [--seed SEED] [-n MAX_TOKENS]
438-
[--stream STREAM] [--echo ECHO] [-i] [--instruct] [--gpu] [--opencl] [--ptx] [--gpu-memory GPU_MEMORY] [--heap-min HEAP_MIN] [--heap-max HEAP_MAX]
439-
[--debug] [--profiler] [--profiler-dump-dir PROFILER_DUMP_DIR] [--print-bytecodes] [--print-threads] [--print-kernel] [--full-dump]
440-
[--show-command] [--execute-after-show] [--opencl-flags OPENCL_FLAGS] [--max-wait-events MAX_WAIT_EVENTS] [--verbose]
441-
442-
GPU-accelerated LLaMA.java model runner using TornadoVM
443-
444-
options:
445-
-h, --help show this help message and exit
446-
--model MODEL_PATH Path to the LLaMA model file (e.g., beehive-llama-3.2-8b-instruct-fp16.gguf) (default: None)
447-
448-
LLaMA Configuration:
449-
--prompt PROMPT Input prompt for the model (default: None)
450-
-sp SYSTEM_PROMPT, --system-prompt SYSTEM_PROMPT
451-
System prompt for the model (default: None)
452-
--temperature TEMPERATURE
453-
Sampling temperature (0.0 to 2.0) (default: 0.1)
454-
--top-p TOP_P Top-p sampling parameter (default: 0.95)
455-
--seed SEED Random seed (default: current timestamp) (default: None)
456-
-n MAX_TOKENS, --max-tokens MAX_TOKENS
457-
Maximum number of tokens to generate (default: 512)
458-
--stream STREAM Enable streaming output (default: True)
459-
--echo ECHO Echo the input prompt (default: False)
460-
--suffix SUFFIX Suffix for fill-in-the-middle request (Codestral) (default: None)
461-
462-
Mode Selection:
463-
-i, --interactive Run in interactive/chat mode (default: False)
464-
--instruct Run in instruction mode (default) (default: True)
465-
466-
Hardware Configuration:
467-
--gpu Enable GPU acceleration (default: False)
468-
--opencl Use OpenCL backend (default) (default: None)
469-
--ptx Use PTX/CUDA backend (default: None)
470-
--gpu-memory GPU_MEMORY
471-
GPU memory allocation (default: 7GB)
472-
--heap-min HEAP_MIN Minimum JVM heap size (default: 20g)
473-
--heap-max HEAP_MAX Maximum JVM heap size (default: 20g)
474-
475-
Debug and Profiling:
476-
--debug Enable debug output (default: False)
477-
--profiler Enable TornadoVM profiler (default: False)
478-
--profiler-dump-dir PROFILER_DUMP_DIR
479-
Directory for profiler output (default: /home/mikepapadim/repos/gpu-llama3.java/prof.json)
480-
481-
TornadoVM Execution Verbose:
482-
--print-bytecodes Print bytecodes (tornado.print.bytecodes=true) (default: False)
483-
--print-threads Print thread information (tornado.threadInfo=true) (default: False)
484-
--print-kernel Print kernel information (tornado.printKernel=true) (default: False)
485-
--full-dump Enable full debug dump (tornado.fullDebug=true) (default: False)
486-
--verbose-init Enable timers for TornadoVM initialization (llama.EnableTimingForTornadoVMInit=true) (default: False)
487-
488-
Command Display Options:
489-
--show-command Display the full Java command that will be executed (default: False)
490-
--execute-after-show Execute the command after showing it (use with --show-command) (default: False)
491-
492-
Advanced Options:
493-
--opencl-flags OPENCL_FLAGS
494-
OpenCL compiler flags (default: -cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only)
495-
--max-wait-events MAX_WAIT_EVENTS
496-
Maximum wait events for TornadoVM event pool (default: 32000)
497-
--verbose, -v Verbose output (default: False)
498-
499-
```
500-
501-
## Debug & Profiling Options
502-
View TornadoVM's internal behavior:
503-
```bash
504-
# Print thread information during execution
505-
./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads
506-
507-
# Show bytecode compilation details
508-
./llama-tornado --gpu --model model.gguf --prompt "..." --print-bytecodes
509-
510-
# Display generated GPU kernel code
511-
./llama-tornado --gpu --model model.gguf --prompt "..." --print-kernel
512-
513-
# Enable full debug output with all details
514-
./llama-tornado --gpu --model model.gguf --prompt "..." --debug --full-dump
515-
516-
# Combine debug options
517-
./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads --print-bytecodes --print-kernel
518-
```
519-
520-
## Current Features & Roadmap
344+
## Miscellaneous
521345

522-
- **Support for GGUF format models** with full FP16 and partial support for Q8_0 and Q4_0 quantization.
523-
- **Instruction-following and chat modes** for various use cases.
524-
- **Interactive CLI** with `--interactive` and `--instruct` modes.
525-
- **Flexible backend switching** - choose OpenCL or PTX at runtime (need to build TornadoVM with both enabled).
526-
- **Cross-platform compatibility**:
527-
- ✅ NVIDIA GPUs (OpenCL & PTX )
528-
- ✅ Intel GPUs (OpenCL)
529-
- ✅ Apple GPUs (OpenCL)
346+
Click [here](https://github.com/beehive-lab/GPULlama3.java/tree/main/docs/RUN_DEBUB.md) for more run and debugging tips, also how to use the ./llama-tornado cli to run the model with different flags.
530347

531348
Click [here](https://github.com/beehive-lab/GPULlama3.java/tree/main/docs/TORNADOVM_TRANSFORMER_OPTIMIZATIONS.md) to view a more detailed list of the transformer optimizations implemented in TornadoVM.
532349

-1.22 MB
Binary file not shown.

docs/RUN_DEBUG.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
## GPU Memory Requirements by Model Size
2+
3+
| Model Size | Recommended GPU Memory |
4+
|-------------|------------------------|
5+
| 1B models | 7GB (default) |
6+
| 3-7B models | 15GB |
7+
| 8B models | 20GB+ |
8+
9+
**Note**: If you still encounter memory issues, try:
10+
11+
1. Using Q4_0 instead of Q8_0 quantization (requires less memory).
12+
2. Closing other GPU-intensive applications in your system.
13+
14+
## Command Line Options
15+
16+
Supported command-line options include:
17+
18+
```bash
19+
cmd ➜ llama-tornado --help
20+
usage: llama-tornado [-h] --model MODEL_PATH [--prompt PROMPT] [-sp SYSTEM_PROMPT] [--temperature TEMPERATURE] [--top-p TOP_P] [--seed SEED] [-n MAX_TOKENS]
21+
[--stream STREAM] [--echo ECHO] [-i] [--instruct] [--gpu] [--opencl] [--ptx] [--gpu-memory GPU_MEMORY] [--heap-min HEAP_MIN] [--heap-max HEAP_MAX]
22+
[--debug] [--profiler] [--profiler-dump-dir PROFILER_DUMP_DIR] [--print-bytecodes] [--print-threads] [--print-kernel] [--full-dump]
23+
[--show-command] [--execute-after-show] [--opencl-flags OPENCL_FLAGS] [--max-wait-events MAX_WAIT_EVENTS] [--verbose]
24+
25+
GPU-accelerated LLaMA.java model runner using TornadoVM
26+
27+
options:
28+
-h, --help show this help message and exit
29+
--model MODEL_PATH Path to the LLaMA model file (e.g., beehive-llama-3.2-8b-instruct-fp16.gguf) (default: None)
30+
31+
LLaMA Configuration:
32+
--prompt PROMPT Input prompt for the model (default: None)
33+
-sp SYSTEM_PROMPT, --system-prompt SYSTEM_PROMPT
34+
System prompt for the model (default: None)
35+
--temperature TEMPERATURE
36+
Sampling temperature (0.0 to 2.0) (default: 0.1)
37+
--top-p TOP_P Top-p sampling parameter (default: 0.95)
38+
--seed SEED Random seed (default: current timestamp) (default: None)
39+
-n MAX_TOKENS, --max-tokens MAX_TOKENS
40+
Maximum number of tokens to generate (default: 512)
41+
--stream STREAM Enable streaming output (default: True)
42+
--echo ECHO Echo the input prompt (default: False)
43+
--suffix SUFFIX Suffix for fill-in-the-middle request (Codestral) (default: None)
44+
45+
Mode Selection:
46+
-i, --interactive Run in interactive/chat mode (default: False)
47+
--instruct Run in instruction mode (default) (default: True)
48+
49+
Hardware Configuration:
50+
--gpu Enable GPU acceleration (default: False)
51+
--opencl Use OpenCL backend (default) (default: None)
52+
--ptx Use PTX/CUDA backend (default: None)
53+
--gpu-memory GPU_MEMORY
54+
GPU memory allocation (default: 7GB)
55+
--heap-min HEAP_MIN Minimum JVM heap size (default: 20g)
56+
--heap-max HEAP_MAX Maximum JVM heap size (default: 20g)
57+
58+
Debug and Profiling:
59+
--debug Enable debug output (default: False)
60+
--profiler Enable TornadoVM profiler (default: False)
61+
--profiler-dump-dir PROFILER_DUMP_DIR
62+
Directory for profiler output (default: /home/mikepapadim/repos/gpu-llama3.java/prof.json)
63+
64+
TornadoVM Execution Verbose:
65+
--print-bytecodes Print bytecodes (tornado.print.bytecodes=true) (default: False)
66+
--print-threads Print thread information (tornado.threadInfo=true) (default: False)
67+
--print-kernel Print kernel information (tornado.printKernel=true) (default: False)
68+
--full-dump Enable full debug dump (tornado.fullDebug=true) (default: False)
69+
--verbose-init Enable timers for TornadoVM initialization (llama.EnableTimingForTornadoVMInit=true) (default: False)
70+
71+
Command Display Options:
72+
--show-command Display the full Java command that will be executed (default: False)
73+
--execute-after-show Execute the command after showing it (use with --show-command) (default: False)
74+
75+
Advanced Options:
76+
--opencl-flags OPENCL_FLAGS
77+
OpenCL compiler flags (default: -cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only)
78+
--max-wait-events MAX_WAIT_EVENTS
79+
Maximum wait events for TornadoVM event pool (default: 32000)
80+
--verbose, -v Verbose output (default: False)
81+
82+
```
83+
84+
## Debug & Profiling Options
85+
View TornadoVM's internal behavior:
86+
```bash
87+
# Print thread information during execution
88+
./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads
89+
90+
# Show bytecode compilation details
91+
./llama-tornado --gpu --model model.gguf --prompt "..." --print-bytecodes
92+
93+
# Display generated GPU kernel code
94+
./llama-tornado --gpu --model model.gguf --prompt "..." --print-kernel
95+
96+
# Enable full debug output with all details
97+
./llama-tornado --gpu --model model.gguf --prompt "..." --debug --full-dump
98+
99+
# Combine debug options
100+
./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads --print-bytecodes --print-kernel
101+
```

0 commit comments

Comments
 (0)