beehive-lab
diff --git a/‎README.md‎
Lines changed: 4 additions & 187 deletions b/‎README.md‎
Lines changed: 4 additions & 187 deletions
diff --git a/‎docs/ChatGPT Image Apr 27, 2025, 02_45_40 PM.png‎
-1.22 MB b/‎docs/ChatGPT Image Apr 27, 2025, 02_45_40 PM.png‎
-1.22 MB
diff --git a/‎docs/RUN_DEBUG.md‎
Lines changed: 101 additions & 0 deletions b/‎docs/RUN_DEBUG.md‎
Lines changed: 101 additions & 0 deletions
@@ -264,82 +264,8 @@ Check models below.
 
 ## Download Model Files
 
-Download `FP16` quantized `Llama-3` .gguf files from:
-- https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16
-- https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16
-- https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16
-
-Download `FP16` quantized `Mistral` .gguf files from:
-- https://huggingface.co/collections/beehive-lab/mistral-gpullama3java-684afabb206136d2e9cd47e0
-
-Download `FP16` quantized `Qwen3` .gguf files from:
-- https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF
-- https://huggingface.co/ggml-org/Qwen3-1.7B-GGUF
-- https://huggingface.co/ggml-org/Qwen3-4B-GGUF
-- https://huggingface.co/ggml-org/Qwen3-8B-GGUF
-
-Download `FP16` quantized `Qwen2.5` .gguf files from:
-- https://huggingface.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF
-- https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
-
-Download `FP16` quantized `DeepSeek-R1-Distill-Qwen` .gguf files from:
-- https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF
-
-Please be gentle with [huggingface.co](https://huggingface.co) servers:
-
-**Note** FP16 models are first-class citizens for the current version.
-```
-# Llama 3.2 (1B) - FP16
-wget https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-1b-instruct-fp16.gguf
-
-# Llama 3.2 (3B) - FP16 
-wget https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-3b-instruct-fp16.gguf
-
-# Llama 3 (8B) - FP16 
-wget https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-8b-instruct-fp16.gguf
-
-# Mistral (7B) - FP16
-wget https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3.fp16.gguf
-
-# Qwen3 (0.6B) - FP16
-wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-f16.gguf
-
-# Qwen3 (1.7B) - FP16
-wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-1.7B-f16.gguf
-
-# Qwen3 (4B) - FP16
-wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-4B-f16.gguf
-
-# Qwen3 (8B) - FP16
-wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-8B-f16.gguf
-
-# Phi-3-mini-4k - FP16
-wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf
-
-# Qwen2.5 (0.5B)
-wget https://huggingface.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/Qwen2.5-0.5B-Instruct-f16.gguf
-
-# Qwen2.5 (1.5B)
-wget https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-fp16.gguf
-
-# DeepSeek-R1-Distill-Qwen (1.5B)
-wget https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf
-```
-
-**[Experimental]** you can download the Q8 and Q4 used in the original implementation of Llama3.java, but for now are going to be dequanted to FP16 for TornadoVM support:
-```
-# Llama 3.2 (1B) - Q4_0
-curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
-# Llama 3.2 (3B) - Q4_0 
-curl -L -O https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
-# Llama 3 (8B) - Q4_0 
-curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf
-# Llama 3.2 (1B) - Q8_0 
-curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
-# Llama 3.1 (8B) - Q8_0 
-curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf
-```
-
+We provided a collection of models tested by us in [Hugging-face](https://huggingface.co/beehive-lab/collections).
+However, any Llama3, Mistral, Qwen2, Qwen3, or Phi-3 model in `gguf` format can be used with **GPULlama3.java**.
 -----------
 
 ## Running `llama-tornado`
@@ -413,120 +339,11 @@ First, check your GPU specifications. If your GPU has high memory capacity, you
 ./llama-tornado --gpu --model beehive-llama-3.2-8b-instruct-fp16.gguf --prompt "Tell me a joke" --gpu-memory 20GB
 ```
 
-### GPU Memory Requirements by Model Size
-
-| Model Size  | Recommended GPU Memory |
-|-------------|------------------------|
-| 1B models   | 7GB (default)          |
-| 3-7B models | 15GB                   |
-| 8B models   | 20GB+                  |
-
-**Note**: If you still encounter memory issues, try:
-
-1. Using Q4_0 instead of Q8_0 quantization (requires less memory).
-2. Closing other GPU-intensive applications in your system.
-
 -----------
 
-## Command Line Options
-
-Supported command-line options include:
-
-```bash
-cmd ➜ llama-tornado --help
-usage: llama-tornado [-h] --model MODEL_PATH [--prompt PROMPT] [-sp SYSTEM_PROMPT] [--temperature TEMPERATURE] [--top-p TOP_P] [--seed SEED] [-n MAX_TOKENS]
-                     [--stream STREAM] [--echo ECHO] [-i] [--instruct] [--gpu] [--opencl] [--ptx] [--gpu-memory GPU_MEMORY] [--heap-min HEAP_MIN] [--heap-max HEAP_MAX]
-                     [--debug] [--profiler] [--profiler-dump-dir PROFILER_DUMP_DIR] [--print-bytecodes] [--print-threads] [--print-kernel] [--full-dump]
-                     [--show-command] [--execute-after-show] [--opencl-flags OPENCL_FLAGS] [--max-wait-events MAX_WAIT_EVENTS] [--verbose]
-
-GPU-accelerated LLaMA.java model runner using TornadoVM
-
-options:
-  -h, --help            show this help message and exit
-  --model MODEL_PATH    Path to the LLaMA model file (e.g., beehive-llama-3.2-8b-instruct-fp16.gguf) (default: None)
-
-LLaMA Configuration:
-  --prompt PROMPT       Input prompt for the model (default: None)
-  -sp SYSTEM_PROMPT, --system-prompt SYSTEM_PROMPT
-                        System prompt for the model (default: None)
-  --temperature TEMPERATURE
-                        Sampling temperature (0.0 to 2.0) (default: 0.1)
-  --top-p TOP_P         Top-p sampling parameter (default: 0.95)
-  --seed SEED           Random seed (default: current timestamp) (default: None)
-  -n MAX_TOKENS, --max-tokens MAX_TOKENS
-                        Maximum number of tokens to generate (default: 512)
-  --stream STREAM       Enable streaming output (default: True)
-  --echo ECHO           Echo the input prompt (default: False)
-  --suffix SUFFIX       Suffix for fill-in-the-middle request (Codestral) (default: None)
-
-Mode Selection:
-  -i, --interactive     Run in interactive/chat mode (default: False)
-  --instruct            Run in instruction mode (default) (default: True)
-
-Hardware Configuration:
-  --gpu                 Enable GPU acceleration (default: False)
-  --opencl              Use OpenCL backend (default) (default: None)
-  --ptx                 Use PTX/CUDA backend (default: None)
-  --gpu-memory GPU_MEMORY
-                        GPU memory allocation (default: 7GB)
-  --heap-min HEAP_MIN   Minimum JVM heap size (default: 20g)
-  --heap-max HEAP_MAX   Maximum JVM heap size (default: 20g)
-
-Debug and Profiling:
-  --debug               Enable debug output (default: False)
-  --profiler            Enable TornadoVM profiler (default: False)
-  --profiler-dump-dir PROFILER_DUMP_DIR
-                        Directory for profiler output (default: /home/mikepapadim/repos/gpu-llama3.java/prof.json)
-
-TornadoVM Execution Verbose:
-  --print-bytecodes     Print bytecodes (tornado.print.bytecodes=true) (default: False)
-  --print-threads       Print thread information (tornado.threadInfo=true) (default: False)
-  --print-kernel        Print kernel information (tornado.printKernel=true) (default: False)
-  --full-dump           Enable full debug dump (tornado.fullDebug=true) (default: False)
-  --verbose-init        Enable timers for TornadoVM initialization (llama.EnableTimingForTornadoVMInit=true) (default: False)
-
-Command Display Options:
-  --show-command        Display the full Java command that will be executed (default: False)
-  --execute-after-show  Execute the command after showing it (use with --show-command) (default: False)
-
-Advanced Options:
-  --opencl-flags OPENCL_FLAGS
-                        OpenCL compiler flags (default: -cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only)
-  --max-wait-events MAX_WAIT_EVENTS
-                        Maximum wait events for TornadoVM event pool (default: 32000)
-  --verbose, -v         Verbose output (default: False)
-
-```
-
-## Debug & Profiling Options
-View TornadoVM's internal behavior:
-```bash
-# Print thread information during execution
-./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads
-
-# Show bytecode compilation details
-./llama-tornado --gpu --model model.gguf --prompt "..." --print-bytecodes
-
-# Display generated GPU kernel code
-./llama-tornado --gpu --model model.gguf --prompt "..." --print-kernel
-
-# Enable full debug output with all details
-./llama-tornado --gpu --model model.gguf --prompt "..." --debug --full-dump
-
-# Combine debug options
-./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads --print-bytecodes --print-kernel
-```
-
-## Current Features & Roadmap
+## Miscellaneous
 
-  - **Support for GGUF format models** with full FP16 and partial support for Q8_0 and Q4_0 quantization.
-  - **Instruction-following and chat modes** for various use cases.
-  - **Interactive CLI** with `--interactive` and `--instruct` modes.
-  - **Flexible backend switching** - choose OpenCL or PTX at runtime (need to build TornadoVM with both enabled).
-  - **Cross-platform compatibility**:
-    - ✅ NVIDIA GPUs (OpenCL & PTX )
-    - ✅ Intel GPUs (OpenCL)
-    - ✅ Apple GPUs (OpenCL)
+Click [here](https://github.com/beehive-lab/GPULlama3.java/tree/main/docs/RUN_DEBUB.md) for more run and debugging tips, also how to use the ./llama-tornado cli to run the model with different flags.
 
 Click [here](https://github.com/beehive-lab/GPULlama3.java/tree/main/docs/TORNADOVM_TRANSFORMER_OPTIMIZATIONS.md) to view a more detailed list of the transformer optimizations implemented in TornadoVM.
 
 
@@ -0,0 +1,101 @@
+## GPU Memory Requirements by Model Size
+
+| Model Size  | Recommended GPU Memory |
+|-------------|------------------------|
+| 1B models   | 7GB (default)          |
+| 3-7B models | 15GB                   |
+| 8B models   | 20GB+                  |
+
+**Note**: If you still encounter memory issues, try:
+
+1. Using Q4_0 instead of Q8_0 quantization (requires less memory).
+2. Closing other GPU-intensive applications in your system.
+
+## Command Line Options
+
+Supported command-line options include:
+
+```bash
+cmd ➜ llama-tornado --help
+usage: llama-tornado [-h] --model MODEL_PATH [--prompt PROMPT] [-sp SYSTEM_PROMPT] [--temperature TEMPERATURE] [--top-p TOP_P] [--seed SEED] [-n MAX_TOKENS]
+                     [--stream STREAM] [--echo ECHO] [-i] [--instruct] [--gpu] [--opencl] [--ptx] [--gpu-memory GPU_MEMORY] [--heap-min HEAP_MIN] [--heap-max HEAP_MAX]
+                     [--debug] [--profiler] [--profiler-dump-dir PROFILER_DUMP_DIR] [--print-bytecodes] [--print-threads] [--print-kernel] [--full-dump]
+                     [--show-command] [--execute-after-show] [--opencl-flags OPENCL_FLAGS] [--max-wait-events MAX_WAIT_EVENTS] [--verbose]
+
+GPU-accelerated LLaMA.java model runner using TornadoVM
+
+options:
+  -h, --help            show this help message and exit
+  --model MODEL_PATH    Path to the LLaMA model file (e.g., beehive-llama-3.2-8b-instruct-fp16.gguf) (default: None)
+
+LLaMA Configuration:
+  --prompt PROMPT       Input prompt for the model (default: None)
+  -sp SYSTEM_PROMPT, --system-prompt SYSTEM_PROMPT
+                        System prompt for the model (default: None)
+  --temperature TEMPERATURE
+                        Sampling temperature (0.0 to 2.0) (default: 0.1)
+  --top-p TOP_P         Top-p sampling parameter (default: 0.95)
+  --seed SEED           Random seed (default: current timestamp) (default: None)
+  -n MAX_TOKENS, --max-tokens MAX_TOKENS
+                        Maximum number of tokens to generate (default: 512)
+  --stream STREAM       Enable streaming output (default: True)
+  --echo ECHO           Echo the input prompt (default: False)
+  --suffix SUFFIX       Suffix for fill-in-the-middle request (Codestral) (default: None)
+
+Mode Selection:
+  -i, --interactive     Run in interactive/chat mode (default: False)
+  --instruct            Run in instruction mode (default) (default: True)
+
+Hardware Configuration:
+  --gpu                 Enable GPU acceleration (default: False)
+  --opencl              Use OpenCL backend (default) (default: None)
+  --ptx                 Use PTX/CUDA backend (default: None)
+  --gpu-memory GPU_MEMORY
+                        GPU memory allocation (default: 7GB)
+  --heap-min HEAP_MIN   Minimum JVM heap size (default: 20g)
+  --heap-max HEAP_MAX   Maximum JVM heap size (default: 20g)
+
+Debug and Profiling:
+  --debug               Enable debug output (default: False)
+  --profiler            Enable TornadoVM profiler (default: False)
+  --profiler-dump-dir PROFILER_DUMP_DIR
+                        Directory for profiler output (default: /home/mikepapadim/repos/gpu-llama3.java/prof.json)
+
+TornadoVM Execution Verbose:
+  --print-bytecodes     Print bytecodes (tornado.print.bytecodes=true) (default: False)
+  --print-threads       Print thread information (tornado.threadInfo=true) (default: False)
+  --print-kernel        Print kernel information (tornado.printKernel=true) (default: False)
+  --full-dump           Enable full debug dump (tornado.fullDebug=true) (default: False)
+  --verbose-init        Enable timers for TornadoVM initialization (llama.EnableTimingForTornadoVMInit=true) (default: False)
+
+Command Display Options:
+  --show-command        Display the full Java command that will be executed (default: False)
+  --execute-after-show  Execute the command after showing it (use with --show-command) (default: False)
+
+Advanced Options:
+  --opencl-flags OPENCL_FLAGS
+                        OpenCL compiler flags (default: -cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only)
+  --max-wait-events MAX_WAIT_EVENTS
+                        Maximum wait events for TornadoVM event pool (default: 32000)
+  --verbose, -v         Verbose output (default: False)
+
+```
+
+## Debug & Profiling Options
+View TornadoVM's internal behavior:
+```bash
+# Print thread information during execution
+./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads
+
+# Show bytecode compilation details
+./llama-tornado --gpu --model model.gguf --prompt "..." --print-bytecodes
+
+# Display generated GPU kernel code
+./llama-tornado --gpu --model model.gguf --prompt "..." --print-kernel
+
+# Enable full debug output with all details
+./llama-tornado --gpu --model model.gguf --prompt "..." --debug --full-dump
+
+# Combine debug options
+./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads --print-bytecodes --print-kernel
+```