@@ -264,82 +264,8 @@ Check models below.
264264
265265## Download Model Files
266266
267- Download ` FP16 ` quantized ` Llama-3 ` .gguf files from:
268- - https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16
269- - https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16
270- - https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16
271-
272- Download ` FP16 ` quantized ` Mistral ` .gguf files from:
273- - https://huggingface.co/collections/beehive-lab/mistral-gpullama3java-684afabb206136d2e9cd47e0
274-
275- Download ` FP16 ` quantized ` Qwen3 ` .gguf files from:
276- - https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF
277- - https://huggingface.co/ggml-org/Qwen3-1.7B-GGUF
278- - https://huggingface.co/ggml-org/Qwen3-4B-GGUF
279- - https://huggingface.co/ggml-org/Qwen3-8B-GGUF
280-
281- Download ` FP16 ` quantized ` Qwen2.5 ` .gguf files from:
282- - https://huggingface.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF
283- - https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
284-
285- Download ` FP16 ` quantized ` DeepSeek-R1-Distill-Qwen ` .gguf files from:
286- - https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF
287-
288- Please be gentle with [ huggingface.co] ( https://huggingface.co ) servers:
289-
290- ** Note** FP16 models are first-class citizens for the current version.
291- ```
292- # Llama 3.2 (1B) - FP16
293- wget https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-1b-instruct-fp16.gguf
294-
295- # Llama 3.2 (3B) - FP16
296- wget https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-3b-instruct-fp16.gguf
297-
298- # Llama 3 (8B) - FP16
299- wget https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-8b-instruct-fp16.gguf
300-
301- # Mistral (7B) - FP16
302- wget https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3.fp16.gguf
303-
304- # Qwen3 (0.6B) - FP16
305- wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-f16.gguf
306-
307- # Qwen3 (1.7B) - FP16
308- wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-1.7B-f16.gguf
309-
310- # Qwen3 (4B) - FP16
311- wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-4B-f16.gguf
312-
313- # Qwen3 (8B) - FP16
314- wget https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/resolve/main/Qwen3-8B-f16.gguf
315-
316- # Phi-3-mini-4k - FP16
317- wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf
318-
319- # Qwen2.5 (0.5B)
320- wget https://huggingface.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/Qwen2.5-0.5B-Instruct-f16.gguf
321-
322- # Qwen2.5 (1.5B)
323- wget https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-fp16.gguf
324-
325- # DeepSeek-R1-Distill-Qwen (1.5B)
326- wget https://huggingface.co/hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf
327- ```
328-
329- ** [ Experimental] ** you can download the Q8 and Q4 used in the original implementation of Llama3.java, but for now are going to be dequanted to FP16 for TornadoVM support:
330- ```
331- # Llama 3.2 (1B) - Q4_0
332- curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
333- # Llama 3.2 (3B) - Q4_0
334- curl -L -O https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
335- # Llama 3 (8B) - Q4_0
336- curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf
337- # Llama 3.2 (1B) - Q8_0
338- curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
339- # Llama 3.1 (8B) - Q8_0
340- curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf
341- ```
342-
267+ We provided a collection of models tested by us in [ Hugging-face] ( https://huggingface.co/beehive-lab/collections ) .
268+ However, any Llama3, Mistral, Qwen2, Qwen3, or Phi-3 model in ` gguf ` format can be used with ** GPULlama3.java** .
343269-----------
344270
345271## Running ` llama-tornado `
@@ -413,120 +339,11 @@ First, check your GPU specifications. If your GPU has high memory capacity, you
413339./llama-tornado --gpu --model beehive-llama-3.2-8b-instruct-fp16.gguf --prompt " Tell me a joke" --gpu-memory 20GB
414340```
415341
416- ### GPU Memory Requirements by Model Size
417-
418- | Model Size | Recommended GPU Memory |
419- | -------------| ------------------------|
420- | 1B models | 7GB (default) |
421- | 3-7B models | 15GB |
422- | 8B models | 20GB+ |
423-
424- ** Note** : If you still encounter memory issues, try:
425-
426- 1 . Using Q4_0 instead of Q8_0 quantization (requires less memory).
427- 2 . Closing other GPU-intensive applications in your system.
428-
429342-----------
430343
431- ## Command Line Options
432-
433- Supported command-line options include:
434-
435- ``` bash
436- cmd ➜ llama-tornado --help
437- usage: llama-tornado [-h] --model MODEL_PATH [--prompt PROMPT] [-sp SYSTEM_PROMPT] [--temperature TEMPERATURE] [--top-p TOP_P] [--seed SEED] [-n MAX_TOKENS]
438- [--stream STREAM] [--echo ECHO] [-i] [--instruct] [--gpu] [--opencl] [--ptx] [--gpu-memory GPU_MEMORY] [--heap-min HEAP_MIN] [--heap-max HEAP_MAX]
439- [--debug] [--profiler] [--profiler-dump-dir PROFILER_DUMP_DIR] [--print-bytecodes] [--print-threads] [--print-kernel] [--full-dump]
440- [--show-command] [--execute-after-show] [--opencl-flags OPENCL_FLAGS] [--max-wait-events MAX_WAIT_EVENTS] [--verbose]
441-
442- GPU-accelerated LLaMA.java model runner using TornadoVM
443-
444- options:
445- -h, --help show this help message and exit
446- --model MODEL_PATH Path to the LLaMA model file (e.g., beehive-llama-3.2-8b-instruct-fp16.gguf) (default: None)
447-
448- LLaMA Configuration:
449- --prompt PROMPT Input prompt for the model (default: None)
450- -sp SYSTEM_PROMPT, --system-prompt SYSTEM_PROMPT
451- System prompt for the model (default: None)
452- --temperature TEMPERATURE
453- Sampling temperature (0.0 to 2.0) (default: 0.1)
454- --top-p TOP_P Top-p sampling parameter (default: 0.95)
455- --seed SEED Random seed (default: current timestamp) (default: None)
456- -n MAX_TOKENS, --max-tokens MAX_TOKENS
457- Maximum number of tokens to generate (default: 512)
458- --stream STREAM Enable streaming output (default: True)
459- --echo ECHO Echo the input prompt (default: False)
460- --suffix SUFFIX Suffix for fill-in-the-middle request (Codestral) (default: None)
461-
462- Mode Selection:
463- -i, --interactive Run in interactive/chat mode (default: False)
464- --instruct Run in instruction mode (default) (default: True)
465-
466- Hardware Configuration:
467- --gpu Enable GPU acceleration (default: False)
468- --opencl Use OpenCL backend (default) (default: None)
469- --ptx Use PTX/CUDA backend (default: None)
470- --gpu-memory GPU_MEMORY
471- GPU memory allocation (default: 7GB)
472- --heap-min HEAP_MIN Minimum JVM heap size (default: 20g)
473- --heap-max HEAP_MAX Maximum JVM heap size (default: 20g)
474-
475- Debug and Profiling:
476- --debug Enable debug output (default: False)
477- --profiler Enable TornadoVM profiler (default: False)
478- --profiler-dump-dir PROFILER_DUMP_DIR
479- Directory for profiler output (default: /home/mikepapadim/repos/gpu-llama3.java/prof.json)
480-
481- TornadoVM Execution Verbose:
482- --print-bytecodes Print bytecodes (tornado.print.bytecodes=true) (default: False)
483- --print-threads Print thread information (tornado.threadInfo=true) (default: False)
484- --print-kernel Print kernel information (tornado.printKernel=true) (default: False)
485- --full-dump Enable full debug dump (tornado.fullDebug=true) (default: False)
486- --verbose-init Enable timers for TornadoVM initialization (llama.EnableTimingForTornadoVMInit=true) (default: False)
487-
488- Command Display Options:
489- --show-command Display the full Java command that will be executed (default: False)
490- --execute-after-show Execute the command after showing it (use with --show-command) (default: False)
491-
492- Advanced Options:
493- --opencl-flags OPENCL_FLAGS
494- OpenCL compiler flags (default: -cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only)
495- --max-wait-events MAX_WAIT_EVENTS
496- Maximum wait events for TornadoVM event pool (default: 32000)
497- --verbose, -v Verbose output (default: False)
498-
499- ```
500-
501- ## Debug & Profiling Options
502- View TornadoVM's internal behavior:
503- ``` bash
504- # Print thread information during execution
505- ./llama-tornado --gpu --model model.gguf --prompt " ..." --print-threads
506-
507- # Show bytecode compilation details
508- ./llama-tornado --gpu --model model.gguf --prompt " ..." --print-bytecodes
509-
510- # Display generated GPU kernel code
511- ./llama-tornado --gpu --model model.gguf --prompt " ..." --print-kernel
512-
513- # Enable full debug output with all details
514- ./llama-tornado --gpu --model model.gguf --prompt " ..." --debug --full-dump
515-
516- # Combine debug options
517- ./llama-tornado --gpu --model model.gguf --prompt " ..." --print-threads --print-bytecodes --print-kernel
518- ```
519-
520- ## Current Features & Roadmap
344+ ## Miscellaneous
521345
522- - ** Support for GGUF format models** with full FP16 and partial support for Q8_0 and Q4_0 quantization.
523- - ** Instruction-following and chat modes** for various use cases.
524- - ** Interactive CLI** with ` --interactive ` and ` --instruct ` modes.
525- - ** Flexible backend switching** - choose OpenCL or PTX at runtime (need to build TornadoVM with both enabled).
526- - ** Cross-platform compatibility** :
527- - ✅ NVIDIA GPUs (OpenCL & PTX )
528- - ✅ Intel GPUs (OpenCL)
529- - ✅ Apple GPUs (OpenCL)
346+ Click [ here] ( https://github.com/beehive-lab/GPULlama3.java/tree/main/docs/RUN_DEBUB.md ) for more run and debugging tips, also how to use the ./llama-tornado cli to run the model with different flags.
530347
531348Click [ here] ( https://github.com/beehive-lab/GPULlama3.java/tree/main/docs/TORNADOVM_TRANSFORMER_OPTIMIZATIONS.md ) to view a more detailed list of the transformer optimizations implemented in TornadoVM.
532349
0 commit comments