intel
diff --git a/‎README.md‎
Lines changed: 23 additions & 11 deletions b/‎README.md‎
Lines changed: 23 additions & 11 deletions
diff --git a/‎docs/tutorials/features/torch_compile_gpu.md‎
Lines changed: 12 additions & 0 deletions b/‎docs/tutorials/features/torch_compile_gpu.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/tutorials/llm.rst‎
Lines changed: 19 additions & 13 deletions b/‎docs/tutorials/llm.rst‎
Lines changed: 19 additions & 13 deletions
diff --git a/‎examples/gpu/llm/fine-tuning/Llama2/README.md‎
Lines changed: 111 additions & 0 deletions b/‎examples/gpu/llm/fine-tuning/Llama2/README.md‎
Lines changed: 111 additions & 0 deletions
diff --git a/‎…ples/gpu/llm/fine-tuning/Llama2/train.py‎ ‎…/gpu/llm/fine-tuning/Llama2/llama2_ft.py‎examples/gpu/llm/fine-tuning/Llama2/train.py renamed to examples/gpu/llm/fine-tuning/Llama2/llama2_ft.py b/‎…ples/gpu/llm/fine-tuning/Llama2/train.py‎ ‎…/gpu/llm/fine-tuning/Llama2/llama2_ft.py‎examples/gpu/llm/fine-tuning/Llama2/train.py renamed to examples/gpu/llm/fine-tuning/Llama2/llama2_ft.py
diff --git a/‎examples/gpu/llm/fine-tuning/Llama2/run_llama2_70b_fsdp.sh‎
Lines changed: 3 additions & 3 deletions b/‎examples/gpu/llm/fine-tuning/Llama2/run_llama2_70b_fsdp.sh‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎examples/gpu/llm/fine-tuning/Llama2/run_llama2_7b_fsdp.sh‎
Lines changed: 4 additions & 4 deletions b/‎examples/gpu/llm/fine-tuning/Llama2/run_llama2_7b_fsdp.sh‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎examples/gpu/llm/fine-tuning/Llama2/run_llama2_converge.sh‎
Lines changed: 3 additions & 16 deletions b/‎examples/gpu/llm/fine-tuning/Llama2/run_llama2_converge.sh‎
Lines changed: 3 additions & 16 deletions
@@ -46,28 +46,40 @@ In the current technological landscape, Generative AI (GenAI) workloads and mode
 
 #### LLM fine-tuning
 
- **Note**: 
- Intel® Data Center Max 1550 GPU: support all the models in the model list above. Intel® Core™ Ultra Processors with Intel® Arc™ Graphics: support Llama 2 7B, Llama 3 8B and Phi-3-Mini 3.8B.
-
-| MODEL FAMILY | Verified < MODEL ID > (Hugging Face hub)| Mixed Precision (BF16+FP32) | Full fine-tuning | LoRA | Intel® Data Center Max 1550 GPU | Intel® Core™ Ultra Processors with Intel® Arc™ Graphics |
-|---|:---:|:---:|:---:|:---:|:---:|:---:|
-|Llama 2 7B| "meta-llama/Llama-2-7b-hf" | 🟩 | 🟩 | 🟩 | 🟩 | 🟩 |
-|Llama 2 70B| "meta-llama/Llama-2-70b-hf" | 🟩 | 🟥 |🟩 | 🟩 | 🟥 |
-|Llama 3 8B| "meta-llama/Meta-Llama-3-8B" | 🟩 | 🟩 |🟩 | 🟩 | 🟩 |
-|Qwen 7B|"Qwen/Qwen-7B"| 🟩 | 🟩 |🟩 | 🟩| 🟥 |
-|Phi-3-mini 3.8B|"Phi-3-mini-4k-instruct"| 🟩 | 🟩 |🟩 | 🟥 | 🟩 |
-
+##### LLM fine-tuning optimized with Intel® Data Center Max 1550 GPU on Linux
 
 
 | Benchmark mode | Full fine-tuning | LoRA |
 |---|:---:|:---:|
 |Single-GPU | 🟥 | 🟩 |
 |Multi-GPU (FSDP) |  🟩 | 🟩 |
 
+| MODEL FAMILY | Verified < MODEL ID > (Hugging Face hub)| Mixed Precision (BF16+FP32) | Full fine-tuning  | LoRA |  
+|---|:---:|:---:|:---:|:---:|
+|[Llama 2 7B](./Llama2/README.md)| "meta-llama/Llama-2-7b-hf" | 🟩 | 🟩 | 🟩 | 
+|[Llama 2 70B](./Llama2/README.md)| "meta-llama/Llama-2-70b-hf" | 🟩 | 🟥 |🟩 | 
+|[Llama 3 8B](./Llama3/README.md)| "meta-llama/Meta-Llama-3-8B" | 🟩 | 🟩 |🟩 | 
+|[Llama 3 70B](./Llama3/README.md)| "meta-llama/Meta-Llama-3-70B" | 🟩 | 🟥 |🟩 | 
+|[Qwen 7B](./Qwen/README.md)|"Qwen/Qwen-7B"| 🟩 | 🟩 |🟩 | 
+|[Phi-3-mini 3.8B](./Phi3/README.md#fine-tuning-on-intel-data-center-max-1550-gpu-on-linux)|"Phi-3-mini-4k-instruct"| 🟩 | 🟩 |🟩 | 
+
+
+\* Intel® Data Center Max 1550 GPU: support all the models in the model list above.
+
+##### LLM fine-tuning optimized with Intel® Core™ Ultra Processors with Intel® Arc™ Graphics 
+
+| MODEL FAMILY | Verified < MODEL ID > (Hugging Face hub)| Mixed Precision (BF16+FP32) | Full fine-tuning  | LoRA |  
+|---|:---:|:---:|:---:|:---:|
+|[Phi-3-mini 3.8B](./Phi3/README.md#fine-tuning-on-intel-core-ultra-processors-with-intel-arc-graphics)|"Phi-3-mini-4k-instruct"| 🟩 | 🟩 |🟩 | 
+
+
 - 🟩 signifies that it is supported.
 
 - 🟥 signifies that it is not supported yet.
 
+\* Intel® Core™ Ultra Processors with Intel® Arc™ Graphics: support Phi-3-Mini 3.8B.
+
+
 
 ## Installation
 
 
@@ -13,6 +13,9 @@ Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph c
 - `intel_extension_for_pytorch` : v2.3
 - `triton` : >= v3.0.0
 
+
+Install [Intel® oneAPI Base Toolkit 2024.2.1](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html).
+
 Follow [Intel® Extension for PyTorch\* Installation](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/) to install `torch` and `intel_extension_for_pytorch` firstly.
 
 Triton could be directly installed using the following command:
@@ -21,8 +24,17 @@ Triton could be directly installed using the following command:
 pip install --pre pytorch-triton-xpu==3.0.0+1b2f15840e --index-url https://download.pytorch.org/whl/nightly/xpu
 ```
 
+Remember to activate the oneAPI basekit by following commands.
+
+```bash
+# {dpcpproot} is the location for dpcpp ROOT path and it is where you installed oneAPI DPCPP, usually it is /opt/intel/oneapi/compiler/latest or ~/intel/oneapi/compiler/latest
+source {dpcpproot}/env/vars.sh
+```
+
+
 # Example Usage
 
+
 ## Inferenece with torch.compile
 
 ```python
 
@@ -67,8 +67,8 @@ LLM Inference
 
 *Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.
 
-LLM fine-tuning
-~~~~~~~~~~~~~~~
+LLM fine-tuning on Intel® Data Center Max 1550 GPU
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. list-table::
    :widths: auto
@@ -79,42 +79,48 @@ LLM fine-tuning
      - Mixed Precision (BF16+FP32)
      - Full fine-tuning
      - LoRA
-     - Intel® Data Center Max 1550 GPU
-     - Intel® Core™ Ultra Processors with Intel® Arc™ Graphics
    * - Llama2
      - "meta-llama/Llama-2-7b-hf"
      - ✅
      - ✅
      - ✅
-     - ✅
-     - ✅
    * - Llama2
      - "meta-llama/Llama-2-70b-hf",
      - ✅
      - ❎
      - ✅
-     - ✅
-     - ❎
    * - Llama3
      - "meta-llama/Meta-Llama-3-8B"
      - ✅
      - ✅
      - ✅
-     - ✅
-     - ✅
    * - Qwen
      - "Qwen/Qwen-7B"
      - ✅
      - ✅
      - ✅
-     - ✅
-     - ❎
    * - Phi-3-mini 3.8B
      - "Phi-3-mini-4k-instruct"
      - ✅
      - ✅
      - ✅
-     - ❎
+
+LLM fine-tuning on Intel® Core™ Ultra Processors with Intel® Arc™ Graphics
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. list-table::
+   :widths: auto
+   :header-rows: 1
+
+   * - Model Family
+     - Verified < MODEL ID > (Huggingface hub)
+     - Mixed Precision (BF16+FP32)
+     - Full fine-tuning
+     - LoRA
+   * - Phi-3-mini 3.8B
+     - "Phi-3-mini-4k-instruct"
+     - ✅
+     - ✅
      - ✅
 
 Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.3.110/examples/gpu/llm>`_ for instructions to install/setup environment and example scripts..
 
@@ -0,0 +1,111 @@
+## Llama2 fine-tuning
+
+
+
+### Download a Model
+During the execution, you may need to log in your Hugging Face account to download model files from online mode. Refer to [HuggingFace Login](https://huggingface.co/docs/huggingface_hub/quick-start#login)
+
+```
+huggingface-cli login --token <your_token_here>
+```
+
+**Note**: If you have download a Llama2 model from Meta official Github, you can also convert it to huggingface format by following the [guide](https://huggingface.co/docs/transformers/main/en/model_doc/llama2#usage-tips).
+
+### Download a Dataset
+
+For Alpaca dataset, you can get here: [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json).
+```
+git clone https://github.com/tatsu-lab/stanford_alpaca
+cd standford_alpaca
+mv alpaca_data.json <Llama2_folder>
+```
+
+
+**Note**: During the execution, you need to log in your wandb account. Refer to [Wandb Login](https://docs.wandb.ai/ref/cli/wandb-login)
+```
+wandb login
+```
+
+### Fine-tuning on multi-GPU
+
+**Note**:
+The default `fsdp_config.yml` is set with 1 machine with 4 cards 8 tiles, If you are using different setting, please change the `num_process` accordingly.
+
+#### Full-finetuning 
+
+
+Example: Llama 2 7B full fine-tuning with Alpaca dataset, you can change the model name/path for another Llama2 Model.
+
+
+**Note**:
+We provide examples for Alpaca dataset with 52k data and guanaco-llama2-1k dataset from Hugging Face. We recommend [Alpaca dataset](#download-a-dataset), which has been recognized by some popular projects.
+Remove the flags `--data_path` in fine-tuning command will load the guanaco-llama2-1k dataset from Hugging Face by default in `llama2_ft.py`.
+
+
+
+```bash
+export CCL_PROCESS_LAUNCHER=none
+export TORCH_LLM_ALLREDUCE=1
+
+export model='meta-llama/Llama-2-7b-hf'
+
+accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
+        --model_name_or_path ${model} \
+        --data_path ./alpaca_data.json \
+        --bf16 True \
+        --use_flashattn True \
+        --output_dir ./result \
+        --num_train_epochs 3 \
+        --per_device_train_batch_size 4 \
+        --per_device_eval_batch_size 1 \
+        --gradient_accumulation_steps 4 \
+        --evaluation_strategy "no" \
+        --save_strategy "steps" \
+        --save_steps 2000 \
+        --save_total_limit 1 \
+        --learning_rate 2e-5 \
+        --weight_decay 0. \
+        --warmup_ratio 0.03 \
+        --lr_scheduler_type "cosine" \
+        --logging_steps 1 \
+        --optim "adamw_torch_fused" \
+```
+
+
+#### LoRA finetuning
+
+Example: Llama 2 7B LoRA fine-tuning with Alpaca dataset, you can change the model name/path for another Llama2 Model.
+
+**Note**:
+We provide examples for Alpaca dataset with 52k data and guanaco-llama2-1k dataset from Hugging Face. We recommend [Alpaca dataset](#download-a-dataset), which has been recognized by some popular projects.
+Remove the flags `--data_path` in fine-tuning command will load the guanaco-llama2-1k dataset from Hugging Face by default in `llama2_ft.py`.
+
+
+```bash
+export CCL_PROCESS_LAUNCHER=none
+export TORCH_LLM_ALLREDUCE=1
+
+export model='meta-llama/Llama-2-7b-hf'
+
+accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
+    --model_name_or_path ${model} \
+    --data_path ./alpaca_data.json \
+    --bf16 True \
+    --use_flashattn True \
+    --use_peft True \
+    --output_dir ./result \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 2000 \
+    --save_total_limit 1 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --optim "adamw_torch_fused"
+```
@@ -5,7 +5,7 @@ export TORCH_LLM_ALLREDUCE=1
 # llama2-70b alpaca dataset peft lora
 Run_llama2-70b_fsdp_alpaca_dataset_peft() {
 
-    accelerate launch --config_file "fsdp_config.yaml"  train.py \
+    accelerate launch --config_file "fsdp_config.yaml"  llama2_ft.py \
         --model_name_or_path ${model} \
         --data_path ./alpaca_data.json \
         --bf16 True \
@@ -31,7 +31,7 @@ Run_llama2-70b_fsdp_alpaca_dataset_peft() {
 # llama2-70b huggingface dataset peft lora
 Run_llama2-70b_fsdp_huggingface_dataset_peft() {
 
-    accelerate launch --config_file "fsdp_config.yaml"  train.py \
+    accelerate launch --config_file "fsdp_config.yaml"  llama2_ft.py \
         --model_name_or_path ${model} \
         --bf16 True \
         --use_flashattn True \
@@ -56,7 +56,7 @@ Run_llama2-70b_fsdp_huggingface_dataset_peft() {
 
 main() {
 
-    model=meta-llama/Llama-2-70b-hf
+    model="meta-llama/Llama-2-70b-hf"
 
     Run_llama2-70b_fsdp_alpaca_dataset_peft
     # Run_llama2-70b_fsdp_huggingface_dataset_peft
 
@@ -7,7 +7,7 @@ export TORCH_LLM_ALLREDUCE=1
 ## alpaca dataset full-ft
 Run_llama2-7b_fsdp_alpaca_dataset() {
 
-    accelerate launch --config_file "fsdp_config.yaml" train.py \
+    accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
         --model_name_or_path ${model} \
         --data_path ./alpaca_data.json \
         --bf16 True \
@@ -33,7 +33,7 @@ Run_llama2-7b_fsdp_alpaca_dataset() {
 ## alpaca dataset peft lora
 Run_llama2-7b_fsdp_alpaca_dataset_peft() {
 
-    accelerate launch --config_file "fsdp_config.yaml" train.py \
+    accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
         --model_name_or_path ${model} \
         --data_path ./alpaca_data.json \
         --bf16 True \
@@ -59,7 +59,7 @@ Run_llama2-7b_fsdp_alpaca_dataset_peft() {
 ## huggingface dataset full-ft
 Run_llama2-7b_fsdp_huggingface_dataset() {
 
-    accelerate launch --config_file "fsdp_config.yaml" train.py \
+    accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
         --model_name_or_path ${model} \
         --bf16 True \
         --use_flashattn True \
@@ -85,7 +85,7 @@ Run_llama2-7b_fsdp_huggingface_dataset() {
 ## huggingface dataset peft lora
 Run_llama2-7b_fsdp_huggingface_dataset_peft() {
 
-    accelerate launch --config_file "fsdp_config.yaml" train.py \
+    accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
         --model_name_or_path ${model} \
         --bf16 True \
         --use_flashattn True \
 
@@ -1,26 +1,13 @@
-
 export CCL_PROCESS_LAUNCHER=none
 
-# profiling set
-# export PROFILE=1
-# export KINETO=1
-
 # settings for torch-ccl
 export TORCH_LLM_ALLREDUCE=1
 
-# torch-ccl verbose
-# export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1
-
-# oneccl runtime
-# source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh
-# source /home2/zhuhong/LLM/ccl-inference-dev-3/build/_install/env/setvars.sh
-
 ## alpaca dataset full-ft
 Run_llama2-7b_fsdp_alpaca_converge() {
 
     model='meta-llama/Llama-2-7b-hf'
-
-    torchrun --nproc_per_node=8 --master_port='29900' train.py \
+    accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
         --model_name_or_path ${model} \
         --data_path ./alpaca_data.json \
         --bf16 True \
@@ -50,7 +37,7 @@ Run_llama2-7b_fsdp_alpaca_peft_converge() {
 
     model='meta-llama/Llama-2-7b-hf'
 
-    torchrun --nproc_per_node=8 --master_port='29900' train.py \
+    accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
         --model_name_or_path ${model} \
         --data_path ./alpaca_data.json \
         --bf16 True \
@@ -81,7 +68,7 @@ Run_llama2-70b_fsdp_alpaca_peft_converge() {
 
     model='meta-llama/Llama-2-70b-hf'
 
-    accelerate launch --config_file "fsdp_config.yaml"  train.py \
+    accelerate launch --config_file "fsdp_config.yaml"  llama2_ft.py \
         --model_name_or_path ${model} \
         --data_path ./alpaca_data.json \
         --bf16 True \