Skip to content

Commit 90fdb70

Browse files
authored
Update llm doc and scripts (#5076)
* update example/llm README * update llm inference README with llama optimization guide for both fp16 and woq * install inc instead of inc&itrex * woq run with xetla path * remove accuracy failed log, verified with yuhua * run with static cache, better perf * add static cache for woq example * remove wa OCL_ICD_VENDORS * remove OCL_ICD_VENDORS in docs * remove CCL_ROOT setting in docs * add inc in inference requirements.txt * add static cache in fp16/woq and add comment to explain it * install and activate standalone dpcpp compiler for using torch.compile * update from jun * update transforemrs to 4.44.2 * ensure param format consistency for bash * update optimize_transformers to ipex.llm.optimize * remove IPEX_COMPUTE_ENGINE for common case, add it only in save quantized model scenario * update optimize_transformers to ipex.llm.optimize * Create README.md for cpp example * update triton doc * update for torch compile * update known_issue for triton installation, link known issue to torch.compile guide * update known issues * fix typo * format fix * update 2 scenario for triton library issue * update triton related issue * Update torch_compile_gpu.md inference example * Update requirements.txt for accelerate 1.1.1 * Update requirements.txt to specify huggingface-hub==0.25.2
1 parent d4d41f9 commit 90fdb70

20 files changed

+468
-236
lines changed

docs/tutorials/features/torch_compile_gpu.md

Lines changed: 37 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,22 @@ Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph c
99
# Required Dependencies
1010

1111
**Verified version**:
12-
- `torch` : v2.3
13-
- `intel_extension_for_pytorch` : v2.3
14-
- `triton` : >= v3.0.0
12+
- `torch` : v2.5
13+
- `intel_extension_for_pytorch` : v2.5
14+
- `triton` : v3.1.0+91b14bf559
1515

1616

17-
Install [Intel® oneAPI Base Toolkit 2024.2.1](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html).
17+
Install [Intel® oneAPI DPC++/C++ Compiler 2025.0.4](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler-download.html).
1818

1919
Follow [Intel® Extension for PyTorch\* Installation](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/) to install `torch` and `intel_extension_for_pytorch` firstly.
2020

2121
Triton could be directly installed using the following command:
2222

2323
```Bash
24-
pip install --pre pytorch-triton-xpu==3.0.0+1b2f15840e --index-url https://download.pytorch.org/whl/nightly/xpu
24+
pip install --pre pytorch-triton-xpu==3.1.0+91b14bf559 --index-url https://download.pytorch.org/whl/nightly/xpu
2525
```
2626

27-
Remember to activate the oneAPI basekit by following commands.
27+
Remember to activate the oneAPI DPC++/C++ Compiler by following commands.
2828

2929
```bash
3030
# {dpcpproot} is the location for dpcpp ROOT path and it is where you installed oneAPI DPCPP, usually it is /opt/intel/oneapi/compiler/latest or ~/intel/oneapi/compiler/latest
@@ -39,19 +39,43 @@ source {dpcpproot}/env/vars.sh
3939

4040
```python
4141
import torch
42+
import torch.nn as nn
4243
import intel_extension_for_pytorch
4344

44-
# create model
45+
# Define the SimpleNet model
46+
class SimpleNet(nn.Module):
47+
def __init__(self):
48+
super(SimpleNet, self).__init__()
49+
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
50+
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
51+
self.fc1 = nn.Linear(32 * 56 * 56, 128)
52+
self.fc2 = nn.Linear(128, 10)
53+
self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
54+
self.relu = nn.ReLU()
55+
56+
def forward(self, x):
57+
x = self.pool(self.relu(self.conv1(x)))
58+
x = self.pool(self.relu(self.conv2(x)))
59+
x = x.view(-1, 32 * 56 * 56)
60+
x = self.relu(self.fc1(x))
61+
x = self.fc2(x)
62+
return x
63+
64+
# Create model
4565
model = SimpleNet().to("xpu")
4666

47-
# compile model
67+
# Compile model
4868
compiled_model = torch.compile(model, options={"freezing": True})
4969

50-
# inference main
70+
# Inference main
5171
input = torch.rand(64, 3, 224, 224, device=torch.device("xpu"))
5272
with torch.no_grad():
5373
with torch.xpu.amp.autocast(dtype=torch.float16):
5474
output = compiled_model(input)
75+
76+
# Print the output shape
77+
print(output.shape)
78+
print("Done for inference with torch.compile")
5579
```
5680

5781
## Training with torch.compile
@@ -76,3 +100,7 @@ optimizer.zero_grad()
76100
loss.backward()
77101
optimizer.step()
78102
```
103+
104+
## Troubleshooting
105+
106+
If you encounter any issue related to `torch.compile` or `triton`, please refer to Library Dependencies section in [known_issues](../known_issues.md).

docs/tutorials/getting_started.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,3 @@ More examples, including training and usage of low precision data types are avai
5151

5252
There are some environment variables in runtime that can be used to configure executions on GPU. Please check [Advanced Configuration](./features/advanced_configuration.html#runtime-configuration) for more detailed information.
5353

54-
Set `OCL_ICD_VENDORS` with default path `/etc/OpenCL/vendors`.
55-
Set `CCL_ROOT` if you are using multi-GPU.
56-
57-
```bash
58-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
59-
export CCL_ROOT=${CONDA_PREFIX}
60-
python <script>
61-
```
62-

docs/tutorials/known_issues.md

Lines changed: 34 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -83,24 +83,49 @@ Troubleshooting
8383

8484
If you continue seeing similar issues for other shared object files, add the corresponding files under `${MKL_DPCPP_ROOT}/lib/intel64/` by `LD_PRELOAD`. Note that the suffix of the libraries may change (e.g. from .1 to .2), if more than one oneMKL library is installed on the system.
8585

86-
- **Problem**: RuntimeError: could not create an engine.
87-
- **Cause**: `OCL_ICD_VENDORS` path is wrongly set when activate a exist conda environment.
88-
- **Solution**: `export OCL_ICD_VENDORS=/etc/OpenCL/vendors` after `conda activate`
89-
90-
- **Problem**: If you encounter issues related to CCL environment variable configuration when running distributed tasks.
91-
- **Cause**: `CCL_ROOT` path is wrongly set.
92-
- **Solution**: `export CCL_ROOT=${CONDA_PREFIX}`
93-
9486
- **Problem**: If you encounter issues related to MPI environment variable configuration when running distributed tasks.
9587
- **Cause**: MPI environment variable configuration not correct.
9688
- **Solution**: `conda deactivate` and then `conda activate` to activate the correct MPI environment variable automatically.
9789

9890
```
9991
conda deactivate
10092
conda activate
101-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
10293
```
10394

95+
96+
- **Problem**: If you encounter issues Runtime error related to C++ compiler with `torch.compile`. Runtime Error: Failed to find C++ compiler. Please specify via CXX environment variable.
97+
- **Cause**: Not install and activate DPC++/C++ Compiler correctly.
98+
- **Solution**: [Install DPC++/C++ Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler-download.html) and activate it by following commands.
99+
100+
```bash
101+
# {dpcpproot} is the location for dpcpp ROOT path and it is where you installed oneAPI DPCPP, usually it is /opt/intel/oneapi/compiler/latest or ~/intel/oneapi/compiler/latest
102+
source {dpcpproot}/env/vars.sh
103+
```
104+
105+
- **Problem**: RuntimeError: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at https://github.com/openai/triton
106+
- **Cause**: No pytorch-triton-xpu installed
107+
- **Solution**: Resolve the issue with following command:
108+
109+
```bash
110+
# Install correct version of pytorch-triton-xpu
111+
pip install --pre pytorch-triton-xpu==3.1.0+91b14bf559 --index-url https://download.pytorch.org/whl/nightly/xpu
112+
```
113+
114+
115+
- **Problem**: LoweringException: ImportError: cannot import name 'intel' from 'triton._C.libtriton'
116+
- **Cause**: Installing Triton causes pytorch-triton-xpu to stop working.
117+
- **Solution**: Resolve the issue with following command:
118+
119+
```bash
120+
pip list | grep triton
121+
# If triton related packages are listed, remove them
122+
pip uninstall triton
123+
pip uninstall pytorch-triton-xpu
124+
# Reinstall correct version of pytorch-triton-xpu
125+
pip install --pre pytorch-triton-xpu==3.1.0+91b14bf559 --index-url https://download.pytorch.org/whl/nightly/xpu
126+
```
127+
128+
104129
## Performance Issue
105130

106131
- **Problem**: Extended durations for data transfers from the host system to the device (H2D) and from the device back to the host system (D2H).

examples/gpu/inference/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Run Model inference
2+
3+
4+
1. Command for compilation of example-app
5+
6+
```
7+
$ cd example-app
8+
$ mkdir build
9+
$ cd build
10+
$ CC=icx CXX=icpx cmake -DCMAKE_PREFIX_PATH=<LIBPYTORCH_PATH> ..
11+
$ make
12+
```
13+
14+
2. Use model_gen.py to generate the resnet50 jit model and save it as resnet50.pt
15+
16+
```
17+
python ../../model_gen.py
18+
```
19+
20+
21+
3. Run example
22+
23+
```
24+
./example-app resnet50.pt
25+
```

examples/gpu/llm/README.md

Lines changed: 58 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,21 +8,70 @@ Here you can find benchmarking scripts for large language models (LLM) text gene
88

99
## Environment Setup
1010

11-
### [Recommended] Docker-based environment setup with compilation from source
11+
### [Recommended] Docker-based environment setup with prebuilt wheel files
1212

1313
```bash
1414
# Get the Intel® Extension for PyTorch* source code
1515
git clone https://github.com/intel/intel-extension-for-pytorch.git
1616
cd intel-extension-for-pytorch
17-
git checkout xpu-main
17+
git checkout v2.5.10+xpu
18+
git submodule sync
19+
git submodule update --init --recursive
20+
21+
# Build an image with the provided Dockerfile by installing Intel® Extension for PyTorch* with prebuilt wheels
22+
docker build -f examples/gpu/llm/Dockerfile -t ipex-llm:2510 .
23+
24+
# Run the container with command below
25+
docker run -it --rm --privileged -v /dev/dri/by-path:/dev/dri/by-path ipex-llm:2510 bash
26+
27+
# When the command prompt shows inside the docker container, enter llm examples directory
28+
cd llm
29+
30+
# Activate environment variables
31+
source ./tools/env_activate.sh [inference|fine-tuning]
32+
```
33+
34+
### Conda-based environment setup with prebuilt wheel files
35+
36+
Make sure the driver packages are installed. Refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/#installation?platform=gpu&version=v2.5.10%2Bxpu&os=linux%2Fwsl2&package=pip).
37+
38+
```bash
39+
40+
# Get the Intel® Extension for PyTorch* source code
41+
git clone https://github.com/intel/intel-extension-for-pytorch.git
42+
cd intel-extension-for-pytorch
43+
git checkout v2.5.10+xpu
44+
git submodule sync
45+
git submodule update --init --recursive
46+
47+
# Make sure you have GCC >= 11 is installed on your system.
48+
# Create a conda environment
49+
conda create -n llm python=3.10 -y
50+
conda activate llm
51+
# Setup the environment with the provided script
52+
cd examples/gpu/llm
53+
# If you want to install Intel® Extension for PyTorch\* with prebuilt wheels, use the commands below:
54+
bash ./tools/env_setup.sh 0x07
55+
conda deactivate
56+
conda activate llm
57+
source ./tools/env_activate.sh [inference|fine-tuning]
58+
```
59+
60+
### Docker-based environment setup with compilation from source
61+
62+
```bash
63+
# Get the Intel® Extension for PyTorch* source code
64+
git clone https://github.com/intel/intel-extension-for-pytorch.git
65+
cd intel-extension-for-pytorch
66+
git checkout v2.5.10+xpu
1867
git submodule sync
1968
git submodule update --init --recursive
2069

2170
# Build an image with the provided Dockerfile by compiling Intel® Extension for PyTorch* from source
22-
docker build -f examples/gpu/llm/Dockerfile --build-arg COMPILE=ON -t ipex-llm:xpu-main .
71+
docker build -f examples/gpu/llm/Dockerfile --build-arg COMPILE=ON -t ipex-llm:2510 .
2372

2473
# Run the container with command below
25-
docker run -it --rm --privileged -v /dev/dri/by-path:/dev/dri/by-path ipex-llm:xpu-main bash
74+
docker run -it --rm --privileged -v /dev/dri/by-path:/dev/dri/by-path ipex-llm:2510 bash
2675

2776
# When the command prompt shows inside the docker container, enter llm examples directory
2877
cd llm
@@ -33,14 +82,14 @@ source ./tools/env_activate.sh [inference|fine-tuning]
3382

3483
### Conda-based environment setup with compilation from source
3584

36-
Make sure the driver and Base Toolkit are installed. Refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/#installation?platform=gpu&version=v2.3.110%2Bxpu&os=linux%2Fwsl2&package=source).
85+
Make sure the driver and Base Toolkit are installed. Refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/#installation?platform=gpu&version=v2.5.10%2Bxpu&os=linux%2Fwsl2&package=source).
3786

3887
```bash
3988

4089
# Get the Intel® Extension for PyTorch* source code
4190
git clone https://github.com/intel/intel-extension-for-pytorch.git
4291
cd intel-extension-for-pytorch
43-
git checkout xpu-main
92+
git checkout v2.5.10+xpu
4493
git submodule sync
4594
git submodule update --init --recursive
4695

@@ -51,12 +100,12 @@ conda activate llm
51100
# Setup the environment with the provided script
52101
cd examples/gpu/llm
53102
# If you want to install Intel® Extension for PyTorch\* from source, use the commands below:
54-
# e.g. bash ./tools/env_setup.sh 3 /opt/intel/oneapi pvc
55-
bash ./tools/env_setup.sh 3 <ONEAPI_ROOT_DIR> <AOT>
103+
104+
# e.g. bash ./tools/env_setup.sh 0x03 /opt/intel/oneapi/compiler/latest /opt/intel/oneapi/mkl/latest /opt/intel/oneapi/ccl/latest /opt/intel/oneapi/mpi/latest /opt/intel/oneapi/pti/latest pvc
105+
bash ./tools/env_setup.sh 0x03 <DPCPP_ROOT> <ONEMKL_ROOT> <ONECCL_ROOT> <MPI_ROOT> <PTI_ROOT> <AOT>
56106

57107
conda deactivate
58108
conda activate llm
59-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
60109
source ./tools/env_activate.sh [inference|fine-tuning]
61110
```
62111

examples/gpu/llm/fine-tuning/Llama2/README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,6 @@ Remove the flags `--data_path` in fine-tuning command will load the guanaco-llam
4545

4646

4747
```bash
48-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
4948
export CCL_PROCESS_LAUNCHER=none
5049
export TORCH_LLM_ALLREDUCE=1
5150

@@ -84,7 +83,6 @@ Remove the flags `--data_path` in fine-tuning command will load the guanaco-llam
8483

8584

8685
```bash
87-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
8886
export CCL_PROCESS_LAUNCHER=none
8987
export TORCH_LLM_ALLREDUCE=1
9088

examples/gpu/llm/fine-tuning/Llama3/README.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ Full-finetuning on single card will cause OOM.
2626
Example: Llama 3 8B LoRA fine-tuning on single card. The default dataset `financial_phrasebank` is loaded in `llama3_ft.py`.
2727

2828
```bash
29-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
3029
export TORCH_LLM_ALLREDUCE=1
3130

3231
export model="meta-llama/Meta-Llama-3-8B"
@@ -57,7 +56,6 @@ Example: Llama 3 8B full fine-tuning, you can change the model name/path for ano
5756

5857

5958
```bash
60-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
6159
export CCL_PROCESS_LAUNCHER=none
6260
export TORCH_LLM_ALLREDUCE=1
6361

@@ -85,7 +83,6 @@ Example: Llama 3 8B LoRA fine-tuning, you can change the model name/path for ano
8583

8684

8785
```bash
88-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
8986
export CCL_PROCESS_LAUNCHER=none
9087
export TORCH_LLM_ALLREDUCE=1
9188

examples/gpu/llm/fine-tuning/Phi3/README.md

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ wandb login
2121
**Note**: Not support full finetuning and flash attention on this platform.
2222

2323
```bash
24-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
2524
export model="microsoft/Phi-3-mini-4k-instruct"
2625

2726
python phi3_ft.py \
@@ -47,7 +46,6 @@ python phi3_ft.py \
4746
Example: Phi-3 Mini 4k full fine-tuning on single card. The default dataset `financial_phrasebank` is loaded in `phi3_ft.py`.
4847

4948
```bash
50-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
5149
export TORCH_LLM_ALLREDUCE=1
5250

5351
export model="microsoft/Phi-3-mini-4k-instruct"
@@ -71,7 +69,6 @@ python phi3_ft.py \
7169
Example: Phi-3 Mini 4k LoRA fine-tuning on single card. The default dataset `financial_phrasebank` is loaded in `phi3_ft.py`.
7270

7371
```bash
74-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
7572
export TORCH_LLM_ALLREDUCE=1
7673

7774
export model="microsoft/Phi-3-mini-4k-instruct"
@@ -102,7 +99,6 @@ Example: Phi-3 Mini 4k full fine-tuning.
10299

103100

104101
```bash
105-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
106102
export CCL_PROCESS_LAUNCHER=none
107103
export TORCH_LLM_ALLREDUCE=1
108104

@@ -130,7 +126,6 @@ Example: Phi-3 Mini 4k LoRA fine-tuning.
130126

131127

132128
```bash
133-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
134129
export CCL_PROCESS_LAUNCHER=none
135130
export TORCH_LLM_ALLREDUCE=1
136131

@@ -159,7 +154,6 @@ Example: Phi3-Mini 4k LoRA fine-tuning.
159154

160155

161156
```bash
162-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
163157
export CCL_PROCESS_LAUNCHER=none
164158
export TORCH_LLM_ALLREDUCE=1
165159

examples/gpu/llm/fine-tuning/Qwen/README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@ Example: Qwen 7B full fine-tuning.
2929

3030

3131
```bash
32-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
3332
export CCL_PROCESS_LAUNCHER=none
3433
export TORCH_LLM_ALLREDUCE=1
3534

@@ -61,7 +60,6 @@ accelerate launch --config_file "fsdp_config.yaml" qwen2_ft.py \
6160
Example: Qwen 7B LoRA fine-tuning.
6261

6362
```bash
64-
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
6563
export CCL_PROCESS_LAUNCHER=none
6664
export TORCH_LLM_ALLREDUCE=1
6765

examples/gpu/llm/fine-tuning/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ fire
44
tokenizers>=0.13.3
55
wandb==0.17.5
66
trl==0.9.4
7-
accelerate==0.28.0
7+
accelerate==1.1.1

0 commit comments

Comments
 (0)