[doc] sync docs to release branch (#1645)

tye1 · web-flow · commit 407fd7176b3d · 2022-10-20T09:49:23.000+08:00
* Add api_doc.rst, features.rst, optimizer_fusion and installation

* remove DDP, hvd, DLpack from features.rst
diff --git a/docs/tutorials/api_doc.rst b/docs/tutorials/api_doc.rst
@@ -0,0 +1,108 @@
+API Documentation
+#################
+
+General
+*******
+
+.. currentmodule:: intel_extension_for_pytorch
+.. autofunction:: optimize
+.. currentmodule:: intel_extension_for_pytorch.xpu
+.. StreamContext
+.. can_device_access_peer
+.. current_blas_handle
+.. autofunction:: current_device
+.. autofunction:: current_stream
+.. default_stream
+.. autoclass:: device
+.. autofunction:: device_count
+.. autoclass:: device_of
+.. autofunction:: getDeviceIdListForCard
+.. autofunction:: get_device_name
+.. autofunction:: get_device_properties
+.. get_gencode_flags
+.. get_sync_debug_mode
+.. autofunction:: init
+.. ipc_collect
+.. autofunction:: is_available
+.. autofunction:: is_initialized
+.. memory_usage
+.. autofunction:: set_device
+.. set_stream
+.. autofunction:: stream
+.. autofunction:: synchronize
+
+
+
+Random Number Generator
+***********************
+
+.. currentmodule:: intel_extension_for_pytorch.xpu
+.. autofunction:: get_rng_state
+.. autofunction:: get_rng_state_all
+.. autofunction:: set_rng_state
+.. autofunction:: set_rng_state_all
+.. autofunction:: manual_seed
+.. autofunction:: manual_seed_all
+.. autofunction:: seed
+.. autofunction:: seed_all
+.. autofunction:: initial_seed
+
+
+
+Streams and events
+******************
+
+.. currentmodule:: intel_extension_for_pytorch.xpu
+.. autoclass:: Stream
+    :members: 
+.. ExternalStream
+.. autoclass:: Event
+    :members: 
+
+Memory management
+*****************
+
+.. currentmodule:: intel_extension_for_pytorch.xpu
+.. autofunction:: empty_cache
+.. list_gpu_processes
+.. mem_get_info
+.. autofunction:: memory_stats
+.. autofunction:: memory_summary
+.. autofunction:: memory_snapshot
+.. autofunction:: memory_allocated
+.. autofunction:: max_memory_allocated
+.. reset_max_memory_allocated
+.. autofunction:: memory_reserved
+.. autofunction:: max_memory_reserved
+.. set_per_process_memory_fraction
+.. memory_cached
+.. max_memory_cached
+.. reset_max_memory_cached
+.. autofunction:: reset_peak_memory_stats
+.. caching_allocator_alloc
+.. caching_allocator_delete
+
+
+
+.. autofunction:: memory_stats_as_nested_dict
+.. autofunction:: reset_accumulated_memory_stats
+
+Other
+*****
+
+.. currentmodule:: intel_extension_for_pytorch.xpu
+.. autofunction:: get_fp32_math_mode
+.. autofunction:: set_fp32_math_mode
+
+    
+.. .. automodule:: intel_extension_for_pytorch.quantization
+..    :members:
+
+C++ API
+*******
+
+.. doxygenenum:: xpu::FP32_MATH_MODE
+
+.. doxygenfunction:: xpu::set_fp32_math_mode
+
+.. doxygenfunction:: xpu::get_queue_from_stream
diff --git a/docs/tutorials/features.rst b/docs/tutorials/features.rst
@@ -0,0 +1,90 @@
+Features
+========
+
+Ease-of-use Python API
+----------------------
+
+Intel® Extension for PyTorch\* provides simple frontend Python APIs and utilities to get performance optimizations such as operator optimization.
+
+Check the `API Documentation <api_doc.html>`_ for details of API functions and `Examples <examples.md>`_ for helpful usage tips.
+
+DPC++ Extension
+---------------
+
+Intel® Extension for PyTorch\* provides C++ APIs to get DPCPP queue and configure floating-point math mode.
+
+Check the `API Documentation`_ for the details of API functions. `DPC++ Extension <features/DPC++_Extension.md>`_ describes how to write customized DPC++ kernels with a practical example and build it with setuptools and CMake.
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   features/DPC++_Extension
+
+Here are detailed discussions of specific feature topics, summarized in the rest of this document:
+
+
+Channels Last
+-------------
+
+Compared with the default NCHW memory format, using channels_last (NHWC) memory format can further accelerate convolutional neural networks. In Intel® Extension for PyTorch\*, NHWC memory format has been enabled for most key GPU operators.
+
+For more detailed information, check `Channels Last <features/nhwc.md>`_.
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   features/nhwc
+
+Auto Mixed Precision (AMP)
+--------------------------
+
+The support of Auto Mixed Precision (AMP) with BFloat16 and Float16 optimization of operators has been enabled in Intel® Extension for PyTorch\*. BFloat16 is the default low precision floating data type when AMP is enabled. We suggest use AMP for accelerating convolutional and matmul based neural networks.
+
+For more detailed information, check `Auto Mixed Precision (AMP) <features/amp.md>`_.
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   features/amp
+
+Advanced Configuration
+----------------------
+
+The default settings for Intel® Extension for PyTorch* are sufficient for most use cases. However, if users want to customize Intel® Extension for PyTorch*, advanced configuration is available at build time and runtime.
+
+For more detailed information, check `Advanced Configuration <features/advanced_configuration.md>`_.
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   features/advanced_configuration
+
+Optimizer Optimization
+----------------------
+
+Optimizers are a key part of the training workloads. Intel® Extension for PyTorch\* supports operator fusion for computation in the optimizers.
+
+For more detailed information, check `Optimizer Fusion <features/optimizer_fusion.md>`_.
+
+.. toctree::
+    :hidden:
+    :maxdepth: 1
+
+    features/optimizer_fusion
+   
+Simple Trace Tool
+-----------------
+
+Simple Trace is a built-in debugging tool that lets you control printing out the call stack for a piece of code. Once enabled, it can automatically print out verbose messages of called operators in a stack format with indenting to distinguish the context. 
+
+For more detailed information, check `Simple Trace Tool <features/simple_trace.md>`_.
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   features/simple_trace
diff --git a/docs/tutorials/features/optimizer_fusion.md b/docs/tutorials/features/optimizer_fusion.md
@@ -0,0 +1,40 @@
+Optimizer Fusion
+================
+
+## Introduction
+As with TorchScript, operation fusion reduces the number of operators that will be executed, and reduces overhead time. This methodology is also applied in Intel® Extension for PyTorch\* optimizer optimization. We support SGD/AdamW fusion for both FP32/BF16 at current stage.
+
+Let's examine the code in [sgd update](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html?highlight=sgd#torch.optim.SGD) as an example.
+
+```python
+
+    # original version
+    if weight_decay != 0:
+        grad = grad.add(param, alpha=weight_decay)
+    if momentum != 0:
+      buf = momentum_buffer_list[i]
+      if buf is None:
+          buf = torch.clone(grad).detach()
+          momentum_buffer_list[i] = buf
+      else:
+          buf.mul_(momentum).add_(grad, alpha=1 - dampening)
+    if nesterov:
+        grad = grad.add(buf, alpha=momentum)
+    else:
+        grad = buf
+
+    param.add_(grad, alpha=-lr)
+```
+
+## Operation Fusion
+
+One problem of the native implementation above is that we need to access the storages of `grad`, `param`, and `buf` several times. For large topologies, `grad` and `parameters` might not be stored in the cache. When we need to access the storage of `grad` again when executing the remaining clauses, the processor must read data out of low speed memory again instead of the more efficient high speed cache. This is a memory-bound bottle neck preventing good performance.
+
+Operation fusion is a way to solve this problem. The clauses in the pseudo code are all element-wise operations, so we can fuse them into a single operation, as in the pseudo code below.
+
+```python
+   # fused version
+   sgd_fused_step(param, grad, buf, ...(other args))
+```
+
+After fusion, one operation `sgd_fused_step` can provide equivalent functionality but much better performance compared with original version of [sgd update](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html?highlight=sgd#torch.optim.SGD).
diff --git a/docs/tutorials/installation.md b/docs/tutorials/installation.md
@@ -0,0 +1,84 @@
+# Build and Install from Source Code
+
+This is guide to build an Intel® Extension for PyTorch* PyPI package from source and install it in Linux.
+
+
+## Prepare
+
+### Hardware Requirement
+
+Verified Hardware Platforms:
+ - Intel® Data Center GPU Flex Series 170
+
+### Software Requirements
+
+- Ubuntu 20.04 (64-bit)
+- Intel GPU Drivers 
+  - Intel® Data Center GPU Flex Series [419.40](https://dgpu-docs.intel.com/releases/stable_419_40_20220914.html)
+- Intel® oneAPI Base Toolkit 2022.3
+- Python 3.7-3.10
+
+### Install Intel GPU Driver
+
+|Release|OS|Intel GPU|Install Intel GPU Driver|
+|-|-|-|-|
+|v1.0.0|Ubuntu 20.04|Intel® Data Center GPU Flex Series| Refer to the [Installation Guides](https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-focal-dc.html) for latest driver installation. If install the verified Intel® Data Center GPU Flex Series [419.40](https://dgpu-docs.intel.com/releases/stable_419_40_20220914.html), please append the specific version after components, such as `sudo apt-get install intel-opencl-icd=22.28.23726.1+i419~u20.04`|
+
+### Install oneAPI Base Toolkit
+
+Please refer to [Install oneAPI Base Toolkit Packages](https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit)
+
+Need to install components of Intel® oneAPI Base Toolkit:
+ - Intel® oneAPI DPC++ Compiler
+ - Intel® oneAPI Math Kernel Library (oneMKL)
+
+Default installation location is /opt/intel/oneapi for root account, ${HOME}/intel/oneapi for other accounts.
+
+### Configure the AOT
+
+Please refer to [AOT documentation](./AOT.md) for how to configure AOT.
+
+### Build and Install from Source Code
+
+Make sure PyTorch is installed so that the extension will work properly. For each PyTorch release, we have a corresponding release of the extension. Here are the PyTorch versions that we support and the mapping relationship:
+
+|PyTorch Version|Intel® Extension for PyTorch* Version|
+|--|--|
+|[v1.10.\*](https://github.com/pytorch/pytorch/tree/v1.10.0 "v1.10.0")|[v1.10.\*](https://github.com/intel/intel-extension-for-pytorch/tree/v1.10.200)|
+
+
+Build and Install PyTorch:
+
+```bash
+$ git clone https://github.com/pytorch/pytorch.git
+$ cd pytorch
+# checkout to specific release branch if in need
+$ git checkout ${PYTORCH_RELEASE_BRANCH_NAME}
+# apply git patch to pytorch code, e.g., apply patch for pytorch v1.10.
+$ git apply ${intel_extension_for_pytorch_directory}/torch_patches/{xpu-1.10}.patch 
+$ git submodule update --init --recursive
+$ pip install -r requirements.txt
+# configure MKL env to enable MKL features
+$ source ${oneAPI_HOME}/mkl/latest/env/vars.sh
+# build pypi package and install it locally
+$ python setup.py bdist_wheel
+$ pip install dist/*.whl
+```
+
+Build and Install Intel® Extension for PyTorch*:
+
+```bash
+$ git clone -b xpu-master https://github.com/intel/intel-extension-for-pytorch.git 
+$ cd intel-extension-for-pytorch
+# checkout to specific release branch if in need
+$ git checkout ${IPEX_RELEASE_BRANCH_NAME}
+$ git submodule update --init --recursive
+$ pip install -r requirements.txt
+# configure dpcpp compiler env
+$ source ${oneAPI_HOME}/compiler/latest/env/vars.sh
+# configure MKL env to enable MKL features
+$ source ${oneAPI_HOME}/mkl/latest/env/vars.sh
+# build pypi package and install it locally
+$ ${USE_AOT_DEVLIST} python setup.py bdist_wheel
+$ pip install dist/*.whl
+```