.. meta::
:google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI
.. important::
.. raw:: html
bigdl-llm
has now become ipex-llm
(see the migration guide here); you may find the original BigDL
project here.
------
################################################
💫 IPEX-LLM
################################################
.. raw:: html
IPEX-LLM
is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency [1].
.. note::
.. raw:: html
-
It is built on top of Intel Extension for PyTorch (
IPEX
), as well as the excellent work of llama.cpp
, bitsandbytes
, vLLM
, qlora
, AutoGPTQ
, AutoAWQ
, etc.
-
It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
-
50+ models have been optimized/verified on
ipex-llm
(including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
************************************************
Latest update 🔥
************************************************
* [2024/04] You can now run **Llama 3** on Intel GPU using ``llama.cpp`` and ``ollama``; see the quickstart `here `_.
* [2024/04] ``ipex-llm`` now supports **Llama 3** on Intel `GPU `_ and `CPU `_.
* [2024/04] ``ipex-llm`` now provides C++ interface, which can be used as an accelerated backend for running `llama.cpp `_ and `ollama `_ on Intel GPU.
* [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_.
* [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_).
* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
* [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI.
* [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively.
* [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_).
* [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_).
.. dropdown:: More updates
:color: primary
* [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_).
* [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_.
* [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_).
* [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**.
* [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available.
* [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_.
* [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_.
* [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU.
* [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX).
* [2023/09] ``ipex-llm`` `tutorial `_ is released.
************************************************
``ipex-llm`` Demos
************************************************
See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below.
.. raw:: html
12th Gen Intel Core CPU |
Intel Arc GPU |
|
|
|
|
chatglm2-6b |
llama-2-13b-chat |
chatglm2-6b |
llama-2-13b-chat |
************************************************
``ipex-llm`` Quickstart
************************************************
============================================
Install ``ipex-llm``
============================================
* `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU
* `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU
* `Docker `_: using ``ipex-llm`` dockers on Intel CPU and GPU
.. seealso::
For more details, please refer to the `installation guide `_
============================================
Run ``ipex-llm``
============================================
* `llama.cpp `_: running **llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp``) on Intel GPU
* `ollama `_: running **ollama** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``ollama``) on Intel GPU
* `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_
* `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU
* `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*)
* `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI**
* `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU
============================================
Code Examples
============================================
* Low bit inference
* `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_
* `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_
* `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_
* `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_
* FP16/BF16 inference
* **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization
* **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization
* Save and load
* `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models
* `GGUF `_: directly loading GGUF models into ``ipex-llm``
* `AWQ `_: directly loading AWQ models into ``ipex-llm``
* `GPTQ `_: directly loading GPTQ models into ``ipex-llm``
* Finetuning
* LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_
* QLoRA finetuning on Intel `CPU `_
* Integration with community libraries
* `HuggingFace tansformers `_
* `Standard PyTorch model `_
* `DeepSpeed-AutoTP `_
* `HuggingFace PEFT `_
* `HuggingFace TRL `_
* `LangChain `_
* `LlamaIndex `_
* `AutoGen `_
* `ModeScope `_
* `Tutorials `_
.. seealso::
For more details, please refer to the |ipex_llm_document|_.
.. |ipex_llm_document| replace:: ``ipex-llm`` document
.. _ipex_llm_document: doc/LLM/index.html
************************************************
Verified Models
************************************************
.. raw:: html
Model |
CPU Example |
GPU Example |
LLaMA
(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) |
link1,
link2 |
link
link |
LLaMA 2 |
link1,
link2 |
link
link |
LLaMA 3 |
link |
link |
ChatGLM |
link |
|
ChatGLM2 |
link |
link |
ChatGLM3 |
link |
link |
Mistral |
link |
link |
Mixtral |
link |
link |
Falcon |
link |
link |
MPT |
link |
link |
Dolly-v1 |
link |
link |
Dolly-v2 |
link |
link |
Replit Code |
link |
link |
RedPajama |
link1,
link2 |
|
Phoenix |
link1,
link2 |
|
StarCoder |
link1,
link2 |
link |
Baichuan |
link |
link |
Baichuan2 |
link |
link |
InternLM |
link |
link |
Qwen |
link |
link |
Qwen1.5 |
link |
link |
Qwen-VL |
link |
link |
Aquila |
link |
link |
Aquila2 |
link |
link |
MOSS |
link |
|
Whisper |
link |
link |
Phi-1_5 |
link |
link |
Flan-t5 |
link |
link |
LLaVA |
link |
link |
CodeLlama |
link |
link |
Skywork |
link |
|
InternLM-XComposer |
link |
|
WizardCoder-Python |
link |
|
CodeShell |
link |
|
Fuyu |
link |
|
Distil-Whisper |
link |
link |
Yi |
link |
link |
BlueLM |
link |
link |
Mamba |
link |
link |
SOLAR |
link |
link |
Phixtral |
link |
link |
InternLM2 |
link |
link |
RWKV4 |
|
link |
RWKV5 |
|
link |
Bark |
link |
link |
SpeechT5 |
|
link |
DeepSeek-MoE |
link |
|
Ziya-Coding-34B-v1.0 |
link |
|
Phi-2 |
link |
link |
Phi-3 |
link |
link |
Yuan2 |
link |
link |
Gemma |
link |
link |
DeciLM-7B |
link |
link |
Deepseek |
link |
link |
StableLM |
link |
link |
CodeGemma |
link |
link |
************************************************
Get Support
************************************************
* Please report a bug or raise a feature request by opening a `Github Issue `_
* Please report a vulnerability by opening a draft `GitHub Security Advisory `_
------
.. raw:: html
Performance varies by use, configuration and other factors. ipex-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.