Skip to content

Commit

Permalink
FEAT: Dynamic batching for the state-of-the-art FLUX.1 `text_to_image…
Browse files Browse the repository at this point in the history
…` interface (#2380)
  • Loading branch information
ChengjieLi28 authored Oct 18, 2024
1 parent 7b1f0b4 commit 948b99a
Show file tree
Hide file tree
Showing 11 changed files with 800 additions and 100 deletions.
1 change: 1 addition & 0 deletions .github/workflows/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ jobs:
${{ env.SELF_HOST_PYTHON }} -m pip install -U "ormsgpack"
${{ env.SELF_HOST_PYTHON }} -m pip uninstall -y opencc
${{ env.SELF_HOST_PYTHON }} -m pip uninstall -y "faster_whisper"
${{ env.SELF_HOST_PYTHON }} -m pip install -U accelerate
${{ env.SELF_HOST_PYTHON }} -m pytest --timeout=1500 \
-W ignore::PendingDeprecationWarning \
--cov-config=setup.cfg --cov-report=xml --cov=xinference xinference/model/image/tests/test_stable_diffusion.py && \
Expand Down
113 changes: 75 additions & 38 deletions doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: Xinference \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2024-09-06 14:26+0800\n"
"POT-Creation-Date: 2024-10-17 18:49+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <[email protected]>\n"
Expand All @@ -18,8 +18,8 @@ msgstr ""
"Generated-By: Babel 2.11.0\n"

#: ../../source/user_guide/continuous_batching.rst:5
msgid "Continuous Batching (experimental)"
msgstr "连续批处理(实验性质)"
msgid "Continuous Batching"
msgstr "连续批处理"

#: ../../source/user_guide/continuous_batching.rst:7
msgid ""
Expand All @@ -35,11 +35,15 @@ msgstr ""
msgid "Usage"
msgstr "使用方式"

#: ../../source/user_guide/continuous_batching.rst:12
#: ../../source/user_guide/continuous_batching.rst:14
msgid "LLM"
msgstr "大语言模型"

#: ../../source/user_guide/continuous_batching.rst:15
msgid "Currently, this feature can be enabled under the following conditions:"
msgstr "当前,此功能在满足以下条件时开启:"

#: ../../source/user_guide/continuous_batching.rst:14
#: ../../source/user_guide/continuous_batching.rst:17
msgid ""
"First, set the environment variable "
"``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` to ``1`` when starting "
Expand All @@ -48,13 +52,22 @@ msgstr ""
"首先,启动 Xinference 时需要将环境变量 ``XINFERENCE_TRANSFORMERS_ENABLE_"
"BATCHING`` 置为 ``1`` 。"

#: ../../source/user_guide/continuous_batching.rst:21
#: ../../source/user_guide/continuous_batching.rst:25
msgid ""
"Since ``v0.16.0``, this feature is turned on by default and is no longer "
"required to set the ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` "
"environment variable. This environment variable has been removed."
msgstr ""
"自 ``v0.16.0`` 开始,此功能默认开启,不再需要设置 ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` 环境变量,"
"且该环境变量已被移除。"

#: ../../source/user_guide/continuous_batching.rst:30
msgid ""
"Then, ensure that the ``transformers`` engine is selected when launching "
"the model. For example:"
msgstr "然后,启动 LLM 模型时选择 ``transformers`` 推理引擎。例如:"

#: ../../source/user_guide/continuous_batching.rst:57
#: ../../source/user_guide/continuous_batching.rst:66
msgid ""
"Once this feature is enabled, all requests for LLMs will be managed by "
"continuous batching, and the average throughput of requests made to a "
Expand All @@ -64,54 +77,92 @@ msgstr ""
"一旦此功能开启,LLM 模型的所有接口将被此功能接管。所有接口的使用方式没有"
"任何变化。"

#: ../../source/user_guide/continuous_batching.rst:63
#: ../../source/user_guide/continuous_batching.rst:71
msgid "Image Model"
msgstr "图像模型"

#: ../../source/user_guide/continuous_batching.rst:72
msgid ""
"Currently, for image models, only the ``text_to_image`` interface is "
"supported for ``FLUX.1`` series models."
msgstr ""
"当前只有 ``FLUX.1`` 系列模型的 ``text_to_image`` (文生图)接口支持此功能。"

#: ../../source/user_guide/continuous_batching.rst:74
msgid ""
"Enabling this feature requires setting the environment variable "
"``XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE``, which indicates the ``size`` "
"of the generated images."
msgstr ""
"图像模型开启此功能需要在启动 xinference 时指定 ``XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE`` 环境变量,"
"表示生成图片的大小。"

#: ../../source/user_guide/continuous_batching.rst:76
msgid "For example, starting xinference like this:"
msgstr ""
"例如,像这样启动 xinference:"

#: ../../source/user_guide/continuous_batching.rst:83
msgid ""
"Then just use the ``text_to_image`` interface as before, and nothing else"
" needs to be changed."
msgstr ""
"接下来正常使用 ``text_to_image`` 接口即可,其他什么都不需要改变。"

#: ../../source/user_guide/continuous_batching.rst:86
msgid "Abort your request"
msgstr "中止请求"

#: ../../source/user_guide/continuous_batching.rst:64
#: ../../source/user_guide/continuous_batching.rst:87
msgid "In this mode, you can abort requests that are in the process of inference."
msgstr "此功能中,你可以优雅地中止正在推理中的请求。"

#: ../../source/user_guide/continuous_batching.rst:66
#: ../../source/user_guide/continuous_batching.rst:89
msgid "First, add ``request_id`` option in ``generate_config``. For example:"
msgstr "首先,在推理请求的 ``generate_config`` 中指定 ``request_id`` 选项。例如:"

#: ../../source/user_guide/continuous_batching.rst:75
#: ../../source/user_guide/continuous_batching.rst:98
msgid ""
"Then, abort the request using the ``request_id`` you have set. For "
"example:"
msgstr "接着,带着你指定的 ``request_id`` 去中止该请求。例如:"

#: ../../source/user_guide/continuous_batching.rst:83
#: ../../source/user_guide/continuous_batching.rst:106
msgid ""
"Note that if your request has already finished, aborting the request will"
" be a no-op."
" be a no-op. Image models also support this feature."
msgstr "注意,如果你的请求已经结束,那么此操作将什么都不做。"

#: ../../source/user_guide/continuous_batching.rst:86
#: ../../source/user_guide/continuous_batching.rst:110
msgid "Note"
msgstr "注意事项"

#: ../../source/user_guide/continuous_batching.rst:88
#: ../../source/user_guide/continuous_batching.rst:112
msgid ""
"Currently, this feature only supports the ``generate``, ``chat`` and "
"``vision`` tasks for ``LLM`` models. The ``tool call`` tasks are not "
"supported."
"Currently, for ``LLM`` models, this feature only supports the "
"``generate``, ``chat``, ``tool call`` and ``vision`` tasks."
msgstr ""
"当前,此功能仅支持 LLM 模型的 ``generate``, ``chat`` 和 ``vision`` (多"
"模态) 功能。``tool call`` (工具调用)暂时不支持。"
"当前,此功能仅支持 LLM 模型的 ``generate``, ``chat``, ``tool call`` (工具调用)和 ``vision`` (多"
"模态) 功能。"

#: ../../source/user_guide/continuous_batching.rst:90
#: ../../source/user_guide/continuous_batching.rst:114
msgid ""
"Currently, for ``image`` models, this feature only supports the "
"``text_to_image`` tasks. Only ``FLUX.1`` series models are supported."
msgstr ""
"当前,对于图像模型,仅支持 `FLUX.1`` 系列模型的 ``text_to_image`` (文生图)功能。"

#: ../../source/user_guide/continuous_batching.rst:116
msgid ""
"For ``vision`` tasks, currently only ``qwen-vl-chat``, ``cogvlm2``, "
"``glm-4v`` and ``MiniCPM-V-2.6`` (only for image tasks) models are "
"supported. More models will be supported in the future. Please let us "
"know your requirements."
msgstr ""
"对于多模态任务,当前支持 ``qwen-vl-chat`` ,``cogvlm2``, ``glm-4v`` 和 ``MiniCPM-V-2.6`` (仅对于图像任务)"
"模型。未来将加入更多模型,敬请期待。"
"对于多模态任务,当前支持 ``qwen-vl-chat`` ,``cogvlm2``, ``glm-4v`` 和 `"
"`MiniCPM-V-2.6`` (仅对于图像任务)模型。未来将加入更多模型,敬请期待。"

#: ../../source/user_guide/continuous_batching.rst:92
#: ../../source/user_guide/continuous_batching.rst:118
msgid ""
"If using GPU inference, this method will consume more GPU memory. Please "
"be cautious when increasing the number of concurrent requests to the same"
Expand All @@ -123,17 +174,3 @@ msgstr ""
"请求量。``launch_model`` 接口提供可选参数 ``max_num_seqs`` 用于调整并发度"
",默认值为 ``16`` 。"

#: ../../source/user_guide/continuous_batching.rst:95
msgid ""
"This feature is still in the experimental stage, and we welcome your "
"active feedback on any issues."
msgstr "此功能仍处于实验阶段,欢迎反馈任何问题。"

#: ../../source/user_guide/continuous_batching.rst:97
msgid ""
"After a period of testing, this method will remain enabled by default, "
"and the original inference method will be deprecated."
msgstr ""
"一段时间的测试之后,此功能将代替原来的 transformers 推理逻辑成为默认行为"
"。原来的推理逻辑将被摒弃。"

38 changes: 30 additions & 8 deletions doc/source/user_guide/continuous_batching.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
.. _user_guide_continuous_batching:

==================================
Continuous Batching (experimental)
==================================
===================
Continuous Batching
===================

Continuous batching, as a means to improve throughput during model serving, has already been implemented in inference engines like ``VLLM``.
Xinference aims to provide this optimization capability when using the transformers engine as well.

Usage
=====

LLM
---
Currently, this feature can be enabled under the following conditions:

* First, set the environment variable ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` to ``1`` when starting xinference. For example:
Expand All @@ -18,6 +21,12 @@ Currently, this feature can be enabled under the following conditions:
XINFERENCE_TRANSFORMERS_ENABLE_BATCHING=1 xinference-local --log-level debug
.. note::
Since ``v0.16.0``, this feature is turned on by default and
is no longer required to set the ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` environment variable.
This environment variable has been removed.


* Then, ensure that the ``transformers`` engine is selected when launching the model. For example:

.. tabs::
Expand Down Expand Up @@ -58,6 +67,20 @@ Once this feature is enabled, all requests for LLMs will be managed by continuou
and the average throughput of requests made to a single model will increase.
The usage of the LLM interface remains exactly the same as before, with no differences.

Image Model
-----------
Currently, for image models, only the ``text_to_image`` interface is supported for ``FLUX.1`` series models.

Enabling this feature requires setting the environment variable ``XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE``, which indicates the ``size`` of the generated images.

For example, starting xinference like this:

.. code-block::
XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE=1024*1024 xinference-local --log-level debug
Then just use the ``text_to_image`` interface as before, and nothing else needs to be changed.

Abort your request
==================
Expand All @@ -81,17 +104,16 @@ In this mode, you can abort requests that are in the process of inference.
client.abort_request("<model_uid>", "<your_unique_request_id>")
Note that if your request has already finished, aborting the request will be a no-op.
Image models also support this feature.

Note
====

* Currently, this feature only supports the ``generate``, ``chat`` and ``vision`` tasks for ``LLM`` models. The ``tool call`` tasks are not supported.
* Currently, for ``LLM`` models, this feature only supports the ``generate``, ``chat``, ``tool call`` and ``vision`` tasks.

* Currently, for ``image`` models, this feature only supports the ``text_to_image`` tasks. Only ``FLUX.1`` series models are supported.

* For ``vision`` tasks, currently only ``qwen-vl-chat``, ``cogvlm2``, ``glm-4v`` and ``MiniCPM-V-2.6`` (only for image tasks) models are supported. More models will be supported in the future. Please let us know your requirements.

* If using GPU inference, this method will consume more GPU memory. Please be cautious when increasing the number of concurrent requests to the same model.
The ``launch_model`` interface provides the ``max_num_seqs`` parameter to adjust the concurrency level, with a default value of ``16``.

* This feature is still in the experimental stage, and we welcome your active feedback on any issues.

* After a period of testing, this method will remain enabled by default, and the original inference method will be deprecated.
4 changes: 4 additions & 0 deletions xinference/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
XINFERENCE_ENV_DISABLE_HEALTH_CHECK = "XINFERENCE_DISABLE_HEALTH_CHECK"
XINFERENCE_ENV_DISABLE_METRICS = "XINFERENCE_DISABLE_METRICS"
XINFERENCE_ENV_DOWNLOAD_MAX_ATTEMPTS = "XINFERENCE_DOWNLOAD_MAX_ATTEMPTS"
XINFERENCE_ENV_TEXT_TO_IMAGE_BATCHING_SIZE = "XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE"


def get_xinference_home() -> str:
Expand Down Expand Up @@ -82,3 +83,6 @@ def get_xinference_home() -> str:
XINFERENCE_DOWNLOAD_MAX_ATTEMPTS = int(
os.environ.get(XINFERENCE_ENV_DOWNLOAD_MAX_ATTEMPTS, 3)
)
XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE = os.environ.get(
XINFERENCE_ENV_TEXT_TO_IMAGE_BATCHING_SIZE, None
)
Loading

0 comments on commit 948b99a

Please sign in to comment.