Skip to content

Commit 5c23e2b

Browse files
committed
Update 2025-03-13 23:03:19
1 parent 76cbb82 commit 5c23e2b

34 files changed

+4526
-4421
lines changed

_sources/backend/function_calling.ipynb

+116-110
Large diffs are not rendered by default.

_sources/backend/native_api.ipynb

+231-219
Large diffs are not rendered by default.

_sources/backend/offline_engine_api.ipynb

+433-428
Large diffs are not rendered by default.

_sources/backend/openai_api_completions.ipynb

+181-166
Large diffs are not rendered by default.

_sources/backend/openai_api_embeddings.ipynb

+63-69
Large diffs are not rendered by default.

_sources/backend/openai_api_vision.ipynb

+87-92
Large diffs are not rendered by default.

_sources/backend/send_request.ipynb

+76-90
Large diffs are not rendered by default.

_sources/backend/separate_reasoning.ipynb

+126-124
Large diffs are not rendered by default.

_sources/backend/speculative_decoding.ipynb

+170-152
Large diffs are not rendered by default.

_sources/backend/structured_outputs.ipynb

+117-117
Large diffs are not rendered by default.

_sources/frontend/frontend.ipynb

+231-215
Large diffs are not rendered by default.

backend/function_calling.html

+48-48
Large diffs are not rendered by default.

backend/function_calling.ipynb

+116-110
Large diffs are not rendered by default.

backend/native_api.html

+139-139
Large diffs are not rendered by default.

backend/native_api.ipynb

+231-219
Large diffs are not rendered by default.

backend/offline_engine_api.html

+48-47
Large diffs are not rendered by default.

backend/offline_engine_api.ipynb

+433-428
Large diffs are not rendered by default.

backend/openai_api_completions.html

+114-110
Large diffs are not rendered by default.

backend/openai_api_completions.ipynb

+181-166
Large diffs are not rendered by default.

backend/openai_api_embeddings.html

+40-40
Original file line numberDiff line numberDiff line change
@@ -481,39 +481,39 @@ <h2>Launch A Server<a class="headerlink" href="#Launch-A-Server" title="Link to
481481
</div>
482482
<div class="output_area docutils container">
483483
<div class="highlight"><pre>
484-
[2025-03-13 21:58:22] server_args=ServerArgs(model_path=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, tokenizer_path=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, tokenizer_mode=&#39;auto&#39;, skip_tokenizer_init=False, load_format=&#39;auto&#39;, trust_remote_code=False, dtype=&#39;auto&#39;, kv_cache_dtype=&#39;auto&#39;, quantization=None, quantization_param_path=None, context_length=None, device=&#39;cuda&#39;, served_model_name=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, chat_template=None, is_embedding=True, revision=None, host=&#39;0.0.0.0&#39;, port=37446, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy=&#39;fcfs&#39;, schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=162798959, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level=&#39;info&#39;, log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path=&#39;sglang_storage&#39;, enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method=&#39;round_robin&#39;, ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args=&#39;{}&#39;, lora_paths=None, max_loras_per_batch=8, lora_backend=&#39;triton&#39;, attention_backend=&#39;flashinfer&#39;, sampling_backend=&#39;flashinfer&#39;, grammar_backend=&#39;outlines&#39;, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type=&#39;qk&#39;, ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config=&#39;&#39;, enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False)
485-
[2025-03-13 21:58:28] Downcasting torch.float32 to torch.float16.
486-
[2025-03-13 21:58:42 TP0] Downcasting torch.float32 to torch.float16.
487-
[2025-03-13 21:58:42 TP0] Overlap scheduler is disabled for embedding models.
488-
[2025-03-13 21:58:42 TP0] Downcasting torch.float32 to torch.float16.
489-
[2025-03-13 21:58:42 TP0] Init torch distributed begin.
490-
[2025-03-13 21:58:43 TP0] Init torch distributed ends. mem usage=0.00 GB
491-
[2025-03-13 21:58:43 TP0] Load weight begin. avail mem=62.69 GB
492-
[2025-03-13 21:58:43 TP0] The following error message &#39;operation scheduled before its operands&#39; can be ignored.
493-
[2025-03-13 21:58:43 TP0] Using model weights format [&#39;*.safetensors&#39;]
484+
[2025-03-13 22:54:38] server_args=ServerArgs(model_path=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, tokenizer_path=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, tokenizer_mode=&#39;auto&#39;, skip_tokenizer_init=False, load_format=&#39;auto&#39;, trust_remote_code=False, dtype=&#39;auto&#39;, kv_cache_dtype=&#39;auto&#39;, quantization=None, quantization_param_path=None, context_length=None, device=&#39;cuda&#39;, served_model_name=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, chat_template=None, is_embedding=True, revision=None, host=&#39;0.0.0.0&#39;, port=32349, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy=&#39;fcfs&#39;, schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=499511450, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level=&#39;info&#39;, log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path=&#39;sglang_storage&#39;, enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method=&#39;round_robin&#39;, ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args=&#39;{}&#39;, lora_paths=None, max_loras_per_batch=8, lora_backend=&#39;triton&#39;, attention_backend=&#39;flashinfer&#39;, sampling_backend=&#39;flashinfer&#39;, grammar_backend=&#39;outlines&#39;, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type=&#39;qk&#39;, ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config=&#39;&#39;, enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False)
485+
[2025-03-13 22:54:43] Downcasting torch.float32 to torch.float16.
486+
[2025-03-13 22:54:57 TP0] Downcasting torch.float32 to torch.float16.
487+
[2025-03-13 22:54:57 TP0] Overlap scheduler is disabled for embedding models.
488+
[2025-03-13 22:54:57 TP0] Downcasting torch.float32 to torch.float16.
489+
[2025-03-13 22:54:57 TP0] Init torch distributed begin.
490+
[2025-03-13 22:54:57 TP0] Init torch distributed ends. mem usage=0.00 GB
491+
[2025-03-13 22:54:57 TP0] Load weight begin. avail mem=62.69 GB
492+
[2025-03-13 22:54:58 TP0] The following error message &#39;operation scheduled before its operands&#39; can be ignored.
493+
[2025-03-13 22:54:58 TP0] Using model weights format [&#39;*.safetensors&#39;]
494494
Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00&lt;?, ?it/s]
495-
Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:01&lt;00:10, 1.79s/it]
496-
Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:03&lt;00:09, 1.88s/it]
497-
Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:05&lt;00:07, 1.86s/it]
498-
Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:07&lt;00:05, 1.85s/it]
499-
Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:09&lt;00:03, 1.85s/it]
500-
Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:10&lt;00:01, 1.53s/it]
501-
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:11&lt;00:00, 1.47s/it]
502-
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:11&lt;00:00, 1.64s/it]
503-
504-
[2025-03-13 21:58:55 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=28.27 GB, mem usage=34.42 GB.
505-
[2025-03-13 21:58:55 TP0] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
506-
[2025-03-13 21:58:55 TP0] Memory pool end. avail mem=26.90 GB
507-
[2025-03-13 21:58:56 TP0] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
508-
[2025-03-13 21:58:56] INFO: Started server process [3156713]
509-
[2025-03-13 21:58:56] INFO: Waiting for application startup.
510-
[2025-03-13 21:58:56] INFO: Application startup complete.
511-
[2025-03-13 21:58:56] INFO: Uvicorn running on http://0.0.0.0:37446 (Press CTRL+C to quit)
512-
[2025-03-13 21:58:56] INFO: 127.0.0.1:52396 - &#34;GET /v1/models HTTP/1.1&#34; 200 OK
513-
[2025-03-13 21:58:57] INFO: 127.0.0.1:52398 - &#34;GET /get_model_info HTTP/1.1&#34; 200 OK
514-
[2025-03-13 21:58:57 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
515-
[2025-03-13 21:58:58] INFO: 127.0.0.1:52402 - &#34;POST /encode HTTP/1.1&#34; 200 OK
516-
[2025-03-13 21:58:58] The server is fired up and ready to roll!
495+
Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:01&lt;00:08, 1.39s/it]
496+
Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:03&lt;00:08, 1.68s/it]
497+
Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:04&lt;00:05, 1.35s/it]
498+
Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:05&lt;00:04, 1.50s/it]
499+
Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:07&lt;00:03, 1.63s/it]
500+
Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:10&lt;00:01, 1.82s/it]
501+
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12&lt;00:00, 1.91s/it]
502+
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12&lt;00:00, 1.73s/it]
503+
504+
[2025-03-13 22:55:10 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=28.27 GB, mem usage=34.42 GB.
505+
[2025-03-13 22:55:10 TP0] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
506+
[2025-03-13 22:55:10 TP0] Memory pool end. avail mem=26.90 GB
507+
[2025-03-13 22:55:11 TP0] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
508+
[2025-03-13 22:55:11] INFO: Started server process [3185084]
509+
[2025-03-13 22:55:11] INFO: Waiting for application startup.
510+
[2025-03-13 22:55:11] INFO: Application startup complete.
511+
[2025-03-13 22:55:11] INFO: Uvicorn running on http://0.0.0.0:32349 (Press CTRL+C to quit)
512+
[2025-03-13 22:55:11] INFO: 127.0.0.1:33278 - &#34;GET /v1/models HTTP/1.1&#34; 200 OK
513+
[2025-03-13 22:55:12] INFO: 127.0.0.1:33280 - &#34;GET /get_model_info HTTP/1.1&#34; 200 OK
514+
[2025-03-13 22:55:12 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
515+
[2025-03-13 22:55:13] INFO: 127.0.0.1:33288 - &#34;POST /encode HTTP/1.1&#34; 200 OK
516+
[2025-03-13 22:55:13] The server is fired up and ready to roll!
517517
</pre></div></div>
518518
</div>
519519
<div class="nboutput nblast docutils container">
@@ -549,8 +549,8 @@ <h2>Using cURL<a class="headerlink" href="#Using-cURL" title="Link to this headi
549549
</div>
550550
<div class="output_area docutils container">
551551
<div class="highlight"><pre>
552-
[2025-03-13 21:59:01 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
553-
[2025-03-13 21:59:01] INFO: 127.0.0.1:55556 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
552+
[2025-03-13 22:55:17 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
553+
[2025-03-13 22:55:17] INFO: 127.0.0.1:33296 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
554554
</pre></div></div>
555555
</div>
556556
<div class="nboutput nblast docutils container">
@@ -586,8 +586,8 @@ <h2>Using Python Requests<a class="headerlink" href="#Using-Python-Requests" tit
586586
</div>
587587
<div class="output_area docutils container">
588588
<div class="highlight"><pre>
589-
[2025-03-13 21:59:01 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
590-
[2025-03-13 21:59:01] INFO: 127.0.0.1:55558 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
589+
[2025-03-13 22:55:17 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
590+
[2025-03-13 22:55:17] INFO: 127.0.0.1:33298 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
591591
</pre></div></div>
592592
</div>
593593
<div class="nboutput nblast docutils container">
@@ -623,8 +623,8 @@ <h2>Using OpenAI Python Client<a class="headerlink" href="#Using-OpenAI-Python-C
623623
</div>
624624
<div class="output_area docutils container">
625625
<div class="highlight"><pre>
626-
[2025-03-13 21:59:01 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
627-
[2025-03-13 21:59:02] INFO: 127.0.0.1:55562 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
626+
[2025-03-13 22:55:17 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
627+
[2025-03-13 22:55:17] INFO: 127.0.0.1:33306 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
628628
</pre></div></div>
629629
</div>
630630
<div class="nboutput nblast docutils container">
@@ -666,8 +666,8 @@ <h2>Using Input IDs<a class="headerlink" href="#Using-Input-IDs" title="Link to
666666
</div>
667667
<div class="output_area docutils container">
668668
<div class="highlight"><pre>
669-
[2025-03-13 21:59:07 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
670-
[2025-03-13 21:59:07] INFO: 127.0.0.1:55572 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
669+
[2025-03-13 22:55:22 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
670+
[2025-03-13 22:55:22] INFO: 127.0.0.1:50782 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
671671
</pre></div></div>
672672
</div>
673673
<div class="nboutput nblast docutils container">

0 commit comments

Comments
 (0)