Skip to content

Commit 57e3d94

Browse files
committed
Update 2025-03-14 03:54:51
1 parent 3f9ec70 commit 57e3d94

34 files changed

+4664
-4403
lines changed

_sources/backend/function_calling.ipynb

+117-119
Large diffs are not rendered by default.

_sources/backend/native_api.ipynb

+230-217
Large diffs are not rendered by default.

_sources/backend/offline_engine_api.ipynb

+441-443
Large diffs are not rendered by default.

_sources/backend/openai_api_completions.ipynb

+197-172
Large diffs are not rendered by default.

_sources/backend/openai_api_embeddings.ipynb

+61-61
Large diffs are not rendered by default.

_sources/backend/openai_api_vision.ipynb

+101-99
Large diffs are not rendered by default.

_sources/backend/send_request.ipynb

+102-77
Large diffs are not rendered by default.

_sources/backend/separate_reasoning.ipynb

+113-112
Large diffs are not rendered by default.

_sources/backend/speculative_decoding.ipynb

+165-166
Large diffs are not rendered by default.

_sources/backend/structured_outputs.ipynb

+130-126
Large diffs are not rendered by default.

_sources/frontend/frontend.ipynb

+237-185
Large diffs are not rendered by default.

backend/function_calling.html

+48-56
Large diffs are not rendered by default.

backend/function_calling.ipynb

+117-119
Large diffs are not rendered by default.

backend/native_api.html

+141-140
Large diffs are not rendered by default.

backend/native_api.ipynb

+230-217
Large diffs are not rendered by default.

backend/offline_engine_api.html

+53-39
Large diffs are not rendered by default.

backend/offline_engine_api.ipynb

+441-443
Large diffs are not rendered by default.

backend/openai_api_completions.html

+114-105
Large diffs are not rendered by default.

backend/openai_api_completions.ipynb

+197-172
Large diffs are not rendered by default.

backend/openai_api_embeddings.html

+37-37
Original file line numberDiff line numberDiff line change
@@ -481,39 +481,39 @@ <h2>Launch A Server<a class="headerlink" href="#Launch-A-Server" title="Link to
481481
</div>
482482
<div class="output_area docutils container">
483483
<div class="highlight"><pre>
484-
[2025-03-14 01:02:18] server_args=ServerArgs(model_path=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, tokenizer_path=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, tokenizer_mode=&#39;auto&#39;, skip_tokenizer_init=False, load_format=&#39;auto&#39;, trust_remote_code=False, dtype=&#39;auto&#39;, kv_cache_dtype=&#39;auto&#39;, quantization=None, quantization_param_path=None, context_length=None, device=&#39;cuda&#39;, served_model_name=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, chat_template=None, is_embedding=True, revision=None, host=&#39;0.0.0.0&#39;, port=37830, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy=&#39;fcfs&#39;, schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=986969624, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level=&#39;info&#39;, log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path=&#39;sglang_storage&#39;, enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method=&#39;round_robin&#39;, ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args=&#39;{}&#39;, lora_paths=None, max_loras_per_batch=8, lora_backend=&#39;triton&#39;, attention_backend=&#39;flashinfer&#39;, sampling_backend=&#39;flashinfer&#39;, grammar_backend=&#39;outlines&#39;, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type=&#39;qk&#39;, ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config=&#39;&#39;, enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False)
485-
[2025-03-14 01:02:23] Downcasting torch.float32 to torch.float16.
486-
[2025-03-14 01:02:39 TP0] Downcasting torch.float32 to torch.float16.
487-
[2025-03-14 01:02:39 TP0] Overlap scheduler is disabled for embedding models.
488-
[2025-03-14 01:02:39 TP0] Downcasting torch.float32 to torch.float16.
489-
[2025-03-14 01:02:39 TP0] Init torch distributed begin.
490-
[2025-03-14 01:02:40 TP0] Init torch distributed ends. mem usage=1.16 GB
491-
[2025-03-14 01:02:40 TP0] Load weight begin. avail mem=58.16 GB
492-
[2025-03-14 01:02:40 TP0] The following error message &#39;operation scheduled before its operands&#39; can be ignored.
493-
[2025-03-14 01:02:40 TP0] Using model weights format [&#39;*.safetensors&#39;]
484+
[2025-03-14 03:46:12] server_args=ServerArgs(model_path=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, tokenizer_path=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, tokenizer_mode=&#39;auto&#39;, skip_tokenizer_init=False, load_format=&#39;auto&#39;, trust_remote_code=False, dtype=&#39;auto&#39;, kv_cache_dtype=&#39;auto&#39;, quantization=None, quantization_param_path=None, context_length=None, device=&#39;cuda&#39;, served_model_name=&#39;Alibaba-NLP/gte-Qwen2-7B-instruct&#39;, chat_template=None, is_embedding=True, revision=None, host=&#39;0.0.0.0&#39;, port=39092, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy=&#39;fcfs&#39;, schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=37673618, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level=&#39;info&#39;, log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path=&#39;sglang_storage&#39;, enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method=&#39;round_robin&#39;, ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args=&#39;{}&#39;, lora_paths=None, max_loras_per_batch=8, lora_backend=&#39;triton&#39;, attention_backend=&#39;flashinfer&#39;, sampling_backend=&#39;flashinfer&#39;, grammar_backend=&#39;outlines&#39;, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type=&#39;qk&#39;, ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config=&#39;&#39;, enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False)
485+
[2025-03-14 03:46:18] Downcasting torch.float32 to torch.float16.
486+
[2025-03-14 03:46:31 TP0] Downcasting torch.float32 to torch.float16.
487+
[2025-03-14 03:46:31 TP0] Overlap scheduler is disabled for embedding models.
488+
[2025-03-14 03:46:31 TP0] Downcasting torch.float32 to torch.float16.
489+
[2025-03-14 03:46:31 TP0] Init torch distributed begin.
490+
[2025-03-14 03:46:31 TP0] Init torch distributed ends. mem usage=0.02 GB
491+
[2025-03-14 03:46:31 TP0] Load weight begin. avail mem=63.16 GB
492+
[2025-03-14 03:46:31 TP0] The following error message &#39;operation scheduled before its operands&#39; can be ignored.
493+
[2025-03-14 03:46:32 TP0] Using model weights format [&#39;*.safetensors&#39;]
494494
Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00&lt;?, ?it/s]
495-
Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:01&lt;00:09, 1.51s/it]
496-
Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:03&lt;00:09, 1.86s/it]
497-
Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:04&lt;00:05, 1.48s/it]
498-
Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:06&lt;00:04, 1.65s/it]
499-
Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:08&lt;00:03, 1.73s/it]
495+
Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:01&lt;00:09, 1.55s/it]
496+
Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:03&lt;00:09, 1.88s/it]
497+
Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:04&lt;00:06, 1.53s/it]
498+
Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:06&lt;00:05, 1.68s/it]
499+
Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:08&lt;00:03, 1.76s/it]
500500
Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:10&lt;00:01, 1.83s/it]
501-
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12&lt;00:00, 1.86s/it]
501+
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12&lt;00:00, 1.83s/it]
502502
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12&lt;00:00, 1.77s/it]
503503

504-
[2025-03-14 01:02:53 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=28.27 GB, mem usage=29.89 GB.
505-
[2025-03-14 01:02:53 TP0] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
506-
[2025-03-14 01:02:53 TP0] Memory pool end. avail mem=26.89 GB
507-
[2025-03-14 01:02:53 TP0] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
508-
[2025-03-14 01:02:54] INFO: Started server process [3422886]
509-
[2025-03-14 01:02:54] INFO: Waiting for application startup.
510-
[2025-03-14 01:02:54] INFO: Application startup complete.
511-
[2025-03-14 01:02:54] INFO: Uvicorn running on http://0.0.0.0:37830 (Press CTRL+C to quit)
512-
[2025-03-14 01:02:54] INFO: 127.0.0.1:53690 - &#34;GET /v1/models HTTP/1.1&#34; 200 OK
513-
[2025-03-14 01:02:55] INFO: 127.0.0.1:53696 - &#34;GET /get_model_info HTTP/1.1&#34; 200 OK
514-
[2025-03-14 01:02:55 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
515-
[2025-03-14 01:02:56] INFO: 127.0.0.1:53706 - &#34;POST /encode HTTP/1.1&#34; 200 OK
516-
[2025-03-14 01:02:56] The server is fired up and ready to roll!
504+
[2025-03-14 03:46:45 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=28.27 GB, mem usage=34.90 GB.
505+
[2025-03-14 03:46:45 TP0] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
506+
[2025-03-14 03:46:45 TP0] Memory pool end. avail mem=26.89 GB
507+
[2025-03-14 03:46:45 TP0] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
508+
[2025-03-14 03:46:45] INFO: Started server process [284110]
509+
[2025-03-14 03:46:45] INFO: Waiting for application startup.
510+
[2025-03-14 03:46:45] INFO: Application startup complete.
511+
[2025-03-14 03:46:45] INFO: Uvicorn running on http://0.0.0.0:39092 (Press CTRL+C to quit)
512+
[2025-03-14 03:46:46] INFO: 127.0.0.1:60148 - &#34;GET /v1/models HTTP/1.1&#34; 200 OK
513+
[2025-03-14 03:46:46] INFO: 127.0.0.1:60150 - &#34;GET /get_model_info HTTP/1.1&#34; 200 OK
514+
[2025-03-14 03:46:46 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
515+
[2025-03-14 03:46:48] INFO: 127.0.0.1:60158 - &#34;POST /encode HTTP/1.1&#34; 200 OK
516+
[2025-03-14 03:46:48] The server is fired up and ready to roll!
517517
</pre></div></div>
518518
</div>
519519
<div class="nboutput nblast docutils container">
@@ -549,8 +549,8 @@ <h2>Using cURL<a class="headerlink" href="#Using-cURL" title="Link to this headi
549549
</div>
550550
<div class="output_area docutils container">
551551
<div class="highlight"><pre>
552-
[2025-03-14 01:02:59 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
553-
[2025-03-14 01:02:59] INFO: 127.0.0.1:35462 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
552+
[2025-03-14 03:46:51 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
553+
[2025-03-14 03:46:51] INFO: 127.0.0.1:42728 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
554554
</pre></div></div>
555555
</div>
556556
<div class="nboutput nblast docutils container">
@@ -586,8 +586,8 @@ <h2>Using Python Requests<a class="headerlink" href="#Using-Python-Requests" tit
586586
</div>
587587
<div class="output_area docutils container">
588588
<div class="highlight"><pre>
589-
[2025-03-14 01:02:59 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
590-
[2025-03-14 01:02:59] INFO: 127.0.0.1:35472 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
589+
[2025-03-14 03:46:51 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
590+
[2025-03-14 03:46:51] INFO: 127.0.0.1:42740 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
591591
</pre></div></div>
592592
</div>
593593
<div class="nboutput nblast docutils container">
@@ -623,8 +623,8 @@ <h2>Using OpenAI Python Client<a class="headerlink" href="#Using-OpenAI-Python-C
623623
</div>
624624
<div class="output_area docutils container">
625625
<div class="highlight"><pre>
626-
[2025-03-14 01:03:00 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
627-
[2025-03-14 01:03:00] INFO: 127.0.0.1:35480 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
626+
[2025-03-14 03:46:51 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
627+
[2025-03-14 03:46:51] INFO: 127.0.0.1:42754 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
628628
</pre></div></div>
629629
</div>
630630
<div class="nboutput nblast docutils container">
@@ -666,8 +666,8 @@ <h2>Using Input IDs<a class="headerlink" href="#Using-Input-IDs" title="Link to
666666
</div>
667667
<div class="output_area docutils container">
668668
<div class="highlight"><pre>
669-
[2025-03-14 01:03:05 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
670-
[2025-03-14 01:03:05] INFO: 127.0.0.1:35486 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
669+
[2025-03-14 03:46:57 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,
670+
[2025-03-14 03:46:57] INFO: 127.0.0.1:42758 - &#34;POST /v1/embeddings HTTP/1.1&#34; 200 OK
671671
</pre></div></div>
672672
</div>
673673
<div class="nboutput nblast docutils container">

0 commit comments

Comments
 (0)