Skip to content

Commit d31f15f

Browse files
committed
Fix duplicate PDF buttons and update abstracts
1 parent 283e38e commit d31f15f

File tree

4 files changed

+4
-49
lines changed

4 files changed

+4
-49
lines changed

content/publications/chronological-thinking/index.md

Lines changed: 1 addition & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -11,38 +11,14 @@ publication_types: ['article']
1111
publication: arXiv preprint
1212
publication_short: arXiv
1313

14-
abstract: This work explores chronological thinking capabilities in full-duplex spoken dialogue language models, enabling more natural and coherent conversational interactions.
14+
abstract: Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, a on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
1515

1616
featured: true
1717

18-
# Featured image - 优美抽象画面封面建议:
19-
# Option 1: 彩色流体艺术(蓝紫金色)
20-
image:
21-
url: 'https://images.unsplash.com/photo-1541701494587-cb58502866ab?w=1200&q=80'
22-
caption: 'Abstract Art'
23-
# Option 2: 渐变水彩质感(粉紫色系)
24-
# image:
25-
# url: 'https://images.unsplash.com/photo-1557672172-298e090bd0f1?w=1200&q=80'
26-
# caption: 'Abstract Watercolor'
27-
# Option 3: 抽象光影纹理(暖色调)
28-
# image:
29-
# url: 'https://images.unsplash.com/photo-1550859492-d5da9d8e45f3?w=1200&q=80'
30-
# caption: 'Abstract Light'
31-
# Option 4: 流动色彩(多彩渐变)
32-
# image:
33-
# url: 'https://images.unsplash.com/photo-1506259091721-347e791bab0f?w=1200&q=80'
34-
# caption: 'Fluid Colors'
35-
# Option 5: 抽象烟雾质感(蓝橙色)
36-
# image:
37-
# url: 'https://images.unsplash.com/photo-1553356084-58ef4a67b2a7?w=1200&q=80'
38-
# caption: 'Abstract Smoke'
39-
4018
links:
4119
- type: pdf
4220
url: https://arxiv.org/pdf/2510.05150
4321

44-
url_pdf: 'https://arxiv.org/pdf/2510.05150'
45-
4622
# Hide page metadata
4723
share: false
4824
show_date: false

content/publications/step-audio-2/index.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,14 @@ publication_types: ['article']
1111
publication: arXiv preprint
1212
publication_short: arXiv
1313

14-
abstract: Step-Audio 2 is the world's first industrial-grade end-to-end audio LLM with deep thinking capabilities, introducing Chain-of-Thought reasoning and audio reinforcement learning into speech models for the first time.
14+
abstract: This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit this https URL for more information.
1515

1616
featured: true
1717

18-
# Featured image - 优美抽象画面
19-
image:
20-
url: 'https://images.unsplash.com/photo-1550859492-d5da9d8e45f3?w=1200&q=80'
21-
caption: 'Abstract Art'
22-
2318
links:
2419
- type: pdf
2520
url: https://arxiv.org/pdf/2507.16632
2621

27-
url_pdf: 'https://arxiv.org/pdf/2507.16632'
28-
2922
# Hide page metadata
3023
share: false
3124
show_date: false

content/publications/step-audio-aqaa/index.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,14 @@ publication_types: ['article']
1111
publication: arXiv preprint
1212
publication_short: arXiv
1313

14-
abstract: Step-Audio-AQAA presents a fully end-to-end expressive large audio language model, pushing the boundaries of expressive speech synthesis and understanding.
14+
abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.
1515

1616
featured: true
1717

18-
# Featured image - 优美抽象画面
19-
image:
20-
url: 'https://images.unsplash.com/photo-1506259091721-347e791bab0f?w=1200&q=80'
21-
caption: 'Abstract Art'
22-
2318
links:
2419
- type: pdf
2520
url: https://arxiv.org/pdf/2506.08967
2621

27-
url_pdf: 'https://arxiv.org/pdf/2506.08967'
28-
2922
# Hide page metadata
3023
share: false
3124
show_date: false

content/publications/step-audio/index.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,14 @@ publication_types: ['article']
1111
publication: arXiv preprint
1212
publication_short: arXiv
1313

14-
abstract: Step-Audio represents a unified approach to understanding and generation in intelligent speech interaction systems, advancing the capabilities of audio language models.
14+
abstract:Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at this https URL.
1515

1616
featured: true
1717

18-
# Featured image - 优美抽象画面
19-
image:
20-
url: 'https://images.unsplash.com/photo-1557672172-298e090bd0f1?w=1200&q=80'
21-
caption: 'Abstract Art'
22-
2318
links:
2419
- type: pdf
2520
url: https://arxiv.org/pdf/2502.11946
2621

27-
url_pdf: 'https://arxiv.org/pdf/2502.11946'
28-
2922
# Hide page metadata
3023
share: false
3124
show_date: false

0 commit comments

Comments
 (0)