fix: 修复了发送西文字符句子时，空格被错误分割的问题 #487

Dax233 · 2025-03-19T02:47:59Z

🔴当前项目处于重构阶段（2025.3.14-）
✅ 接受：与main直接相关的Bug修复：提交到main-fix分支
⚠️ 冻结：所有新功能开发和非紧急重构

请填写以下内容

（删除掉中括号内的空格，并替换为小写的x）

- main 分支 禁止修改，请确认本次提交的分支 不是 main 分支
- 本次更新 包含破坏性变更（如数据库结构变更、配置文件修改等）
- 本次更新是否经过测试
- 请不要在数据库中添加group_id字段，这会影响本项目对其他平台的兼容
请填写破坏性更新的具体内容（如有）:
请简要说明本次更新的内容和目的：
增加西文字符段落判断，如为西文字符段落则不采用空格作为分隔符，修复了bot发送西文字符，如英语、俄语等时，分隔单词的空格会被错误分割至多条发送的问题。

其他信息

关联 Issue：Close #
截图/GIF：
附加信息:

好的，这是翻译成中文的 pull request 总结：

Sourcery 总结

Bug 修复：

修复了在使用机器人发送消息时，西方字符句子中的空格被错误分割的问题。

Original summary in English

Summary by Sourcery

Bug Fixes:

Fixes an issue where spaces in Western character sentences were incorrectly split when sending messages with the bot.

sourcery-ai · 2025-03-19T02:48:07Z

## Sourcery 评审指南

此拉取请求引入了更改，以正确处理包含西文字符（例如，英语、俄语）的句子。它防止了在机器人发送消息时错误地分割由空格分隔的单词。该实现包括检测西文字符段落，并使用一种不同的分割策略，该策略依赖于临时分隔符而不是空格。此外，西文字符段落的最大响应长度已调整，以适应更长的句子。

#### 分割包含西文字符的句子的时序图

```mermaid
sequenceDiagram
  participant User
  participant split_into_sentences_w_remove_punctuation
  participant is_western_paragraph

  User->>split_into_sentences_w_remove_punctuation: 发送包含西文字符的文本
  activate split_into_sentences_w_remove_punctuation
  split_into_sentences_w_remove_punctuation->>is_western_paragraph: 检查文本是否为西文段落
  activate is_western_paragraph
  is_western_paragraph-->>split_into_sentences_w_remove_punctuation: 返回 True
  deactivate is_western_paragraph
  split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: 将标点符号替换为临时分隔符
  split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: 使用临时分隔符分割文本
  split_into_sentences_w_remove_punctuation-->>User: 返回句子列表
  deactivate split_into_sentences_w_remove_punctuation

文本处理函数的更新后的类图

classDiagram
  class utils {
    +split_into_sentences_w_remove_punctuation(text: str) : List[str]
    +process_llm_response(text: str) : List[str]
    +is_western_char(char) : bool
    +is_western_paragraph(paragraph) : bool
  }

  note for utils.split_into_sentences_w_remove_punctuation "已修改以处理西文字符段落"
  note for utils.process_llm_response "已修改以调整西文字符段落的最大长度"
  note for utils.is_western_char "用于检测西文字符的新函数"
  note for utils.is_western_paragraph "用于检测西文字符段落的新函数"

文件级别更改

变更	详情	文件
添加了以不同方式处理西文字符段落的功能，防止错误地分割由空格分隔的单词。	添加了 `is_western_char` 函数来检测西文字符。添加了 `is_western_paragraph` 函数来检测西文字符段落。修改了 `split_into_sentences_w_remove_punctuation` 以通过使用临时分隔符 `	seg

提示和命令

与 Sourcery 交互

触发新的审查： 在拉取请求上评论 @sourcery-ai review。
继续讨论： 直接回复 Sourcery 的审查评论。
从审查评论生成 GitHub issue： 通过回复审查评论，要求 Sourcery 从审查评论创建一个 issue。您也可以回复审查评论并使用 @sourcery-ai issue 从中创建一个 issue。
生成拉取请求标题： 在拉取请求标题中的任何位置写入 @sourcery-ai 以随时生成标题。您也可以在拉取请求上评论 @sourcery-ai title 以随时（重新）生成标题。
生成拉取请求摘要： 在拉取请求正文中的任何位置写入 @sourcery-ai summary 以随时在您想要的位置生成 PR 摘要。您也可以在拉取请求上评论 @sourcery-ai summary 以随时（重新）生成摘要。
生成审查指南： 在拉取请求上评论 @sourcery-ai guide 以随时（重新）生成审查指南。
解决所有 Sourcery 评论： 在拉取请求上评论 @sourcery-ai resolve 以解决所有 Sourcery 评论。如果您已经解决了所有评论并且不想再看到它们，这将非常有用。
驳回所有 Sourcery 审查： 在拉取请求上评论 @sourcery-ai dismiss 以驳回所有现有的 Sourcery 审查。如果您想从新的审查开始，这将特别有用 - 不要忘记评论 @sourcery-ai review 以触发新的审查！
为 issue 生成行动计划： 在 issue 上评论 @sourcery-ai plan 以为其生成行动计划。

自定义您的体验

访问您的仪表板以：

启用或禁用审查功能，例如 Sourcery 生成的拉取请求摘要、审查指南等。
更改审查语言。
添加、删除或编辑自定义审查说明。
调整其他审查设置。

获得帮助

联系我们的支持团队提出问题或反馈。
访问我们的文档以获取详细的指南和信息。
通过在 X/Twitter、LinkedIn 或 GitHub 上关注我们，与 Sourcery 团队保持联系。

```

Original review guide in English

Reviewer's Guide by Sourcery

This pull request introduces changes to correctly handle sentences containing Western characters (e.g., English, Russian). It prevents the incorrect splitting of words separated by spaces when the bot sends messages. The implementation involves detecting Western character paragraphs and using a different splitting strategy that relies on a temporary separator instead of spaces. Additionally, the maximum response length for Western character paragraphs has been adjusted to accommodate longer sentences.

Sequence diagram for splitting a sentence with Western characters

sequenceDiagram
  participant User
  participant split_into_sentences_w_remove_punctuation
  participant is_western_paragraph

  User->>split_into_sentences_w_remove_punctuation: Send text with Western characters
  activate split_into_sentences_w_remove_punctuation
  split_into_sentences_w_remove_punctuation->>is_western_paragraph: Check if the text is a Western paragraph
  activate is_western_paragraph
  is_western_paragraph-->>split_into_sentences_w_remove_punctuation: Return True
  deactivate is_western_paragraph
  split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: Replace punctuation with temporary separator
  split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: Split the text using the temporary separator
  split_into_sentences_w_remove_punctuation-->>User: Return list of sentences
  deactivate split_into_sentences_w_remove_punctuation

Updated class diagram for text processing functions

classDiagram
  class utils {
    +split_into_sentences_w_remove_punctuation(text: str) : List[str]
    +process_llm_response(text: str) : List[str]
    +is_western_char(char) : bool
    +is_western_paragraph(paragraph) : bool
  }

  note for utils.split_into_sentences_w_remove_punctuation "Modified to handle Western character paragraphs"
  note for utils.process_llm_response "Modified to adjust max length for Western character paragraphs"
  note for utils.is_western_char "New function to detect Western characters"
  note for utils.is_western_paragraph "New function to detect Western character paragraphs"

File-Level Changes

Change	Details	Files
Added functionality to handle Western character paragraphs differently, preventing incorrect splitting of words separated by spaces.	Added `is_western_char` function to detect Western characters. Added `is_western_paragraph` function to detect Western paragraphs. Modified `split_into_sentences_w_remove_punctuation` to handle Western paragraphs by using a temporary separator `	seg

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!
Generate a plan of action for an issue: Comment @sourcery-ai plan on
an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @Dax233 - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider extracting the repeated is_western_paragraph checks into a variable to improve readability.
It might be helpful to add a unit test for the is_western_paragraph function.

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-03-19T02:48:54Z

src/plugins/chat/utils.py

@@ -276,21 +282,29 @@ def split_into_sentences_w_remove_punctuation(text: str) -> List[str]:
    for sentence in sentences:
        parts = sentence.split('，')


issue (complexity): Consider extracting common splitting and replacement operations into helper functions to reduce nested conditionals and duplicated logic, such as creating a function to process parts of a sentence based on a splitter and split strength, and applying this function to both Chinese and Western paragraphs.

It looks like the added branching and duplicated loops could benefit from extracting the common splitting/replacement operations into helper functions. This will reduce nested conditionals and duplicate logic while keeping functionality intact.

For example, you can extract the loop that processes parts into a helper function:

def process_parts(parts, split_strength, splitter): current = parts[0] new_sentences = [] for part in parts[1:]: if random.random() < split_strength: new_sentences.append(current.strip()) current = part else: current += splitter + part new_sentences.append(current.strip()) return new_sentences

Then refactor the core function to use it:

# For Chinese paragraphs: parts = sentence.split('，') new_sentences = process_parts(parts, split_strength, '，') # For Western paragraphs: parts = current_sentence.split('|seg|') new_sentences = process_parts(parts, split_strength, '|seg|')

Also consider a similar helper for the punctuation removal loops if their logic is near-identical. This centralizes the logic and reduces cognitive overhead.

Dax233 and others added 3 commits March 19, 2025 10:08

修复西文字符句子错误分行的问题

4e73f66

修复西文字符错误分行问题

50d2239

修复西文字符句子错误分割的bug

61007ff

sourcery-ai bot reviewed Mar 19, 2025

View reviewed changes

modified: src/plugins/chat/utils.py

65c26af

HYY1116 approved these changes Mar 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 修复了发送西文字符句子时，空格被错误分割的问题 #487

fix: 修复了发送西文字符句子时，空格被错误分割的问题 #487

Dax233 commented Mar 19, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 19, 2025 •

edited

Loading

与 Sourcery 交互

自定义您的体验

获得帮助

Reviewer's Guide by Sourcery

Sequence diagram for splitting a sentence with Western characters

Updated class diagram for text processing functions

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Mar 19, 2025

		@@ -276,21 +282,29 @@ def split_into_sentences_w_remove_punctuation(text: str) -> List[str]:
		for sentence in sentences:
		parts = sentence.split('，')

fix: 修复了发送西文字符句子时，空格被错误分割的问题 #487

Are you sure you want to change the base?

fix: 修复了发送西文字符句子时，空格被错误分割的问题 #487

Conversation

Dax233 commented Mar 19, 2025 • edited by sourcery-ai bot Loading

请填写以下内容

其他信息

Sourcery 总结

Summary by Sourcery

sourcery-ai bot commented Mar 19, 2025 • edited Loading

文本处理函数的更新后的类图

文件级别更改

与 Sourcery 交互

自定义您的体验

获得帮助

Reviewer's Guide by Sourcery

Sequence diagram for splitting a sentence with Western characters

Updated class diagram for text processing functions

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Mar 19, 2025

Choose a reason for hiding this comment

Dax233 commented Mar 19, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 19, 2025 •

edited

Loading