Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: 修复了发送西文字符句子时,空格被错误分割的问题 #487

Open
wants to merge 4 commits into
base: main-fix
Choose a base branch
from

Conversation

Dax233
Copy link
Contributor

@Dax233 Dax233 commented Mar 19, 2025

  • 🔴当前项目处于重构阶段(2025.3.14-)
  • ✅ 接受:与main直接相关的Bug修复:提交到main-fix分支
  • ⚠️ 冻结:所有新功能开发和非紧急重构

请填写以下内容

(删除掉中括号内的空格,并替换为小写的x

    • main 分支 禁止修改,请确认本次提交的分支 不是 main 分支
    • 本次更新 包含破坏性变更(如数据库结构变更、配置文件修改等)
    • 本次更新是否经过测试
    • 不要在数据库中添加group_id字段,这会影响本项目对其他平台的兼容
  1. 请填写破坏性更新的具体内容(如有):
  2. 请简要说明本次更新的内容和目的:
    增加西文字符段落判断,如为西文字符段落则不采用空格作为分隔符,修复了bot发送西文字符,如英语、俄语等时,分隔单词的空格会被错误分割至多条发送的问题。

其他信息

  • 关联 Issue:Close #
  • 截图/GIF
    image
    image
    image
  • 附加信息:

好的,这是翻译成中文的 pull request 总结:

Sourcery 总结

Bug 修复:

  • 修复了在使用机器人发送消息时,西方字符句子中的空格被错误分割的问题。
Original summary in English

Summary by Sourcery

Bug Fixes:

  • Fixes an issue where spaces in Western character sentences were incorrectly split when sending messages with the bot.

Copy link
Contributor

sourcery-ai bot commented Mar 19, 2025

## Sourcery 评审指南

此拉取请求引入了更改,以正确处理包含西文字符(例如,英语、俄语)的句子。它防止了在机器人发送消息时错误地分割由空格分隔的单词。该实现包括检测西文字符段落,并使用一种不同的分割策略,该策略依赖于临时分隔符而不是空格。此外,西文字符段落的最大响应长度已调整,以适应更长的句子。

#### 分割包含西文字符的句子的时序图

```mermaid
sequenceDiagram
  participant User
  participant split_into_sentences_w_remove_punctuation
  participant is_western_paragraph

  User->>split_into_sentences_w_remove_punctuation: 发送包含西文字符的文本
  activate split_into_sentences_w_remove_punctuation
  split_into_sentences_w_remove_punctuation->>is_western_paragraph: 检查文本是否为西文段落
  activate is_western_paragraph
  is_western_paragraph-->>split_into_sentences_w_remove_punctuation: 返回 True
  deactivate is_western_paragraph
  split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: 将标点符号替换为临时分隔符
  split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: 使用临时分隔符分割文本
  split_into_sentences_w_remove_punctuation-->>User: 返回句子列表
  deactivate split_into_sentences_w_remove_punctuation

文本处理函数的更新后的类图

classDiagram
  class utils {
    +split_into_sentences_w_remove_punctuation(text: str) : List[str]
    +process_llm_response(text: str) : List[str]
    +is_western_char(char) : bool
    +is_western_paragraph(paragraph) : bool
  }

  note for utils.split_into_sentences_w_remove_punctuation "已修改以处理西文字符段落"
  note for utils.process_llm_response "已修改以调整西文字符段落的最大长度"
  note for utils.is_western_char "用于检测西文字符的新函数"
  note for utils.is_western_paragraph "用于检测西文字符段落的新函数"
Loading

文件级别更改

变更 详情 文件
添加了以不同方式处理西文字符段落的功能,防止错误地分割由空格分隔的单词。
  • 添加了 is_western_char 函数来检测西文字符。
  • 添加了 is_western_paragraph 函数来检测西文字符段落。
  • 修改了 split_into_sentences_w_remove_punctuation 以通过使用临时分隔符 `
seg

提示和命令

与 Sourcery 交互

  • 触发新的审查: 在拉取请求上评论 @sourcery-ai review
  • 继续讨论: 直接回复 Sourcery 的审查评论。
  • 从审查评论生成 GitHub issue: 通过回复审查评论,要求 Sourcery 从审查评论创建一个 issue。您也可以回复审查评论并使用 @sourcery-ai issue 从中创建一个 issue。
  • 生成拉取请求标题: 在拉取请求标题中的任何位置写入 @sourcery-ai 以随时生成标题。您也可以在拉取请求上评论 @sourcery-ai title 以随时(重新)生成标题。
  • 生成拉取请求摘要: 在拉取请求正文中的任何位置写入 @sourcery-ai summary 以随时在您想要的位置生成 PR 摘要。您也可以在拉取请求上评论 @sourcery-ai summary 以随时(重新)生成摘要。
  • 生成审查指南: 在拉取请求上评论 @sourcery-ai guide 以随时(重新)生成审查指南。
  • 解决所有 Sourcery 评论: 在拉取请求上评论 @sourcery-ai resolve 以解决所有 Sourcery 评论。如果您已经解决了所有评论并且不想再看到它们,这将非常有用。
  • 驳回所有 Sourcery 审查: 在拉取请求上评论 @sourcery-ai dismiss 以驳回所有现有的 Sourcery 审查。如果您想从新的审查开始,这将特别有用 - 不要忘记评论 @sourcery-ai review 以触发新的审查!
  • 为 issue 生成行动计划: 在 issue 上评论 @sourcery-ai plan 以为其生成行动计划。

自定义您的体验

访问您的 仪表板 以:

  • 启用或禁用审查功能,例如 Sourcery 生成的拉取请求摘要、审查指南等。
  • 更改审查语言。
  • 添加、删除或编辑自定义审查说明。
  • 调整其他审查设置。

获得帮助

```
Original review guide in English

Reviewer's Guide by Sourcery

This pull request introduces changes to correctly handle sentences containing Western characters (e.g., English, Russian). It prevents the incorrect splitting of words separated by spaces when the bot sends messages. The implementation involves detecting Western character paragraphs and using a different splitting strategy that relies on a temporary separator instead of spaces. Additionally, the maximum response length for Western character paragraphs has been adjusted to accommodate longer sentences.

Sequence diagram for splitting a sentence with Western characters

sequenceDiagram
  participant User
  participant split_into_sentences_w_remove_punctuation
  participant is_western_paragraph

  User->>split_into_sentences_w_remove_punctuation: Send text with Western characters
  activate split_into_sentences_w_remove_punctuation
  split_into_sentences_w_remove_punctuation->>is_western_paragraph: Check if the text is a Western paragraph
  activate is_western_paragraph
  is_western_paragraph-->>split_into_sentences_w_remove_punctuation: Return True
  deactivate is_western_paragraph
  split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: Replace punctuation with temporary separator
  split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: Split the text using the temporary separator
  split_into_sentences_w_remove_punctuation-->>User: Return list of sentences
  deactivate split_into_sentences_w_remove_punctuation
Loading

Updated class diagram for text processing functions

classDiagram
  class utils {
    +split_into_sentences_w_remove_punctuation(text: str) : List[str]
    +process_llm_response(text: str) : List[str]
    +is_western_char(char) : bool
    +is_western_paragraph(paragraph) : bool
  }

  note for utils.split_into_sentences_w_remove_punctuation "Modified to handle Western character paragraphs"
  note for utils.process_llm_response "Modified to adjust max length for Western character paragraphs"
  note for utils.is_western_char "New function to detect Western characters"
  note for utils.is_western_paragraph "New function to detect Western character paragraphs"
Loading

File-Level Changes

Change Details Files
Added functionality to handle Western character paragraphs differently, preventing incorrect splitting of words separated by spaces.
  • Added is_western_char function to detect Western characters.
  • Added is_western_paragraph function to detect Western paragraphs.
  • Modified split_into_sentences_w_remove_punctuation to handle Western paragraphs by using a temporary separator `
seg

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Dax233 - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider extracting the repeated is_western_paragraph checks into a variable to improve readability.
  • It might be helpful to add a unit test for the is_western_paragraph function.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -276,21 +282,29 @@ def split_into_sentences_w_remove_punctuation(text: str) -> List[str]:
for sentence in sentences:
parts = sentence.split(',')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting common splitting and replacement operations into helper functions to reduce nested conditionals and duplicated logic, such as creating a function to process parts of a sentence based on a splitter and split strength, and applying this function to both Chinese and Western paragraphs.

It looks like the added branching and duplicated loops could benefit from extracting the common splitting/replacement operations into helper functions. This will reduce nested conditionals and duplicate logic while keeping functionality intact.

For example, you can extract the loop that processes parts into a helper function:

def process_parts(parts, split_strength, splitter):
    current = parts[0]
    new_sentences = []
    for part in parts[1:]:
        if random.random() < split_strength:
            new_sentences.append(current.strip())
            current = part
        else:
            current += splitter + part
    new_sentences.append(current.strip())
    return new_sentences

Then refactor the core function to use it:

# For Chinese paragraphs:
parts = sentence.split(',')
new_sentences = process_parts(parts, split_strength, ',')
# For Western paragraphs:
parts = current_sentence.split('|seg|')
new_sentences = process_parts(parts, split_strength, '|seg|')

Also consider a similar helper for the punctuation removal loops if their logic is near-identical. This centralizes the logic and reduces cognitive overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants