-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: 修复了发送西文字符句子时,空格被错误分割的问题 #487
base: main-fix
Are you sure you want to change the base?
Conversation
文本处理函数的更新后的类图classDiagram
class utils {
+split_into_sentences_w_remove_punctuation(text: str) : List[str]
+process_llm_response(text: str) : List[str]
+is_western_char(char) : bool
+is_western_paragraph(paragraph) : bool
}
note for utils.split_into_sentences_w_remove_punctuation "已修改以处理西文字符段落"
note for utils.process_llm_response "已修改以调整西文字符段落的最大长度"
note for utils.is_western_char "用于检测西文字符的新函数"
note for utils.is_western_paragraph "用于检测西文字符段落的新函数"
文件级别更改
提示和命令与 Sourcery 交互
自定义您的体验访问您的 仪表板 以:
获得帮助Original review guide in EnglishReviewer's Guide by SourceryThis pull request introduces changes to correctly handle sentences containing Western characters (e.g., English, Russian). It prevents the incorrect splitting of words separated by spaces when the bot sends messages. The implementation involves detecting Western character paragraphs and using a different splitting strategy that relies on a temporary separator instead of spaces. Additionally, the maximum response length for Western character paragraphs has been adjusted to accommodate longer sentences. Sequence diagram for splitting a sentence with Western characterssequenceDiagram
participant User
participant split_into_sentences_w_remove_punctuation
participant is_western_paragraph
User->>split_into_sentences_w_remove_punctuation: Send text with Western characters
activate split_into_sentences_w_remove_punctuation
split_into_sentences_w_remove_punctuation->>is_western_paragraph: Check if the text is a Western paragraph
activate is_western_paragraph
is_western_paragraph-->>split_into_sentences_w_remove_punctuation: Return True
deactivate is_western_paragraph
split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: Replace punctuation with temporary separator
split_into_sentences_w_remove_punctuation->>split_into_sentences_w_remove_punctuation: Split the text using the temporary separator
split_into_sentences_w_remove_punctuation-->>User: Return list of sentences
deactivate split_into_sentences_w_remove_punctuation
Updated class diagram for text processing functionsclassDiagram
class utils {
+split_into_sentences_w_remove_punctuation(text: str) : List[str]
+process_llm_response(text: str) : List[str]
+is_western_char(char) : bool
+is_western_paragraph(paragraph) : bool
}
note for utils.split_into_sentences_w_remove_punctuation "Modified to handle Western character paragraphs"
note for utils.process_llm_response "Modified to adjust max length for Western character paragraphs"
note for utils.is_western_char "New function to detect Western characters"
note for utils.is_western_paragraph "New function to detect Western character paragraphs"
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Dax233 - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider extracting the repeated
is_western_paragraph
checks into a variable to improve readability. - It might be helpful to add a unit test for the
is_western_paragraph
function.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@@ -276,21 +282,29 @@ def split_into_sentences_w_remove_punctuation(text: str) -> List[str]: | |||
for sentence in sentences: | |||
parts = sentence.split(',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider extracting common splitting and replacement operations into helper functions to reduce nested conditionals and duplicated logic, such as creating a function to process parts of a sentence based on a splitter and split strength, and applying this function to both Chinese and Western paragraphs.
It looks like the added branching and duplicated loops could benefit from extracting the common splitting/replacement operations into helper functions. This will reduce nested conditionals and duplicate logic while keeping functionality intact.
For example, you can extract the loop that processes parts into a helper function:
def process_parts(parts, split_strength, splitter):
current = parts[0]
new_sentences = []
for part in parts[1:]:
if random.random() < split_strength:
new_sentences.append(current.strip())
current = part
else:
current += splitter + part
new_sentences.append(current.strip())
return new_sentences
Then refactor the core function to use it:
# For Chinese paragraphs:
parts = sentence.split(',')
new_sentences = process_parts(parts, split_strength, ',')
# For Western paragraphs:
parts = current_sentence.split('|seg|')
new_sentences = process_parts(parts, split_strength, '|seg|')
Also consider a similar helper for the punctuation removal loops if their logic is near-identical. This centralizes the logic and reduces cognitive overhead.
请填写以下内容
(删除掉中括号内的空格,并替换为小写的x)
main
分支 禁止修改,请确认本次提交的分支 不是main
分支增加西文字符段落判断,如为西文字符段落则不采用空格作为分隔符,修复了bot发送西文字符,如英语、俄语等时,分隔单词的空格会被错误分割至多条发送的问题。
其他信息
好的,这是翻译成中文的 pull request 总结:
Sourcery 总结
Bug 修复:
Original summary in English
Summary by Sourcery
Bug Fixes: