-
-
Notifications
You must be signed in to change notification settings - Fork 118
Text splitter bugfixes #560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* additional hardening against running on windows (removing instances of "\r" from the text to be split) for the text splitters that did not already have this. * fixed a bug in the MarkdownHeaderTextSplitter where the header breadcrumb line was not losing the proper number of segments on depth reduction (it only ever removed 1 segment, regardless of how much less deep the next header was than the previous one). * added a test to validate that abovementioned bug was fixed.
WalkthroughThe changes introduce the FluentAssertions package as a dependency in multiple test project files and centralize its version in the package management file. Additionally, text splitting logic in several splitter classes is updated to normalize line endings by removing carriage returns. A new markdown resource and a comprehensive test for markdown header splitting are also added. Changes
Sequence Diagram(s)sequenceDiagram
participant TestRunner
participant MarkdownResource
participant MarkdownHeaderTextSplitter
TestRunner->>MarkdownResource: Load markdown_test_material.md
TestRunner->>MarkdownHeaderTextSplitter: SplitText(resource)
MarkdownHeaderTextSplitter-->>TestRunner: Return segments
TestRunner->>TestRunner: Assert segment count and header/text correctness
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj (1)
7-9
: Duplicate of previous comment – applyPrivateAssets="all"
for consistency.src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj (1)
7-9
: Duplicate of previous comment – applyPrivateAssets="all"
for consistency.src/Meta/test/LangChain.IntegrationTests.csproj (1)
8-10
: Duplicate of previous comment – applyPrivateAssets="all"
for consistency.
🧹 Nitpick comments (2)
src/Directory.Packages.props (1)
12-12
: Mark test-only FluentAssertions package as PrivateAssets to prevent it from leaking into production artifacts
Directory.Packages.props
centrally defines the version, but because every test project now references FluentAssertions withoutPrivateAssets
, a library project that happens to reference a test project (rare but possible in multi-targeted solutions) would transitively ship the assertion library.
AddingPrivateAssets="all"
at each reference (or centrally via aPackageVersion
entry that carries metadata) keeps the dependency graph for published packages clean.- <PackageVersion Include="FluentAssertions" Version="8.5.0" /> + <PackageVersion Include="FluentAssertions" Version="8.5.0"> + <PrivateAssets>all</PrivateAssets> + </PackageVersion>src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj (1)
9-11
: AddPrivateAssets="all"
so the assertion library stays test-scopeSame rationale as explained for
Directory.Packages.props
: marking the reference asPrivateAssets
guarantees FluentAssertions never slips into runtime packages produced from this project (should you ever pack it).- <PackageReference Include="FluentAssertions" /> + <PackageReference Include="FluentAssertions" PrivateAssets="all" />
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj
(1 hunks)src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj
(1 hunks)src/Directory.Packages.props
(1 hunks)src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj
(1 hunks)src/Meta/test/LangChain.IntegrationTests.csproj
(1 hunks)src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs
(1 hunks)src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs
(1 hunks)src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs
(1 hunks)src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj
(1 hunks)src/Splitters/Abstractions/test/Resources/markdown_test_material.md
(1 hunks)src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs
(1 hunks)src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj
(1 hunks)src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs (1)
src/Utilities/Pollyfils/src/StringExtensions.Replace.cs (1)
Replace
(8-31)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test / Build, test and publish
🔇 Additional comments (9)
src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj (1)
9-9
: LGTM! Standard test dependency addition.Adding FluentAssertions is a good improvement for test readability and assertion capabilities. The centralized version management approach is also a best practice.
src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj (1)
7-9
: LGTM! Consistent test improvement.The FluentAssertions package addition follows the same pattern as other test projects and will improve assertion readability and capabilities.
src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj (1)
10-12
: LGTM! Systematic test enhancement.Consistent addition of FluentAssertions across test projects. This systematic approach ensures uniform testing capabilities across the codebase.
src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs (2)
39-39
: LGTM! Windows compatibility improvement.Adding carriage return removal provides better cross-platform compatibility, specifically addressing Windows line ending handling as mentioned in the PR objectives.
79-79
: LGTM! Correct fix for header breadcrumb depth calculation.The updated calculation
(1 - hLen + currentHeaderLen)
properly handles the depth difference between headers. This ensures the correct number of segments are removed when encountering headers at different levels, fixing the bug where only one segment was always removed regardless of the actual depth change.Example: When transitioning from level 4 to level 2, this now correctly removes 3 segments instead of just 1.
src/Splitters/Abstractions/test/Resources/markdown_test_material.md (1)
1-71
: LGTM! Comprehensive test resource for header hierarchy validation.This markdown file provides excellent test coverage for the header breadcrumb fix with:
- Multiple nesting levels (H1, H2, H3)
- Various header level transitions
- Associated text content for each section
The hierarchical structure will effectively validate the corrected breadcrumb depth calculation logic.
src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs (1)
22-23
: LGTM! Good cross-platform compatibility fix.Removing carriage return characters normalizes Windows-style line endings (
\r\n
) to Unix-style (\n
), ensuring consistent text processing across different operating systems. The inline comment clearly explains the purpose.src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs (1)
18-19
: LGTM! Consistent normalization across splitter implementations.This change mirrors the fix in
RecursiveCharacterTextSplitter.cs
, ensuring both text splitter classes handle Windows-style line endings consistently. The carriage return removal normalizes input text for reliable cross-platform behavior.src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs (1)
65-128
: Excellent comprehensive test for nested header functionality.This test thoroughly validates the MarkdownHeaderTextSplitter's ability to handle complex nested header structures. The test effectively covers:
- Multiple nesting levels (A → A.A → A.B.A, etc.)
- Proper header concatenation with colons as separators
- Correct association of text content with header paths
- Expected count validation (18 elements)
This comprehensive testing approach is valuable for ensuring the header breadcrumb fix mentioned in the PR objectives works correctly across various nesting scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj (1)
9-11
: SamePrivateAssets="all"
recommendation as in Core testsSee earlier comment in
LangChain.Core.UnitTests.csproj
.src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj (1)
10-12
: SamePrivateAssets="all"
recommendation as in Core testsSee earlier comment in
LangChain.Core.UnitTests.csproj
.src/Meta/test/LangChain.IntegrationTests.csproj (1)
8-10
: SamePrivateAssets="all"
recommendation as in Core testsSee earlier comment in
LangChain.Core.UnitTests.csproj
.
🧹 Nitpick comments (4)
src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj (1)
7-9
: AddPrivateAssets="all"
to keep test-only dependency from leaking transitivelyFluentAssertions is only needed while compiling/running the test assembly. Marking it as private avoids accidentally flowing it to downstream consumers should the test project ever be packed.
- <PackageReference Include="FluentAssertions" /> + <PackageReference Include="FluentAssertions" PrivateAssets="all" />src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs (1)
79-79
: LGTM! Header breadcrumb logic fix looks correct.The updated calculation
(1 - hLen + currentHeaderLen)
correctly computes how many segments to remove when transitioning from a deeper header level to a shallower one. This fixes the bug where only one segment was removed regardless of the actual depth difference.Consider adding a comment to explain the calculation for future maintainers:
+ // Remove segments to match the new header depth: (currentHeaderLen - hLen + 1) segments string prevHeader = string.Join("|", existingHeader.Take(existingHeader.Length - (1 - hLen + currentHeaderLen)));
src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs (1)
22-23
: LGTM! Consistent Windows line ending normalization.The carriage return removal ensures consistent behavior across different operating systems, matching the implementation in other text splitters.
Consider using a more professional comment:
- text = text.Replace("\r", ""); // some people are using windows + text = text.Replace("\r", ""); // Normalize line endings for cross-platform compatibilitysrc/Splitters/Abstractions/test/Tests.MarkdownHeader.cs (1)
75-127
: Comprehensive validation of hierarchical header breadcrumbs.The test thoroughly validates the breadcrumb logic fix mentioned in the PR objectives. The colon-separated header hierarchy format correctly demonstrates that headers now properly reduce segments according to depth differences.
However, consider refactoring to reduce repetition and improve maintainability:
+ var expectedResults = new[] + { + ("Header A", "Text A"), + ("Header A: Header A.A", "Text A.A"), + ("Header A: Header A.B", "Text A.B"), + // ... continue for all expected results + }; + + for (int i = 0; i < expectedResults.Length; i++) + { + res[i].Split("\n")[0].Should().Be(expectedResults[i].Item1); + res[i].Split("\n")[1].Should().Be(expectedResults[i].Item2); + }This would make the test more maintainable and reduce code duplication while preserving the same validation logic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj
(1 hunks)src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj
(1 hunks)src/Directory.Packages.props
(1 hunks)src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj
(1 hunks)src/Meta/test/LangChain.IntegrationTests.csproj
(1 hunks)src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs
(1 hunks)src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs
(1 hunks)src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs
(1 hunks)src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj
(1 hunks)src/Splitters/Abstractions/test/Resources/markdown_test_material.md
(1 hunks)src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs
(1 hunks)src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj
(1 hunks)src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj
(1 hunks)
🔇 Additional comments (8)
src/Directory.Packages.props (1)
12-12
: Confirmed: FluentAssertions 8.5.0 is available on NuGet
- Verified via
curl https://api.nuget.org/v3-flatcontainer/fluentassertions/index.json | jq '.versions | last'
returning"8.5.0"
.- No changes needed; the specified version is correct and restore will succeed.
src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj (1)
9-9
: LGTM! FluentAssertions package addition enhances test readability.The addition of FluentAssertions will improve test assertions and make them more expressive and readable.
src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj (1)
7-10
: LGTM! Consistent FluentAssertions integration.The new ItemGroup with FluentAssertions package reference follows proper MSBuild conventions and maintains consistency with other test projects in the repository.
src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs (1)
39-39
: LGTM! Windows compatibility improvement.Adding carriage return removal ensures consistent line ending handling across different operating systems, addressing the Windows hardening mentioned in the PR objectives.
src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj (1)
7-10
: LGTM! Consistent test infrastructure improvement.The FluentAssertions package addition follows the same pattern as other test projects and will enhance assertion readability in the splitter tests.
src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs (1)
18-18
: LGTM! Effective normalization of line endings.The carriage return removal correctly addresses Windows compatibility issues by normalizing
\r\n
to\n
line endings. The placement after null validation and before splitting logic is appropriate.src/Splitters/Abstractions/test/Resources/markdown_test_material.md (1)
1-71
: Excellent test resource structure for comprehensive markdown testing.The hierarchical markdown structure effectively covers multiple testing scenarios:
- Multi-level header nesting (H1 → H2 → H3)
- Header level transitions (deep to shallow)
- Sibling headers at the same level
- Consistent naming pattern for easy verification
This resource should thoroughly validate the header breadcrumb logic fixes mentioned in the PR objectives.
src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs (1)
65-73
: Well-structured test setup and count validation.The test correctly loads the markdown resource and validates the expected segment count. The assertion of exactly 18 segments aligns with the comprehensive markdown structure.
Summary by CodeRabbit
New Features
Bug Fixes
Chores