Text splitter bugfixes #560

Aexalon · 2025-08-05T12:21:27Z

additional hardening against running on windows (removing instances of "\r" from the text to be split) for the text splitters that did not already have this.
fixed a bug in the MarkdownHeaderTextSplitter where the header breadcrumb line was not losing the proper number of segments on depth reduction (it only ever removed 1 segment, regardless of how much less deep the next header was than the previous one).
added a test to validate that abovementioned bug was fixed.

Summary by CodeRabbit

New Features
- Added a new markdown test resource to improve testing of markdown parsing and splitting.
- Introduced a new test to validate markdown header splitting, ensuring correct handling of nested headers.
Bug Fixes
- Improved text splitting by normalizing line endings, ensuring consistent handling of carriage returns across different platforms.
- Enhanced markdown header splitting logic for more accurate header hierarchy truncation.
Chores
- Added FluentAssertions as a dependency to various test projects to improve test assertions.

* additional hardening against running on windows (removing instances of "\r" from the text to be split) for the text splitters that did not already have this. * fixed a bug in the MarkdownHeaderTextSplitter where the header breadcrumb line was not losing the proper number of segments on depth reduction (it only ever removed 1 segment, regardless of how much less deep the next header was than the previous one). * added a test to validate that abovementioned bug was fixed.

CLAassistant · 2025-08-05T12:21:35Z

All committers have signed the CLA.

coderabbitai · 2025-08-05T12:25:31Z

Walkthrough

The changes introduce the FluentAssertions package as a dependency in multiple test project files and centralize its version in the package management file. Additionally, text splitting logic in several splitter classes is updated to normalize line endings by removing carriage returns. A new markdown resource and a comprehensive test for markdown header splitting are also added.

Changes

Cohort / File(s)	Change Summary
FluentAssertions Package Reference in Test Projects `src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj`, `src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj`, `src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj`, `src/Meta/test/LangChain.IntegrationTests.csproj`, `src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj`, `src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj`, `src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj`	Added FluentAssertions as a package reference to various test project files.
Centralized Package Version Management `src/Directory.Packages.props`	Added FluentAssertions version "8.5.0" to centralized package management.
Text Splitter Line Ending Normalization `src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs`, `src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs`	Updated methods to remove carriage return characters (`\r`) from input text before splitting.
Markdown Header Splitting Logic `src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs`	Modified how the header hierarchy is truncated when encountering headers of equal or lesser level during markdown splitting.
Markdown Test Resource `src/Splitters/Abstractions/test/Resources/markdown_test_material.md`	Added a new markdown file with nested headers and sample text for testing markdown splitting.
Markdown Header Splitter Test `src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs`	Added a new test method `TestMarkdown4` to validate the markdown header splitter using the new resource file.

Sequence Diagram(s)

sequenceDiagram
    participant TestRunner
    participant MarkdownResource
    participant MarkdownHeaderTextSplitter

    TestRunner->>MarkdownResource: Load markdown_test_material.md
    TestRunner->>MarkdownHeaderTextSplitter: SplitText(resource)
    MarkdownHeaderTextSplitter-->>TestRunner: Return segments
    TestRunner->>TestRunner: Assert segment count and header/text correctness

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A bunny hopped through code today,
Adding FluentAssertions along the way.
Splitters now trim those carriage returns,
While markdown headers get their turns.
New tests and resources, neat and bright—
This patch brings clarity and delight!
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj (1)

7-9: Duplicate of previous comment – apply PrivateAssets="all" for consistency.

src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj (1)

7-9: Duplicate of previous comment – apply PrivateAssets="all" for consistency.

src/Meta/test/LangChain.IntegrationTests.csproj (1)

8-10: Duplicate of previous comment – apply PrivateAssets="all" for consistency.

🧹 Nitpick comments (2)

src/Directory.Packages.props (1)
12-12: Mark test-only FluentAssertions package as PrivateAssets to prevent it from leaking into production artifacts

Directory.Packages.props centrally defines the version, but because every test project now references FluentAssertions without PrivateAssets, a library project that happens to reference a test project (rare but possible in multi-targeted solutions) would transitively ship the assertion library.
Adding PrivateAssets="all" at each reference (or centrally via a PackageVersion entry that carries metadata) keeps the dependency graph for published packages clean.
-    <PackageVersion Include="FluentAssertions" Version="8.5.0" />
+    <PackageVersion Include="FluentAssertions" Version="8.5.0">
+      <PrivateAssets>all</PrivateAssets>
+    </PackageVersion>
src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj (1)
9-11: Add PrivateAssets="all" so the assertion library stays test-scope

Same rationale as explained for Directory.Packages.props: marking the reference as PrivateAssets guarantees FluentAssertions never slips into runtime packages produced from this project (should you ever pack it).
-  <PackageReference Include="FluentAssertions" />
+  <PackageReference Include="FluentAssertions" PrivateAssets="all" />

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b20bc9 and 8b13a5e.

📒 Files selected for processing (13)

src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj (1 hunks)
src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj (1 hunks)
src/Directory.Packages.props (1 hunks)
src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj (1 hunks)
src/Meta/test/LangChain.IntegrationTests.csproj (1 hunks)
src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs (1 hunks)
src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs (1 hunks)
src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs (1 hunks)
src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj (1 hunks)
src/Splitters/Abstractions/test/Resources/markdown_test_material.md (1 hunks)
src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs (1 hunks)
src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj (1 hunks)
src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs (1)

src/Utilities/Pollyfils/src/StringExtensions.Replace.cs (1)

Replace (8-31)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test / Build, test and publish

🔇 Additional comments (9)

src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj (1)

9-9: LGTM! Standard test dependency addition.

Adding FluentAssertions is a good improvement for test readability and assertion capabilities. The centralized version management approach is also a best practice.

src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj (1)

7-9: LGTM! Consistent test improvement.

The FluentAssertions package addition follows the same pattern as other test projects and will improve assertion readability and capabilities.

src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj (1)

10-12: LGTM! Systematic test enhancement.

Consistent addition of FluentAssertions across test projects. This systematic approach ensures uniform testing capabilities across the codebase.

src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs (2)

39-39: LGTM! Windows compatibility improvement.

Adding carriage return removal provides better cross-platform compatibility, specifically addressing Windows line ending handling as mentioned in the PR objectives.

79-79: LGTM! Correct fix for header breadcrumb depth calculation.

The updated calculation (1 - hLen + currentHeaderLen) properly handles the depth difference between headers. This ensures the correct number of segments are removed when encountering headers at different levels, fixing the bug where only one segment was always removed regardless of the actual depth change.

Example: When transitioning from level 4 to level 2, this now correctly removes 3 segments instead of just 1.

src/Splitters/Abstractions/test/Resources/markdown_test_material.md (1)

1-71: LGTM! Comprehensive test resource for header hierarchy validation.

This markdown file provides excellent test coverage for the header breadcrumb fix with:

Multiple nesting levels (H1, H2, H3)

Various header level transitions

Associated text content for each section

The hierarchical structure will effectively validate the corrected breadcrumb depth calculation logic.

src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs (1)

22-23: LGTM! Good cross-platform compatibility fix.

Removing carriage return characters normalizes Windows-style line endings (\r\n) to Unix-style (\n), ensuring consistent text processing across different operating systems. The inline comment clearly explains the purpose.

src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs (1)

18-19: LGTM! Consistent normalization across splitter implementations.

This change mirrors the fix in RecursiveCharacterTextSplitter.cs, ensuring both text splitter classes handle Windows-style line endings consistently. The carriage return removal normalizes input text for reliable cross-platform behavior.

src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs (1)

65-128: Excellent comprehensive test for nested header functionality.

This test thoroughly validates the MarkdownHeaderTextSplitter's ability to handle complex nested header structures. The test effectively covers:

Multiple nesting levels (A → A.A → A.B.A, etc.)

Proper header concatenation with colons as separators

Correct association of text content with header paths

Expected count validation (18 elements)

This comprehensive testing approach is valuable for ensuring the header breadcrumb fix mentioned in the PR objectives works correctly across various nesting scenarios.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj (1)

9-11: Same PrivateAssets="all" recommendation as in Core tests

See earlier comment in LangChain.Core.UnitTests.csproj.

src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj (1)

10-12: Same PrivateAssets="all" recommendation as in Core tests

See earlier comment in LangChain.Core.UnitTests.csproj.

src/Meta/test/LangChain.IntegrationTests.csproj (1)

8-10: Same PrivateAssets="all" recommendation as in Core tests

See earlier comment in LangChain.Core.UnitTests.csproj.

🧹 Nitpick comments (4)

src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj (1)
7-9: Add PrivateAssets="all" to keep test-only dependency from leaking transitively

FluentAssertions is only needed while compiling/running the test assembly. Marking it as private avoids accidentally flowing it to downstream consumers should the test project ever be packed.
-  <PackageReference Include="FluentAssertions" />
+  <PackageReference Include="FluentAssertions" PrivateAssets="all" />
src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs (1)
79-79: LGTM! Header breadcrumb logic fix looks correct.

The updated calculation (1 - hLen + currentHeaderLen) correctly computes how many segments to remove when transitioning from a deeper header level to a shallower one. This fixes the bug where only one segment was removed regardless of the actual depth difference.

Consider adding a comment to explain the calculation for future maintainers:
+ // Remove segments to match the new header depth: (currentHeaderLen - hLen + 1) segments
  string prevHeader = string.Join("|", existingHeader.Take(existingHeader.Length - (1 - hLen + currentHeaderLen)));
src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs (1)
22-23: LGTM! Consistent Windows line ending normalization.

The carriage return removal ensures consistent behavior across different operating systems, matching the implementation in other text splitters.

Consider using a more professional comment:
- text = text.Replace("\r", ""); // some people are using windows
+ text = text.Replace("\r", ""); // Normalize line endings for cross-platform compatibility
src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs (1)
75-127: Comprehensive validation of hierarchical header breadcrumbs.

The test thoroughly validates the breadcrumb logic fix mentioned in the PR objectives. The colon-separated header hierarchy format correctly demonstrates that headers now properly reduce segments according to depth differences.

However, consider refactoring to reduce repetition and improve maintainability:
+        var expectedResults = new[]
+        {
+            ("Header A", "Text A"),
+            ("Header A: Header A.A", "Text A.A"),
+            ("Header A: Header A.B", "Text A.B"),
+            // ... continue for all expected results
+        };
+
+        for (int i = 0; i < expectedResults.Length; i++)
+        {
+            res[i].Split("\n")[0].Should().Be(expectedResults[i].Item1);
+            res[i].Split("\n")[1].Should().Be(expectedResults[i].Item2);
+        }
This would make the test more maintainable and reduce code duplication while preserving the same validation logic.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b20bc9 and 8b13a5e.

📒 Files selected for processing (13)

src/Cli/test/LangChain.Cli.IntegrationTests/LangChain.Cli.IntegrationTests.csproj (1 hunks)
src/Core/test/UnitTests/LangChain.Core.UnitTests.csproj (1 hunks)
src/Directory.Packages.props (1 hunks)
src/DocumentLoaders/IntegrationTests/LangChain.DocumentLoaders.IntegrationTests.csproj (1 hunks)
src/Meta/test/LangChain.IntegrationTests.csproj (1 hunks)
src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs (1 hunks)
src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs (1 hunks)
src/Splitters/Abstractions/src/Text/RecursiveCharacterTextSplitter.cs (1 hunks)
src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj (1 hunks)
src/Splitters/Abstractions/test/Resources/markdown_test_material.md (1 hunks)
src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs (1 hunks)
src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj (1 hunks)
src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj (1 hunks)

🔇 Additional comments (8)

src/Directory.Packages.props (1)

12-12: Confirmed: FluentAssertions 8.5.0 is available on NuGet

Verified via curl https://api.nuget.org/v3-flatcontainer/fluentassertions/index.json | jq '.versions | last' returning "8.5.0".

No changes needed; the specified version is correct and restore will succeed.

src/Utilities/Postgres/test/LangChain.Utilities.Postgres.IntegrationTests.csproj (1)

9-9: LGTM! FluentAssertions package addition enhances test readability.

The addition of FluentAssertions will improve test assertions and make them more expressive and readable.

src/Splitters/CSharp/test/LangChain.Splitters.CSharp.Tests.csproj (1)

7-10: LGTM! Consistent FluentAssertions integration.

The new ItemGroup with FluentAssertions package reference follows proper MSBuild conventions and maintains consistency with other test projects in the repository.

src/Splitters/Abstractions/src/Text/MarkdownHeaderTextSplitter.cs (1)

39-39: LGTM! Windows compatibility improvement.

Adding carriage return removal ensures consistent line ending handling across different operating systems, addressing the Windows hardening mentioned in the PR objectives.

src/Splitters/Abstractions/test/LangChain.Splitters.Abstractions.Tests.csproj (1)

7-10: LGTM! Consistent test infrastructure improvement.

The FluentAssertions package addition follows the same pattern as other test projects and will enhance assertion readability in the splitter tests.

src/Splitters/Abstractions/src/Text/CharacterTextSplitter.cs (1)

18-18: LGTM! Effective normalization of line endings.

The carriage return removal correctly addresses Windows compatibility issues by normalizing \r\n to \n line endings. The placement after null validation and before splitting logic is appropriate.

src/Splitters/Abstractions/test/Resources/markdown_test_material.md (1)

1-71: Excellent test resource structure for comprehensive markdown testing.

The hierarchical markdown structure effectively covers multiple testing scenarios:

Multi-level header nesting (H1 → H2 → H3)

Header level transitions (deep to shallow)

Sibling headers at the same level

Consistent naming pattern for easy verification

This resource should thoroughly validate the header breadcrumb logic fixes mentioned in the PR objectives.

src/Splitters/Abstractions/test/Tests.MarkdownHeader.cs (1)

65-73: Well-structured test setup and count validation.

The test correctly loads the markdown resource and validates the expected segment count. The assertion of exactly 18 segments aligns with the comprehensive markdown structure.

Added missing package references

8b13a5e

coderabbitai bot reviewed Aug 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Text splitter bugfixes #560

Text splitter bugfixes #560

Uh oh!

Aexalon commented Aug 5, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

CLAassistant commented Aug 5, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Aug 5, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Text splitter bugfixes #560

Are you sure you want to change the base?

Text splitter bugfixes #560

Uh oh!

Conversation

Aexalon commented Aug 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

CLAassistant commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Aexalon commented Aug 5, 2025 •

edited by coderabbitai bot

Loading

CLAassistant commented Aug 5, 2025 •

edited

Loading

coderabbitai bot commented Aug 5, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)