Skip to content

fix: add UTF-8 BOM handling to file reading operations #5948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

roomote[bot]
Copy link

@roomote roomote bot commented Jul 19, 2025

Summary

This PR adds proper UTF-8 BOM (Byte Order Mark) handling to the file reading operations in Roo Code. This addresses the concern raised by @pwilkin in issue #5789 about potential UTF-8 BOM issues on Windows.

Problem

Windows applications often add a UTF-8 BOM (3-byte sequence: 0xEF, 0xBB, 0xBF) to the beginning of UTF-8 files. Without proper handling, this can cause:

  • Files with only BOM to not be recognized as empty
  • BOM characters appearing in file content
  • Potential parsing issues

Solution

  1. Added BOM stripping using the strip-bom package (v5.0.0):

    • In extract-text.ts: Strips BOM from text files after reading
    • In read-lines.ts: Strips BOM from the first chunk when streaming files
  2. Updated empty file detection in readFileTool.ts:

    • Files are now considered "effectively empty" if they have no lines OR if the content (after BOM stripping) is only whitespace
    • Files containing only a BOM character are now correctly identified as empty
  3. Added comprehensive tests:

    • Unit tests for BOM handling in the read_file tool
    • Tests verify that files with only BOM are treated as empty
    • Tests confirm BOM is stripped while preserving actual content

Impact on AI Model

This PR does NOT change the content sent to the AI model for files with actual content. The changes only:

  • Remove the invisible BOM character (which is not meaningful content)
  • Improve empty file detection for BOM-only files
  • Ensure Windows-generated UTF-8 files are handled correctly

For non-empty files, the AI receives the exact same content as before, just without the invisible BOM prefix.

Testing

  • All new tests pass ✅
  • All existing tests continue to pass ✅
  • Type checking passes ✅
  • Linting passes ✅

Related Issues

Implementation Details

The BOM stripping happens transparently during file reading, so:

  • The original file content is never modified
  • BOM is only stripped for processing (empty file detection and content display)
  • The solution works for both full file reads and streaming reads

cc @daniel-lxs @pwilkin

- Created bomUtils module with functions to detect and strip UTF-8 BOM
- Updated extractTextFromFile to strip BOM from text files
- Updated readLines to strip BOM from the first chunk when streaming
- Added comprehensive tests for BOM handling
- Updated readFileTool to treat files with only BOM as empty
- Ensures Windows files with BOM are handled correctly

Addresses the concern raised in #5789 about UTF-8 BOM causing issues on Windows
@roomote roomote bot requested review from mrubens, cte and jr as code owners July 19, 2025 14:39
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jul 19, 2025
@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 19, 2025
@roomote roomote bot mentioned this pull request Jul 19, 2025
4 tasks
@daniel-lxs
Copy link
Collaborator

@roomote-agent The strip-bom package (v5.0.0) is already installed in the project and is being used in DiffViewProvider.ts and CustomModesManager.ts. The new bomUtils.ts module duplicates this existing functionality.

Please update this PR (do not create a new PR, commit directly to this PR #5948) to:

  1. Remove the src/utils/bomUtils.ts file
  2. Remove the src/utils/__tests__/bomUtils.test.ts file
  3. Replace all imports of stripBOM from bomUtils with stripBom from the strip-bom package
  4. In src/integrations/misc/extract-text.ts, change line 8 from import { stripBOM } from "../../utils/bomUtils" to import stripBom from "strip-bom"
  5. In src/integrations/misc/extract-text.ts, change line 73 from return addLineNumbers(stripBOM(content)) to return addLineNumbers(stripBom(content))
  6. In src/integrations/misc/read-lines.ts, remove line 10 import { stripBOM } from "../../utils/bomUtils"
  7. In src/integrations/misc/read-lines.ts, add import stripBom from "strip-bom" after the other imports
  8. In src/integrations/misc/read-lines.ts, change line 73 from chunkStr = stripBOM(chunkStr) to chunkStr = stripBom(chunkStr)
  9. In src/core/tools/readFileTool.ts, the empty file detection logic on line 525 is good and should remain as const isEffectivelyEmpty = totalLines === 0 || content.trim() === ""

Commit these changes to the existing branch fix/utf8-bom-handling for this PR #5948. The existing strip-bom package already handles UTF-8 BOM stripping correctly. We should use it instead of creating duplicate utilities.

@daniel-lxs daniel-lxs moved this from Triage to PR [Draft / In Progress] in Roo Code Roadmap Jul 19, 2025
Copy link
Author

roomote bot commented Jul 19, 2025

Thank you for the review @daniel-lxs! You are absolutely right - we should use the existing strip-bom package instead of creating duplicate utilities. I will update this PR to use the existing package and remove the custom bomUtils module.

- Remove src/utils/bomUtils.ts and its test file
- Update imports in extract-text.ts to use strip-bom package
- Update imports in read-lines.ts to use strip-bom package
- Keep the empty file detection logic in readFileTool.ts unchanged

This eliminates code duplication by using the existing strip-bom package
that is already used in DiffViewProvider.ts and CustomModesManager.ts.
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 19, 2025
Copy link
Author

roomote bot commented Jul 19, 2025

Hi @daniel-lxs, I have successfully updated PR #5948 as requested:

Changes made:

  1. Removed src/utils/bomUtils.ts file
  2. Removed src/utils/__tests__/bomUtils.test.ts file
  3. Updated imports in src/integrations/misc/extract-text.ts to use strip-bom package
  4. Updated imports in src/integrations/misc/read-lines.ts to use strip-bom package
  5. Verified that src/core/tools/readFileTool.ts line 525 already has the correct empty file detection logic

All CI checks have passed:

  • ✓ Analyze (javascript-typescript)
  • ✓ CodeQL
  • ✓ check-openrouter-api-key
  • ✓ check-translations
  • ✓ compile
  • ✓ integration-test
  • ✓ knip
  • ✓ platform-unit-test (ubuntu-latest)
  • ✓ platform-unit-test (windows-latest)

The PR now uses the existing strip-bom package (v5.0.0) that is already used in DiffViewProvider.ts and CustomModesManager.ts, eliminating the code duplication. The UTF-8 BOM handling functionality remains intact.

@hannesrudolph hannesrudolph added PR - Draft / In Progress and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PR - Draft / In Progress size:M This PR changes 30-99 lines, ignoring generated files.
Projects
Status: PR [Draft / In Progress]
Development

Successfully merging this pull request may close these issues.

3 participants