Skip to content

Conversation

Copy link

Copilot AI commented Dec 23, 2025

Performance Optimization Complete ✅

  • Create a RecursiveFilterIterator class to filter out ignored directories before iteration
  • Update get_file_list() to use the filter iterator to prevent descending into ignored directories
  • Manual testing confirms ignored directories are no longer scanned
  • Add Behat test to verify ignored directories are not scanned
  • Fix filter to properly handle both zip and targz formats
  • Handle complex gitignore patterns with negations correctly
  • All distignore tests pass (27/27 scenarios)
  • Address code review feedback (path manipulation, portability, comments)
  • All static analysis checks pass (lint, phpcs, phpstan)
  • Security review complete (CodeQL)
  • Optimize to avoid duplicate isPathIgnored() calls with caching
  • Refactor filtering into iterator's accept() method for profound optimization
  • Add comprehensive PHPUnit tests for Distignore_Filter_Iterator class

Summary

This PR fixes a critical performance issue where RecursiveIteratorIterator was scanning ALL files in ignored directories before filtering them out. For projects with large node_modules directories (30,000+ files), this made the command nearly unusable.

Solution: Introduced a RecursiveFilterIterator that filters ignored files directly in the accept() method, preventing them from being yielded to the foreach loop. The iterator:

  • Filters out ignored files immediately (not yielded)
  • Yields ignored directories for tracking but prevents descent via hasChildren()
  • Uses caching to avoid duplicate isPathIgnored() checks
  • Correctly handles complex gitignore patterns with negations

Performance Impact:

  • Before: Foreach loop iterated 200+ times for 200 ignored files
  • After: Foreach loop iterates ~5 times (only non-ignored items)
  • Execution time: <0.01s for 200 ignored files

Changes:

  • Refactored Distignore_Filter_Iterator with smart filtering in accept() method
  • Updated get_file_list() to leverage iterator's filtering (reduced logic)
  • New Behat performance test scenario
  • Comprehensive PHPUnit test suite with 9 tests covering all filter iterator functionality
  • Total: 250+ lines added/modified across 4 files

Test Coverage:

  • 27 Behat scenarios passing (integration tests)
  • 9 PHPUnit tests with 24 assertions passing (unit tests for Distignore_Filter_Iterator)
    • File filtering behavior
    • Directory tracking and descent prevention
    • Caching functionality
    • Negation patterns
    • Error handling
    • Nested directory handling
    • State sharing across child iterators

Compatibility: All 27 existing Behat test scenarios pass, plus new performance test added. All 9 PHPUnit tests pass.

Original prompt

This section details on the original issue you should resolve

<issue_title>Performance Issue: RecursiveIteratorIterator scans all files in ignored directories before filtering</issue_title>
<issue_description>I'll create the issue following the repository's template format:


Bug Report

Describe the current, buggy behavior

The get_file_list() method in src/Dist_Archive_Command. php (lines 493-536) uses RecursiveIteratorIterator which iterates through every single file in the directory tree, including all files inside directories that should be ignored according to .distignore. This causes severe performance degradation when dealing with large ignored directories like node_modules.

The problem occurs because:

  1. The RecursiveIteratorIterator descends into every subdirectory, including ignored ones like node_modules/
  2. For each file (potentially 30,000+ files in node_modules), the iterator:
    • Creates an SplFileInfo object
    • Enters the foreach loop
    • Calculates the relative filepath
    • Calls $this->checker->isPathIgnored() to check if it should be excluded
  3. Only after all this work does it decide not to include the file in the archive

The check if ( $this->checker->isPathIgnored( $relative_filepath ) ) happens inside the loop for every single item found, rather than preventing descent into ignored directories in the first place.

Describe how other contributors can replicate this bug

  1. Create a WordPress plugin with a node_modules directory containing 30,000+ files (or any large directory)
  2. Add node_modules to your .distignore file:
node_modules
. git
vendor
tests
  1. Run wp dist-archive . build. zip from the plugin directory
  2. Observe that the command takes several minutes to complete
  3. The command iterates through all 30,000+ files in node_modules even though the entire directory is ignored

Describe what you would expect as the correct outcome

The iterator should skip descending into directories that are marked as ignored in .distignore, avoiding the need to iterate through their contents entirely.

For a project with node_modules containing 30,000 files:

  • Current behavior: 30,000+ iterations and isPathIgnored() checks, taking several minutes
  • Expected behavior: Skip the node_modules directory at the directory level, only iterate through files that need to be processed, completing in seconds

Let us know what environment you are running this on

OS: [Various - affects all operating systems]
PHP version: [Various - affects all PHP versions]
WP-CLI version: [Various - affects current versions using this package]

The issue is present in the current implementation regardless of environment.

Provide a possible solution

The solution is to use a RecursiveFilterIterator to filter out ignored directories before the RecursiveIteratorIterator descends into them. This way, the iterator never enters ignored directories, avoiding thousands of unnecessary iterations.

The filter would check at the directory level: "If this is a directory and it's ignored in .distignore, don't descend into it." This is fundamentally different from the current approach which checks every file after already finding it.

Provide additional context/Screenshots

This is a well-known performance pattern issue with RecursiveIteratorIterator. The iterator is designed to "flatten" the entire directory tree before filtering can occur.

Related discussion: The issue was previously mentioned in #81 (comment)

For WordPress plugins with modern build processes (npm packages, composer vendor directories, etc.), this performance issue makes the dist-archive command nearly unusable without manually deleting these directories first, which defeats the purpose of having a .distignore file.


To submit this issue:

Go to https://github.com/wp-cli/dist-archive-command/issues/new and paste the content above, or would you like me to attempt to create it directly for you? </issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 4 commits December 23, 2025 17:30
Updated the RecursiveFilterIterator to use hasChildren() instead of accept() to prevent descending into ignored directories. This ensures:
1. Ignored directories are still yielded (needed for exclude lists in targz format)
2. We don't descend into them (performance optimization)
3. All existing tests pass

Co-authored-by: swissspidy <[email protected]>
Updated hasChildren() to be more conservative when skipping descent:
- Only skip for top-level ignored directories where children would also be ignored
- This handles complex patterns like "frontend/*" with "!/frontend/build/"
- Still provides performance benefit for common cases like node_modules and .git
- All 27 distignore test scenarios now pass

Co-authored-by: swissspidy <[email protected]>
- Use substr() instead of str_replace() for path manipulation to avoid edge cases
- Add comment explaining the use of 'test' as a probe filename
- Replace Bash-specific syntax with portable POSIX shell commands in tests
- Fix code style (trailing whitespace)

Co-authored-by: swissspidy <[email protected]>
Copilot AI changed the title [WIP] Fix performance issue with RecursiveIteratorIterator in file scanning Optimize dist-archive by preventing descent into ignored directories Dec 23, 2025
Copilot AI requested a review from swissspidy December 23, 2025 17:45
@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

❌ Patch coverage is 91.02564% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/Distignore_Filter_Iterator.php 89.83% 6 Missing ⚠️
src/Dist_Archive_Command.php 94.73% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@swissspidy

This comment was marked as resolved.

Introduced an ignored_cache array in Distignore_Filter_Iterator to cache the results of isPathIgnored() checks. This eliminates duplicate checks:
- Once in hasChildren() when deciding whether to descend into directories
- Once in get_file_list() when categorizing files/directories

The cache is shared between both methods via the new isPathIgnoredCached() method, reducing redundant gitignore pattern matching for directories.

Co-authored-by: swissspidy <[email protected]>

This comment was marked as resolved.

@swissspidy

This comment was marked as resolved.

Moved the filtering logic from get_file_list() into the iterator's accept() method as suggested. This provides a more profound optimization:

- accept() now filters out ignored FILES immediately (not yielded)
- Ignored DIRECTORIES are still yielded (for exclude tracking) but won't be descended into per hasChildren()
- get_file_list() loop now only processes items that pass the filter, significantly reducing iterations
- Excluded files are tracked in the iterator and available via getExcludedFiles()

Performance: Iterator now yields only ~5 items instead of 200+ for a directory with 200 ignored files, while maintaining correct behavior for complex gitignore patterns with negations.

Co-authored-by: swissspidy <[email protected]>

This comment was marked as resolved.

@swissspidy

This comment was marked as resolved.

Created 9 test methods covering all major functionality:
- test_filters_ignored_files: Verifies ignored files are not yielded
- test_tracks_ignored_directories: Ensures directories are tracked but files inside are filtered
- test_get_excluded_files: Tests getExcludedFiles() method
- test_caching_avoids_duplicate_checks: Verifies caching behavior
- test_has_children_prevents_descent: Tests that hasChildren() prevents descent into ignored dirs
- test_negation_patterns: Validates complex negation patterns work correctly
- test_get_error_for_item_returns_null: Tests error handling
- test_nested_directory_filtering: Ensures multi-level directories are handled
- test_children_share_state: Verifies child iterators share cache and excluded files

All 9 tests pass with 24 assertions. Tests include setup/teardown with temporary directories.

Co-authored-by: swissspidy <[email protected]>

This comment was marked as resolved.

@swissspidy swissspidy marked this pull request as ready for review December 24, 2025 09:52
@swissspidy swissspidy requested a review from a team as a code owner December 24, 2025 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance Issue: RecursiveIteratorIterator scans all files in ignored directories before filtering

2 participants