-
Notifications
You must be signed in to change notification settings - Fork 371
Fixed the limit bug and added test for count() method and documentation for count() #2423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kaustuvnandy
wants to merge
5
commits into
apache:main
Choose a base branch
from
kaustuvnandy:test-count-method-2121
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
ca777d6
Added test for count() method and documentation for count()
kaustuvnandy d8f9411
Enhanced documentation on the test for count() recipe
kaustuvnandy ecf72d1
Some changes as per review comments
kaustuvnandy 782bea5
Fixed DataScan.count() limit parameter
kaustuvnandy 8b59b81
SQL-Like expressions
kaustuvnandy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,179 @@ | ||
--- | ||
title: Count Recipe - Efficiently Count Rows in Iceberg Tables | ||
--- | ||
|
||
# Counting Rows in an Iceberg Table | ||
|
||
This recipe demonstrates how to use the `count()` function to efficiently count rows in an Iceberg table using PyIceberg. The count operation is optimized for performance by reading file metadata rather than scanning actual data. | ||
|
||
## How Count Works | ||
|
||
The `count()` method leverages Iceberg's metadata architecture to provide fast row counts by: | ||
|
||
1. **Reading file manifests**: Examines metadata about data files without loading the actual data | ||
2. **Aggregating record counts**: Sums up record counts stored in Parquet file footers | ||
3. **Applying filters at metadata level**: Pushes down predicates to skip irrelevant files | ||
4. **Handling deletes**: Automatically accounts for delete files and tombstones | ||
|
||
## Basic Usage | ||
|
||
Count all rows in a table: | ||
|
||
```python | ||
from pyiceberg.catalog import load_catalog | ||
|
||
catalog = load_catalog("default") | ||
table = catalog.load_table("default.cities") | ||
|
||
# Get total row count | ||
row_count = table.scan().count() | ||
print(f"Total rows in table: {row_count}") | ||
``` | ||
|
||
## Count with Filters | ||
|
||
Count rows matching specific conditions: | ||
|
||
```python | ||
# Count rows with population > 1,000,000 | ||
large_cities = table.scan().filter("population > 1000000").count() | ||
print(f"Large cities: {large_cities}") | ||
|
||
# Count rows with specific country and population criteria | ||
filtered_count = table.scan().filter("country = 'Netherlands' AND population > 100000").count() | ||
print(f"Dutch cities with population > 100k: {filtered_count}") | ||
``` | ||
|
||
## Count with Limit | ||
|
||
The `count()` method supports a `limit` parameter for efficient counting when you only need to know if a table has at least N rows, or when working with very large datasets: | ||
|
||
```python | ||
# Check if table has at least 1000 rows (stops counting after reaching 1000) | ||
has_enough_rows = table.scan().count(limit=1000) >= 1000 | ||
print(f"Table has at least 1000 rows: {has_enough_rows}") | ||
|
||
# Get count up to a maximum of 10,000 rows | ||
limited_count = table.scan().count(limit=10000) | ||
print(f"Row count (max 10k): {limited_count}") | ||
|
||
# Combine limit with filters for efficient targeted counting | ||
recent_orders_sample = table.scan().filter("order_date > '2023-01-01'").count(limit=5000) | ||
print(f"Recent orders (up to 5000): {recent_orders_sample}") | ||
``` | ||
|
||
### Performance Benefits of Limit | ||
|
||
Using the `limit` parameter provides significant performance improvements: | ||
|
||
- **Early termination**: Stops processing files once the limit is reached | ||
- **Reduced I/O**: Avoids reading metadata from unnecessary files | ||
- **Memory efficiency**: Processes only the minimum required data | ||
- **Faster response**: Ideal for existence checks and sampling operations | ||
|
||
!!! tip "When to Use Limit" | ||
|
||
**Use `limit` when:** | ||
- Checking if a table has "enough" data (existence checks) | ||
- Sampling row counts from very large tables | ||
- Building dashboards that show approximate counts | ||
- Validating data ingestion without full table scans | ||
|
||
**Example use cases:** | ||
- Data quality gates: "Does this partition have at least 1000 rows?" | ||
- Monitoring alerts: "Are there more than 100 error records today?" | ||
- Approximate statistics: "Show roughly how many records per hour" | ||
|
||
## Performance Characteristics | ||
|
||
The count operation is highly efficient because: | ||
|
||
- **No data scanning**: Only reads metadata from file headers | ||
- **Parallel processing**: Can process multiple files concurrently | ||
- **Filter pushdown**: Eliminates files that don't match criteria | ||
- **Cached statistics**: Utilizes pre-computed record counts | ||
|
||
!!! tip "Even Faster: Use Snapshot Properties" | ||
|
||
For the fastest possible total row count (without filters), you can access the cached count directly from snapshot properties, avoiding any table scanning: | ||
|
||
```python | ||
# Get total records from snapshot metadata (fastest method) | ||
total_records = table.current_snapshot().summary.additional_properties["total-records"] | ||
print(f"Total rows from snapshot: {total_records}") | ||
``` | ||
|
||
**When to use this approach:** | ||
- When you need the total table row count without any filters | ||
- For dashboard queries that need instant response times | ||
- When working with very large tables where even metadata scanning takes time | ||
- For monitoring and alerting systems that check table sizes frequently | ||
|
||
**Note:** This method only works for total counts. For filtered counts, use `table.scan().filter(...).count()`. | ||
|
||
## Test Scenarios | ||
|
||
Our test suite validates count behavior across different scenarios: | ||
|
||
### Basic Counting (test_count_basic) | ||
```python | ||
# Simulates a table with a single file containing 42 records | ||
assert table.scan().count() == 42 | ||
``` | ||
|
||
### Empty Tables (test_count_empty) | ||
```python | ||
# Handles tables with no data files | ||
assert empty_table.scan().count() == 0 | ||
``` | ||
|
||
### Large Datasets (test_count_large) | ||
```python | ||
# Aggregates counts across multiple files (2 files × 500,000 records each) | ||
assert large_table.scan().count() == 1000000 | ||
``` | ||
|
||
### Limit Functionality (test_count_with_limit_mock) | ||
```python | ||
# Tests that limit parameter is respected and provides early termination | ||
limited_count = table.scan().count(limit=50) | ||
assert limited_count == 50 # Stops at limit even if more rows exist | ||
|
||
# Test with limit larger than available data | ||
all_rows = small_table.scan().count(limit=1000) | ||
assert all_rows == 42 # Returns actual count when limit > total rows | ||
``` | ||
|
||
### Integration Testing (test_datascan_count_respects_limit) | ||
```python | ||
# Full end-to-end validation with real table operations | ||
# Creates table, adds data, verifies limit behavior in realistic scenarios | ||
assert table.scan().count(limit=1) == 1 | ||
assert table.scan().count() > 1 # Unlimited count returns more | ||
``` | ||
|
||
## Best Practices | ||
|
||
1. **Use count() for data validation**: Verify expected row counts after ETL operations | ||
2. **Combine with filters**: Get targeted counts without full table scans | ||
3. **Leverage limit for existence checks**: Use `count(limit=N)` when you only need to know if a table has at least N rows | ||
4. **Monitor table growth**: Track record counts over time for capacity planning | ||
5. **Validate partitions**: Count rows per partition to ensure balanced distribution | ||
6. **Use appropriate limits**: Set sensible limits for dashboard queries and monitoring to improve response times | ||
|
||
!!! warning "Limit Considerations" | ||
|
||
When using `limit`, remember that: | ||
- The count may be less than the actual total if limit is reached | ||
- Results are deterministic but depend on file processing order | ||
- Use unlimited count when you need exact totals | ||
- Combine with filters for more targeted limited counting | ||
|
||
## Common Use Cases | ||
|
||
- **Data quality checks**: Verify ETL job outputs | ||
- **Partition analysis**: Compare record counts across partitions | ||
- **Performance monitoring**: Track table growth and query patterns | ||
- **Cost estimation**: Understand data volume before expensive operations | ||
|
||
For more details and complete API documentation, see the [API documentation](api.md#count-rows-in-a-table). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth mentioning as a note that we could get the total count of a table from snapshot properties doing this:
table.current_snapshot().summary.additional_properties["total-records"]
so users can avoid doing a full table scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments, I will work on them 😊