disagg: Add O11y on object store usage summary of each tiflash store#10764
Conversation
Signed-off-by: JaySon-Huang <tshent@qq.com>
|
Review Complete Findings: 6 issues ℹ️ Learn more details on Pantheon AI. |
📝 WalkthroughWalkthroughAdds periodic remote S3 storage summary reporting: new setting to control interval, S3 GC manager/service ownership and scheduling for a periodic summary task, and Prometheus metrics to record per-store data and DT file byte totals. Changes
Sequence DiagramsequenceDiagram
participant BP as BackgroundProcessingPool
participant SVC as S3GCManagerService
participant MGR as S3GCManager
participant S3 as S3 Storage
participant Metrics as TiFlashMetrics
rect rgba(100,200,150,0.5)
Note over SVC: Periodic Summary Task
BP->>SVC: run scheduled task
SVC->>MGR: isOwner()?
MGR-->>SVC: owner status
alt manager is owner
SVC->>MGR: getStoreStorageSummary(store_id)
MGR->>S3: list objects (abort if shutdown)
S3-->>MGR: object metadata
MGR->>MGR: compute data_file.bytes & dt_file.bytes
MGR->>Metrics: setS3StoreSummaryBytes(store_id, data, dt)
Metrics->>Metrics: lazy-create/update per-store gauges
Metrics-->>MGR: metrics recorded
else not owner or shutdown
SVC-->>BP: skip/stop scheduling
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: JaySon-Huang <tshent@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@metrics/grafana/tiflash_summary.json`:
- Line 1116: The dashboard currently sets the panel option "nullPointMode":
"null as zero" for the owner-only remote-summary metric, which causes missing
summaries to display as 0 B; change the JSON to stop turning nulls into zeros by
replacing "nullPointMode": "null as zero" with a non-zero-filling mode (e.g.,
"nullPointMode": "null as null") or remove the nullPointMode entry for the
remote-summary panels (the JSON key is "nullPointMode" in tiflash_summary.json
where the remote-summary metric is defined).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f03894c1-c201-42c1-8324-238a765dfe8e
📒 Files selected for processing (1)
metrics/grafana/tiflash_summary.json
There was a problem hiding this comment.
Pull request overview
Adds owner-only, configurable periodic S3 object-store usage summarization for TiFlash disaggregated mode, exporting per-store usage as Prometheus metrics and surfacing it on the Grafana summary dashboard to improve low-noise observability of remote footprint.
Changes:
- Add configurable periodic S3 storage summary scheduling to
S3GCManagerService(owner-only; disabled when interval ≤ 0) and early-exit behavior during shutdown. - Introduce new Prometheus gauge family
tiflash_storage_s3_store_summary_bytes{store_id,type}and update it after each per-store summary scan. - Wire a new setting
remote_summary_interval_seconds(default 24h) throughTMTContextinto S3GC config, and update Grafana dashboard panels.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| metrics/grafana/tiflash_summary.json | Adds panels for remote store summary metric and adjusts dashboard layout. |
| dbms/src/Storages/S3/S3GCManager.h | Extends S3GC config with summary interval and adds owner-check API + service task handle. |
| dbms/src/Storages/S3/S3GCManager.cpp | Implements owner check, shutdown-aware listing break, per-store metric update, and periodic summary task scheduling/shutdown. |
| dbms/src/Storages/KVStore/TMTContext.cpp | Wires the new settings value into S3GCConfig. |
| dbms/src/Interpreters/Settings.h | Introduces remote_summary_interval_seconds setting (default 24h, ≤0 disables). |
| dbms/src/Common/TiFlashMetrics.h | Declares setter and storage for per-store summary gauge metrics. |
| dbms/src/Common/TiFlashMetrics.cpp | Registers the new gauge family and implements setS3StoreSummaryBytes. |
Comments suppressed due to low confidence (1)
metrics/grafana/tiflash_summary.json:13769
- In this panel,
seriesOverridescontains an override with an emptyaliasand setsyaxis: 2, while the panel only has a single target (tiflash_storage_io_limiter_pending_countrate) that should use the left "ops" axis. This override is effectively dead configuration and the right Y-axis is formatted as seconds but has no series mapped to it. Consider removing the unused override and the unused seconds Y-axis (or reintroduce the intended duration series if that was the goal) to avoid confusion when editing the dashboard later.
"seriesOverrides": [
{
"alias": "",
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "sum(rate(tiflash_storage_io_limiter_pending_count{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\", instance=~\"$tiflash_role\"}[1m])) by (type, instance)",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{type}}-{{instance}}",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "I/O Limiter Pending Rate",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": 0,
"format": "ops",
"label": null,
"logBase": 1,
"max": null,
"min": "0",
"show": true
},
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: JaySon-Huang <tshent@qq.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dbms/src/Interpreters/Settings.h (1)
243-243: Use camelCase for the new C++ setting field name.
remote_summary_interval_secondsintroduces a new snake_case C++ member. Please rename it to camelCase (for example,remoteSummaryIntervalSeconds) and keep external config-key mapping explicit where needed.As per coding guidelines, Method and variable names in C++ must use camelCase (e.g.,
readBlock,totalBytes).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Interpreters/Settings.h` at line 243, Rename the C++ setting identifier remote_summary_interval_seconds to camelCase remoteSummaryIntervalSeconds (the SettingInt64 declaration using remote_summary_interval_seconds should be updated to remoteSummaryIntervalSeconds) and ensure the external config key remains explicit (keep the string "remote_summary_interval_seconds" as the config name or add an explicit mapping initializer if required) so only the C++ member follows camelCase while external configuration key stays unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dbms/src/Interpreters/Settings.h`:
- Line 243: Rename the C++ setting identifier remote_summary_interval_seconds to
camelCase remoteSummaryIntervalSeconds (the SettingInt64 declaration using
remote_summary_interval_seconds should be updated to
remoteSummaryIntervalSeconds) and ensure the external config key remains
explicit (keep the string "remote_summary_interval_seconds" as the config name
or add an explicit mapping initializer if required) so only the C++ member
follows camelCase while external configuration key stays unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f0d5407e-e2cf-41f4-9f93-e0b8c1e2d663
📒 Files selected for processing (2)
dbms/src/Interpreters/Settings.hdbms/src/Storages/S3/S3GCManager.cpp
✅ Files skipped from review due to trivial changes (1)
- dbms/src/Storages/S3/S3GCManager.cpp
|
/cherry-pick release-nextgen-20251011 |
|
@JaySon-Huang: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CalvinNeo, JinheLin, yudongusa The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
|
@JaySon-Huang: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
What problem does this PR solve?
Issue Number: ref #10763
Problem Summary:
What is changed and how it works?
Add periodic S3 storage summary in
S3GCManagerService:summary_interval_seconds.summary_interval_seconds <= 0, the periodic task is not registered.http://${TIFLASH_IP}:${TIFLASH_STATUS_PORT}/tiflash/remote/infohttps://github.com/pingcap/tiflash/blob/master/docs/tiflash_http_api.md#fetch-the-remote-storage-summary-from-tiflash-write-nodeAdd graceful shutdown behavior during summary scan:
getStoreStorageSummarychecksshutdown_calledinlistPrefixloop and exits early.Add per-store summary metrics:
tiflash_storage_s3_store_summary_bytes{store_id, type}typein{data_file_bytes, dt_file_bytes}getStoreStorageSummarycompletion.Wire configuration from settings to S3GC config:
profiles.default.remote_summary_interval_seconds(default 0, disabled)TMTContextand passed intoS3GCManagerService.Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
Documentation