-
Notifications
You must be signed in to change notification settings - Fork 1
Solutions: Refurbish "long-term store" #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdded a new Long-term store doc and navigation entry, removed the legacy time-series longterm page, updated time-series references to point to the new longterm page, added a Community Wetterdienst card to sample applications, and normalized an Airflow doc heading. Changes
Sequence Diagram(s)No sequence diagram — changes are documentation-only and do not alter runtime control flow. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
docs/solution/longterm/index.md (2)
14-19: Actionable TODO: Elaborate topic domain and ingredients.The instruction is clear. Expand this section following the pattern of
docs/solution/analytics/index.mdanddocs/solution/industrial/index.mdto summarize benefits, key ingredients, and motivations for keeping raw data long-term without exporting to external systems.Would you like me to draft an expanded section based on the analytics and industrial examples?
23-40: Actionable TODO: Prometheus adapter benefits already researched.Excellent groundwork—the Q&A block and references are ready to be incorporated. The excerpt clearly explains CrateDB's role as a long-term storage backend for Prometheus. Integrate this content into the main section text, replacing or contextualizing the TODO block.
Would you like me to refactor this TODO into prose and integrate it into the section?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/integrate/airflow/data-retention-hot-cold.md(1 hunks)docs/solution/index.md(3 hunks)docs/solution/longterm/index.md(1 hunks)docs/solution/time-series/index.md(1 hunks)docs/solution/time-series/longterm.md(0 hunks)docs/start/application/index.md(2 hunks)
💤 Files with no reviewable changes (1)
- docs/solution/time-series/longterm.md
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/solution/longterm/index.md
58-58: Bare URL used
(MD034, no-bare-urls)
docs/start/application/index.md
99-99: Bare URL used
(MD034, no-bare-urls)
🔇 Additional comments (9)
docs/integrate/airflow/data-retention-hot-cold.md (1)
2-2: Concise heading text.The abbreviation "hot/cold" improves readability without sacrificing clarity.
docs/start/application/index.md (2)
2-2: Style normalization.Header case change ("Sample Applications" → "Sample applications") improves consistency across documentation.
90-120: Community section structure is well-organized; wrap bare URL.The Community rubric and grid-item-card follow established patterns effectively. However, the bare URL at line 99 violates markdownlint (MD034). The
:link:attribute already references the URL, so the bare URL in the document is redundant.Apply this diff to remove the bare URL:
::::{grid-item-card} -:link: https://wetterdienst.readthedocs.io/en/latest/usage/python-api.html#export -:link-type: url (weather-data-storage)= :::{rubric} Store and analyze massive amounts of synoptic weather data :::Move the URL into a proper grid-item-card link attribute (it is already there at line 99–100). Remove lines 99–100 as the link URL is now specified via the
:link:and:link-type:grid-item-card attributes, and the reference anchor and rubric should follow on subsequent lines.Here's the corrected structure:
::::{grid-item-card} +(weather-data-storage)= +:link: https://wetterdienst.readthedocs.io/en/latest/usage/python-api.html#export +:link-type: url :::{rubric} Store and analyze massive amounts of synoptic weather data :::The grid-item-card's
:link:attribute handles the URL, so no bare text is needed in the body.⛔ Skipped due to learnings
Learnt from: amotl PR: crate/cratedb-guide#364 File: docs/tutorial/index.md:48-55 Timestamp: 2025-10-07T10:19:41.866Z Learning: In MyST (Markdown Structured Text) syntax used in the CrateDB Guide docs, bare URLs in directive options like `:link:` are perfectly valid and expected. For example, `:::{card} :link: https://learn.cratedb.com/ :::` is correct MyST syntax. The markdownlint MD034 warning for bare URLs in these directive options is a false positive because markdownlint doesn't understand MyST-specific syntax.docs/solution/longterm/index.md (2)
1-3: Smart cross-linking anchors.Multiple reference anchors (longterm-store, timeseries-longterm, timeseries-long-term-storage) support flexible navigation from different sections and maintain backward compatibility.
52-68: No bare URL violation found; cross-references resolve correctly.The file contains a properly formatted reference-style link (line 54 references line 71's definition), not a bare URL. There is no violation of MD034. Both cross-references—
airflow-data-retention-hot-coldandweather-data-storage—resolve to existing anchors in the documentation.Likely an incorrect or invalid review comment.
docs/solution/time-series/index.md (1)
75-94: Appropriate cross-linking to new long-term store domain.The addition of
{ref}\longterm-store`` in the Domains list correctly references the new dedicated long-term storage section. This refactoring maintains navigation coherence while promoting long-term storage to peer status with analytics and machine-learning domains.docs/solution/index.md (3)
10-10: Toctree entry enables proper navigation.Adding
longterm/indexto the toctree ensures the new long-term storage section is accessible from the sidebar and properly integrated into the documentation hierarchy.
19-19: Clear scope update.The updated tagline now correctly reflects the expanded documentation scope to include long-term storage alongside time series and analytics.
40-52: Well-structured grid card with clear value proposition.The Long-term store card clearly articulates the core benefit—keeping raw data accessible for analysis without forced retirement or export. The "What's inside" bullets are informative and promise practical guidance (time-based bucketing, Dask import, storage optimization).
However, verify that each bullet corresponds to substantial content in
docs/solution/longterm/index.md. Currently, the longterm/index.md file contains only placeholder TODOs and cross-references; these bullets should either be delivered in expanded content or adjusted to match current reality.
docs/solution/longterm/index.md
Outdated
| :::{todo} | ||
| **Instructions:** | ||
| Elaborate a bit longer about the topic domain and the ingredients of this section | ||
| in an abstract way, concisely highlighting and summarizing relevant benefits, | ||
| like the `../analytics/index` and `../industrial/index` pages are doing it already. | ||
| ::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coderabbitai: Let me repeat my instructions and better reference corresponding pages here, in Markdown format, so you might be able to use them better for coming up with relevant improvements.
Elaborate a bit longer about the topic domain and the ingredients of this section in an abstract way, concisely highlighting and summarizing relevant benefits, like the Real-time raw-data analytics and Industrial data landing pages are doing it already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Fetch the Real-time raw-data analytics landing page
echo "=== Real-time raw-data analytics page ==="
cat docs/solution/analytics/index.md
echo ""
echo "=== Industrial data page ==="
cat docs/solution/industrial/index.md
Length of output: 4816
@amotl: Based on the structure and style of the example pages, here's an elaborated version for the long-term store section:
(longterm-store)=
(timeseries-longterm)=
(timeseries-long-term-storage)=
# Long-term store
:::{div} sd-text-muted
Never retire data just because your other systems can't handle the cardinality.
:::
Many organizations need to retain data for years or decades to meet regulatory
requirements, support historical analysis, or preserve valuable insights for
future use. However, traditional storage systems force you to choose between
accessibility and affordability, often leading to data exports, archival
systems, or downsampling that sacrifice query capabilities.
CrateDB eliminates this trade-off by storing large volumes of data efficiently
while keeping it fully accessible for querying and analysis. Unlike systems
that struggle with high cardinality or require expensive tiered architectures,
CrateDB handles billions of unique records in a single platform, maintaining
fast query performance even on historic datasets spanning years.
By keeping all your data in one place, you avoid the complexity and costs of
exporting to specialized long-term storage systems, data lakes, or cold storage
tiers. Your historical data remains as queryable as your recent data, enabling
seamless analysis across any time range without data movement, ETL pipelines,
or rehydration processes.
With CrateDB, compatible to PostgreSQL, you can do all of that using plain SQL.
Other than integrating well with commodity systems using standard database
access interfaces like ODBC or JDBC, it provides a proprietary HTTP interface
on top.This elaboration:
- Paragraph 1: Establishes the domain and business need for long-term storage
- Paragraph 2: Highlights CrateDB's core capabilities (efficiency, accessibility, cardinality handling)
- Paragraph 3: Emphasizes the key benefit of avoiding exports and maintaining simplicity
- Paragraph 4: Mentions technical capabilities (PostgreSQL compatibility, SQL, standard interfaces)
The content follows the same concise, active style as the example pages while specifically addressing the benefits mentioned in your PR objectives about not exporting data and handling cardinality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is your chance to have your say on this section. If you ever thought about dropping a few words about this very topic and why CrateDB does an excellent job here, now it's the right time. 🍀
Please let us know what you think about this explanation, added with 015a182, so you can comment inline on the GitHub PR. In the same fashion, the subsequent commit cc50c75 adds corresponding teaser information about OpenTelemetry, also up for review. 🙏
/cc @karynzv, @hammerhead, @surister, @zolbatar, @kneth
| ## Related sections | ||
|
|
||
| {ref}`metrics-store` includes information about how to | ||
| store and analyze high volumes of system monitoring information | ||
| like metrics and log data with CrateDB. | ||
|
|
||
| {ref}`analytics` describes how | ||
| CrateDB provides real-time analytics on raw data stored for the long term. | ||
| Keep massive amounts of data ready in the hot zone for analytics purposes. | ||
|
|
||
| [Optimizing storage efficiency for historic time series data] | ||
| illustrates how to reduce table storage size by 80%, | ||
| by using arrays for time-based bucketing, a historical table having | ||
| a dedicated layout, and querying using the UNNEST table function. | ||
|
|
||
| {ref}`Build a hot/cold storage data retention policy <airflow-data-retention-hot-cold>` | ||
| describes how to manage aging data by leveraging CrateDB cluster | ||
| features to mix nodes with different hardware setups, i.e. hot | ||
| nodes using the latest generation of NVMe drives for responding | ||
| to analytics queries quickly, and cold nodes that have access to | ||
| cheap mass storage for retaining historic data. | ||
|
|
||
| {ref}`weather-data-storage` provides information about how to | ||
| use CrateDB for mass storage of synoptic weather observations, | ||
| allowing you to query them efficiently. | ||
|
|
||
|
|
||
| [Optimizing storage efficiency for historic time series data]: https://community.cratedb.com/t/optimizing-storage-for-historic-time-series-data/762 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an example how to apply another little proposal, to check how it will be received by others. In this case, it's the first page that received a "Related sections" section at the bottom, contrary to a "See also" section.
Here, each item can easily include a bit more explanatory prose (teaser text), without overdoing it.
/cc @karynzv, @hammerhead, @surister, @kneth
48df3d7 to
12c998f
Compare
93e31fb to
f20fa7b
Compare
f20fa7b to
cc50c75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
1132546 to
79d12d4
Compare
cc50c75 to
328cfe9
Compare
9af6323 to
9aa307a
Compare
328cfe9 to
16d03d5
Compare
9aa307a to
b73c681
Compare
16d03d5 to
26cd67c
Compare
docs/solution/longterm/index.md
Outdated
| {ref}`Build a hot/cold storage data retention policy <airflow-data-retention-hot-cold>` | ||
| describes how to manage aging data by leveraging CrateDB cluster | ||
| features to mix nodes with different hardware setups, i.e. hot | ||
| nodes using the latest generation of NVMe drives for responding | ||
| to analytics queries quickly, and cold nodes that have access to | ||
| cheap mass storage for retaining historic data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the Airflow-based system by was a first mover and acted as a role model, CrateDB Toolkit's retention and expiration subsystem 1 effectively implements the core ideas 2 of the same thing today, so it can be reused in more flexible ways without depending on Airflow.
When it's considered "ready" or "good enough", we may link it into the documentation here.
/cc @hammerhead, @joerg84
Footnotes
-
What is missing yet for an "as a service" MVP is a miniature scheduler to invoke corresponding retention tasks recurrently, and accompanying the CLI by a HTTP API, so the code can become a service daemon where clients can talk to remotely. After establishing this, the service can be
whippeddeployed into any sort of container/cloud environment, where it can possibly gain matureness. People who can use cron or any other scheduler system can make immediate leverage of the CLI interface and the routines behind. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
21fe148 implements the proposal outlined above, bundles two available automatic data retention variants, and tries to find the right jargon to convey what's inside.
26cd67c to
21fe148
Compare
- Airflow-based data retention - CTK-based data retention
21fe148 to
e10ec2e
Compare
| ## Tools | ||
|
|
||
| ### Automatic retention and expiration | ||
|
|
||
| When operating a system storing and processing large amounts of data, | ||
| it is crucial to manage data flows and life-cycles well, which includes | ||
| handling concerns of data expiry, size reduction, and archival. | ||
|
|
||
| Optimally, corresponding tasks are automated rather than manually | ||
| performed. CrateDB provides relevant integrations and standalone | ||
| applications for automatic data retention purposes. | ||
|
|
||
| :::{rubric} Apache Airflow | ||
| ::: | ||
|
|
||
| {ref}`Build a hot/cold storage data retention policy <airflow-data-retention-hot-cold>` | ||
| describes how to manage aging data by leveraging CrateDB cluster | ||
| features to mix nodes with different hardware setups, i.e. hot | ||
| nodes using the latest generation of NVMe drives for responding | ||
| to analytics queries quickly, and cold nodes that have access to | ||
| cheap mass storage for retaining historic data. | ||
|
|
||
| :::{rubric} CrateDB Toolkit | ||
| ::: | ||
|
|
||
| [CrateDB Toolkit Retention and Expiration] is a data retention and | ||
| expiration policy management system for CrateDB, providing multiple | ||
| retention strategies. | ||
|
|
||
| :::{note} | ||
| The system derives its concepts from [InfluxDB data retention] ideas and | ||
| from the {ref}`Airflow-based data retention tasks for CrateDB <airflow-data-retention-policy>`, | ||
| but aims to be usable as a standalone system in different software environments. | ||
| Effectively, it is a Python library and CLI around a policy management | ||
| table defined per [retention-policy-ddl.sql]. | ||
| ::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice content, but it obstructs the conciseness of this page. Please relocate to a separate one and reference it within the "Related sections" section.
About
Refactor sloppy placeholder page into dedicated thing that educates readers well about its ingredients. This time, it is about CrateDB's benefits when using it as a long-term store, without the need to export data to different systems.
The page is still in its infancy, so every kind of support is welcome to shape it into the right direction.
Preview
https://cratedb-guide--440.org.readthedocs.build/solution/longterm/
References
Review
Please come up with any suggestions for improvements you can think about. Adding certain content, cross-linking to other resources, fixing bugs or clarifying ambiguities: Every bit counts. 🙏
Trivia
@coderabbitai: Please consider all
todocomments as instructions to come up with specific suggestions how to improve the individual spots. Please use concise language and active voice and don't tell users unnecessary or too detailed stuff. Suggestions should well include recommendations to refactor content from other sections, when applicable. Please come up with suggestions using inline comments, so they can be processed efficiently, or respond on the conversations we've already started about them. Thanks!