Skip to content

Conversation

@amotl
Copy link
Member

@amotl amotl commented Oct 24, 2025

About

Refactor sloppy placeholder page into dedicated thing that educates readers well about its ingredients. This time, it is about CrateDB's benefits when using it as a long-term store, without the need to export data to different systems.

The page is still in its infancy, so every kind of support is welcome to shape it into the right direction.

Preview

https://cratedb-guide--440.org.readthedocs.build/solution/longterm/

References

Review

Please come up with any suggestions for improvements you can think about. Adding certain content, cross-linking to other resources, fixing bugs or clarifying ambiguities: Every bit counts. 🙏

Trivia

@coderabbitai: Please consider all todo comments as instructions to come up with specific suggestions how to improve the individual spots. Please use concise language and active voice and don't tell users unnecessary or too detailed stuff. Suggestions should well include recommendations to refactor content from other sections, when applicable. Please come up with suggestions using inline comments, so they can be processed efficiently, or respond on the conversations we've already started about them. Thanks!

@coderabbitai
Copy link

coderabbitai bot commented Oct 24, 2025

Walkthrough

Added a new Long-term store doc and navigation entry, removed the legacy time-series longterm page, updated time-series references to point to the new longterm page, added a Community Wetterdienst card to sample applications, and normalized an Airflow doc heading.

Changes

Cohort / File(s) Summary
Long-term store reorganization
docs/solution/longterm/index.md, docs/solution/index.md, docs/solution/time-series/index.md, docs/solution/time-series/longterm.md (deleted)
Added new long-term store documentation and TOC/card entry; removed legacy time-series/longterm.md; updated time-series overview, See Also, and Domains to reference the new longterm page.
Docs content tweaks & Community card
docs/start/application/index.md, docs/integrate/airflow/data-retention-hot-cold.md
Inserted a Community grid-item-card (Wetterdienst) into Sample applications and normalized the Airflow document heading text (minor stylistic change).

Sequence Diagram(s)

No sequence diagram — changes are documentation-only and do not alter runtime control flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify cross-links between docs/solution/index.md, docs/solution/longterm/index.md, and docs/solution/time-series/index.md.
  • Confirm deletion of docs/solution/time-series/longterm.md leaves no stray references.
  • Check new grid-item-card markup in docs/start/application/index.md for layout and TOC consistency.

Possibly related PRs

Suggested labels

refurbish

Suggested reviewers

  • matriv
  • bmunkholm
  • kneth

Poem

🐇
I hopped through docs with tiny, tidy paws,
Moved old pages to new well-marked stalls.
Wetterdienst blooms in community light,
Headings smoothed and links set right.
A rabbit cheers for docs that shine tonight.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Solutions: Refurbish 'long-term store'" directly corresponds to the primary changes in this PR. The raw summary confirms the changeset involves reorganizing and improving the long-term store documentation—moving content from a time-series subsection to a dedicated solution page at docs/solution/longterm/index.md, updating navigation structures, and refactoring related documentation. The term "Refurbish" accurately captures the intent to convert a placeholder page into a well-structured guide. The title is concise, specific, and clearly communicates the main objective to anyone scanning the commit history.
Description Check ✅ Passed The PR description is clearly related to the changeset and provides meaningful context about the changes. It explains that the PR refactors a placeholder page into a dedicated guide for CrateDB's long-term store capabilities, emphasizing retaining data without exporting to other systems. The description includes a preview link, references to related issues (GH-227), and explicitly requests feedback on content additions, cross-linking, and improvements. While some portions contain review instructions rather than traditional description content, the overall description meaningfully documents the purpose and scope of the changes being made.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch solution-longterm

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e10ec2e and 1c611bb.

📒 Files selected for processing (1)
  • docs/solution/longterm/index.md (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/solution/longterm/index.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@amotl amotl added reorganize Moving content around, inside and between other systems. cross linking Linking to different locations of the documentation. guidance Matters of layout, shape, and structure. labels Oct 24, 2025
@amotl amotl marked this pull request as ready for review October 24, 2025 00:21
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/solution/longterm/index.md (2)

14-19: Actionable TODO: Elaborate topic domain and ingredients.

The instruction is clear. Expand this section following the pattern of docs/solution/analytics/index.md and docs/solution/industrial/index.md to summarize benefits, key ingredients, and motivations for keeping raw data long-term without exporting to external systems.

Would you like me to draft an expanded section based on the analytics and industrial examples?


23-40: Actionable TODO: Prometheus adapter benefits already researched.

Excellent groundwork—the Q&A block and references are ready to be incorporated. The excerpt clearly explains CrateDB's role as a long-term storage backend for Prometheus. Integrate this content into the main section text, replacing or contextualizing the TODO block.

Would you like me to refactor this TODO into prose and integrate it into the section?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48fc073 and 93e31fb.

📒 Files selected for processing (6)
  • docs/integrate/airflow/data-retention-hot-cold.md (1 hunks)
  • docs/solution/index.md (3 hunks)
  • docs/solution/longterm/index.md (1 hunks)
  • docs/solution/time-series/index.md (1 hunks)
  • docs/solution/time-series/longterm.md (0 hunks)
  • docs/start/application/index.md (2 hunks)
💤 Files with no reviewable changes (1)
  • docs/solution/time-series/longterm.md
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/solution/longterm/index.md

58-58: Bare URL used

(MD034, no-bare-urls)

docs/start/application/index.md

99-99: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (9)
docs/integrate/airflow/data-retention-hot-cold.md (1)

2-2: Concise heading text.

The abbreviation "hot/cold" improves readability without sacrificing clarity.

docs/start/application/index.md (2)

2-2: Style normalization.

Header case change ("Sample Applications" → "Sample applications") improves consistency across documentation.


90-120: Community section structure is well-organized; wrap bare URL.

The Community rubric and grid-item-card follow established patterns effectively. However, the bare URL at line 99 violates markdownlint (MD034). The :link: attribute already references the URL, so the bare URL in the document is redundant.

Apply this diff to remove the bare URL:

 ::::{grid-item-card}
-:link: https://wetterdienst.readthedocs.io/en/latest/usage/python-api.html#export
-:link-type: url
 (weather-data-storage)=
 :::{rubric} Store and analyze massive amounts of synoptic weather data
 :::

Move the URL into a proper grid-item-card link attribute (it is already there at line 99–100). Remove lines 99–100 as the link URL is now specified via the :link: and :link-type: grid-item-card attributes, and the reference anchor and rubric should follow on subsequent lines.

Here's the corrected structure:

 ::::{grid-item-card}
+(weather-data-storage)=
+:link: https://wetterdienst.readthedocs.io/en/latest/usage/python-api.html#export
+:link-type: url
 
 :::{rubric} Store and analyze massive amounts of synoptic weather data
 :::

The grid-item-card's :link: attribute handles the URL, so no bare text is needed in the body.

⛔ Skipped due to learnings
Learnt from: amotl
PR: crate/cratedb-guide#364
File: docs/tutorial/index.md:48-55
Timestamp: 2025-10-07T10:19:41.866Z
Learning: In MyST (Markdown Structured Text) syntax used in the CrateDB Guide docs, bare URLs in directive options like `:link:` are perfectly valid and expected. For example, `:::{card} :link: https://learn.cratedb.com/ :::` is correct MyST syntax. The markdownlint MD034 warning for bare URLs in these directive options is a false positive because markdownlint doesn't understand MyST-specific syntax.
docs/solution/longterm/index.md (2)

1-3: Smart cross-linking anchors.

Multiple reference anchors (longterm-store, timeseries-longterm, timeseries-long-term-storage) support flexible navigation from different sections and maintain backward compatibility.


52-68: No bare URL violation found; cross-references resolve correctly.

The file contains a properly formatted reference-style link (line 54 references line 71's definition), not a bare URL. There is no violation of MD034. Both cross-references—airflow-data-retention-hot-cold and weather-data-storage—resolve to existing anchors in the documentation.

Likely an incorrect or invalid review comment.

docs/solution/time-series/index.md (1)

75-94: Appropriate cross-linking to new long-term store domain.

The addition of {ref}\longterm-store`` in the Domains list correctly references the new dedicated long-term storage section. This refactoring maintains navigation coherence while promoting long-term storage to peer status with analytics and machine-learning domains.

docs/solution/index.md (3)

10-10: Toctree entry enables proper navigation.

Adding longterm/index to the toctree ensures the new long-term storage section is accessible from the sidebar and properly integrated into the documentation hierarchy.


19-19: Clear scope update.

The updated tagline now correctly reflects the expanded documentation scope to include long-term storage alongside time series and analytics.


40-52: Well-structured grid card with clear value proposition.

The Long-term store card clearly articulates the core benefit—keeping raw data accessible for analysis without forced retirement or export. The "What's inside" bullets are informative and promise practical guidance (time-based bucketing, Dask import, storage optimization).

However, verify that each bullet corresponds to substantial content in docs/solution/longterm/index.md. Currently, the longterm/index.md file contains only placeholder TODOs and cross-references; these bullets should either be delivered in expanded content or adjusted to match current reality.

Comment on lines 14 to 19
:::{todo}
**Instructions:**
Elaborate a bit longer about the topic domain and the ingredients of this section
in an abstract way, concisely highlighting and summarizing relevant benefits,
like the `../analytics/index` and `../industrial/index` pages are doing it already.
:::
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai: Let me repeat my instructions and better reference corresponding pages here, in Markdown format, so you might be able to use them better for coming up with relevant improvements.

Elaborate a bit longer about the topic domain and the ingredients of this section in an abstract way, concisely highlighting and summarizing relevant benefits, like the Real-time raw-data analytics and Industrial data landing pages are doing it already.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the Real-time raw-data analytics landing page
echo "=== Real-time raw-data analytics page ==="
cat docs/solution/analytics/index.md
echo ""
echo "=== Industrial data page ==="
cat docs/solution/industrial/index.md

Length of output: 4816


@amotl: Based on the structure and style of the example pages, here's an elaborated version for the long-term store section:

(longterm-store)=
(timeseries-longterm)=
(timeseries-long-term-storage)=

# Long-term store

:::{div} sd-text-muted
Never retire data just because your other systems can't handle the cardinality.
:::

Many organizations need to retain data for years or decades to meet regulatory
requirements, support historical analysis, or preserve valuable insights for
future use. However, traditional storage systems force you to choose between
accessibility and affordability, often leading to data exports, archival
systems, or downsampling that sacrifice query capabilities.

CrateDB eliminates this trade-off by storing large volumes of data efficiently
while keeping it fully accessible for querying and analysis. Unlike systems
that struggle with high cardinality or require expensive tiered architectures,
CrateDB handles billions of unique records in a single platform, maintaining
fast query performance even on historic datasets spanning years.

By keeping all your data in one place, you avoid the complexity and costs of
exporting to specialized long-term storage systems, data lakes, or cold storage
tiers. Your historical data remains as queryable as your recent data, enabling
seamless analysis across any time range without data movement, ETL pipelines,
or rehydration processes.

With CrateDB, compatible to PostgreSQL, you can do all of that using plain SQL.
Other than integrating well with commodity systems using standard database
access interfaces like ODBC or JDBC, it provides a proprietary HTTP interface
on top.

This elaboration:

  • Paragraph 1: Establishes the domain and business need for long-term storage
  • Paragraph 2: Highlights CrateDB's core capabilities (efficiency, accessibility, cardinality handling)
  • Paragraph 3: Emphasizes the key benefit of avoiding exports and maintaining simplicity
  • Paragraph 4: Mentions technical capabilities (PostgreSQL compatibility, SQL, standard interfaces)

The content follows the same concise, active style as the example pages while specifically addressing the benefits mentioned in your PR objectives about not exporting data and handling cardinality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is your chance to have your say on this section. If you ever thought about dropping a few words about this very topic and why CrateDB does an excellent job here, now it's the right time. 🍀

Please let us know what you think about this explanation, added with 015a182, so you can comment inline on the GitHub PR. In the same fashion, the subsequent commit cc50c75 adds corresponding teaser information about OpenTelemetry, also up for review. 🙏

/cc @karynzv, @hammerhead, @surister, @zolbatar, @kneth

Comment on lines 42 to 135
## Related sections

{ref}`metrics-store` includes information about how to
store and analyze high volumes of system monitoring information
like metrics and log data with CrateDB.

{ref}`analytics` describes how
CrateDB provides real-time analytics on raw data stored for the long term.
Keep massive amounts of data ready in the hot zone for analytics purposes.

[Optimizing storage efficiency for historic time series data]
illustrates how to reduce table storage size by 80%,
by using arrays for time-based bucketing, a historical table having
a dedicated layout, and querying using the UNNEST table function.

{ref}`Build a hot/cold storage data retention policy <airflow-data-retention-hot-cold>`
describes how to manage aging data by leveraging CrateDB cluster
features to mix nodes with different hardware setups, i.e. hot
nodes using the latest generation of NVMe drives for responding
to analytics queries quickly, and cold nodes that have access to
cheap mass storage for retaining historic data.

{ref}`weather-data-storage` provides information about how to
use CrateDB for mass storage of synoptic weather observations,
allowing you to query them efficiently.


[Optimizing storage efficiency for historic time series data]: https://community.cratedb.com/t/optimizing-storage-for-historic-time-series-data/762
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an example how to apply another little proposal, to check how it will be received by others. In this case, it's the first page that received a "Related sections" section at the bottom, contrary to a "See also" section.

Here, each item can easily include a bit more explanatory prose (teaser text), without overdoing it.

/cc @karynzv, @hammerhead, @surister, @kneth

@amotl amotl changed the title Solution: Refurbish section about "long-term store" Solutions: Refurbish section about "long-term store" Oct 24, 2025
@amotl amotl changed the title Solutions: Refurbish section about "long-term store" Solutions: Refurbish "long-term store" Oct 24, 2025
@amotl amotl force-pushed the reorganize-solutions branch 3 times, most recently from 48df3d7 to 12c998f Compare October 24, 2025 03:14
@amotl amotl force-pushed the solution-longterm branch from 93e31fb to f20fa7b Compare October 24, 2025 03:43
coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the solution-longterm branch from f20fa7b to cc50c75 Compare October 24, 2025 04:00
@amotl amotl requested review from matriv and seut October 24, 2025 04:33
Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@amotl amotl force-pushed the reorganize-solutions branch from 1132546 to 79d12d4 Compare October 24, 2025 18:39
@amotl amotl force-pushed the solution-longterm branch from cc50c75 to 328cfe9 Compare October 24, 2025 18:43
@amotl amotl force-pushed the reorganize-solutions branch 2 times, most recently from 9af6323 to 9aa307a Compare October 24, 2025 18:48
@amotl amotl force-pushed the solution-longterm branch from 328cfe9 to 16d03d5 Compare October 24, 2025 18:49
@amotl amotl force-pushed the reorganize-solutions branch from 9aa307a to b73c681 Compare October 24, 2025 19:43
@amotl amotl force-pushed the solution-longterm branch from 16d03d5 to 26cd67c Compare October 24, 2025 19:45
Comment on lines 91 to 96
{ref}`Build a hot/cold storage data retention policy <airflow-data-retention-hot-cold>`
describes how to manage aging data by leveraging CrateDB cluster
features to mix nodes with different hardware setups, i.e. hot
nodes using the latest generation of NVMe drives for responding
to analytics queries quickly, and cold nodes that have access to
cheap mass storage for retaining historic data.
Copy link
Member Author

@amotl amotl Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the Airflow-based system by was a first mover and acted as a role model, CrateDB Toolkit's retention and expiration subsystem 1 effectively implements the core ideas 2 of the same thing today, so it can be reused in more flexible ways without depending on Airflow.

When it's considered "ready" or "good enough", we may link it into the documentation here.

/cc @hammerhead, @joerg84

Footnotes

  1. https://cratedb-toolkit.readthedocs.io/retention.html

  2. What is missing yet for an "as a service" MVP is a miniature scheduler to invoke corresponding retention tasks recurrently, and accompanying the CLI by a HTTP API, so the code can become a service daemon where clients can talk to remotely. After establishing this, the service can be whipped deployed into any sort of container/cloud environment, where it can possibly gain matureness. People who can use cron or any other scheduler system can make immediate leverage of the CLI interface and the routines behind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21fe148 implements the proposal outlined above, bundles two available automatic data retention variants, and tries to find the right jargon to convey what's inside.

Base automatically changed from reorganize-solutions to main October 25, 2025 00:05
@amotl amotl force-pushed the solution-longterm branch from 26cd67c to 21fe148 Compare October 25, 2025 01:02
- Airflow-based data retention
- CTK-based data retention
@amotl amotl force-pushed the solution-longterm branch from 21fe148 to e10ec2e Compare October 25, 2025 01:14
coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl added refurbish A more significant refurbishment. and removed reorganize Moving content around, inside and between other systems. labels Oct 25, 2025
Comment on lines +76 to +111
## Tools

### Automatic retention and expiration

When operating a system storing and processing large amounts of data,
it is crucial to manage data flows and life-cycles well, which includes
handling concerns of data expiry, size reduction, and archival.

Optimally, corresponding tasks are automated rather than manually
performed. CrateDB provides relevant integrations and standalone
applications for automatic data retention purposes.

:::{rubric} Apache Airflow
:::

{ref}`Build a hot/cold storage data retention policy <airflow-data-retention-hot-cold>`
describes how to manage aging data by leveraging CrateDB cluster
features to mix nodes with different hardware setups, i.e. hot
nodes using the latest generation of NVMe drives for responding
to analytics queries quickly, and cold nodes that have access to
cheap mass storage for retaining historic data.

:::{rubric} CrateDB Toolkit
:::

[CrateDB Toolkit Retention and Expiration] is a data retention and
expiration policy management system for CrateDB, providing multiple
retention strategies.

:::{note}
The system derives its concepts from [InfluxDB data retention] ideas and
from the {ref}`Airflow-based data retention tasks for CrateDB <airflow-data-retention-policy>`,
but aims to be usable as a standalone system in different software environments.
Effectively, it is a Python library and CLI around a policy management
table defined per [retention-policy-ddl.sql].
:::
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice content, but it obstructs the conciseness of this page. Please relocate to a separate one and reference it within the "Related sections" section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cross linking Linking to different locations of the documentation. guidance Matters of layout, shape, and structure. refurbish A more significant refurbishment.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants