Skip to content

Conversation

@aminghadersohi
Copy link
Contributor

SUMMARY

This PR moves datetime format detection from query-time to dataset configuration time, addressing performance concerns and log noise issues raised in PR #35042.

What changed:

  • Added datetime_format column to the table_columns database table to persist detected formats
  • Created DatetimeFormatDetector service class that samples column data and detects datetime formats during dataset creation/refresh
  • Updated RefreshDatasetCommand to automatically detect and store formats when datasets are synced
  • Modified normalize_dttm_col() utility to accept pre-detected format mappings, eliminating runtime detection
  • Added REST API endpoint POST /api/v1/dataset/<pk>/detect_datetime_formats for manual format detection
  • Included configuration options: DATASET_AUTO_DETECT_DATETIME_FORMATS and DATETIME_FORMAT_DETECTION_SAMPLE_SIZE

Why this matters:

  • Performance: Eliminates repeated format detection on every query execution
  • Reliability: Provides consistent datetime parsing across all queries for the same column
  • Maintainability: Reduces "Could not infer format" warnings that spam application logs
  • User Control: Allows manual format override and visibility into detected formats

Technical approach:

  • Detection happens once during dataset metadata refresh (existing workflow)
  • Falls back to runtime detection if no format is stored (backward compatible)
  • Uses SQLAlchemy's quoted_name for safe SQL identifier handling
  • Comprehensive test coverage with 7 unit tests

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A - Backend-only change

TESTING INSTRUCTIONS

  1. Set up test database with datetime columns:

    CREATE TABLE test_dates (
      id INT,
      date_iso VARCHAR(20),      -- stores "2023-01-15"
      date_us VARCHAR(20),        -- stores "01/15/2023"
      timestamp_col VARCHAR(30)   -- stores "2023-01-15 14:30:00"
    );
    INSERT INTO test_dates VALUES 
      (1, '2023-01-15', '01/15/2023', '2023-01-15 14:30:00'),
      (2, '2023-02-20', '02/20/2023', '2023-02-20 16:45:00');
  2. Add dataset to Superset:

    • Navigate to Data → Datasets → + Dataset
    • Select the database and test_dates table
    • Click "Create Dataset and Create Chart"
  3. Verify format detection:

    • Check application logs - should see "Detected format" messages
    • Query the database to verify formats were stored:
      SELECT column_name, datetime_format, is_dttm 
      FROM table_columns 
      WHERE table_id = (SELECT id FROM tables WHERE table_name = 'test_dates');
  4. Test manual detection API:

    curl -X POST http://localhost:8088/api/v1/dataset/<dataset_id>/detect_datetime_formats?force=true \
      -H "Authorization: Bearer <token>"
  5. Verify query performance:

    • Run a query using the datetime columns
    • Check logs - should NOT see format detection warnings during query execution
    • Verify datetime columns are parsed correctly in query results
  6. Test unit tests:

    pytest tests/unit_tests/datasets/test_datetime_format_detector.py -v

ADDITIONAL INFORMATION

  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided:
      • Migration adds a nullable column - near-instant on all databases
      • No data migration required
      • Zero downtime - existing queries continue to work
  • Introduces new feature or API
    • New API endpoint: POST /api/v1/dataset/<pk>/detect_datetime_formats
    • New config options: DATASET_AUTO_DETECT_DATETIME_FORMATS, DATETIME_FORMAT_DETECTION_SAMPLE_SIZE
  • Removes existing feature or API
  • Changes UI
  • Has associated issue
  • Required feature flags

Related work:

@github-actions github-actions bot added risk:db-migration PRs that require a DB migration api Related to the REST API labels Nov 18, 2025
@bito-code-review
Copy link
Contributor

bito-code-review bot commented Nov 18, 2025

Code Review Agent Run #ad70d8

Actionable Suggestions - 0
Additional Suggestions - 20
  • superset/datasets/api.py - 4
    • Broad exception catching without specificity · Line 770-770
      Catching generic `Exception` is too broad. Consider catching specific exception types like `SupersetSecurityException` or other relevant exceptions for better error handling.
      Code suggestion
       @@ -769,2 +769,2 @@
      -            try:
      -                security_manager.raise_for_ownership(dataset)
      -            except Exception:  # pylint: disable=broad-except
      -                return self.response_403()
      +            try:
      +                security_manager.raise_for_ownership(dataset)
      +            except SupersetSecurityException:
      +                return self.response_403()
    • Unused lambda arguments in event logger · Line 713-713
      Lambda function has unused arguments `args` and `kwargs`. Consider using `*_` and `**_` to indicate intentionally unused parameters or remove them if not needed.
      Code suggestion
       @@ -712,4 +712,4 @@
      -    @event_logger.log_this_with_context(
      -        action=lambda self, *args, **kwargs: f"{self.__class__.__name__}"
      -        ".detect_datetime_formats",
      -        log_to_statsd=False,
      +    @event_logger.log_this_with_context(
      +        action=lambda self, *_, **__: f"{self.__class__.__name__}"
      +        ".detect_datetime_formats",
      +        log_to_statsd=False,
    • Missing blank line after docstring summary · Line 718-718
      Docstring is missing a blank line between the summary line and description. Add a blank line after the summary for proper docstring formatting.
      Code suggestion
       @@ -717,2 +717,3 @@
      -    def detect_datetime_formats(self, pk: int) -> Response:
      -        """Detect and store datetime formats for dataset columns.
      +    def detect_datetime_formats(self, pk: int) -> Response:
      +        """Detect and store datetime formats for dataset columns.
      +        
      ---
    • Redundant exception in logging exception call · Line 786-786
      The `logging.exception` call includes redundant exception object `ex`. The `logging.exception` method automatically includes exception details, so passing `str(ex)` is unnecessary.
      Code suggestion
       @@ -783,5 +783,4 @@
      -        except Exception as ex:
      -            logger.exception(
      -                "Error detecting datetime formats for dataset %s: %s",
      -                pk,
      -                str(ex),
      -            )
      +        except Exception as ex:
      +            logger.exception(
      +                "Error detecting datetime formats for dataset %s",
      +                pk,
      +            )
  • superset/datasets/datetime_format_detector.py - 8
    • Use keyword only boolean parameter · Line 141-141
      Boolean parameter `force` should be keyword-only to improve API clarity and prevent accidental positional usage.
      Code suggestion
       @@ -138,3 +138,3 @@
            def detect_all_formats(
                self,
                dataset: SqlaTable,
      -        force: bool = False,
      +        *,
      +        force: bool = False,
            ) -> dict[str, str | None]:
    • Move imports to type checking block · Line 27-27
      Move `SqlaTable` and `TableColumn` imports into the `TYPE_CHECKING` block since they're only used for type annotations. This follows the typing-only-first-party-import rule.
      Code suggestion
       @@ -24,9 +24,6 @@
        from flask import current_app
       
        from superset import db
      -from superset.connectors.sqla.models import SqlaTable, TableColumn
        from superset.utils.pandas import detect_datetime_format
       
        if TYPE_CHECKING:
      +    from superset.connectors.sqla.models import SqlaTable, TableColumn
            from superset.models.core import Database
    • Fix docstring summary line placement · Line 37-37
      Multi-line docstring summary should start at the first line. There are 4 similar docstring issues in this file (lines 37, 45, 59, 143).
      Code suggestion
       @@ -36,4 +36,3 @@
        class DatetimeFormatDetector:
      -    """
      -    Service for detecting and storing datetime formats in dataset columns.
      +    """Service for detecting and storing datetime formats in dataset columns.
       
            This service samples data from datetime columns to detect their format,
    • Add missing trailing comma · Line 51-51
      Missing trailing comma after `"DATETIME_FORMAT_DETECTION_SAMPLE_SIZE", 1000`.
      Code suggestion
       @@ -50,2 +50,2 @@
                self.sample_size = sample_size or current_app.config.get(
      -            "DATETIME_FORMAT_DETECTION_SAMPLE_SIZE", 1000
      +            "DATETIME_FORMAT_DETECTION_SAMPLE_SIZE", 1000,
                )
    • Remove unused noqa directives · Line 92-92
      Unused `noqa` directives for `S608` on lines 92 and 94. These can be removed since the warnings are no longer triggered.
      Code suggestion
       @@ -91,4 +91,4 @@
                    # Build SQL query string with quoted identifiers
      -            # S608: false positive - using SQLAlchemy quoted_name
      -            sql = (  # noqa: S608
      -                f"SELECT {column_name_quoted} FROM {full_table} "  # noqa: S608
      -                f"WHERE {column_name_quoted} IS NOT NULL "
      +            sql = (
      +                f"SELECT {column_name_quoted} FROM {full_table} "
      +                f"WHERE {column_name_quoted} IS NOT NULL "
    • Use descriptive DataFrame variable name · Line 99-99
      Avoid using the generic variable name `df` for DataFrames. Use a more descriptive name like `sample_data` or `column_data`.
      Code suggestion
       @@ -98,3 +98,3 @@
                    # Execute query and get results
      -            df = database.get_df(sql, dataset.schema)
      +            sample_data = database.get_df(sql, dataset.schema)
       
      -            if df.empty or column.column_name not in df.columns:
      +            if sample_data.empty or column.column_name not in sample_data.columns:
                        logger.warning(
       @@ -108,2 +108,2 @@
       
                    # Detect format using existing utility
      -            series = df[column.column_name]
      +            series = sample_data[column.column_name]
    • Move return to else block · Line 127-127
      Consider moving the `return detected_format` statement to an `else` block to improve code structure and clarity.
      Code suggestion
       @@ -125,8 +125,8 @@
                        )
       
      -            return detected_format
      -
                except Exception as ex:
                    logger.exception(
                        "Error detecting format for column %s.%s: %s",
                        dataset.table_name,
                        column.column_name,
                        str(ex),
                    )
                    return None
      +        else:
      +            return detected_format
    • Remove redundant exception from logging · Line 134-134
      Remove redundant exception object `str(ex)` from `logging.exception` call since the exception details are automatically included.
      Code suggestion
       @@ -129,7 +129,6 @@
                except Exception as ex:
                    logger.exception(
                        "Error detecting format for column %s.%s: %s",
                        dataset.table_name,
                        column.column_name,
      -                str(ex),
                    )
  • superset/migrations/versions/2025-11-17_19-03_a9c01ec10479_add_datetime_format_to_table_columns.py - 4
    • Missing return type annotation for upgrade · Line 39-39
      The `upgrade` function is missing a return type annotation. The `downgrade` function has the same issue on line 56.
      Code suggestion
       @@ -39,1 +39,1 @@
      -def upgrade():
      +def upgrade() -> None:
    • Missing return type annotation for downgrade · Line 56-56
      The `downgrade` function is missing a return type annotation, similar to the `upgrade` function.
      Code suggestion
       @@ -56,1 +56,1 @@
      -def downgrade():
      +def downgrade() -> None:
    • Module docstring missing ending period · Line 17-17
      The module docstring should end with a period for consistency with documentation standards.
      Code suggestion
       @@ -17,1 +17,1 @@
      -"""add_datetime_format_to_table_columns
      +"""add_datetime_format_to_table_columns.
    • Missing trailing comma in logger info · Line 46-46
      Trailing commas are missing in multiple logger.info calls (lines 46, 63) and the Column constructor (line 52).
      Code suggestion
       @@ -45,2 +45,2 @@
      -        logger.info(
      -            "Column table_columns.datetime_format already exists. Skipping migration."
      +        logger.info(
      +            "Column table_columns.datetime_format already exists. Skipping migration.",
  • superset/commands/dataset/refresh.py - 2
    • Add missing trailing comma · Line 57-57
      Add trailing comma after `self._model.table_name` for consistency with project style guidelines.
      Code suggestion
       @@ -56,2 +56,2 @@
      -                logger.info(
      -                    "Detected datetime formats for dataset %s", self._model.table_name
      +                logger.info(
      +                    "Detected datetime formats for dataset %s", self._model.table_name,
    • Remove redundant exception in logging call · Line 63-63
      Remove `str(ex)` from `logging.exception` call as the exception details are automatically included by the `exception` method.
      Code suggestion
       @@ -60,5 +60,4 @@
      -                logger.exception(
      -                    "Failed to detect datetime formats for dataset %s: %s",
      -                    self._model.table_name,
      -                    str(ex),
      -                )
      +                logger.exception(
      +                    "Failed to detect datetime formats for dataset %s",
      +                    self._model.table_name,
      +                )
  • tests/unit_tests/datasets/test_datetime_format_detector.py - 2
    • Missing trailing comma in function parameters · Line 48-48
      Missing trailing comma after `mock_column: MagicMock` parameter. This issue occurs in 4 other function definitions (lines 67, 80, 93). Adding trailing commas improves code consistency and reduces diff noise.
      Code suggestion
       @@ -47,2 +47,2 @@
      -def test_detect_column_format_success(
      -    mock_dataset: MagicMock, mock_column: MagicMock
      +def test_detect_column_format_success(
      +    mock_dataset: MagicMock, mock_column: MagicMock,
    • Missing trailing comma in dictionary literal · Line 53-53
      Missing trailing comma after `"date_column": ["2023-01-01", "2023-01-02", "2023-01-03"]` in dictionary literal. Adding trailing comma improves code consistency.
      Code suggestion
       @@ -52,2 +52,2 @@
      -    sample_data = pd.DataFrame(
      -        {"date_column": ["2023-01-01", "2023-01-02", "2023-01-03"]}
      +    sample_data = pd.DataFrame(
      +        {"date_column": ["2023-01-01", "2023-01-02", "2023-01-03"]},
Review Details
  • Files reviewed - 9 · Commit Range: e468b6f..e468b6f
    • superset/commands/dataset/refresh.py
    • superset/config.py
    • superset/connectors/sqla/models.py
    • superset/datasets/api.py
    • superset/datasets/datetime_format_detector.py
    • superset/datasets/schemas.py
    • superset/migrations/versions/2025-11-17_19-03_a9c01ec10479_add_datetime_format_to_table_columns.py
    • superset/utils/core.py
    • tests/unit_tests/datasets/test_datetime_format_detector.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Default Agent You can customize the agent settings here or contact your Bito workspace admin at [email protected].

Documentation & Help

AI Code Review powered by Bito Logo

@dosubot dosubot bot added change:backend Requires changing the backend data:dataset Related to dataset configurations labels Nov 18, 2025
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've completed my review and didn't find any issues.

Files scanned
File Path Reviewed
superset/commands/dataset/refresh.py
superset/migrations/versions/2025-11-17_19-03_a9c01ec10479_add_datetime_format_to_table_columns.py
superset/datasets/datetime_format_detector.py
superset/datasets/schemas.py
superset/datasets/api.py
superset/utils/core.py
superset/connectors/sqla/models.py
superset/config.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

@aminghadersohi aminghadersohi marked this pull request as draft November 18, 2025 07:16
@codecov
Copy link

codecov bot commented Nov 18, 2025

Codecov Report

❌ Patch coverage is 34.86239% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.99%. Comparing base (76d897e) to head (abd72ca).
⚠️ Report is 2965 commits behind head on master.

Files with missing lines Patch % Lines
superset/datasets/datetime_format_detector.py 28.12% 44 Missing and 2 partials ⚠️
superset/datasets/api.py 27.27% 16 Missing ⚠️
superset/commands/dataset/refresh.py 66.66% 2 Missing and 1 partial ⚠️
superset/models/helpers.py 40.00% 1 Missing and 2 partials ⚠️
superset/utils/core.py 0.00% 1 Missing and 1 partial ⚠️
superset/connectors/sqla/models.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #36150      +/-   ##
==========================================
+ Coverage   60.48%   67.99%   +7.50%     
==========================================
  Files        1931      636    -1295     
  Lines       76236    46817   -29419     
  Branches     8568     5081    -3487     
==========================================
- Hits        46114    31831   -14283     
+ Misses      28017    13710   -14307     
+ Partials     2105     1276     -829     
Flag Coverage Δ
hive 43.75% <24.77%> (-5.41%) ⬇️
javascript ?
mysql 67.09% <34.86%> (?)
postgres 67.14% <34.86%> (?)
presto 47.35% <26.60%> (-6.46%) ⬇️
python 67.95% <34.86%> (+4.45%) ⬆️
sqlite 66.76% <34.86%> (?)
unit 100.00% <ø> (+42.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aminghadersohi aminghadersohi force-pushed the feat/dataset-datetime-format-detection branch from e1e54c9 to a86cdfe Compare November 18, 2025 11:14
@aminghadersohi aminghadersohi marked this pull request as ready for review November 18, 2025 11:19
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've completed my review and didn't find any issues.

Files scanned
File Path Reviewed
superset/commands/dataset/refresh.py
superset/migrations/versions/2025-11-17_19-03_a9c01ec10479_add_datetime_format_to_table_columns.py
superset/datasets/datetime_format_detector.py
superset/datasets/schemas.py
superset/datasets/api.py
superset/utils/core.py
superset/connectors/sqla/models.py
superset/config.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

aminghadersohi and others added 3 commits November 25, 2025 21:11
Move datetime format detection from query-time to dataset
configuration time to improve performance and reduce log noise.

**Changes:**
- Add `datetime_format` column to `table_columns` table
- Create `DatetimeFormatDetector` service to detect and store formats
- Update `RefreshDatasetCommand` to auto-detect formats on dataset sync
- Modify `normalize_dttm_col` to accept pre-detected format mapping
- Add API endpoint for manual format detection
- Add configuration options for feature enablement
- Include comprehensive unit tests

**Benefits:**
- Eliminates repeated format detection on every query
- Reduces "Could not infer format" warnings in logs
- Provides consistent datetime parsing across queries
- Allows manual format override via API

Addresses performance concerns raised in PR apache#35042
- Fix SQL LIMIT portability by using database.apply_limit_to_sql() instead of hard-coded LIMIT syntax (supports SQL Server, Oracle, etc.)
- Add virtual dataset and expression column handling to skip unsupported cases
- Fix test mocking to include engine context manager and apply_limit_to_sql
- Add test coverage for virtual datasets and expression columns

Note: format_map integration will be added to superset/models/helpers.py after rebase
Build format_map from detected datetime formats stored in dataset columns
and pass it to normalize_dttm_col() so the detected formats are actually
used during query execution, eliminating the need for runtime detection.
@mistercrunch mistercrunch force-pushed the feat/dataset-datetime-format-detection branch from 9c477a4 to abd72ca Compare November 25, 2025 21:11
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mistercrunch mistercrunch merged commit 06a8f4d into apache:master Nov 25, 2025
61 checks passed
kshi020302 pushed a commit to jl141/superset that referenced this pull request Nov 30, 2025
sadpandajoe pushed a commit that referenced this pull request Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Related to the REST API change:backend Requires changing the backend data:dataset Related to dataset configurations risk:db-migration PRs that require a DB migration size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants