Skip to content

chore: Improve CLI startup with lazy heavy import cleanup#330

Merged
johnnygreco merged 17 commits intomainfrom
johnny/chore/improve-cli-startup-time
Feb 18, 2026
Merged

chore: Improve CLI startup with lazy heavy import cleanup#330
johnnygreco merged 17 commits intomainfrom
johnny/chore/improve-cli-startup-time

Conversation

@johnnygreco
Copy link
Contributor

@johnnygreco johnnygreco commented Feb 16, 2026

Summary

  • Standardize heavy third-party imports around import data_designer.lazy_heavy_imports as lazy across config/engine/interface modules.
  • Clean up TYPE_CHECKING imports in non-__init__.py files so they only include symbols actually used for type hints.
  • Preserve TYPE_CHECKING export blocks in package __init__.py facades for IDE UX and lazy-export behavior.
  • Update test patch targets to the lazy namespace where needed (for example, managed_dataset_repository.lazy.duckdb).
  • Keep direct pandas import in the Pydantic model ColumnConfigWithDataFrame to ensure runtime type resolution.
  • Clarify import performance test wording to "average pure import time" and add lazy import coverage via packages/data-designer/tests/test_lazy_imports.py.
  • Added benchmark to measure CLI cold and warm startup times.

Validation

  • uv run ruff check --select F401
  • make test-engine

Benchmark Reference

before changes in this branch:

     ======================================================================
     CLI Startup Benchmark Results
     ======================================================================
       Python:    3.11.9
       Platform:  Darwin (arm64)
       Git:       15cbc9bc (main)
       Venv setup: 9.6s
       Warm runs: 10

       import_only
         Cold:  1.665s
         Warm:  0.731s mean, 0.604s median, 0.400s stdev [0.581s - 1.867s]

       cli_help
         Cold:  19.032s
         Warm:  0.649s mean, 0.637s median, 0.034s stdev [0.610s - 0.711s]

       config_list
         Cold:  18.391s
         Warm:  0.715s mean, 0.621s median, 0.303s stdev [0.594s - 1.576s]

       compilation_overhead
         Without precompile:  19.032s
         With precompile:     15.477s
         Overhead:            3.555s

after changes in this branch:

======================================================================
CLI Startup Benchmark Results
======================================================================
  Python:    3.11.9
  Platform:  Darwin (arm64)
  Git:       0b6edb28 (johnny/chore/improve-cli-startup-time)
  Venv setup: 8.0s
  Warm runs: 10

  import_only
    Cold:  0.087s
    Warm:  0.043s mean, 0.041s median, 0.009s stdev [0.038s - 0.070s]

  cli_help
    Cold:  0.985s
    Warm:  0.146s mean, 0.081s median, 0.147s stdev [0.079s - 0.537s]

  config_list
    Cold:  3.458s
    Warm:  0.653s mean, 0.293s median, 0.864s stdev [0.283s - 2.950s]

  compilation_overhead
    Without precompile:  0.985s
    With precompile:     0.551s
    Overhead:            0.434s

======================================================================

@johnnygreco johnnygreco requested a review from a team as a code owner February 16, 2026 20:39
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 16, 2026

Greptile Summary

This PR dramatically improves CLI startup performance by implementing comprehensive lazy loading for heavy third-party dependencies and CLI commands. The changes standardize usage of import data_designer.lazy_heavy_imports as lazy pattern across config/engine/interface modules and introduce a new LazyTyperGroup that defers command module loading until invocation.

Key improvements:

  • CLI cold start reduced from 19s to 1s (~19x faster)
  • CLI warm start reduced from 0.65s to 0.15s (~4.3x faster)
  • Pure import time reduced from 1.7s to 0.09s (~19x faster)
  • Added comprehensive CLI startup benchmarks and lazy import test coverage
  • Cleaned up TYPE_CHECKING imports to only include symbols actually used for type hints
  • Updated test mock targets to match the lazy namespace pattern
  • Direct pandas import in seed_source_dataframe.py is appropriate for Pydantic model type resolution

The refactoring maintains clean architecture by separating lazy loading concerns while preserving type safety through TYPE_CHECKING blocks where needed.

Confidence Score: 5/5

  • Safe to merge - well-tested performance optimization with comprehensive validation
  • The PR demonstrates excellent engineering: dramatic performance improvements (19x faster cold start), comprehensive test coverage including lazy import verification, proper handling of unavoidable eager imports in Pydantic models, consistent pattern application across 87 files, and thorough benchmarking infrastructure. All test mock targets correctly updated to match refactored code.
  • No files require special attention

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/lazy_heavy_imports.py Enhanced lazy imports facade with improved documentation explaining usage pattern and avoiding eager imports
packages/data-designer/src/data_designer/cli/main.py Refactored to use lazy command loading via create_lazy_typer_group, removing eager imports of all commands
packages/data-designer/src/data_designer/cli/lazy_group.py New lazy loading mechanism for CLI commands that defers module imports until command invocation
packages/data-designer-config/src/data_designer/config/models.py Converted from eager import pattern to lazy.np access, removed TYPE_CHECKING import of np
packages/data-designer-engine/src/data_designer/engine/analysis/utils/column_statistics_calculations.py Converted numpy/pandas imports to lazy pattern and added lru_cache to defer tokenizer initialization
scripts/benchmarks/benchmark_cli_startup.py New comprehensive CLI startup benchmark measuring cold/warm times, compilation overhead, and import traces

Flowchart

flowchart TD
    A[CLI Entry Point] -->|Uses| B[LazyTyperGroup]
    B -->|Defers loading| C[Command Modules]
    C -->|Import on demand| D[Heavy Dependencies]
    
    E[Runtime Code] -->|Uses| F[lazy_heavy_imports]
    F -->|__getattr__| G[importlib.import_module]
    G -->|First access only| H[numpy/pandas/duckdb/etc]
    F -->|Cache in globals| I[Subsequent accesses]
    
    J[Pydantic Models] -->|Direct import| H
    
    K[TYPE_CHECKING] -->|Type hints only| H
    
    style B fill:#90EE90
    style F fill:#90EE90
    style G fill:#FFD700
    style J fill:#FFA07A
    style H fill:#87CEEB
Loading

Last reviewed commit: 4f711f5

Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost.

Key changes:
- Defer controller imports to inside command functions
- Remove eager re-export chains from CLI package __init__ files
- Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time
- Add lazy __getattr__ exports in interface/__init__.py
- Replace module-level tokenizer init with cached lazy getter
- Fix ModelProvider import to use config layer instead of engine
- Update test mock paths to match new import locations

Reduces CLI import-time from ~1.67s to ~0.46s.
- Replace eager `from lazy_heavy_imports import pd, np` in io_helpers
  with module-level __getattr__ (for backwards-compatible external
  access / test mocks) and function-level imports in the 3 functions
  that actually use them (read_parquet_dataset, smart_load_dataframe,
  _convert_to_serializable). Importing io_helpers no longer triggers
  pandas/numpy loading.
- Defer heavy imports in list and reset CLI commands into function
  bodies to avoid loading repositories, Rich, and prompt_toolkit at
  module import time.
- Add `config_list` (data-designer config list) measurement to the
  CLI startup benchmark with isolated cold measurement in a separate
  venv and a --skip-config-list-check flag.
- Update test mock paths to match new import locations.
@johnnygreco johnnygreco force-pushed the johnny/chore/improve-cli-startup-time branch from 0b6edb2 to 1a95f75 Compare February 16, 2026 20:49
@johnnygreco johnnygreco changed the title Improve CLI startup with lazy heavy import cleanup chore: Improve CLI startup with lazy heavy import cleanup Feb 16, 2026
Add globals() caching and explanatory comment to all three lazy
__getattr__ implementations (lazy_heavy_imports, config/__init__,
interface/__init__) so subsequent attribute accesses bypass __getattr__.
) from None

case litellm.exceptions.UnprocessableEntityError():
case lazy.litellm.exceptions.UnprocessableEntityError():
Copy link
Contributor

@nabinchha nabinchha Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd double check this works for litellm. Last time I was working on this all tests and local/CI runs were passing, but later failing when pulling from a published package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks for reminding me!

@nabinchha
Copy link
Contributor

@jogreco — here's a detailed review of this PR with suggestions.

What the PR Does Well

  • lazy_heavy_imports.py with __getattr__ + globals() caching is correct and effective
  • Lazy __init__.py facades for config/ (92 exports) and interface/ (4 exports) work well
  • Moving configure_logging() and resolve_seed_default_model_settings() out of module-level into _initialize_interface_runtime() is the right call
  • test_lazy_imports.py with the guardrail enforcing no from lazy_heavy_imports import X in source is great
  • CLI startup benchmark tooling is a nice addition

Core Problem: Anti-Pattern in the Fix

The PR defers heavy imports by putting them inside function bodies throughout the CLI layer (create.py, reset.py, list.py, generation_controller.py). This violates the project's own guideline: "Avoid importing Python modules inside method definitions."

This is a symptom, not a solution. The root cause is that main.py eagerly loads all 10 command modules at startup.


Suggestion 1: Lazy Command Loading in main.py

Instead of making every command module defer its own imports, make main.py load command modules only when their command is invoked. Typer wraps Click, which supports this via a custom Group:

# cli/lazy_group.py
import importlib
import click

class LazyGroup(click.Group):
    """Click group that defers loading command modules until invocation."""

    def __init__(self, *args, lazy_subcommands=None, **kwargs):
        super().__init__(*args, **kwargs)
        self._lazy_subcommands = lazy_subcommands or {}

    def list_commands(self, ctx):
        base = super().list_commands(ctx)
        return base + sorted(self._lazy_subcommands.keys())

    def get_command(self, ctx, cmd_name):
        if cmd_name in self._lazy_subcommands:
            module_path, func_name = self._lazy_subcommands[cmd_name]
            module = importlib.import_module(module_path)
            return getattr(module, func_name)
        return super().get_command(ctx, cmd_name)

Then main.py registers commands by name without importing their modules:

app = typer.Typer(cls=LazyGroup, lazy_subcommands={
    "preview": ("data_designer.cli.commands.preview", "preview_command"),
    "create":  ("data_designer.cli.commands.create",  "create_command"),
    ...
})

Result: Command modules can use normal module-level imports — they only execute when the command is actually invoked, not for --help. No function-level import hacks needed.


Suggestion 2: Split seed_source.py to Break the pandas Chain

The single biggest transitive contamination:

DataDesignerConfigBuilder
  → config_builder.py imports DataFrameSeedSource
    → seed_source.py line 24: pd = lazy.pd  ← triggers pandas+numpy for ENTIRE module

seed_source.py contains LocalFileSeedSource, HuggingFaceSeedSource, DataFrameSeedSource, and DatabricksVolumeSeedSource. Only DataFrameSeedSource needs pandas.

Split into:

  • seed_source.py → base class + LocalFile, HuggingFace, DatabricksVolume (no pandas)
  • seed_source_dataframe.py → DataFrameSeedSource only (pd = lazy.pd stays here)

Then config_builder.py can import DataFrameSeedSource from the isolated file — only that file pays the pandas cost.


Suggestion 3: Fix Remaining Eager-Import Offenders

File Problem Fix
configurable_task.py:15 TypeVar("DataT", dict, lazy.pd.DataFrame) — evaluates immediately Use unconstrained TypeVar("DataT") at runtime; put constrained version in TYPE_CHECKING
sampling_gen/base.py:9,20 from numpy.typing import NDArray + TypeAlias = int | lazy.np.random.RandomState Move both to TYPE_CHECKING (from __future__ import annotations already present)
sampling_gen/people_gen.py:30 TypeAlias = lazy.faker.Faker | ... Move to TYPE_CHECKING
phone_number.py:13 ZIP_AREA_CODE_DATA = lazy.pd.read_parquet(...) — I/O + pandas at import time Wrap in @functools.lru_cache function
gsonschema/validators.py:17 DEFAULT_JSONSCHEMA_VALIDATOR = lazy.jsonschema.Draft202012Validator Wrap in @functools.lru_cache function
gsonschema/exceptions.py:9 class JSONSchemaValidationError(lazy.jsonschema.ValidationError) Unavoidable — isolate into its own file so importing other gsonschema utilities doesn't trigger jsonschema

Suggestion 4: Minor Cleanup

  • list.py:9-13: Dead TYPE_CHECKING block — imports repository types but they're never used in type annotations, only at runtime inside function bodies. Remove or move to runtime imports.
  • engine/models/errors.py:128-196: match/case with lazy.litellm.exceptions.APIError() works at runtime but is fragile — if litellm changes its exception hierarchy, the match silently falls through. Consider adding a test to verify match arms resolve correctly, or use explicit isinstance checks.

Priority Order

  1. Lazy command loading in main.py — eliminates the anti-pattern entirely, biggest architectural win
  2. Split seed_source.py — breaks the pandas chain for DataDesignerConfigBuilder
  3. Fix TypeVar/TypeAlias eager evaluations — straightforward TYPE_CHECKING moves
  4. Wrap module-level constants in lru_cachephone_number.py, validators.py
  5. Isolate unavoidable eager importsexceptions.py class inheritance

@nabinchha
Copy link
Contributor

May be an alternative for suggestion 2:
replace isinstance(seed_config.source, DataFrameSeedSource) check with seed_config.source.seed_type == "df" to removed the need to import DataFrameSeedSource in this module.

@mikeknep
Copy link
Contributor

RE: the suggestion to split seed_source.py and extract the DF one into its own module—I like the idea in theory, but in practice wouldn't that kind of be undone by the top-level data_designer.config.__init__.py file importing it?

If so, would it make sense to do some sort of getattr trick in that top-level config module to defer import/eval of the DataFrameSeedSource object?

@nabinchha
Copy link
Contributor

nabinchha commented Feb 17, 2026

but in practice wouldn't that kind of be undone by the top-level data_designer.config.__init__.py file importing it?

Yes, might be a good idea to stop exporting everything out from __init__.py files if it can be avoided.

Also starrting to cut down on module size might help to reduce the import blast radius. For example, models.py has lots of things that all get imported when the consumer might only want one thing.

Having a file per class/module method, etc might be overkill (though that's how I always did things in compiled languages), but there's probably a middle ground in python

- Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files

- Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes

- Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks

- Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use

- Update test mock targets to patch at usage-site for module-level imports
@johnnygreco
Copy link
Contributor Author

@jogreco — here's a detailed review of this PR with suggestions.

@nabinchha – had a go at implementing in 9acacdd

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

87 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Drop lazy-loading for pandas in DataFrameSeedSource; use direct import
for simplicity.
Switch test modules to import data_designer.lazy_heavy_imports as lazy
and reference heavy libraries through that namespace. This keeps heavy
imports deferred during module import and aligns tests with the new
lazy-import usage pattern.
Copy link
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about updating these thresholds since there's a nice improvement on cold starts?

# Maximum allowed average import time in seconds
# Average of 1 cold start + 4 warm cache runs
# Cold starts vary 4-13s due to OS caching, system load, CPU scaling
# Warm cache consistently <3s. Average should be well under 6s.
MAX_IMPORT_TIME_SECONDS = 6.0
PERF_TEST_TIMEOUT_SECONDS = 30.0

Document recent baseline timings and lower the allowed average
import time and timeout so regressions are detected sooner.
Clarify that Pydantic needs DataFrame resolved at module load and
that keeping the direct import preserves IDE typing support.
nabinchha
nabinchha previously approved these changes Feb 18, 2026
Copy link
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports
- add TYPE_CHECKING pandas import and keep CLI controller imports sorted
Switch sample-record handling to lazy pandas types so runtime paths no longer
depend on TYPE_CHECKING imports. Align preview controller tests to patch the
module-local DataDesigner symbol, preventing real engine invocation in save
results scenarios.
@johnnygreco johnnygreco merged commit 1439bbe into main Feb 18, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments