chore: Improve CLI startup with lazy heavy import cleanup by johnnygreco · Pull Request #330 · NVIDIA-NeMo/DataDesigner

johnnygreco · 2026-02-16T20:39:16Z

Summary

Standardize heavy third-party imports around import data_designer.lazy_heavy_imports as lazy across config/engine/interface modules.
Clean up TYPE_CHECKING imports in non-__init__.py files so they only include symbols actually used for type hints.
Preserve TYPE_CHECKING export blocks in package __init__.py facades for IDE UX and lazy-export behavior.
Update test patch targets to the lazy namespace where needed (for example, managed_dataset_repository.lazy.duckdb).
Keep direct pandas import in the Pydantic model ColumnConfigWithDataFrame to ensure runtime type resolution.
Clarify import performance test wording to "average pure import time" and add lazy import coverage via packages/data-designer/tests/test_lazy_imports.py.
Added benchmark to measure CLI cold and warm startup times.

Validation

uv run ruff check --select F401
make test-engine

Benchmark Reference

before changes in this branch:

     ======================================================================
     CLI Startup Benchmark Results
     ======================================================================
       Python:    3.11.9
       Platform:  Darwin (arm64)
       Git:       15cbc9bc (main)
       Venv setup: 9.6s
       Warm runs: 10

       import_only
         Cold:  1.665s
         Warm:  0.731s mean, 0.604s median, 0.400s stdev [0.581s - 1.867s]

       cli_help
         Cold:  19.032s
         Warm:  0.649s mean, 0.637s median, 0.034s stdev [0.610s - 0.711s]

       config_list
         Cold:  18.391s
         Warm:  0.715s mean, 0.621s median, 0.303s stdev [0.594s - 1.576s]

       compilation_overhead
         Without precompile:  19.032s
         With precompile:     15.477s
         Overhead:            3.555s

after changes in this branch:

======================================================================
CLI Startup Benchmark Results
======================================================================
  Python:    3.11.9
  Platform:  Darwin (arm64)
  Git:       0b6edb28 (johnny/chore/improve-cli-startup-time)
  Venv setup: 8.0s
  Warm runs: 10

  import_only
    Cold:  0.087s
    Warm:  0.043s mean, 0.041s median, 0.009s stdev [0.038s - 0.070s]

  cli_help
    Cold:  0.985s
    Warm:  0.146s mean, 0.081s median, 0.147s stdev [0.079s - 0.537s]

  config_list
    Cold:  3.458s
    Warm:  0.653s mean, 0.293s median, 0.864s stdev [0.283s - 2.950s]

  compilation_overhead
    Without precompile:  0.985s
    With precompile:     0.551s
    Overhead:            0.434s

======================================================================

greptile-apps · 2026-02-16T20:42:06Z

Greptile Summary

This PR dramatically improves CLI startup performance by implementing comprehensive lazy loading for heavy third-party dependencies and CLI commands. The changes standardize usage of import data_designer.lazy_heavy_imports as lazy pattern across config/engine/interface modules and introduce a new LazyTyperGroup that defers command module loading until invocation.

Key improvements:

CLI cold start reduced from 19s to 1s (~19x faster)
CLI warm start reduced from 0.65s to 0.15s (~4.3x faster)
Pure import time reduced from 1.7s to 0.09s (~19x faster)
Added comprehensive CLI startup benchmarks and lazy import test coverage
Cleaned up TYPE_CHECKING imports to only include symbols actually used for type hints
Updated test mock targets to match the lazy namespace pattern
Direct pandas import in seed_source_dataframe.py is appropriate for Pydantic model type resolution

The refactoring maintains clean architecture by separating lazy loading concerns while preserving type safety through TYPE_CHECKING blocks where needed.

Confidence Score: 5/5

Safe to merge - well-tested performance optimization with comprehensive validation
The PR demonstrates excellent engineering: dramatic performance improvements (19x faster cold start), comprehensive test coverage including lazy import verification, proper handling of unavoidable eager imports in Pydantic models, consistent pattern application across 87 files, and thorough benchmarking infrastructure. All test mock targets correctly updated to match refactored code.
No files require special attention

Important Files Changed

Filename	Overview
packages/data-designer-config/src/data_designer/lazy_heavy_imports.py	Enhanced lazy imports facade with improved documentation explaining usage pattern and avoiding eager imports
packages/data-designer/src/data_designer/cli/main.py	Refactored to use lazy command loading via create_lazy_typer_group, removing eager imports of all commands
packages/data-designer/src/data_designer/cli/lazy_group.py	New lazy loading mechanism for CLI commands that defers module imports until command invocation
packages/data-designer-config/src/data_designer/config/models.py	Converted from eager import pattern to lazy.np access, removed TYPE_CHECKING import of np
packages/data-designer-engine/src/data_designer/engine/analysis/utils/column_statistics_calculations.py	Converted numpy/pandas imports to lazy pattern and added lru_cache to defer tokenizer initialization
scripts/benchmarks/benchmark_cli_startup.py	New comprehensive CLI startup benchmark measuring cold/warm times, compilation overhead, and import traces

Flowchart

flowchart TD
    A[CLI Entry Point] -->|Uses| B[LazyTyperGroup]
    B -->|Defers loading| C[Command Modules]
    C -->|Import on demand| D[Heavy Dependencies]
    
    E[Runtime Code] -->|Uses| F[lazy_heavy_imports]
    F -->|__getattr__| G[importlib.import_module]
    G -->|First access only| H[numpy/pandas/duckdb/etc]
    F -->|Cache in globals| I[Subsequent accesses]
    
    J[Pydantic Models] -->|Direct import| H
    
    K[TYPE_CHECKING] -->|Type hints only| H
    
    style B fill:#90EE90
    style F fill:#90EE90
    style G fill:#FFD700
    style J fill:#FFA07A
    style H fill:#87CEEB

_{Last reviewed commit: 4f711f5}

Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s.

- Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations.

Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__.

packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/base.py

nabinchha · 2026-02-17T16:04:18Z

packages/data-designer-engine/src/data_designer/engine/models/errors.py

            ) from None

-        case litellm.exceptions.UnprocessableEntityError():
+        case lazy.litellm.exceptions.UnprocessableEntityError():


I'd double check this works for litellm. Last time I was working on this all tests and local/CI runs were passing, but later failing when pulling from a published package.

ah, thanks for reminding me!

packages/data-designer/src/data_designer/cli/commands/reset.py

nabinchha · 2026-02-17T19:24:55Z

@jogreco — here's a detailed review of this PR with suggestions.

What the PR Does Well

lazy_heavy_imports.py with __getattr__ + globals() caching is correct and effective
Lazy __init__.py facades for config/ (92 exports) and interface/ (4 exports) work well
Moving configure_logging() and resolve_seed_default_model_settings() out of module-level into _initialize_interface_runtime() is the right call
test_lazy_imports.py with the guardrail enforcing no from lazy_heavy_imports import X in source is great
CLI startup benchmark tooling is a nice addition

Core Problem: Anti-Pattern in the Fix

The PR defers heavy imports by putting them inside function bodies throughout the CLI layer (create.py, reset.py, list.py, generation_controller.py). This violates the project's own guideline: "Avoid importing Python modules inside method definitions."

This is a symptom, not a solution. The root cause is that main.py eagerly loads all 10 command modules at startup.

Suggestion 1: Lazy Command Loading in `main.py`

Instead of making every command module defer its own imports, make main.py load command modules only when their command is invoked. Typer wraps Click, which supports this via a custom Group:

# cli/lazy_group.py
import importlib
import click

class LazyGroup(click.Group):
    """Click group that defers loading command modules until invocation."""

    def __init__(self, *args, lazy_subcommands=None, **kwargs):
        super().__init__(*args, **kwargs)
        self._lazy_subcommands = lazy_subcommands or {}

    def list_commands(self, ctx):
        base = super().list_commands(ctx)
        return base + sorted(self._lazy_subcommands.keys())

    def get_command(self, ctx, cmd_name):
        if cmd_name in self._lazy_subcommands:
            module_path, func_name = self._lazy_subcommands[cmd_name]
            module = importlib.import_module(module_path)
            return getattr(module, func_name)
        return super().get_command(ctx, cmd_name)

Then main.py registers commands by name without importing their modules:

app = typer.Typer(cls=LazyGroup, lazy_subcommands={
    "preview": ("data_designer.cli.commands.preview", "preview_command"),
    "create":  ("data_designer.cli.commands.create",  "create_command"),
    ...
})

Result: Command modules can use normal module-level imports — they only execute when the command is actually invoked, not for --help. No function-level import hacks needed.

Suggestion 2: Split `seed_source.py` to Break the pandas Chain

The single biggest transitive contamination:

DataDesignerConfigBuilder
  → config_builder.py imports DataFrameSeedSource
    → seed_source.py line 24: pd = lazy.pd  ← triggers pandas+numpy for ENTIRE module

seed_source.py contains LocalFileSeedSource, HuggingFaceSeedSource, DataFrameSeedSource, and DatabricksVolumeSeedSource. Only DataFrameSeedSource needs pandas.

Split into:

seed_source.py → base class + LocalFile, HuggingFace, DatabricksVolume (no pandas)
seed_source_dataframe.py → DataFrameSeedSource only (pd = lazy.pd stays here)

Then config_builder.py can import DataFrameSeedSource from the isolated file — only that file pays the pandas cost.

Suggestion 3: Fix Remaining Eager-Import Offenders

File	Problem	Fix
`configurable_task.py:15`	`TypeVar("DataT", dict, lazy.pd.DataFrame)` — evaluates immediately	Use unconstrained `TypeVar("DataT")` at runtime; put constrained version in `TYPE_CHECKING`
`sampling_gen/base.py:9,20`	`from numpy.typing import NDArray` + `TypeAlias = int \| lazy.np.random.RandomState`	Move both to `TYPE_CHECKING` (`from __future__ import annotations` already present)
`sampling_gen/people_gen.py:30`	`TypeAlias = lazy.faker.Faker \| ...`	Move to `TYPE_CHECKING`
`phone_number.py:13`	`ZIP_AREA_CODE_DATA = lazy.pd.read_parquet(...)` — I/O + pandas at import time	Wrap in `@functools.lru_cache` function
`gsonschema/validators.py:17`	`DEFAULT_JSONSCHEMA_VALIDATOR = lazy.jsonschema.Draft202012Validator`	Wrap in `@functools.lru_cache` function
`gsonschema/exceptions.py:9`	`class JSONSchemaValidationError(lazy.jsonschema.ValidationError)`	Unavoidable — isolate into its own file so importing other gsonschema utilities doesn't trigger jsonschema

Suggestion 4: Minor Cleanup

list.py:9-13: Dead TYPE_CHECKING block — imports repository types but they're never used in type annotations, only at runtime inside function bodies. Remove or move to runtime imports.
engine/models/errors.py:128-196: match/case with lazy.litellm.exceptions.APIError() works at runtime but is fragile — if litellm changes its exception hierarchy, the match silently falls through. Consider adding a test to verify match arms resolve correctly, or use explicit isinstance checks.

Priority Order

Lazy command loading in main.py — eliminates the anti-pattern entirely, biggest architectural win
Split seed_source.py — breaks the pandas chain for DataDesignerConfigBuilder
Fix TypeVar/TypeAlias eager evaluations — straightforward TYPE_CHECKING moves
Wrap module-level constants in lru_cache — phone_number.py, validators.py
Isolate unavoidable eager imports — exceptions.py class inheritance

nabinchha · 2026-02-17T19:35:32Z

May be an alternative for suggestion 2:
replace isinstance(seed_config.source, DataFrameSeedSource) check with seed_config.source.seed_type == "df" to removed the need to import DataFrameSeedSource in this module.

mikeknep · 2026-02-17T19:42:42Z

RE: the suggestion to split seed_source.py and extract the DF one into its own module—I like the idea in theory, but in practice wouldn't that kind of be undone by the top-level data_designer.config.__init__.py file importing it?

If so, would it make sense to do some sort of getattr trick in that top-level config module to defer import/eval of the DataFrameSeedSource object?

nabinchha · 2026-02-17T19:55:41Z

but in practice wouldn't that kind of be undone by the top-level data_designer.config.__init__.py file importing it?

Yes, might be a good idea to stop exporting everything out from __init__.py files if it can be avoided.

Also starrting to cut down on module size might help to reduce the import blast radius. For example, models.py has lots of things that all get imported when the consumer might only want one thing.

Having a file per class/module method, etc might be overkill (though that's how I always did things in compiled languages), but there's probably a middle ground in python

- Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports

johnnygreco · 2026-02-17T21:49:52Z

@jogreco — here's a detailed review of this PR with suggestions.

@nabinchha – had a go at implementing in 9acacdd

greptile-apps

_{87 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

packages/data-designer-config/src/data_designer/config/seed_source_dataframe.py

packages/data-designer/tests/test_lazy_imports.py

Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity.

Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern.

packages/data-designer-config/src/data_designer/config/seed_source_dataframe.py

nabinchha

wdyt about updating these thresholds since there's a nice improvement on cold starts?

DataDesigner/packages/data-designer/tests/test_import_perf.py

Lines 8 to 13 in cbf7182

    
           # Maximum allowed average import time in seconds 
        
           # Average of 1 cold start + 4 warm cache runs 
        
           # Cold starts vary 4-13s due to OS caching, system load, CPU scaling 
        
           # Warm cache consistently <3s. Average should be well under 6s. 
        
           MAX_IMPORT_TIME_SECONDS = 6.0 
        
           PERF_TEST_TIMEOUT_SECONDS = 30.0

Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner.

Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support.

nabinchha

🚢

- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted

Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.

johnnygreco requested a review from a team as a code owner February 16, 2026 20:39

johnnygreco added 5 commits February 16, 2026 15:47

Refine lazy import usage and TYPE_CHECKING cleanup

820815a

Run license header updater on PR-touched files

92c61ec

fix: update sqlfluff mock target for lazy imports in test_sql

1a95f75

johnnygreco force-pushed the johnny/chore/improve-cli-startup-time branch from 0b6edb2 to 1a95f75 Compare February 16, 2026 20:49

johnnygreco changed the title ~~Improve CLI startup with lazy heavy import cleanup~~ chore: Improve CLI startup with lazy heavy import cleanup Feb 16, 2026

perf: cache globals() in lazy __getattr__ to avoid repeated lookups

fb812c1

Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__.

nabinchha reviewed Feb 17, 2026

View reviewed changes

packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/base.py Show resolved Hide resolved

nabinchha reviewed Feb 17, 2026

View reviewed changes

packages/data-designer/src/data_designer/cli/commands/reset.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 17, 2026

View reviewed changes

packages/data-designer-config/src/data_designer/config/seed_source_dataframe.py Outdated Show resolved Hide resolved

packages/data-designer/tests/test_lazy_imports.py Show resolved Hide resolved

johnnygreco added 3 commits February 17, 2026 17:19

refactor: use direct pandas import in seed_source_dataframe

4f711f5

Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity.

update lazy import pattern

4ed31a5

update tests to use lazy import namespace

17fc695

Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern.

johnnygreco requested a review from nabinchha February 18, 2026 00:36

Merge branch 'main' into johnny/chore/improve-cli-startup-time

5a8f3ca

nabinchha reviewed Feb 18, 2026

View reviewed changes

packages/data-designer-config/src/data_designer/config/seed_source_dataframe.py Show resolved Hide resolved

nabinchha reviewed Feb 18, 2026

View reviewed changes

johnnygreco added 3 commits February 18, 2026 15:09

tighten import perf test thresholds

ed89d14

Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner.

document pandas import requirement

2b87adb

Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support.

increase timeout time

da6350f

nabinchha previously approved these changes Feb 18, 2026

View reviewed changes

Merge branch 'main' into johnny/chore/improve-cli-startup-time

0294d3b

johnnygreco dismissed nabinchha’s stale review via 0294d3b February 18, 2026 21:08

johnnygreco added 2 commits February 18, 2026 16:10

use lazy pandas imports in visualization tests

3cb4957

- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted

nabinchha approved these changes Feb 18, 2026

View reviewed changes

johnnygreco merged commit 1439bbe into main Feb 18, 2026
46 checks passed

	# Maximum allowed average import time in seconds
	# Average of 1 cold start + 4 warm cache runs
	# Cold starts vary 4-13s due to OS caching, system load, CPU scaling
	# Warm cache consistently <3s. Average should be well under 6s.
	MAX_IMPORT_TIME_SECONDS = 6.0
	PERF_TEST_TIMEOUT_SECONDS = 30.0

Conversation

johnnygreco commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Benchmark Reference

Uh oh!

greptile-apps bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

nabinchha Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnnygreco Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nabinchha commented Feb 17, 2026

What the PR Does Well

Core Problem: Anti-Pattern in the Fix

Suggestion 1: Lazy Command Loading in main.py

Suggestion 2: Split seed_source.py to Break the pandas Chain

Suggestion 3: Fix Remaining Eager-Import Offenders

Suggestion 4: Minor Cleanup

Priority Order

Uh oh!

nabinchha commented Feb 17, 2026

Uh oh!

mikeknep commented Feb 17, 2026

Uh oh!

nabinchha commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnygreco commented Feb 17, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

johnnygreco commented Feb 16, 2026 •

edited

Loading

greptile-apps bot commented Feb 16, 2026 •

edited

Loading

nabinchha Feb 17, 2026 •

edited

Loading

Suggestion 1: Lazy Command Loading in `main.py`

Suggestion 2: Split `seed_source.py` to Break the pandas Chain

nabinchha commented Feb 17, 2026 •

edited

Loading