chore: Improve CLI startup with lazy heavy import cleanup#330
chore: Improve CLI startup with lazy heavy import cleanup#330johnnygreco merged 17 commits intomainfrom
Conversation
Greptile SummaryThis PR dramatically improves CLI startup performance by implementing comprehensive lazy loading for heavy third-party dependencies and CLI commands. The changes standardize usage of Key improvements:
The refactoring maintains clean architecture by separating lazy loading concerns while preserving type safety through
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/lazy_heavy_imports.py | Enhanced lazy imports facade with improved documentation explaining usage pattern and avoiding eager imports |
| packages/data-designer/src/data_designer/cli/main.py | Refactored to use lazy command loading via create_lazy_typer_group, removing eager imports of all commands |
| packages/data-designer/src/data_designer/cli/lazy_group.py | New lazy loading mechanism for CLI commands that defers module imports until command invocation |
| packages/data-designer-config/src/data_designer/config/models.py | Converted from eager import pattern to lazy.np access, removed TYPE_CHECKING import of np |
| packages/data-designer-engine/src/data_designer/engine/analysis/utils/column_statistics_calculations.py | Converted numpy/pandas imports to lazy pattern and added lru_cache to defer tokenizer initialization |
| scripts/benchmarks/benchmark_cli_startup.py | New comprehensive CLI startup benchmark measuring cold/warm times, compilation overhead, and import traces |
Flowchart
flowchart TD
A[CLI Entry Point] -->|Uses| B[LazyTyperGroup]
B -->|Defers loading| C[Command Modules]
C -->|Import on demand| D[Heavy Dependencies]
E[Runtime Code] -->|Uses| F[lazy_heavy_imports]
F -->|__getattr__| G[importlib.import_module]
G -->|First access only| H[numpy/pandas/duckdb/etc]
F -->|Cache in globals| I[Subsequent accesses]
J[Pydantic Models] -->|Direct import| H
K[TYPE_CHECKING] -->|Type hints only| H
style B fill:#90EE90
style F fill:#90EE90
style G fill:#FFD700
style J fill:#FFA07A
style H fill:#87CEEB
Last reviewed commit: 4f711f5
Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s.
- Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations.
0b6edb2 to
1a95f75
Compare
Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__.
packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/base.py
Show resolved
Hide resolved
| ) from None | ||
|
|
||
| case litellm.exceptions.UnprocessableEntityError(): | ||
| case lazy.litellm.exceptions.UnprocessableEntityError(): |
There was a problem hiding this comment.
I'd double check this works for litellm. Last time I was working on this all tests and local/CI runs were passing, but later failing when pulling from a published package.
There was a problem hiding this comment.
ah, thanks for reminding me!
|
@jogreco — here's a detailed review of this PR with suggestions. What the PR Does Well
Core Problem: Anti-Pattern in the FixThe PR defers heavy imports by putting them inside function bodies throughout the CLI layer ( This is a symptom, not a solution. The root cause is that Suggestion 1: Lazy Command Loading in
|
| File | Problem | Fix |
|---|---|---|
configurable_task.py:15 |
TypeVar("DataT", dict, lazy.pd.DataFrame) — evaluates immediately |
Use unconstrained TypeVar("DataT") at runtime; put constrained version in TYPE_CHECKING |
sampling_gen/base.py:9,20 |
from numpy.typing import NDArray + TypeAlias = int | lazy.np.random.RandomState |
Move both to TYPE_CHECKING (from __future__ import annotations already present) |
sampling_gen/people_gen.py:30 |
TypeAlias = lazy.faker.Faker | ... |
Move to TYPE_CHECKING |
phone_number.py:13 |
ZIP_AREA_CODE_DATA = lazy.pd.read_parquet(...) — I/O + pandas at import time |
Wrap in @functools.lru_cache function |
gsonschema/validators.py:17 |
DEFAULT_JSONSCHEMA_VALIDATOR = lazy.jsonschema.Draft202012Validator |
Wrap in @functools.lru_cache function |
gsonschema/exceptions.py:9 |
class JSONSchemaValidationError(lazy.jsonschema.ValidationError) |
Unavoidable — isolate into its own file so importing other gsonschema utilities doesn't trigger jsonschema |
Suggestion 4: Minor Cleanup
list.py:9-13: DeadTYPE_CHECKINGblock — imports repository types but they're never used in type annotations, only at runtime inside function bodies. Remove or move to runtime imports.engine/models/errors.py:128-196:match/casewithlazy.litellm.exceptions.APIError()works at runtime but is fragile — if litellm changes its exception hierarchy, the match silently falls through. Consider adding a test to verify match arms resolve correctly, or use explicitisinstancechecks.
Priority Order
- Lazy command loading in
main.py— eliminates the anti-pattern entirely, biggest architectural win - Split
seed_source.py— breaks the pandas chain forDataDesignerConfigBuilder - Fix TypeVar/TypeAlias eager evaluations — straightforward
TYPE_CHECKINGmoves - Wrap module-level constants in
lru_cache—phone_number.py,validators.py - Isolate unavoidable eager imports —
exceptions.pyclass inheritance
|
May be an alternative for suggestion 2: |
|
RE: the suggestion to split If so, would it make sense to do some sort of |
Yes, might be a good idea to stop exporting everything out from Also starrting to cut down on module size might help to reduce the import blast radius. For example, Having a file per class/module method, etc might be overkill (though that's how I always did things in compiled languages), but there's probably a middle ground in python |
- Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports
@nabinchha – had a go at implementing in 9acacdd |
packages/data-designer-config/src/data_designer/config/seed_source_dataframe.py
Outdated
Show resolved
Hide resolved
Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity.
Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern.
packages/data-designer-config/src/data_designer/config/seed_source_dataframe.py
Show resolved
Hide resolved
nabinchha
left a comment
There was a problem hiding this comment.
wdyt about updating these thresholds since there's a nice improvement on cold starts?
DataDesigner/packages/data-designer/tests/test_import_perf.py
Lines 8 to 13 in cbf7182
Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner.
Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support.
- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted
Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.
Summary
import data_designer.lazy_heavy_imports as lazyacross config/engine/interface modules.TYPE_CHECKINGimports in non-__init__.pyfiles so they only include symbols actually used for type hints.TYPE_CHECKINGexport blocks in package__init__.pyfacades for IDE UX and lazy-export behavior.managed_dataset_repository.lazy.duckdb).pandasimport in the Pydantic modelColumnConfigWithDataFrameto ensure runtime type resolution.packages/data-designer/tests/test_lazy_imports.py.Validation
uv run ruff check --select F401make test-engineBenchmark Reference
before changes in this branch:
after changes in this branch: