Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor bulk_create calls #9047

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open

Conversation

zhiltsov-max
Copy link
Contributor

@zhiltsov-max zhiltsov-max commented Feb 3, 2025

Motivation and context

  • Added default batch size in bulk create calls
  • Removed SQLite workaround in bulk_create() calls

How has this been tested?

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

@zhiltsov-max zhiltsov-max requested a review from SpecLad as a code owner February 3, 2025 17:58
@SpecLad
Copy link
Contributor

SpecLad commented Feb 4, 2025

What's the purpose of this?

@zhiltsov-max
Copy link
Contributor Author

As the description states - use chunk size by default. It's often unintentionally forgotten.

@SpecLad
Copy link
Contributor

SpecLad commented Feb 4, 2025

As the description states - use chunk size by default. It's often unintentionally forgotten.

Sure, but why do we want this?

@zhiltsov-max
Copy link
Contributor Author

Because we don't want it to be unintentionally forgotten in random places.

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 87.69231% with 8 lines in your changes missing coverage. Please review.

Project coverage is 73.75%. Comparing base (908a47d) to head (8fe0455).
Report is 16 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9047      +/-   ##
===========================================
+ Coverage    73.43%   73.75%   +0.32%     
===========================================
  Files          419      428       +9     
  Lines        44351    44523     +172     
  Branches      3875     3881       +6     
===========================================
+ Hits         32569    32839     +270     
+ Misses       11782    11684      -98     
Components Coverage Δ
cvat-ui 77.47% <ø> (+0.03%) ⬆️
cvat-server 70.69% <87.69%> (+0.58%) ⬆️

Comment on lines +34 to +37
ignore_conflicts: bool = ...,
update_conflicts: bool | None = ...,
update_fields: Sequence[str] | None = ...,
unique_fields: Sequence[str] | None = ...,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be replaced with **kwargs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, I just wanted it to be a "drop-in" replacement with the same signature. It can't be a pure drop-in, because now we require the first model argument, but this design offers good enough compatibility.

if "postgresql" in settings.DATABASES["default"]["ENGINE"]:
return db_model.objects.bulk_create(objs, batch_size=batch_size, **kwargs)
else:
# imitate RETURNING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this workaround can be removed now? The only relevant database that this would be used for is SQLite; according to the Django docs, SQLite 3.35 supports returning the primary key; and Ubuntu 22.04 already has SQLite 3.37.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably.

@@ -1153,7 +1153,7 @@ def _append_upload_info_entries(self, client_files: list[dict[str, Any]]):
task_data.client_files.bulk_create([
ClientFile(file=self._prepare_upload_info_entry(cf['file'].name), data=task_data)
for cf in client_files
])
], batch_size=settings.DEFAULT_DB_BULK_CREATE_BATCH_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the helper function here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving this to cvat/utils. Doesn't seem like it has much to do with engine.

@@ -95,3 +95,5 @@
MAX_CONSENSUS_REPLICAS = int(os.getenv("CVAT_MAX_CONSENSUS_REPLICAS", 11))
if MAX_CONSENSUS_REPLICAS < 1:
raise ImproperlyConfigured(f"MAX_CONSENSUS_REPLICAS must be >= 1, got {MAX_CONSENSUS_REPLICAS}")

DEFAULT_DB_BULK_CREATE_BATCH_SIZE = int(os.getenv("CVAT_DEFAULT_DB_BULK_CREATE_BATCH_SIZE", 1000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be a setting, and an environment variable? Under what circumstances would one want to adjust this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, it's just a named constant. I can't say why this may require changes so far, so I'm open to removing configurability. Typically, it's better have such performance constants configurable.

update_conflicts: bool | None = ...,
update_fields: Sequence[str] | None = ...,
unique_fields: Sequence[str] | None = ...,
*,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to move this above batch_size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used API from the documentation (and from the VS code django stub). Do you think the argument order should be changed?

bulk_create(objs, batch_size=None, ignore_conflicts=False, update_conflicts=False, update_fields=None, unique_fields=None)

Copy link

sonarqubecloud bot commented Feb 6, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants