Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of generators instead of list/tuples in executemany #200

Open
meisn opened this issue Jul 4, 2018 · 9 comments
Open

Usage of generators instead of list/tuples in executemany #200

meisn opened this issue Jul 4, 2018 · 9 comments

Comments

@meisn
Copy link

meisn commented Jul 4, 2018

Answer the following questions:

  1. What is your version of Python? Is it 32-bit or 64-bit?
    Python 3.6.4/3.7 64bit

  2. What is your version of cx_Oracle?
    6.4

  3. What is your version of the Oracle client (e.g. Instant Client)? How was it
    installed? Where is it installed?
    11.2, packaged installation in my company

  4. What is your version of the Oracle Database?
    11gR2

  5. What is your OS and version?
    Windows 7 64bit Enterprise

  6. What compiler version did you use? For example, with GCC, run
    none

  7. What environment variables did you set? How exactly did you set them?

  8. What exact command caused the problem (e.g. what command did you try to
    install with)? Who were you logged in as?
    nothing actually, but intended to use executemany to (bulk)insert large dataset which has to be retrieved from a file before (i.e. parameter list generated). Logged in as a regular user, no sys or any privileged account, personal user has write privileges on the database tables of course

  9. What error(s) you are seeing?
    I have to insert many records from large (xml) files. I use a function for this, where I also have a generator version. On small files (python process below 300MB) I can generate the list of parameters and use executemany(stmnt, params) on it without any issue. Now I also have some larger files of the same sort, the files being around 90MB which results in memory consumption way above that (other modules involved to produce the parameter set). If I could use my generator instead of the function, i.e. if the whole list would not need to be produced upfront, the memory consumption would be way lower and my process would not fail with these file. I played a bit (testing) with the SQLITE3 module included, where there the generator is accepted in the executemany version of the method (https://docs.python.org/3/library/sqlite3.html) - and compared the two (there my python process ran close to a 1GB of memory with the function but didn't fail) and the generator version consumed some 250MB only, ran a bit longer of course, but that is acceptable when the insert itself matters.

Can there be an enhancement like this in one of the future versions?

@anthony-tuininga
Copy link
Member

That's an interesting request and it seems perfectly reasonable to request, too. :-) I'll consider this for a future release.

@meisn
Copy link
Author

meisn commented Jul 4, 2018

Hi Anthony,
I'm really impressed now. Not just by the response time you did but that you accept this as a request :-)
Thank you. I would appreciate the help.
Regards,
Ronny

@cjbj
Copy link
Member

cjbj commented Jul 4, 2018

@anthony-tuininga I guess you are thinking of the internal batching & inserting of data that was once discussed? Otherwise cx_Oracle would still need to instantiate the full data set in memory before calling OCI.

@meisn you would have seen this in Anthony's blog post:

For really huge data sets there may be external buffer or network limits to how many rows can be processed at one time. These limits are based on both the number of rows being processed as well as the "size" of each row that is being processed. The sweet spot can be found by tuning your application. Repeated calls to executemany() are still better than repeated calls to execute().

To work with the current functionality, try calling executemany() multiple times.

@anthony-tuininga I wonder whether direct path loading would be more interesting for large data sets?

@anthony-tuininga
Copy link
Member

@cjbj, yes, I was thinking of batching the inserts and giving the caller the chance to specify how many rows would be processed in each batch, with a reasonable default. I do hope to take a look at direct path loading as well -- but in both cases batching will be needed for very large data sets.

@meisn
Copy link
Author

meisn commented Jul 5, 2018

@cjbj Thanks for the link and the hint. In fact you recommended there to open the "issue" here :-)
Not sure about this multiple executemany() - I would need to split my xml file into some reasonable chunks before, also not that easy. But I will investigate this route as well.

@anthony-tuininga
Copy link
Member

@meisn, what I think @cjbj is suggesting is that you would create your own version of executemany() in Python code which would accept the iterator. That method would then consume the iterator up to a certain number of rows (100? 1000? 10000?) and then call the real executemany() and continue doing this until the iterator was exhausted. This would be fairly easy to do in Python code, I believe.

@meisn
Copy link
Author

meisn commented Jul 5, 2018

@anthony-tuininga I can try that. I just realized that the majority of this issue is my code. I create a list of dictionaries (to have my named parameters and to accommodate for the different types of xml-tags I can have in the files). This works fine up to a certain size of the file but is terrible for the larger ones.
I will also need to think of alternatives.
Thanks so far, hope you still enhance this once - even if this is just a conviniance.

@anthony-tuininga
Copy link
Member

No problem. And I will certainly consider this enhancement request further when I get some time!

@meisn
Copy link
Author

meisn commented Jul 11, 2018

@anthony-tuininga I have worked around my issue for now by putting up smaller inputfiles and hence smaller parameter sets. I use dictionaries as parameters, which turned out had probably not been my best choice. But with smaller params and more executemany-calls it works fine.
Thanks for your suggestions also to @cjbj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants