Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of URLConnection cache configurable in PlexusIoURLResource #12

Closed
rbjorklin opened this issue May 23, 2018 · 27 comments
Closed

Comments

@rbjorklin
Copy link

rbjorklin commented May 23, 2018

See #2 on why the cache was initially disabled.

Maven-assembly-plugin depends on plexus-io and having the cache enabled does wonders for the build performance when building a jar-with-dependencies and the dependencies are fairly large such as eclipse-collections.

EDIT: https://github.com/codehaus-plexus/plexus-io/blob/master/src/main/java/org/codehaus/plexus/components/io/resources/PlexusIoURLResource.java#L42

@michael-o
Copy link
Member

Please read my comment at #2.

@plamentotev
Copy link
Member

Hi,

How does enabling the cache speeds up your build? Does that means that without the cache a jar is downloaded multiple times(during the build not necessary by the Assembly plugin)? My point is that if a jar is downloaded multiple times, when it is perfectly safe to use cache, then the local .m2 cache should be used instead. So maybe that is the root cause of the issue.

@rbjorklin
Copy link
Author

rbjorklin commented May 24, 2018

@plamentotev My current understanding is that the file is kept in memory and each class file is read from the cached jar. When the cache isn't used the 9.6MB eclipse-collections jar is read from disk once per class file which takes unnecessary time. The build does not download the file multiple times, the build works and takes the same amount of time with the network cable unplugged.

@michael-o
Copy link
Member

@rbjorklin Did you check your assumption with VisualVM or similar?

@rbjorklin
Copy link
Author

rbjorklin commented May 24, 2018

@michael-o With the help of YourKit we could see that the maven build spent a majority of it's time in this class and we decided to try turning the cache on. We're currently using a snapshot build of plexus-io with the cache enabled and seeing good build performance with maven-assembly 3.1.0.

@michael-o
Copy link
Member

@rbjorklin Thanks for the update. Willing to review a PR.

@plamentotev
Copy link
Member

@rbjorklin thanks for the information. Plexus Archiver uses PlexusIoZipFileResourceCollection to add the content of the Jar dependencies into the resulting archive. It uses separate PlexusIoURLResource for each entry in the Jar file and that leads to the performance penalty you see. I just didn't expect the difference to be so huge if the Jar file is stored on the file system.

If Plexus Archiver uses PlexusArchiverZipFileResourceCollection the issue with opening the Jar file for each entry would be solved. Would you please try if that solves you performance issue? It easy to make the change. Here is the diff based on Plexus Archiver 3.6.0:

diff --git a/src/main/resources/META-INF/plexus/components.xml b/src/main/resources/META-INF/plexus/components.xml
index 71d47ac..1c613b7 100644
--- a/src/main/resources/META-INF/plexus/components.xml
+++ b/src/main/resources/META-INF/plexus/components.xml
@@ -321,7 +321,7 @@
       <role>org.codehaus.plexus.components.io.resources.PlexusIoResourceCollection</role>
       <role-hint>jar</role-hint>
       <!-- there is no implementation of PlexusIoJarFileResourceCollection, but PlexusIoZipFileResourceCollection will do the job -->
-      <implementation>org.codehaus.plexus.archiver.zip.PlexusIoZipFileResourceCollection</implementation>
+      <implementation>org.codehaus.plexus.archiver.zip.PlexusArchiverZipFileResourceCollection</implementation>
       <instantiation-strategy>per-lookup</instantiation-strategy>
     </component>
     <component>

As far I know PlexusArchiverZipFileResourceCollection is the better implementation but there are some IT failing (but that was some time ago) preventing it from being the default. I guess its worth investigating if PlexusArchiverZipFileResourceCollection could be used for all ZIP based formats - Jar, War, etc

@michael-o
Copy link
Member

@plamentotev What is the difference between PlexusIoZipFileResourceCollection and PlexusArchiverZipFileResourceCollection in brief?

@plamentotev
Copy link
Member

@michael-o if you ask why there are two implementations - all I know is this comment.

If you ask what is the difference between the implementations:

  • PlexusIoZipFileResourceCollection uses org.apache.commons.compress.archivers.zip.ZipFile to get the list of all entries inside the ZIP file, but the InputStream for each entry is retrieved using PlexusIoURLResource (each entry can be represented as jar:{fileURL}!/{entryName}). But that means the ZIP headers are parsed for each entry to locate it inside the ZIP file.
  • PlexusArchiverZipFileResourceCollection also uses org.apache.commons.compress.archivers.zip.ZipFile to get the list of all entries inside the ZIP file, but it uses it to retrieve the InputStream for each entry as well. So the extra work to open the ZIP file and read the headers for each entry is avoided.

At first sight looks like they should be interchangeable, but maybe there is some subtle difference in the behavior. PlexusArchiverZipFileResourceCollection is used for ZIP files, so in the base case it works. PlexusIoZipFileResourceCollection is used for Jar, War, etc.

@michael-o
Copy link
Member

That's really confusing because they are all ZIP files after all.

@plamentotev
Copy link
Member

Yes, that is why I think is better to replace PlexusIoZipFileResourceCollection with PlexusArchiverZipFileResourceCollection for all ZIP formats than fixing the PlexusArchiverZipFileResourceCollection performance. I've created codehaus-plexus/plexus-archiver#90 so we can track it.

That being said, making the cache configurable may still make sense if PlexusIoURLResource have other uses.

@rbjorklin
Copy link
Author

@plamentotev Just want to make sure I understood you correctly. The change you proposed was in the maven-assembly-plugin itself, right? If so then changing to PlexusArchiverZipFileResourceCollection made no performance difference unfortunately. So for now we're sticking with our patched plexus-io.

Any suggestions on a sane way forward?

@plamentotev
Copy link
Member

plamentotev commented May 28, 2018 via email

@rbjorklin
Copy link
Author

Gotcha! Making the suggested changes to plexus-archiver did make the build performance on-par with our hack to turn the cache on in PlexusIoURLResource. Your suggested change in codehaus-plexus/plexus-archiver#90 has definitely got my vote!

@rbjorklin
Copy link
Author

@plamentotev Is there anything required on my part for you suggested changes to be implemented and merged into master?

@plamentotev
Copy link
Member

@rbjorklin I suppose you mean using PlexusArchiverZipFileResourceCollection for all ZIP based archive types, right? I want to make sure that the change does not bring any regressions so I've decided to implement the change and see if all of the integration tests of the plugins using Plexus Archiver are passing[1]. Unfortunately I encounter problems with the first one I've tried - Maven Assembly Plugin. It looks like the master (even without the changes) is not stable and some of the integrations tests are failing. The official build does contain some failures, but I think I got even more when I ran it on my machine. Any help with making the Maven Assembly Plugin build stable and green will be greatly appreciated and will help with proceeding with this issue as it'll give us confidence that the change does not introduce any regressions.

[1] I think those are Maven Assembly Plugin, Maven Dependency Plugin, Maven Shade Plugin, Maven JAR Plugin, Maven WAR Plugin, Spring Boot Maven Plugin.

@slachiewicz
Copy link
Member

With cache disabled in class sun.net.www.protocol.jar.JarURLConnection.JarURLInputStream with every close() - jar file is closed also. For jars like Bouncycastle with signatures inside - reading every entry (ie open jar) involves signature verification - that is why we see slow performance.

java.util.jar.JarFile#getInputStream

@plamentotev
Copy link
Member

@slachiewicz thanks for the update - that makes sense. I wonder if PlexusArchiverZipFileResourceCollection verifies the signatures. I should check that - that may explain why it is not used for Jar based files.

@vojtechhabarta
Copy link

We have the same problem. Creating jar with dependencies using maven-assembly-plugin is really slow when we depend on signed jars like Bouncy Castle.

Removing following line speeds up our build from 6 minutes to 30 seconds:

I understand that caching should not be enabled by default. So how should this be configurable? Do you think system property (-D) would be fine? I think ideally this should be configurable using some configuration parameter in maven-assembly-plugin but this is more complex.

I can submit PR if there is an agreement how to solve it.

@michael-o
Copy link
Member

@plamentotev. I know system props is a bit ugly, but making it configurable through the entire Maven chain will take way more time.

@vojtechhabarta
Copy link

I investigated how this flag could be propagated from maven-assembly-plugin through plexus-archiver into plexus-io. It seems it could be done similarly to how encoding propagation was done: apache/maven-assembly-plugin@f9869e2, codehaus-plexus/plexus-archiver@530c2ee, de3b52d.
But this means to add yet another overloads to several methods in 3 repositories.

Maybe some shortcut would be better solution in this case. One possibility would be to control the caching using system property (as I mentioned before), another would be to add public static flag to PlexusIoURLResource which could be set directly from AbstractAssemblyMojo.

Significant disadvantage of not allowing users to set this flag from plugin (i.e. only system property) would be really bad discoverability/documentation of this feature.

@plamentotev
Copy link
Member

In general I think if we're going to add the caching flag all the way to the plugins we should add it to the interface (PlexusIoResourceCollection). Adding it only to the concrete implementation (PlexusIoZipFileResourceCollection) will effectively tied the plugins to it and it make the code more tightly coupled. Keep in mind that if we add the flag we should fix #2 in some other way. A flag that speeds the build but opens a bug (that is not manifested on each run so essentially you can't be really sure if your build is exposed to it unless you're familiar with how Maven works internally and most users are not) should not be added lightly.

About the issue discussed here. The root cause for the performance problems is the PlexusIoZipFileResourceCollection implementation and I think we should fix it as a long term solution. One proposed solution is codehaus-plexus/plexus-archiver#90, but other are welcome too. For example we can modify it to use JarFile instead of PlexusIoURLResource (URLConnection internally uses JarFile ) so we'll preserve the same functionality and fix the performance issues without having the same problem as in #2.

Having system prop is really ugly IMHO. Especially when we have better option (like fixing the root cause of the performance issues). The only advantage is that is quick. Correct me if I'm wrong but I don't think this issue is that critical to justify making the code ugly or hard to maintain only to fix it quickly.

@plamentotev
Copy link
Member

I did some more investigation. If PlexusIoZipFileResourceCollection uses JarFile instead of PlexusIoURLResource that would allow a single JarFile instance to be used (the initialization of the instance is what causes the performance issues when the JAR is signed). On other hand we avoid using the cache. The problem with the cache is that is global and as shown by #2 may cause problems in certain situations. URL and URLConnection( used by PlexusIoURLResource) in our case are just wrappers around JarFile (they can work with any protocol but for JAR files are just creating JarFile) so this change will keep the existing PlexusIoZipFileResourceCollection and Plexus Archiver behavior and those reducing the risk that comes with codehaus-plexus/plexus-archiver#90.

I think that is the better solution and if you agree I can implement it (ETA this or next week).

@vojtechhabarta
Copy link

This issue is important for us so I created our own version of this module (with enabled caching) which we can use until performance is solved.

Having "right" long term solution would be ideal, I just don't know how exactly this should be done. If you could do it, it would be really good.

@snago
Copy link

snago commented Jan 4, 2019

I've just verified that when using maven-assembly-plugin 3.1.1 we don't need the locally modified snapshot build of plexus-io that rbjorklin did for performance anymore.

Thanks @plamentotev!

@rbjorklin This issue can be closed.

@vojtechhabarta
Copy link

I've also verified that maven-assembly-plugin 3.1.1 (with default dependencies) works well.

Thanks.

@plamentotev
Copy link
Member

I'll close the issue as the performance issue is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants