Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JARs should be self-contained and not rely on external virtualenvs [or use Storm hooks and get rid of SSH] #99

Open
dan-blanchard opened this issue Jan 22, 2015 · 14 comments

Comments

@dan-blanchard
Copy link
Member

I've been looking over the Pyleus code a little, and one thing that they do that makes deployment simpler is that they create the entire virtualenv inside the JAR instead of having it reside on the servers. They don't require SSH at all for anything, because they require people to have Storm installed somewhere on their path, and then they just use the storm command directly and specify the host and port for nimbus to it. I end up installing storm with streamparse anyway so that I can run storm ui, and I don't think I'm alone there.

If we switched to putting everything in the JAR, then we wouldn't have to worry about anything with SSH anymore and could hopefully get rid of our dependency on fabric, since that's not Python 3 compatible (as I keep mentioning 😄).

@codywilbourn
Copy link
Contributor

If we switch to putting everything in the jar, wouldn't that imply compiling the venv on the user's machine, which may be different than the deployment target?

@dan-blanchard
Copy link
Member Author

Yes, but isn't it probably a good idea for people to be developing in a VM that's the same as the deployment target anyway?

@coffenbacher
Copy link

This seems like a great idea to me. As long as it's configurable for anyone that ends up with compilation issues, this would be a nice win for simplicity IMO.

@dan-blanchard
Copy link
Member Author

I actually just thought of an interesting way we might be able to make this work. Apparently Storm supports hooks that get triggered on certain events. As a part of that each hook can have a prepare method that is called at the time the TopologyContext is put together, which happens before the actual ShellBolt and ShellSpout prepare methods are called. It would take a tiny bit of JVM code, but we could implement a hook that's only purpose was to build the virtualenv from a requirements file we put in the JAR. That way you're always building on the same architecture, and we wouldn't be bloating the JAR size.

From what I can tell, we would just make sure that the virtualenv only gets built once, because TopologyContext.addHook would get called for every component in the topology if we use the topology.auto.task.hooks config setting, which is the simplest way to add hooks.

@amontalenti @msukmanowsky Any thoughts on this?

@msukmanowsky
Copy link
Contributor

The hooks thing seems interesting but I think I'd like to try and ping the Apache team to support a topology-level hook as opposed to us doing some flock stuff to avoid race conditions from multiple components all executing the same code.

No gotchas with this approach and other package managers like Conda right? I haven't used Conda yet.

@dan-blanchard
Copy link
Member Author

It would have to be more than a topology-level thing, since it would have to run on every machine the topology is running on. One way to avoid locals and race conditions would be to make each shell component run in its own independent environment.

If we were using conda, we wouldn't even need to store multiple copies of everything that way, because conda hardlinks packages to each other (and has its own locking mechanism to make sure two conda commands aren't messing with the package index at the same time).

@sixers
Copy link

sixers commented Apr 25, 2015

This is important for me as well, we're using Streamparse to deploy Machine Learning models. Compiling numerical and machine learning libraries during deployment is very painful and takes a lot of time :)

@westover
Copy link

+1 for this based on the mailing list request I put in and feedback from @rduplain

@rduplain rduplain added the ready label Aug 27, 2015
@rduplain rduplain changed the title JARs should be self-contained and not rely on external virtualenvs JARs should be self-contained and not rely on external virtualenvs [or use Storm hooks] Sep 1, 2015
@rduplain
Copy link
Contributor

This depends on #84 as prerequisite.

@rduplain rduplain changed the title JARs should be self-contained and not rely on external virtualenvs [or use Storm hooks] JARs should be self-contained and not rely on external virtualenvs [or use Storm hooks and get rid of SSH] Oct 16, 2015
@rduplain rduplain removed the ready label Oct 26, 2015
@dan-blanchard dan-blanchard modified the milestones: v3.0, v3.1 Jul 27, 2016
@dan-blanchard dan-blanchard modified the milestones: v3.1, v3.2 Sep 1, 2016
@westover
Copy link

Where does the status on this sit?

@dan-blanchard
Copy link
Member Author

This is mostly waiting on pex-tool/pex#316 being merged. Once that's set, we'll transition from using primarily using virtualenv's to using PEX (see #212). With a PEX, we can ship everything we need inside the JAR. There will need to be a little bit of work done to work around the issue that executable permissions are lost when you create JAR, but the main hold up is PEX not supporting manylinux wheels. Without those, you could not really deploy to a Linux machine from OS X or Windows if one of your project's dependencies needed to be compiled.

@westover
Copy link

@dan-blanchard can I clarify that the current jar sparse command does not bundle the venv like Pyleus did?

@dan-blanchard
Copy link
Member Author

dan-blanchard commented Sep 15, 2017 via email

@Richard-Mathie
Copy link

If your interested, my workaround for this is using docker. Its a bit ugly but bear with me.

see here https://github.com/Richard-Mathie/storm_swarm

Basically the idea is to deploy the storm workers as a docker service using docker swarm, mounting a volume for the virtual environments to exist in. You can then have another service which builds the virtual environment for those storm workers to run from.

If you deploy the services as global any nodes you add to the swarm automatically get added to the storm cluster, and your venvs get build.

Building is done using pip and a requiremnts.txt file which is distributed to the nodes using a docker secret (though they have config now as well). Change requirements? then update the docker secrets which will trigger a restart and hence restart of the storm_venv service. Finally I have to disable ssh in streamparse and put dummy entries in to the nodes list so that it can set the number of workers to deploy to.

I am looking forward to the day though when I just have to submit a JAR to the nimbus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants