- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20190723
        Geoffrey Paulsen edited this page Jul 23, 2019 
        ·
        2 revisions
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Brendan Cunningham (Intel)
 - Brian Barrett (Amazon)
 - Edgar Gabriel (UH)
 - Geoff Paulsen (IBM)
 - Harumi Kuno (HPE)
 - Howard Pritchard (LANL)
 - Jeff Squyres (Cisco)
 - Josh Hursey (IBM)
 - Joshua Ladd (Mellanox)
 - Noah Evans (Sandia)
 - Ralph Castain (Intel)
 - Thomas Naughton
 - Todd Kordenbrock
 
- Akshay Venkatesh (nVidia)
 - Aravind Gopalakrishnan (Intel)
 - Arm (UTK)
 - Artem Polyakov (Mellanox)
 - Brandon Yates (Intel)
 - Dan Topa (LANL)
 - David Bernhold
 - Geoffroy Vallee
 - George Bosilca (UTK)
 - Jake Hemstad
 - Mark Allen (IBM)
 - Matias Cabral
 - Matthew Dosanjh (Sandia)
 - Michael Heinz (Intel)
 - Nathan Hjelm
 - Peter Gottesman (Cisco)
 - Xin Zhao (Mellanox)
 - mohan
 
- 
Git submodules
- This PR is in progress.  Requires CI owners to add 
--recursiveto their Jenkin's git clone commands. - As a first step, Jeff created:
- PR 6821 "hwloc201 use a submodule"
 
 
 - This PR is in progress.  Requires CI owners to add 
 - 
What to do with OFI BTL and OFI MTL
- Harumi Kuno (HPE) - Discussion about OMPI's component philosophy
 - mail archive: https://www.mail-archive.com/[email protected]/msg20736.html
 - ofi/BTL and MTL components can step on each other.
 - PSM2 - when a user of PSM2 calls PSM2_Finalize, as long as there's a PSM2 provider, PSM2 is refcounting is only observed in initializing not in finallizing, meaning first finalize, was finalizing entire job.
 
 - 
Status of Scale testing
- No update
 - Issue 6786 "OMPI 4.0.1 TCP connection errors beyond 86 nodes"
 - Issue 6198 "SSH launch fails when host file has more than 64 hosts"
 - IBM is also working on something like this as well (for ssh launch)
- Prefer this every night, instead of each PR.
 
 
 - 
Issue 6799 "UFM buffers failing in culpGetMemHandle ?"
- No update
 
 - 
- https://engineering.mongodb.com/post/succeeding-with-clangformat-part-1-pitfalls-and-planning
 - Should get this cleaned up. Need one big PR fix.
 - Whitespace vs Tab cleanup.
 - Good conversation on PR.
 - Should we have CI for this?
 - MongoDB did something similar, and branches, and issues, and why they went with CLANG.
 - After folks write the scripts, then adding to CI is no problem.
 - Want it to be EASY to add local githooks so CI isn't first line for these.
 - Giant clean up commits should be done on each
 - Implementation details:
- It might be easy to use clang for the CI / formating.
 - clang enforces a set of things, but it may require more than
 - We have a requirement in Open MPI that says you write 'if (NULL == var)'
- very hard to enforce this in perl, and gcc can't give us AST to do at that level.
 
 - run clang far enough to get AST, to do formatting.
- you can now run clang_format.py reformat-branch T R (using T and R from the algorithm above) to easily bring a stranded topic branch forward after a reformat commit.
 
 - If we have to add yet another dependency (like clang), most of us don't use clang, so adding a bunch of painful.
 
 - White space is how this started, and perhaps just fix white space stuff. And both githooks and CI to enforce.
 - scripts are in mentioned in PR.
 - Most of these scripts UPDATE the git commit, and so for CI we want them just to check.
 - Command line example on how to add to add to git hooks.
 
 
- Complete
 
- No update
 
- Suggest just doing hwloc (stable and not too much development) first
 - No update
 
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
Review v3.1.x Milestones v3.1.4
- v3.0.x MPIR_Breakpoint issue need a bit more data why -O3
 - Tested new PMIx
- Exposed a few new test suite issues in "ibm", but fixed
 
 
Review v4.0.x Milestones v4.0.2
- George is back from vacation, want two things before rc1
- Datatype work, master PR for datatypes
 - Also ob1 get/put path problem
 
 - Howard is verifying 6613 MPIR Disapearing queue on re-attach.
 - PR6806 - Want to wait until CI is back.  Do we have any tests to test this?
- Howard will reproduce and add to ibm suite
 
 - 2nd Put issue PR 6568 (Vader deadlocking with 4MB transfers)
- waiting on George to return (end of the month)
 
 - New Datatype work https://github.com/open-mpi/ompi/pull/6695 (master)
- Want for v4.0.2
 - Now approved for master.
 - waiting on George to return (end of the month). We could merge to master, but if any issues, we'd need George to fix.
 
 - 
https://github.com/open-mpi/ompi/issues/6568 - put protocol has lost it's pipelining.
- Right now only shows in vader, because all others prefer get protocol.
 - Vader generate a bunch of 32K frags. so for 4MBs overwhelms vader.
 - Does NOT occur with single copy like CMA or KNEM.
 
 - Issue 6789 - OMPI crashes when configured with ucx version
- Issue with PML UCX conflicting with btl_uct - memory hooks
 - New this week: Howard not convinced it's memory hooks.
 
 
Review Master Master Pull Requests
- PR6556 and 6621 should go to the release branches.
- no update
 
 - Good reminder that we now need to be careful about OPAL's ABI.
 
- Not a great way to test CI before
 
- When do we get rid of 32bit?
 - Still don't have any release manager.
- Need to identify someone in next few months.
 
 
- a bunch of stuff going on, but nothing necciarily impacting OMPI.
 - Made a change for Nathan - allow you to get locality of other processes on node.
- Allows you to hook up with shared memory
 
 - The version master PMIx can support network coordinates of any NIC, and depending
on type of network can map for each process.
- "network coordinates" - map to MPI network topology definition.
 - Fujitsu, Cray is implementing.
 
 - In PMIx when do instant-on, the scheduler queries the ___ plugin to get a payload of info you want.  If the process is bound to a certain socket, this is the NIC they should use, and these others are available.  Then you assign the endpoint to that NIC.
- Requires Instant-On? - simple to do without instant-on if you want to.
 
 
- Aug 7th - web-ex meeting.
 - Gile's PRRTE work was done differently than we're not proposing.  New proposal uses submodules, etc.
- PR6339 - he's closed, and re-opened a new branch to look at.
 - Howard reviewed PR6339, and likes everything that Giles did.
 
 
- IBM has to triage some failures on master and v4.0.x