Multi threaded_cores_and_HPC HDF5

Multi-threaded cores and HPC-HDF5

What are the issues we face in the long term on these exascale systems that will likely involve multi-core chips running multiple threads? How do we expect applications to use those threads to do their work? How do we expect the existence of threads to impact how/what the application does in the way of I/O?

Thread Safety versus Thread Implementation

These two issues can be confused. In the context of libraries like Silo and HDF5, threads can have implications on two levels

Is the library thread safe?
Is the library threaded?

Thread Safety & Concurrency

By thread safe we mean the library can be safely used by an application running on multiple threads and that the application, running on each thread, can make calls into the library safely without encountering any problems endemic to threads such as global data structures getting corrupted and/or race conditions. When a library is not thread safe, an application running on multiple threads has to be re-engineered (slightly) so that when it makes use of a library that is not thread safe, it ensures that use of that library occurs on only one thread. Concurrency refers to a thread safe library’s ability to be used simultaneously by multiple threads (See note²).

Silo is not a thread safe library. HDF5 is thread safe but is not concurrent. The whole library is locked when a thread enters an API routine. The problem with HDF5’s thread safety is that locking is done on code (functions in the API or below it) but should instead be being done on data. Its a challenging enough problem to fix that although it has been discussed many times over the last several years, no funding agency has wanted to support it.

Threaded Implementation

By threaded implementation, we mean the library is designed to use multiple threads to do its work. For a computational kernal library like LINPACK, threaded implementations can be important as the primary service the library provides is a computational one. The additional threads parallelize the computational work of the library.

For an I/O library, where the primary service the library performs is to move data between memory and disk, the value of employing threads is unclear. A purely I/O library is one that engages in no problem sized work (operations on arrays/buffers passed into the library from the caller) and simply passes application data it is handed onto the underlying I/O interfaces (section 2, stdio, MPI-IO, etc.). However, both Silo and HDF5 do support operations on the data as it moves between memory and disk. These operations include

compression (both Silo and HDF5)
numeric architecture conversion (e.g. Cray float to IEEE-784 – HDF5)
datatype conversion (e.g. int to double – HDF5)
transposition (e.g. row-major to column-major)
seiving/subsetting (e.g. hyperslabbing (relevant only for partial I/O) – HDF5)
other?

For a detailed description and flowchart of HDF5 operations during an H5Dwrite call, read chunk write actions

Nonetheless, performance studies have shown that even without threads, the HDF5 library can perform these operations at speeds well-above the associated disk I/O bandwidth. So, what value is there in making them any faster by employing threads when the associated disk I/O is going to dominate any particular data movement operation (See note¹)? The answer is unclear.

We have identified a few things that use of threads in HDF5 could perhaps be gainfully employed at exascale to facilitate I/O

Asynchronous I/O: to reduce lock latency in existing thread safety mechanism
Here, we introduce asynchronous I/O and employ threads in HDF5 to enable a given HDF5 call to hold a lock for as short a period of time as possible by having it issue an async. I/O request and then pass the work to service that request off to another thread. Then, latencies in HDF5’s existing locking mechanism are reduced by having to lock only long enough to (perhaps do a buffer copy and) issue async. I/O request and return.
Compression: We can employ more expensive and exotic compression schemes. Why not? We’ll have the compute resources to do it. And, if we can gain an extray 25-50% compression at no cost, we’d be crazy not to try. Quincey observed that for Rich Man’s parallel I/O, compression is not easily handled due to unpredictability of block sizes (and subsequent locations within the file). But, employing variable-loss but fixed-size compression schemes (for plot purposes only) such as wavelets, would be a potential big win here.
What about progressive I/O where data is re-organized before write such that most important stuff gets written first, followed by less important, so on? Would that be useful? (QAK: Sounds like an interesting idea)
I/O Aggregation and Optimization: We could employ extra threads in HDF5 (along with perhaps local flash memory) to aggregate a lot of smaller I/O requests from the caller into a single larger I/O request that then gets shipped out (Neely/Keasler). For example, if HDF5 had the windowed group or group-in-core (e.g. like core VFD but works on a group-by-group basis), then a single domain’s worth of data could be written to a group on local flash memory and that whole shootin-match of data could then be written to the real HDF5 file in a single I/O request from the processor (Neely/Keasler) (QAK: Yes, I like this idea also :-)

MPI and OpenMP types of parallelism

MPI-like parallelism is where the application is engineered to explicitly handle messaging between tasks using the MPI message library (or equivalent). OpenMP-like parallelism is much finer grained and typically handled via #pragma statements with a compiler. Work on a large array of data is then assigned to a variable number of tasks (threads) and the compiler handles all the issues (messaging/locking whatever) under the covers.

Could/would every thread in an exascale app look like any other MPI tasks as they currently do now? Apparently, MPI community is aiming to enable this degree of transparency such that a quality MPI-2/3 implementation would handle sending messages with whatever native efficiencies are possible between threads/chips. However, for an application like Ale3d, such an approach is likely to be impractical given memory constraints of each domain level task. So, instead, the way Ale3d might handle this is to run with one or just a few uber domains on a chip. MPI-parallelism would occur between uber domains (chips) while OpenMP parallelism would be used within the chip to operate on one or the few domains over threads there (Neely/Keasler). How would this effect the way Ale3d might do its I/O?

For many of the ways of employing a thread implementation of HDF5 described above, OpenMP parallelism makes the most (only) sense.

Threads provide more parallelism in compute, not I/O

The I/O pathways on/off chip are not improving. Indeed, the gap between processor speeds and disk I/O has continued to widen in the last decade. This is true on single cpu systems as well as parallel systems. Increasing parallelism on chip with extra cores and threads is great from a compute standpoint but can be leveraged very little, if at all to help move data on and off chip. About the only thing threads could possibly help with is hiding some I/O with compute by doing async. I/O and handing the actual I/O work off to a different thread. At the same time, few codes are designed with async. I/O in mind and would have to be either retro-fitted substantially to add logic to check that buffer reads and writes were completed before using those buffers or the underlying I/O libraries would have to make buffer copies (chewing up precious memory).

Conclusions

Existence of threaded execution in applications at exascale system is unlikely to change how I/O needs to be done. We don’t want to do I/O independently from each thread do we?. Is there any reason we should try? Threads are really for compute anyways. There is no additional bandwidth off processor that using multiple threads for I/O will give us. So, although it seems counter-intuitive, from an I/O standpoint the thread headaches we’re anticipating with exascale don’t seem to be relevant.

Notes

¹ Note 1:
None of HDF5’s data operations listed in this section are performed apart from I/O operations. At least none that I know of. That means they are only performed as part of some larger I/O operation and are not an operation HDF5 provides apart from I/O. This may not be true of numeric arhitecture conversions.

² Note 2:
Quincey explained that thread safety mechanism effects ability for app running on multiple threads to call into a lib, concurrently. HDF5’s existing thread safety (locking) mechanism permits NO concurrency. Introduced concept of internal vs. external concurrency; internal concurrency is where lib spawns/uses threads to perform work while external concurrency is where lib can be called from multiple threads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly