Skip to content

Conversation

esyr
Copy link
Member

@esyr esyr commented Sep 25, 2025

Includes the relevant update to the pkeyread test, as it already tries to report some thread indices in the -v mode.

@esyr esyr requested review from Sashan and jogme September 25, 2025 23:53
Copy link
Contributor

@Sashan Sashan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good I could find just few nits in threads.c you might want to address.
thanks.

unsigned int ret = 0;

for (size_t i = 0; i < sizeof(a) * CHAR_BIT; i++)
ret += ((a & (1ULL << i)) == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this what I've messed up in my sgguested change. we discussed this off-llist we are supposed to count bits which are set, right? if so then we need ret += ((a & (1ULL << i)) != 0); here.

goto err;
}

ta = OPENSSL_malloc(sizeof(*ta) * threadcount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to use OPENSSL_malloc() so tests work with libraries which don't provide OPENSSL_malloc_array()

args[i].num = i;
perflib_run_thread(&threads[i], &args[i]);
if (!(run_threads[i] = perflib_run_thread_(&threads[i], &args[i],
ta + i)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use &ta[i] here so it is clear we work with array, thanks.

@esyr esyr force-pushed the esyr/thread-affinity branch 2 times, most recently from efab928 to c033606 Compare October 16, 2025 12:18
esyr and others added 8 commits October 16, 2025 15:15
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Co-Authored-by: Alexandr Nedvedicky <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
@esyr esyr force-pushed the esyr/thread-affinity branch from c033606 to 30f2a1f Compare October 16, 2025 13:16
@esyr esyr requested a review from Sashan October 16, 2025 13:30
@esyr esyr marked this pull request as ready for review October 16, 2025 13:31
static ossl_inline unsigned int popcount(affinity_t a)
{
return __builtin_popcountl(a);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to special case the ability to use a compiler built in here? It seems like the balance between the ifdeffery here and a single function that counts up to sizeof(unsigned long) * 8 bits is biased in favor of just having one function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't like the idea of rolling own implementation when the built-in is right here, but I don't really care here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to use compiler built-ins can we also enable them for clang?

diff --git a/source/perflib/threads.c b/source/perflib/threads.c
index 8cf3a76..4f9187c 100644
--- a/source/perflib/threads.c
+++ b/source/perflib/threads.c
@@ -22,7 +22,7 @@
 /** affinity_t-typed value with nth bit set. */
 #define AFFINITY_BIT(n) ((affinity_t)1U << (n))
 
-#if defined(__GNUC__)
+#if defined(__GNUC__) || defined(__clang__)
 
 static ossl_inline unsigned int popcount(affinity_t a)
 {
@@ -41,7 +41,7 @@ static ossl_inline unsigned int popcount(affinity_t a)
     return ret;
 }
 
-#endif /* __GNUC__ */
+#endif /* __GNUC__ or __clang__ */
 
 int perflib_roundrobin_affinity(affinity_t *cpu_set_bits, size_t cpu_set_size,
                                 size_t num, size_t cnt, void *arg)

to be honest I'm with Neal here. My reasoning is the peftools need to be portable to as many platforms/compilers as (conveniently) possible. you are rolling the builtin implementation anway so using a bultinn one here does not buy as much.

on the other hand if limit ourselves to clang and GCC tools, then I'm fine with going to bultin only one.

the true reason I don't like the if/else here is it leaves a dead/untested code behind. In my opinion the true choice here should be:

  • being portable, then roll your own
    or
  • let's rely on compiler then code will work on platforms where bultiin is provided

in my view the perftools are roll your own case.

"\t-v verbose output, includes min, max, stddev, and median times\n"
"\t-T timeout for each test run in seconds, can be fractional"
"\t-T timeout for each test run in seconds, can be fractional\n"
"\t-b Set CPU affinity for the threads (in round robin fashion)\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding this option to all the other tests in the repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was prototyping on pkeyread, but, yeah, adding it to other tests should be trivial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand Nikola's question better now. and I think he is making a good point. let me ask the question different way: what is a difference between running the test using the command:

./pkeyread -f all -k all -b 16

and

taskset  0xffff ./pkeyread -f all -k all 16

If I understand things right, then th -b is a shortcut so people don't need to think of using a taskset(1) is my understanding correct?

OSSL_TIME max_time;

int err = 0;
int error = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Theres a good portion of this PR dedicated to renaming variables that doesn't really have anything to do with the addition of thread affinity management.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this address linker issues. there is function err() which conflicts with variables err. the changes in this PR just discovered this conflict. so the change got included here.


#include <string.h>
#include <openssl/crypto.h>
#include <openssl/macros.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? There is no other change in this file

}

void
err(int status, const char *fmt, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why to duplicate errx function? Same for warn and warnx

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err/warn append the output of perror() to the message, while errx/warnx just print the provided string (along with the program name as a prefix).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err/warn append the output of perror() to the message, while errx/warnx just print the provided string (along with the program name as a prefix).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now; sorry for the noise

# include <err.h>

# else /* _WIN32 */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use new lines around include in #if.

@npajkovsky
Copy link

The work is ok, but I'm a little bit lost why the work is needed.

@Sashan
Copy link
Contributor

Sashan commented Oct 17, 2025

The work is ok, but I'm a little bit lost why the work is needed.

my understanding is you want to pin a thread to CPU so scheduler does not migrate the thread which runs performance test around the system. I think this does not present on system with low number of cores. it becomes more important on large multicore systems.

static ossl_inline unsigned int popcount(affinity_t a)
{
return __builtin_popcountl(a);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to use compiler built-ins can we also enable them for clang?

diff --git a/source/perflib/threads.c b/source/perflib/threads.c
index 8cf3a76..4f9187c 100644
--- a/source/perflib/threads.c
+++ b/source/perflib/threads.c
@@ -22,7 +22,7 @@
 /** affinity_t-typed value with nth bit set. */
 #define AFFINITY_BIT(n) ((affinity_t)1U << (n))
 
-#if defined(__GNUC__)
+#if defined(__GNUC__) || defined(__clang__)
 
 static ossl_inline unsigned int popcount(affinity_t a)
 {
@@ -41,7 +41,7 @@ static ossl_inline unsigned int popcount(affinity_t a)
     return ret;
 }
 
-#endif /* __GNUC__ */
+#endif /* __GNUC__ or __clang__ */
 
 int perflib_roundrobin_affinity(affinity_t *cpu_set_bits, size_t cpu_set_size,
                                 size_t num, size_t cnt, void *arg)

to be honest I'm with Neal here. My reasoning is the peftools need to be portable to as many platforms/compilers as (conveniently) possible. you are rolling the builtin implementation anway so using a bultinn one here does not buy as much.

on the other hand if limit ourselves to clang and GCC tools, then I'm fine with going to bultin only one.

the true reason I don't like the if/else here is it leaves a dead/untested code behind. In my opinion the true choice here should be:

  • being portable, then roll your own
    or
  • let's rely on compiler then code will work on platforms where bultiin is provided

in my view the perftools are roll your own case.

"\t-v verbose output, includes min, max, stddev, and median times\n"
"\t-T timeout for each test run in seconds, can be fractional"
"\t-T timeout for each test run in seconds, can be fractional\n"
"\t-b Set CPU affinity for the threads (in round robin fashion)\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand Nikola's question better now. and I think he is making a good point. let me ask the question different way: what is a difference between running the test using the command:

./pkeyread -f all -k all -b 16

and

taskset  0xffff ./pkeyread -f all -k all 16

If I understand things right, then th -b is a shortcut so people don't need to think of using a taskset(1) is my understanding correct?

@nhorman
Copy link
Contributor

nhorman commented Oct 17, 2025

I think I understand Nikola's question better now. and I think he is making a good point. let me ask the question different way: what is a difference between running the test using the command:

I think the difference between:

./pkeyread -f all -k all -b 16

and

taskset  0xffff ./pkeyread -f all -k all 16

Is that in the latter case we rely on the OS scheduler to place threads on unique cores.

In the former case thread 1 is guaranteed to have an affinity of 0x1, thread 2 an affinity of 0x2, thread 3 an affinity of 0x4, etc.

In the latter all threads can run on any ore in the affinity set. Will they likely be scheduled to unique cores? Probably. Are they guaranteed to be? No.

I guess the question to ask is "Does that matter to us?", and honestly, I'm not sure of the answer there.

@esyr
Copy link
Member Author

esyr commented Oct 17, 2025

The work is ok, but I'm a little bit lost why the work is needed.

So, the original reason I ended up writing that is that while working on x509storeissuer updates, I started seeing some anomalous results, and wanted to exclude that aspect from the list of possible factors. In general, pinning threads helps with the following:

  • it minimises noise from rescheduling and discrepancies of impacts of performance of specific CPU cores across test runs;
  • it allows referencing to thread numbers (which is sometimes useful in cases of anomalous performance of some of them), as they correlate with CPU cores that way;
  • it allows providing specific thread mappings on the system's topology, which is useful in conjunction with some other aspects of test runs, like, the way some resources are shared across threads or the way some thread perform work, and/or the CPU mask set for the whole test.

All those factors are predominantly relevant only when running on NUMA systems, naturally.

@Sashan
Copy link
Contributor

Sashan commented Oct 17, 2025

> All those factors are predominantly relevant only when running on NUMA systems, naturally.

understood. my preference here is to get away with taskset(1) (if possible) also it looks like windows offer similar mechanism according to stack overflow The takset seems to be available on FreeBSD. Solaris has prset(1M) to set affinity for process I believe other systems which can manage thread affinity expose their own command line tooling.

In my opinion the less we do here the better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[perftools] Add support for setting thread affinity in tests

5 participants