|  | 
|  | 1 | +.. SPDX-License-Identifier: GPL-2.0 | 
|  | 2 | +
 | 
|  | 3 | +============= | 
|  | 4 | +False Sharing | 
|  | 5 | +============= | 
|  | 6 | + | 
|  | 7 | +What is False Sharing | 
|  | 8 | +===================== | 
|  | 9 | +False sharing is related with cache mechanism of maintaining the data | 
|  | 10 | +coherence of one cache line stored in multiple CPU's caches; then | 
|  | 11 | +academic definition for it is in [1]_. Consider a struct with a | 
|  | 12 | +refcount and a string:: | 
|  | 13 | + | 
|  | 14 | +	struct foo { | 
|  | 15 | +		refcount_t refcount; | 
|  | 16 | +		... | 
|  | 17 | +		char name[16]; | 
|  | 18 | +	} ____cacheline_internodealigned_in_smp; | 
|  | 19 | + | 
|  | 20 | +Member 'refcount'(A) and 'name'(B) _share_ one cache line like below:: | 
|  | 21 | + | 
|  | 22 | +                +-----------+                     +-----------+ | 
|  | 23 | +                |   CPU 0   |                     |   CPU 1   | | 
|  | 24 | +                +-----------+                     +-----------+ | 
|  | 25 | +               /                                        | | 
|  | 26 | +              /                                         | | 
|  | 27 | +             V                                          V | 
|  | 28 | +         +----------------------+             +----------------------+ | 
|  | 29 | +         | A      B             | Cache 0     | A       B            | Cache 1 | 
|  | 30 | +         +----------------------+             +----------------------+ | 
|  | 31 | +                             |                  | | 
|  | 32 | +  ---------------------------+------------------+----------------------------- | 
|  | 33 | +                             |                  | | 
|  | 34 | +                           +----------------------+ | 
|  | 35 | +                           |                      | | 
|  | 36 | +                           +----------------------+ | 
|  | 37 | +              Main Memory  | A       B            | | 
|  | 38 | +                           +----------------------+ | 
|  | 39 | + | 
|  | 40 | +'refcount' is modified frequently, but 'name' is set once at object | 
|  | 41 | +creation time and is never modified.  When many CPUs access 'foo' at | 
|  | 42 | +the same time, with 'refcount' being only bumped by one CPU frequently | 
|  | 43 | +and 'name' being read by other CPUs, all those reading CPUs have to | 
|  | 44 | +reload the whole cache line over and over due to the 'sharing', even | 
|  | 45 | +though 'name' is never changed. | 
|  | 46 | + | 
|  | 47 | +There are many real-world cases of performance regressions caused by | 
|  | 48 | +false sharing.  One of these is a rw_semaphore 'mmap_lock' inside | 
|  | 49 | +mm_struct struct, whose cache line layout change triggered a | 
|  | 50 | +regression and Linus analyzed in [2]_. | 
|  | 51 | + | 
|  | 52 | +There are two key factors for a harmful false sharing: | 
|  | 53 | + | 
|  | 54 | +* A global datum accessed (shared) by many CPUs | 
|  | 55 | +* In the concurrent accesses to the data, there is at least one write | 
|  | 56 | +  operation: write/write or write/read cases. | 
|  | 57 | + | 
|  | 58 | +The sharing could be from totally unrelated kernel components, or | 
|  | 59 | +different code paths of the same kernel component. | 
|  | 60 | + | 
|  | 61 | + | 
|  | 62 | +False Sharing Pitfalls | 
|  | 63 | +====================== | 
|  | 64 | +Back in time when one platform had only one or a few CPUs, hot data | 
|  | 65 | +members could be purposely put in the same cache line to make them | 
|  | 66 | +cache hot and save cacheline/TLB, like a lock and the data protected | 
|  | 67 | +by it.  But for recent large system with hundreds of CPUs, this may | 
|  | 68 | +not work when the lock is heavily contended, as the lock owner CPU | 
|  | 69 | +could write to the data, while other CPUs are busy spinning the lock. | 
|  | 70 | + | 
|  | 71 | +Looking at past cases, there are several frequently occurring patterns | 
|  | 72 | +for false sharing: | 
|  | 73 | + | 
|  | 74 | +* lock (spinlock/mutex/semaphore) and data protected by it are | 
|  | 75 | +  purposely put in one cache line. | 
|  | 76 | +* global data being put together in one cache line. Some kernel | 
|  | 77 | +  subsystems have many global parameters of small size (4 bytes), | 
|  | 78 | +  which can easily be grouped together and put into one cache line. | 
|  | 79 | +* data members of a big data structure randomly sitting together | 
|  | 80 | +  without being noticed (cache line is usually 64 bytes or more), | 
|  | 81 | +  like 'mem_cgroup' struct. | 
|  | 82 | + | 
|  | 83 | +Following 'mitigation' section provides real-world examples. | 
|  | 84 | + | 
|  | 85 | +False sharing could easily happen unless they are intentionally | 
|  | 86 | +checked, and it is valuable to run specific tools for performance | 
|  | 87 | +critical workloads to detect false sharing affecting performance case | 
|  | 88 | +and optimize accordingly. | 
|  | 89 | + | 
|  | 90 | + | 
|  | 91 | +How to detect and analyze False Sharing | 
|  | 92 | +======================================== | 
|  | 93 | +perf record/report/stat are widely used for performance tuning, and | 
|  | 94 | +once hotspots are detected, tools like 'perf-c2c' and 'pahole' can | 
|  | 95 | +be further used to detect and pinpoint the possible false sharing | 
|  | 96 | +data structures.  'addr2line' is also good at decoding instruction | 
|  | 97 | +pointer when there are multiple layers of inline functions. | 
|  | 98 | + | 
|  | 99 | +perf-c2c can capture the cache lines with most false sharing hits, | 
|  | 100 | +decoded functions (line number of file) accessing that cache line, | 
|  | 101 | +and in-line offset of the data. Simple commands are:: | 
|  | 102 | + | 
|  | 103 | +  $ perf c2c record -ag sleep 3 | 
|  | 104 | +  $ perf c2c report --call-graph none -k vmlinux | 
|  | 105 | + | 
|  | 106 | +When running above during testing will-it-scale's tlb_flush1 case, | 
|  | 107 | +perf reports something like:: | 
|  | 108 | + | 
|  | 109 | +  Total records                     :    1658231 | 
|  | 110 | +  Locked Load/Store Operations      :      89439 | 
|  | 111 | +  Load Operations                   :     623219 | 
|  | 112 | +  Load Local HITM                   :      92117 | 
|  | 113 | +  Load Remote HITM                  :        139 | 
|  | 114 | + | 
|  | 115 | +  #---------------------------------------------------------------------- | 
|  | 116 | +      4        0     2374        0        0        0  0xff1100088366d880 | 
|  | 117 | +  #---------------------------------------------------------------------- | 
|  | 118 | +    0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1 | 
|  | 119 | +    0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1 | 
|  | 120 | +    0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1 | 
|  | 121 | +    0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1 | 
|  | 122 | +    0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1 | 
|  | 123 | + | 
|  | 124 | +A nice introduction for perf-c2c is [3]_. | 
|  | 125 | + | 
|  | 126 | +'pahole' decodes data structure layouts delimited in cache line | 
|  | 127 | +granularity.  Users can match the offset in perf-c2c output with | 
|  | 128 | +pahole's decoding to locate the exact data members.  For global | 
|  | 129 | +data, users can search the data address in System.map. | 
|  | 130 | + | 
|  | 131 | + | 
|  | 132 | +Possible Mitigations | 
|  | 133 | +==================== | 
|  | 134 | +False sharing does not always need to be mitigated.  False sharing | 
|  | 135 | +mitigations should balance performance gains with complexity and | 
|  | 136 | +space consumption.  Sometimes, lower performance is OK, and it's | 
|  | 137 | +unnecessary to hyper-optimize every rarely used data structure or | 
|  | 138 | +a cold data path. | 
|  | 139 | + | 
|  | 140 | +False sharing hurting performance cases are seen more frequently with | 
|  | 141 | +core count increasing.  Because of these detrimental effects, many | 
|  | 142 | +patches have been proposed across variety of subsystems (like | 
|  | 143 | +networking and memory management) and merged.  Some common mitigations | 
|  | 144 | +(with examples) are: | 
|  | 145 | + | 
|  | 146 | +* Separate hot global data in its own dedicated cache line, even if it | 
|  | 147 | +  is just a 'short' type. The downside is more consumption of memory, | 
|  | 148 | +  cache line and TLB entries. | 
|  | 149 | + | 
|  | 150 | +  - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated") | 
|  | 151 | + | 
|  | 152 | +* Reorganize the data structure, separate the interfering members to | 
|  | 153 | +  different cache lines.  One downside is it may introduce new false | 
|  | 154 | +  sharing of other members. | 
|  | 155 | + | 
|  | 156 | +  - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing") | 
|  | 157 | + | 
|  | 158 | +* Replace 'write' with 'read' when possible, especially in loops. | 
|  | 159 | +  Like for some global variable, use compare(read)-then-write instead | 
|  | 160 | +  of unconditional write. For example, use:: | 
|  | 161 | + | 
|  | 162 | +	if (!test_bit(XXX)) | 
|  | 163 | +		set_bit(XXX); | 
|  | 164 | + | 
|  | 165 | +  instead of directly "set_bit(XXX);", similarly for atomic_t data:: | 
|  | 166 | + | 
|  | 167 | +	if (atomic_read(XXX) == AAA) | 
|  | 168 | +		atomic_set(XXX, BBB); | 
|  | 169 | + | 
|  | 170 | +  - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing") | 
|  | 171 | +  - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP") | 
|  | 172 | + | 
|  | 173 | +* Turn hot global data to 'per-cpu data + global data' when possible, | 
|  | 174 | +  or reasonably increase the threshold for syncing per-cpu data to | 
|  | 175 | +  global data, to reduce or postpone the 'write' to that global data. | 
|  | 176 | + | 
|  | 177 | +  - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses") | 
|  | 178 | +  - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy") | 
|  | 179 | + | 
|  | 180 | +Surely, all mitigations should be carefully verified to not cause side | 
|  | 181 | +effects.  To avoid introducing false sharing when coding, it's better | 
|  | 182 | +to: | 
|  | 183 | + | 
|  | 184 | +* Be aware of cache line boundaries | 
|  | 185 | +* Group mostly read-only fields together | 
|  | 186 | +* Group things that are written at the same time together | 
|  | 187 | +* Separate frequently read and frequently written fields on | 
|  | 188 | +  different cache lines. | 
|  | 189 | + | 
|  | 190 | +and better add a comment stating the false sharing consideration. | 
|  | 191 | + | 
|  | 192 | +One note is, sometimes even after a severe false sharing is detected | 
|  | 193 | +and solved, the performance may still have no obvious improvement as | 
|  | 194 | +the hotspot switches to a new place. | 
|  | 195 | + | 
|  | 196 | + | 
|  | 197 | +Miscellaneous | 
|  | 198 | +============= | 
|  | 199 | +One open issue is that kernel has an optional data structure | 
|  | 200 | +randomization mechanism, which also randomizes the situation of cache | 
|  | 201 | +line sharing of data members. | 
|  | 202 | + | 
|  | 203 | + | 
|  | 204 | +.. [1] https://en.wikipedia.org/wiki/False_sharing | 
|  | 205 | +.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/ | 
|  | 206 | +.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/ | 
0 commit comments