tabs ↹ over ␣ ␣ ␣ spaces

by Jiří {x2} Činčura

How is Thread.SpinWait actually implemented?

17 Jul 2018 .NET, .NET Core, JIT, Multithreading/Parallelism/Asynchronous/Concurrency, RyuJIT

I’m always drawn into disassembling stuff and learning how something works under the hood. The Thread.SpinWait is something I’m going to explore. Because .NET Core is open source I can attack this from side of both sources as well as pure disassembly.

Sources

Let’s start simply from sources. Following where the Thread.SpinWait goes, I eventually ended up in internal (yes, internal) class Thread, that derives from RuntimeThread, where the SpinWait method. This method calls SpinWaitInternal, that is conveniently right above. That’s where the C# code ends and we need to go lower (in this case the “VM”).

The implementation is in comsynchronizable.cpp file, using the FCIMPL1 macro (which I think is an abbreviation for ”fastcall function implementation with one argument”). It simply checks what the number of iterations is. If it’s over 100000 the preemptive mode is used to avoid stalling a GC, else the code stays in cooperative mode. In both cases the YieldProcessorNormalized is called passing result from YieldProcessorNormalizationInfo.

The YieldProcessorNormalized method calls YieldProcessor number of times based on YieldProcessorNormalizationInfo.yieldsPerNormalizedYield. The YieldProcessor is again a macro, defined in gcenv.base.h (together with MemoryBarrier). Looking at it shows that the implementation differs based on platform. For example on AMD64 using Visual C++ it uses _mm_pause intrinsic. This eventually puts pause instruction into the resulting code. For x86 it simply uses rep nop. The important part of that file is included at the bottom as a reference.

Looks like I have the implementation. On platforms where I’m running my code most often, it’s simply pause instruction.

Disassembly

All the above is nice, but what if I’ve made some mistake? I should be able to see the result in pure disassembly, right?

I compiled a simple .NET Framework (non-Core) console application with full optimizations enabled and loaded it into WinDbg. Using the Disassembly and F11 went deeper and deeper into the code. Eventually I ended in this piece of code for 32bit.

7217f56c 8bf1            mov     esi,ecx
7217f56e 8975e4          mov     dword ptr [ebp-1Ch],esi
7217f571 bf60f51772      mov     edi,offset clr!ThreadNative::SpinWait (7217f560)
7217f576 897de0          mov     dword ptr [ebp-20h],edi ss:002b:00f3ee58=00000000
7217f579 81fe40420f00    cmp     esi,0F4240h
7217f57f 0f8f26492100    jg      clr!ThreadNative::SpinWait+0x35 (72393eab)
7217f585 85f6            test    esi,esi
7217f587 7e07            jle     clr!ThreadNative::SpinWait+0x123 (7217f590)
7217f589 f390            pause
7217f58b 83ee01          sub     esi,1
7217f58e 75f9            jne     clr!ThreadNative::SpinWait+0x29 (7217f589)

The pause instruction is nicely there and the esi register is used to count (down) the iterations.

For 64bit, the code obviously still uses pause, but the looping is done slightly differently.

00007ffe`b5a2b556 33db            xor     ebx,ebx
00007ffe`b5a2b558 81f940420f00    cmp     ecx,0F4240h
00007ffe`b5a2b55e 7f0e            jg      clr!ThreadNative::SpinWait+0x4e (00007ffe`b5a2b56e)
00007ffe`b5a2b560 3bd9            cmp     ebx,ecx
00007ffe`b5a2b562 0f8dc5010000    jge     clr!ThreadNative::SpinWait+0x20d (00007ffe`b5a2b72d)
00007ffe`b5a2b568 f390            pause
00007ffe`b5a2b56a ffc3            inc     ebx
00007ffe`b5a2b56c ebf2            jmp     clr!ThreadNative::SpinWait+0x40 (00007ffe`b5a2b560)

The ebx (rbx) register is incremented and compared with ecx (rcx) where the total number of interations is stored.

The decision for cooperative or preemptive mode is visible in both with cmp with 0F4240h value.

Summary

True, for day-to-day programming in .NET one does not need to know this, heck one does not need Thread.SpinWait at all, and I know it. So what’s the reason for all this? I like such disassembling (pun intended). It keeps my brain occupied and sometimes stretches my abilities, thus I’m learning new stuff.

Appendix

YieldProcessor macro etc. in gcenv.base.h
#if defined(_MSC_VER) 
 #if defined(_ARM_)

  __forceinline void YieldProcessor() { }
  extern "C" void __emit(const unsigned __int32 opcode);
  #pragma intrinsic(__emit)
  #define MemoryBarrier() { __emit(0xF3BF); __emit(0x8F5F); }

 #elif defined(_ARM64_)

  extern "C" void __yield(void);
  #pragma intrinsic(__yield)
  __forceinline void YieldProcessor() { __yield();}

  extern "C" void __dmb(const unsigned __int32 _Type);
  #pragma intrinsic(__dmb)
  #define MemoryBarrier() { __dmb(_ARM64_BARRIER_SY); }

 #elif defined(_AMD64_)
  
  extern "C" void
  _mm_pause (
      void
      );
  
  extern "C" void
  _mm_mfence (
      void
      );

  #pragma intrinsic(_mm_pause)
  #pragma intrinsic(_mm_mfence)
  
  #define YieldProcessor _mm_pause
  #define MemoryBarrier _mm_mfence

 #elif defined(_X86_)
  
  #define YieldProcessor() __asm { rep nop }
  #define MemoryBarrier() MemoryBarrierImpl()
  __forceinline void MemoryBarrierImpl()
  {
      int32_t Barrier;
      __asm {
          xchg Barrier, eax
      }
  }

 #else // !_ARM_ && !_AMD64_ && !_X86_
  #error Unsupported architecture
 #endif
#else // _MSC_VER

// Only clang defines __has_builtin, so we first test for a GCC define
// before using __has_builtin.

#if defined(__i386__) || defined(__x86_64__)

#if (__GNUC__ > 4 && __GNUC_MINOR > 7) || __has_builtin(__builtin_ia32_pause)
 // clang added this intrinsic in 3.8
 // gcc added this intrinsic by 4.7.1
 #define YieldProcessor __builtin_ia32_pause
#endif // __has_builtin(__builtin_ia32_pause)

#if defined(__GNUC__) || __has_builtin(__builtin_ia32_mfence)
 // clang has had this intrinsic since at least 3.0
 // gcc has had this intrinsic since forever
 #define MemoryBarrier __builtin_ia32_mfence
#endif // __has_builtin(__builtin_ia32_mfence)

// If we don't have intrinsics, we can do some inline asm instead.
#ifndef YieldProcessor
 #define YieldProcessor() asm volatile ("pause")
#endif // YieldProcessor

#ifndef MemoryBarrier
 #define MemoryBarrier() asm volatile ("mfence")
#endif // MemoryBarrier

#endif // defined(__i386__) || defined(__x86_64__)

#ifdef __aarch64__
 #define YieldProcessor() asm volatile ("yield")
 #define MemoryBarrier __sync_synchronize
#endif // __aarch64__

#ifdef __arm__
 #define YieldProcessor()
 #define MemoryBarrier __sync_synchronize
#endif // __arm__

#endif // _MSC_VER