Skip to content

Use runtime feature detection for fma routines on x86 #896

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

tgross35
Copy link
Contributor

@tgross35 tgross35 commented Apr 30, 2025

Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with +fma (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic.

Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of std-detect.

Fixes with sync: rust-lang/rust#140452

@tgross35 tgross35 force-pushed the x86-asm branch 5 times, most recently from 3a6e471 to 7cb49a7 Compare April 30, 2025 09:08
@tgross35
Copy link
Contributor Author

tgross35 commented Apr 30, 2025

 icount::icount_bench_fma_group::icount_bench_fma logspace:setup_fma()
  Baselines:                      hardfloat|hardfloat
  Instructions:                       10072|53426                (-81.1478%) [-5.30441x]
  L1 Hits:                            13762|60191                (-77.1361%) [-4.37371x]
  L2 Hits:                                4|3                    (+33.3333%) [+1.33333x]
  RAM Hits:                              15|19                   (-21.0526%) [-1.26667x]
  Total read+write:                   13781|60213                (-77.1129%) [-4.36928x]
  Estimated Cycles:                   14307|60871                (-76.4962%) [-4.25463x]
icount::icount_bench_fmaf_group::icount_bench_fmaf logspace:setup_fmaf()
  Baselines:                      hardfloat|hardfloat
  Instructions:                       10072|16587                (-39.2777%) [-1.64684x]
  L1 Hits:                            13763|19234                (-28.4444%) [-1.39752x]
  L2 Hits:                                2|3                    (-33.3333%) [-1.50000x]
  RAM Hits:                              16|6                    (+166.667%) [+2.66667x]
  Total read+write:                   13781|19243                (-28.3843%) [-1.39634x]
  Estimated Cycles:                   14333|19459                (-26.3426%) [-1.35764x]

@sarah-quinones
Copy link

sarah-quinones commented Apr 30, 2025

might be nitpicking, but this adds more branches levels of indirection than strictly needed. this is the implementation i had in mind

pseudocode (i think data ptr -> fn ptr requires a transmute)

fn fma_select() {
    let fn_ptr = todo!();
    // x86/x86_64 function pointers and data pointers are the same
    FMA_DYNAMIC.store(fn_ptr as *mut (), Relaxed);
    fn_ptr();
}

static FMA_DYNAMIC: AtomicPtr<()> = AtomicPtr::new(fma_select as fn() as *mut ());

pub fn fma() { (FMA_DYNAMIC.load(Relaxed) as fn())(); }

@jieyouxu
Copy link
Member

jieyouxu commented Apr 30, 2025

@tgross35 you might want to edit the magic "Fixes" magic keyword in ur PR description since u have write access to r-l/r, it will also close the r-l/r issue if this PR merges I think

@tgross35
Copy link
Contributor Author

tgross35 commented Apr 30, 2025

might be nitpicking, but this adds more branches levels of indirection than strictly needed. this is the implementation i had in mind

Not at all, thanks for pointing this out. Updated, that saved another four instructions per invocation (CI runs 500 samples).

 icount::icount_bench_fma_group::icount_bench_fma logspace:setup_fma()
  Baselines:                      hardfloat|hardfloat
  Instructions:                        8031|53426                (-84.9680%) [-6.65247x]
  L1 Hits:                            11722|60191                (-80.5253%) [-5.13487x]
  L2 Hits:                                4|3                    (+33.3333%) [+1.33333x]
  RAM Hits:                              15|19                   (-21.0526%) [-1.26667x]
  Total read+write:                   11741|60213                (-80.5009%) [-5.12844x]
  Estimated Cycles:                   12267|60871                (-79.8475%) [-4.96217x]
icount::icount_bench_fmaf_group::icount_bench_fmaf logspace:setup_fmaf()
  Baselines:                      hardfloat|hardfloat
  Instructions:                        8031|16587                (-51.5826%) [-2.06537x]
  L1 Hits:                            11723|19234                (-39.0506%) [-1.64071x]
  L2 Hits:                                3|3                    (No change)
  RAM Hits:                              15|6                    (+150.000%) [+2.50000x]
  Total read+write:                   11741|19243                (-38.9856%) [-1.63896x]
  Estimated Cycles:                   12263|19459                (-36.9803%) [-1.58681x]


/// Stores a pointer that is immediately jumped to upon entering $name. By default it
/// is an init function that sets FUNC to something else.
static FUNC: AtomicPtr<Func> = AtomicPtr::new({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than holding a pointer to a function pointer, you should instead use an AtomicPtr<()> and hold the function pointer directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, I have updated this. Saves a mov per iteration as expected.

@tgross35 tgross35 force-pushed the x86-asm branch 5 times, most recently from e16aa60 to 34af030 Compare May 1, 2025 18:47
tgross35 added 3 commits May 1, 2025 19:06
This module is used for both i686 and x86-64.
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants