Improve performance scaling of `fmod` using modular exponentiation #898

quaternic · 2025-05-01T16:03:14Z

In the previous implementation, the cost of evaluating fmod(x, y) was linear in log(|x/y|) (or, the difference in exponents). Especially for the larger float types, this made some inputs very slow to evaluate. E.g. fmod(f64::MAX, 13.0) would run a loop with over 1000 iterations.

This implementation changes that scaling to log(log(|x/y|)), giving a significant improvement for the worst cases.

Todos:

For some more typical inputs this can be a small regression, so I still need to look into that closer.
Currently not available for f128, even though that's the one that would benefit the most

tgross35 · 2025-05-01T18:50:04Z

libm/src/math/generic/fmod.rs

+    // FIXME: This is a temporary hack to get around the lack of `u256 / u256`.
+    // Actually, the algorithm only needs the operation `(x << I::BITS) / y`
+    // where `x < y`. That is, a division `u256 / u128` where the quotient must
+    // not overflow `u128` would be sufficient for `f128`.


Would this be easier with u256? We have an implementation here

compiler-builtins/libm/src/math/support/big.rs

Line 15 in f456aa8

pub struct u256 {

but only shifts and widening multiplication are supported currently.

If u256 already implemented Div, then the generic code here could just use that, so it would be easy, yes. But implementing that sounds like extra work just to achieve a suboptimal solution.

a division u2N / uN where the quotient must not overflow uN

For context, x86's div-instruction works just like this. Take the double-wide dividend in a pair of registers, divide by a value in a third register, and replace the low and high halves of the dividend with the quotient and remainder respectively. If the quotient would overflow (which is exactly when x.hi() >= y), signal divide error.

So that's the abstraction I'd like to use; something like

unsafe fn unchecked_wide_div_rem<U: HInt>(U::D, U) -> (U, U);

But of course, that would be even more work to implement since it doesn't exist yet, and I don't think other arches have a native operation for it.

Another idea would be to get rid of the integer division altogether, and compute the reciprocal from the original floating point value. I expect that this could have better performance, but it needs more careful analysis.

quaternic · 2025-05-02T20:04:58Z

It looks like CI was previously failing only because there were a few debug_assert_eq which include formatting code for the arguments. Switching those to debug_assert!(x == y) seems to have been sufficient.

quaternic · 2025-05-02T21:24:05Z

icount::icount_bench_fmod_group::icount_bench_fmod logspace:setup_fmod()
  Baselines:                      hardfloat|hardfloat
  Instructions:                       74379|1019076              (-92.7013%) [-13.7011x]
  L1 Hits:                            84569|1021276              (-91.7193%) [-12.0762x]
  L2 Hits:                                6|1                    (+500.000%) [+6.00000x]
  RAM Hits:                              18|11                   (+63.6364%) [+1.63636x]
  Total read+write:                   84593|1021288              (-91.7170%) [-12.0730x]
  Estimated Cycles:                   85229|1021666              (-91.6578%) [-11.9873x]
icount::icount_bench_fmodf128_group::icount_bench_fmodf128 logspace:setup_fmodf128()
  Baselines:                      hardfloat|hardfloat
  Instructions:                    30701786|30701786             (No change)
  L1 Hits:                         30717674|30717679             (-0.00002%) [-1.00000x]
  L2 Hits:                               10|5                    (+100.000%) [+2.00000x]
  RAM Hits:                              37|37                   (No change)
  Total read+write:                30717721|30717721             (No change)
  Estimated Cycles:                30719019|30718999             (+0.00007%) [+1.00000x]
icount::icount_bench_fmodf16_group::icount_bench_fmodf16 logspace:setup_fmodf16()
  Baselines:                      hardfloat|hardfloat
  Instructions:                       50481|50124                (+0.71223%) [+1.00712x]
  L1 Hits:                            62816|56769                (+10.6519%) [+1.10652x]
  L2 Hits:                                6|1                    (+500.000%) [+6.00000x]
  RAM Hits:                              23|15                   (+53.3333%) [+1.53333x]
  Total read+write:                   62845|56785                (+10.6718%) [+1.10672x]
  Estimated Cycles:                   63651|57299                (+11.0857%) [+1.11086x]
icount::icount_bench_fmodf_group::icount_bench_fmodf logspace:setup_fmodf()
  Baselines:                      hardfloat|hardfloat
  Instructions:                       54988|151547               (-63.7155%) [-2.75600x]
  L1 Hits:                            63383|153749               (-58.7750%) [-2.42571x]
  L2 Hits:                                4|1                    (+300.000%) [+4.00000x]
  RAM Hits:                              15|9                    (+66.6667%) [+1.66667x]
  Total read+write:                   63402|153759               (-58.7653%) [-2.42514x]
  Estimated Cycles:                   63928|154069               (-58.5069%) [-2.41004x]

If I understand correctly, the inputs for these is a sampling with log(|x|) and log(|y|) each independently evenly spaced, so it should give a reasonable estimate of overall performance for the general case.

Implement Barrett reduction for modular multiplication

bf85502

tgross35 reviewed May 1, 2025

View reviewed changes

Optimize performance of fmod with Barrett multiplication

c811592

quaternic force-pushed the fmod-opt branch from 77429e2 to c811592 Compare May 2, 2025 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance scaling of `fmod` using modular exponentiation #898

Improve performance scaling of `fmod` using modular exponentiation #898

quaternic commented May 1, 2025

tgross35 May 1, 2025

quaternic May 1, 2025 •

edited

Loading

quaternic commented May 2, 2025

quaternic commented May 2, 2025

Improve performance scaling of fmod using modular exponentiation #898

Are you sure you want to change the base?

Improve performance scaling of fmod using modular exponentiation #898

Conversation

quaternic commented May 1, 2025

tgross35 May 1, 2025

Choose a reason for hiding this comment

quaternic May 1, 2025 • edited Loading

Choose a reason for hiding this comment

quaternic commented May 2, 2025

quaternic commented May 2, 2025

Improve performance scaling of `fmod` using modular exponentiation #898

Improve performance scaling of `fmod` using modular exponentiation #898

quaternic May 1, 2025 •

edited

Loading