Skip to content

Improve performance scaling of fmod using modular exponentiation #898

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

quaternic
Copy link
Contributor

In the previous implementation, the cost of evaluating fmod(x, y) was linear in log(|x/y|) (or, the difference in exponents). Especially for the larger float types, this made some inputs very slow to evaluate. E.g. fmod(f64::MAX, 13.0) would run a loop with over 1000 iterations.

This implementation changes that scaling to log(log(|x/y|)), giving a significant improvement for the worst cases.

Todos:

  • For some more typical inputs this can be a small regression, so I still need to look into that closer.
  • Currently not available for f128, even though that's the one that would benefit the most

Comment on lines +63 to +66
// FIXME: This is a temporary hack to get around the lack of `u256 / u256`.
// Actually, the algorithm only needs the operation `(x << I::BITS) / y`
// where `x < y`. That is, a division `u256 / u128` where the quotient must
// not overflow `u128` would be sufficient for `f128`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be easier with u256? We have an implementation here

but only shifts and widening multiplication are supported currently.

Copy link
Contributor Author

@quaternic quaternic May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If u256 already implemented Div, then the generic code here could just use that, so it would be easy, yes. But implementing that sounds like extra work just to achieve a suboptimal solution.

a division u2N / uN where the quotient must not overflow uN

For context, x86's div-instruction works just like this. Take the double-wide dividend in a pair of registers, divide by a value in a third register, and replace the low and high halves of the dividend with the quotient and remainder respectively. If the quotient would overflow (which is exactly when x.hi() >= y), signal divide error.

So that's the abstraction I'd like to use; something like

unsafe fn unchecked_wide_div_rem<U: HInt>(U::D, U) -> (U, U);

But of course, that would be even more work to implement since it doesn't exist yet, and I don't think other arches have a native operation for it.

Another idea would be to get rid of the integer division altogether, and compute the reciprocal from the original floating point value. I expect that this could have better performance, but it needs more careful analysis.

@quaternic
Copy link
Contributor Author

It looks like CI was previously failing only because there were a few debug_assert_eq which include formatting code for the arguments. Switching those to debug_assert!(x == y) seems to have been sufficient.

@quaternic
Copy link
Contributor Author

icount::icount_bench_fmod_group::icount_bench_fmod logspace:setup_fmod()
  Baselines:                      hardfloat|hardfloat
  Instructions:                       74379|1019076              (-92.7013%) [-13.7011x]
  L1 Hits:                            84569|1021276              (-91.7193%) [-12.0762x]
  L2 Hits:                                6|1                    (+500.000%) [+6.00000x]
  RAM Hits:                              18|11                   (+63.6364%) [+1.63636x]
  Total read+write:                   84593|1021288              (-91.7170%) [-12.0730x]
  Estimated Cycles:                   85229|1021666              (-91.6578%) [-11.9873x]
icount::icount_bench_fmodf128_group::icount_bench_fmodf128 logspace:setup_fmodf128()
  Baselines:                      hardfloat|hardfloat
  Instructions:                    30701786|30701786             (No change)
  L1 Hits:                         30717674|30717679             (-0.00002%) [-1.00000x]
  L2 Hits:                               10|5                    (+100.000%) [+2.00000x]
  RAM Hits:                              37|37                   (No change)
  Total read+write:                30717721|30717721             (No change)
  Estimated Cycles:                30719019|30718999             (+0.00007%) [+1.00000x]
icount::icount_bench_fmodf16_group::icount_bench_fmodf16 logspace:setup_fmodf16()
  Baselines:                      hardfloat|hardfloat
  Instructions:                       50481|50124                (+0.71223%) [+1.00712x]
  L1 Hits:                            62816|56769                (+10.6519%) [+1.10652x]
  L2 Hits:                                6|1                    (+500.000%) [+6.00000x]
  RAM Hits:                              23|15                   (+53.3333%) [+1.53333x]
  Total read+write:                   62845|56785                (+10.6718%) [+1.10672x]
  Estimated Cycles:                   63651|57299                (+11.0857%) [+1.11086x]
icount::icount_bench_fmodf_group::icount_bench_fmodf logspace:setup_fmodf()
  Baselines:                      hardfloat|hardfloat
  Instructions:                       54988|151547               (-63.7155%) [-2.75600x]
  L1 Hits:                            63383|153749               (-58.7750%) [-2.42571x]
  L2 Hits:                                4|1                    (+300.000%) [+4.00000x]
  RAM Hits:                              15|9                    (+66.6667%) [+1.66667x]
  Total read+write:                   63402|153759               (-58.7653%) [-2.42514x]
  Estimated Cycles:                   63928|154069               (-58.5069%) [-2.41004x]

If I understand correctly, the inputs for these is a sampling with log(|x|) and log(|y|) each independently evenly spaced, so it should give a reasonable estimate of overall performance for the general case.

No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants