-
Notifications
You must be signed in to change notification settings - Fork 229
Improve performance scaling of fmod
using modular exponentiation
#898
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
// FIXME: This is a temporary hack to get around the lack of `u256 / u256`. | ||
// Actually, the algorithm only needs the operation `(x << I::BITS) / y` | ||
// where `x < y`. That is, a division `u256 / u128` where the quotient must | ||
// not overflow `u128` would be sufficient for `f128`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be easier with u256? We have an implementation here
pub struct u256 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If u256
already implemented Div
, then the generic code here could just use that, so it would be easy, yes. But implementing that sounds like extra work just to achieve a suboptimal solution.
a division
u2N / uN
where the quotient must not overflowuN
For context, x86's div
-instruction works just like this. Take the double-wide dividend in a pair of registers, divide by a value in a third register, and replace the low and high halves of the dividend with the quotient and remainder respectively. If the quotient would overflow (which is exactly when x.hi() >= y
), signal divide error.
So that's the abstraction I'd like to use; something like
unsafe fn unchecked_wide_div_rem<U: HInt>(U::D, U) -> (U, U);
But of course, that would be even more work to implement since it doesn't exist yet, and I don't think other arches have a native operation for it.
Another idea would be to get rid of the integer division altogether, and compute the reciprocal from the original floating point value. I expect that this could have better performance, but it needs more careful analysis.
It looks like CI was previously failing only because there were a few |
If I understand correctly, the inputs for these is a sampling with |
In the previous implementation, the cost of evaluating
fmod(x, y)
was linear inlog(|x/y|)
(or, the difference in exponents). Especially for the larger float types, this made some inputs very slow to evaluate. E.g.fmod(f64::MAX, 13.0)
would run a loop with over 1000 iterations.This implementation changes that scaling to
log(log(|x/y|))
, giving a significant improvement for the worst cases.Todos:
f128
, even though that's the one that would benefit the most