The purpose of this exercise was to create an algorithm to find integer square roots using only integer operations. This method was initially devised by me in 1990 for the ZX Spectrum +3 which had a Zilog Z80 (Intel APX 8080 compatible) processor which doesn't even have multiply or divide assembly instructions!
This document was started in 1994, before I founded Cynergi, but wasn't finished until July 2000. During the course of finishing the document, I found an improvement over the algorithm (the binary search method) that seems to be novel and if implemented in hardware, could perhaps generate 32bit floatingpoint square roots in 25 clock cycles or less, which is even faster than the latest generation of Intel processors (the Itanium, on August 2000)! FPU (floatingpoint unit) emulators built into operating systems can also use this to implement a quick square root just about twice as slow as the hardware version (on an i486).
Given their simplicity and speed, both algorithms can also be used on
and any other systems where memory is short, an FPU is not available and/or 3D graphics (which most often require square root calculations) are required. However, since the binary search algorithm is faster and cheaper than current square root implementations, it could replace them in every implementation.
I have split the document into two parts, but left it pretty much in the same item order as the ideas were developed. This helps to show the path of discovery. Part I deals with an empirical discovery of a sequential algorithm to find the square root, and its subsequent mathematical proof, and Part II will deal with a binary search variation that greatly improves the root's calculation speed (even though this sequential search algorithm is the smallest you'll find, if size is what matters to you). Both algorithms have the same accuracy.
The entire document consists of:
Part I
Part IINOTES: The "official" address for this document is http://www.pedrofreire.com/sqrt. Please do not link directly to the address that shows up in your browser — this may be changed without notice. Also note that at the time I write this (August 2000) I am not aware of these algorithms being in use (at least the binary search version), but I cannot confirm this.
Starting with the most used root (the square root or sqrt(x)), we can build a table of square roots (where r is the integer result):
x  sqrt(x)  r  x  sqrt(x)  r  

0  0.00  0  20  4.47  4  
1  1.00  1  21  4.58  5  
2  1.41  1  …  
3  1.73  2  30  5.48  5  
4  2.00  2  31  5.57  6  
5  2.24  2  …  
6  2.45  2  42  6.48  6  
7  2.65  3  43  6.56  7  
…  …  
12  3.46  3  56  7.48  7  
13  3.61  4  57  7.55  8  
…  … 
From this table, you may notice that for successive numbers, the results repeat themselves with a pattern, namely
r  no. of similar results  r  no. of similar results  

0  1  5  3021+1 = 10  
1  2  6  4231+1 = 12  
2  4  7  5643+1 = 14  
3  127+1 = 6  8  16  
4  2013+1 = 8  …  2r + (1 if r=0) 
With this in mind, we can now easily build the following code in C
register unsigned int x, r; /* … */ x = (x+1) >> 1; /* init x */ for( r=0; x>r; x=r++ );
which assumes x unsigned. Upon exit, r
holds the integer root, and x
holds some sort of remainder (this
or any other remainder will not be explored in this document).
Let's try some examples to see if it works:
x  init x  for loop 
r  

0  0  0  0  
1  1  10=1  1  
2  1  10=1  1  
3  2  20=2, 21=1  2  
4  2  20=2, 21=1  2  
5  3  30=3, 31=2  2  
6  3  30=3, 31=2  2  
7  4  40=4, 41=3, 32=1  3  
…  
12  6  60=6, 61=5, 52=3  3  
13  7  70=7, 71=6, 62=4, 43=1  4  
…  
15  8  80=8, 81=7, 72=5, 53=2  4  
…  
144  72 
720, 721, 712, 693, 664, 625, 576, 517, 448, 369, 2710, 1711=6 
12 
The maximum number of iterations of the for
loop for
nbit words is the root of the largest number that can
be coded in that many bits, or
`sqrt(2^n  1) ~~ sqrt(2^n) = 2^(n/2)`
which, for some values of n, might prove the algorithm to be more efficient than integertoreal conversion, followed by a floatingpoint square root and by realtointeger conversion again. Let's find out if that is so.
An Intel APX/IA assembly optimisation of the loop could be
(assuming AX
=x and BX
=r, both
unsigned integers):
; … inc AX ; i486 cycles: 1 shr AX, 1 ; 2 mov BX, 1 ; 1 loop1: inc BX ; 1 sub AX, BX ; 1 ja loop1 ; 3/1
Upon exit, BX
holds the result r, and AX
some
sort of remainder. The semicolons were simply used to make the
comments valid.
The corresponding code using floatingpoint arithmetic would be something like
ctrlword DW ? ; Set rounding to nearest integer fstcw ctrlword and ctrlword, 1111001111111111b fldcw ctrlword ; … fild WORD PTR [BX] ; i486 cycles: 1316 fsqrt ; 8387 fistp WORD PTR [BX] ; 2934
assuming WORD PTR [BX]
holds x. The result is returned on
WORD PTR [BX]
(we don't account for remainders in floatingpoint).
32bit registers can be used in turn of their 16bit counterparts in
both pieces of code, without slowing them down.
There can be a mismatch between the result of this code and the algorithm's. If a result has a decimal part of exactly .5, the algorithm will always round the result up, whereas the i486's FPU (floatingpoint unit) will round it to the nearest even integer number. Both results are correct, however, and no integer number has a square root with a decimal part of .5.
To compare the speed of the two methods, we first show that for the algorithm (let's call it softsqrt), the clock count is
1+2+1+(1+1+3)(r+1)2 = 7+5r
where r is again the calculated root. For the method using the FPU (let's call it hardsqrt), you have
Therefore, the highest root range that can be calculated faster by softsqrt is
`7+5r_min = 125` `r_min = 23.6` `x_min = r_min^2 = 556.96` 
to 
`7+5r_max = 137` `r_max = 26` `x_max = r_max^2 = 676` 
This means that softsqrt is faster than hardsqrt for values of x between 0 and 557~676. Similar calculations for other APX/IA (x86) processors yield similar results (the best ratio being of the i286 processor).
If you often need to calculate an integer root in the above range (which has n>9, i.e. approximately 9bit numbers), then this algorithm is of greatest use for you — remember, only its high range values are as slow as the hardware floatingpoint solution! Better yet, the other previous processors don't have a builtin FPU so the job is probably being done by some sluggish software algorithm (sluggish since it has to take floatingpoint into account).
The major drawback of the method is that it lacks the accuracy you have with floatingpoint (it returns the result rounded to the nearest integer, but further operations with this result may yield a greater rounding error).
The square of a number r is of course
`r^2`
and the square of its successor (next integer number) is
`(r+1)^2 = r^2 + 2r + 1`
that is, it is offset 2r+1 from the original r². This tells us how far apart (in terms of x, which = r²) will a new r+1 result be.
However, the r values you feed to this formula should be chosen with care. The formula works with real numbers, and my algorithm with integer numbers. My algorithm was devised to return the square root result rounded to the nearest integer, which means the switch from r = 0 to r = 1 for instance, will occur in real numbers when r = 0.5. This means we should apply 0.5 to the formula to find out how many r = 1 results exist, since all r = 0.5 to r = 1.49999… results will be rounded to the integer 1, and 0.5 is the lowest value in this range. You want to find out how far apart are (what is the offset between) r = 0.5 and r = 0.5+1.
So instead of applying
r = 0, 1, 2, 3, 4, 5, …
you should actually apply
r = 0, 0.5, 1.5, 2.5, 3.5, 4.5, …
since 0 is the lowest number (greater than or equal to 0) eligible to be rounded to 0, 0.5 to 1, 1.5 to 2 and so on. The above formula will thus yield
r  2r+1  x  sqrt(x)  

0  1  0  0  
0.5  2  1..2  1  
1.5  4  3..6  2  
2.5  6  7..12  3  
3.5  8  13..20  4  
4.5  10  21..30  5  
… 
which is obviously the same integer succession as shown in the initial examples. However, if you still doubt, then you should realise that
2×0+1 = 1 
for the oddity, 0 (which gives us how many x values return r = 0) 

2×0.5+1 = 2 
for the first regular offset (which gives us how many x values return r = 1) 

(2(r+1)+1)(2r+1) = 2r+2+12r1 = 2 
for the changing of the offset (which gives us how many x values return r+1, relative to how many x values return r) 
as we assumed.
Q.E.D.
Following the same reasoning as above, we could still resolve for roots of any positive integer index (say i). For that, let's make use of Newton's binomial theorem
`(r+1)^i = sum_(k=0)^i (i!r^(ik)) / (k!(ik)!)`
which reduces to
`(r+1)^2 = (2!r^2)/(0!*2!) + (2!r^1)/(1!*1!) + (2!r^0)/(2!*0!) = r^2 + 2r + 1`
for the special case of the square (i=2) root (and is equal to the formula we got above, verifying both). Since the summation index k=0 is
`(i!r^(i0)) / (0!(i0)!) = r^i`
the offset can then be generally described as
`(r+1)^i  r^i = sum_(k=1)^i (i!r^(ik)) / (k!(ik)!)`
and the changing of the offset, as
`(((r+1)+1)^i  (r+1)^i)  ((r+1)^i  r^i) = (r+2)^i  2(r+1)^i + r^i`
or
`sum_(k=1)^i (i!(r+1)^(ik)) / (k!(ik)!)  sum_(k=1)^i (i!r^(ik)) / (k!(ik)!) = sum_(k=1)^(i1) (i!((r+1)^(ik)r^(ik))) / (k!(ik)!)`
where the top summation index is now i1 because when k=i,
`(i!((r+1)^(ii)r^(ii))) / (i!(ii)!) = 0`
Applying i=2 to either of these two formulas will return you the constant 2, verifying them and the results previously calculated.
Now you understand why did I start using Newton's binomial: even though the first of these two formulas doesn't show it, the second explicitly shows that the changing of the offset is never of a degree higher than i2. Why i2 and not i1? Well, because if you take
`(r+1)^(ik)  r^(ik)`
of the formula with the summation, you'll see it resembles the offset formula, and its resolution shows that you'll lower the result by one more degree.
Since the algorithm explained above needs the changing of the offset to work, high exponents (degrees) means a slow algorithm on high values of i. But we can build generic C code for any root index:
register int x, s; /* use double if needed */ register int r; /* … */ r = 0; if( x > 0 ) { s = FO(0.5); for( r=1; (x=s)>0; r++ ) s += FCO(R0.5); }
This code only works with positive numbers: if you want it to work
with negative numbers just reverse the sign of r
if x
was negative upon entry, and work with abs(x)
. The operation
x=s
prior to the comparison with 0 is the reason
why x
isn't unsigned
. s
and r
aren't
unsigned
to avoid unnecessary conversions.
You should replace FO(0.5)
with the result of the offset formula
when you make r=0.5, and FCO(R0.5)
by the formula of the
changing of the offset, replacing r in that formula by
r0.5. You should also make the variables x
,
and s
double
(real) instead of int
(integer) if any
of these replacements return noninteger numbers. This will slow down the
code, but you'll have to optimise it manually, anyway. x
should
have the value to calculate the root for, upon reaching the code, and
r
will have the calculated root upon exit.
For example, for the square root, you would make FO(0.5)
= 2 and
FCO(R0.5)
= 2 since it is constant. x
and s
remain
int
.
How fast is this, in comparison with a hardware operation? Well, it depends on four factors:
double
s instead of int
s;
The first point tells you if all this work is really worthwile.
On most CPUs, floatingpoint calculations take tens (if not hundreds)
of clock cycles, and the only root they do is the square root
(i=2). Other roots need to be calculated using logarithms,
which makes that operation even slower and this algorithm even faster
(relatively). Also, on high i values, this algorithm's code
loop spans more numbers (in terms of x
— this means that
s
grows faster), which means the code quickly gets to a
solution. But if your CPU can achieve a fast root operation, all the
work described here is useless — just use that operation!
The second point (using double
instead of int
)
can be avoided if you know enough math and know how
to optimise the code. For instance an FO(0.5)
value of 3.25
can be transformed into 3, and a new counter be setup to subtract 1
from x
on every 4 iterations. Depending on how fast are
floatingpoint (real) additions (and subtractions) on your CPU,
this may or may not be worth it.
The third point determines how far up the ladder of i can you
climb before the algorithm is no longer fast enough. Some CPUs execute
an integer multiply operation in one clock cycle. For those, it really
won't matter much what the degree of FCO(R0.5)
is, since you
just need to make a sequence of multiplications, and these are cheap
enough (in terms of speed) on such CPUs.
Finally, the last point is entirely up to you. For i=2, for
instance, the above code turns to the code I showed at the beginning
of this document — with no if
s and a very tight loop. Only
experience can help you with this stage — it is not the point of
this document to give you a series of optimised algorithms for
several values of i, specially since optimisation should be
done in assembly, and variations of this language abound. I will
however focus on the cube root to prepare you for such analysis.
As an exercise and example, let's create a cube root algorithm. For the cube root, we have i=3. Let us apply this value to the offset formula:
`sum_(k=1)^3 (3!r^(3k)) / (k!(3k)!) = (3!r^2)/(1!*2!) + (3!r^1)/(2!*1!) + (3!r^0)/(3!*0!) = 3r^2 + 3r + 1`
When you make r=0.5, you get the value 3.25 from this formula.
This means that FO(0.5)
= 3.25. Now for the changing of
the offset formula:
`sum_(k=1)^2 (3!((r+1)^(3k)r^(3k))) / (k!(3k)!) = (3!((r+1)^2r^2))/(1!*2!) + (3!((r+1)^1r^1))/(2!*1!) = 6r + 6`
Here, when you make r=r0.5, you get 6(r0.5)+6 = 6r+3.
This means FCO(R0.5)
= 6r+3. This gives us the generic algorithm
register double x, s; /* double *is* needed */ register int r; /* … */ r = 0; if( x > 0 ) { s = 3.25; for( r=1; (x=s)>0; r++ ) s += 6*r+3; }
Again, this algorithm only works with positive numbers: if you want
it to work with negative numbers just reverse the sign of r
if x
was negative upon entry, and work with abs(x)
.
We had to use x
and s
as double
s because even though
FCO(R0.5)
doesn't ever return a noninteger number, the start
value FO(0.5)
is not an integer. Therefore, our fist step in
optimising this code is to get rid of the double
values, making
the entire loop integerbased.
Do note that the following steps only apply to this particular root,
and may not apply to others. For those not familiar with C, also note
that x=s
means x=xs, and
is a valid expression returning the result of the subtraction.
The .25 decimal part of s
never changes: s
is
incremented by 6r+3, which always returns integers if fed
with integer values of r
(which it is). The decimal
part present in the start value of s
does have an impact
on each iteration of x
, however.
Since .25 = ¼, it turns 1 on every 4 iterations of the loop.
This means that 1 should be subtracted from x
when its
decimal part is .00, prior to the subtraction of s
.
Why prior to the subtraction? Well, because we would need to
check whether this decrement would make x
<0, and
that never occurs at that point, saving us another comparison.
The decimal part of x
is .00 at the start of the first
iteration of the loop and at the start of every 4 iterations
after that. Let's take a look at an example that starts at
x
=20 and s
=3.25.
iteration  after x=s 
int(x=s ) 
int(x )int(x=s ) 


1  16.75  16  4  
2  13.50  13  3  
3  10.25  10  3  
4  7.00  7  3  
5  3.75  3  4  
6  0.5  0  3  
… 
So you decrement x
prior to the first iteration
of the loop, and do that again prior to x=s
when the lower two bits of r
equal a certain value.
This eliminates the need for a counter — after all, r
behaves much like a counter, and the lower two bits of a counter
loop on every 4 counts, automatically. But which value to choose?
Well, the value for iteration in the table above is the
same as r
inside the loop (r
is incremented at
the end of the loop, after everything else). Here you can see that
you need to decrement x
prior to iteration 1 and prior to
iteration 5. This means iterations 0 and 4, or when the lower 2
bits of r
=00b.
That helps keep synchronised the values of s
and x
.
Comparing both (in the subtraction x=s
) is now
trivial except when they are equal in value. Unlike the code with
double
, we now cannot dismiss x=s
being
equal to 0 as a sign for the loop to end, because even if their
integer parts are equal, their decimal parts might not be, so we
need a way to figure out what decimal part x
had just after
the subtraction. For that, let's take a better look at the loop
iteration at that point:
lower 2 bits of r 
x int(x ) 
>0?  

01b  .75  yes  
10b  .50  yes  
11b  .25  yes  
00b  .00  no 
This is valid for the first 4 iterations of the loop, and for
every 4 after that. Now, as this table shows, only when the lower
2 bits of r
are 00b would the loop actually end in that
iteration, otherwise one more iteration would occur. So all the
code has to do in such case (one more iteration) is increment
r
before returning.
Let's take a look at a different version of the code that takes all this into account:
register int x, s; register int r; /* … */ r = 0; if( x > 0 ) { x; s = 3; for( r=1; (x=s)>=0; r++ ) { if( x==0 ) { if( (r & 3) != 0 ) r++; break; } s += 6*r + 3; if( (r & 3) == 0 ) /* 3 == 11b */ x; } }
Now all operations are of type int
, but the code still only works
with positive numbers. There is a double comparison of x
with 0, but
that can be easily fixed in assembly code. What we could still improve is
the multiplication: this can be removed, making the code faster. This can
be done by setting up a counter that starts at 6 and increments by 6 on
every iteration. It will therefore hold 6r since r
is a
counter that starts at 1 and increments by 1 on every iteration. This
replaces a multiplication with an addition, a very welcome change.
One could also argument that the counter for r
would no longer be necessary since we can determine r
by dividing
the multiplication counter by 6 when the loop ends. This is true, and
you could still test bits 1 and 2 of this counter for 00b instead of
testing bits 0 and 1 of r
for the same value (I will not get into
showing this). But the divide operation is still too expensive (24 to 40
clock cycles) versus what you save (1 clock cycle per loop iteration),
if you want to use this "softcuberoot" against a "hardcuberoot".
On some CPUs (and generally, if you don't have a hardware FPU available)
this may be worth it, however.
Since the C code for that last change is trivial, let's move straight
to an assembly version of this algorithm (assuming AX
=x
and BX
=r
, both unsigned integers):
; … mov BX, 0 ; BX = r i486 cycles: 1 sub AX, 1 ; 1 jb done ; 3/1 mov CX, 3 ; CX = s 1 mov DX, 0 ; DX = multiplication counter 1 loop1: inc BX ; 1 sub AX, CX ; 1 jb done ; 3/1 je equal ; 3/1 add DX, 6 ; 1 add CX, DX ; 1 add CX, 3 ; 1 test BX, 11b ; 1 jnz loop1 ; 3/1 dec AX ; 1 jmp loop1 ; 3 equal: test BX, 11b ; 1 jz done ; 3/1 inc BX ; 1 done:
Upon exit, BX
holds the result, r. This code is
optimised for speed on the i486, and not size. Also, the
test
operations are not available on processors older
than the i386, which would mean you would have to find an
alternative using a temporary copy of BX
, and using
and
.
Now let's write the floatingpoint code for the same processor. The Intel APX/IA16/32 family of processors (i.e., the x86 up to and including the PentiumIII) does not have a hardware operation for the root of any index. We could then, use
`root(i) x = x^(1/i)`
but they don't even have an operation for raising any base to any power! They do have select logarithms and select bases raised to powers, so you can use the rule
`x = b^(log_(b) x)`
to build the formula
`x^(1/i) = b^(log_(b) x^(1/i))`
`x^(1/i) = b^(1/i log_(b) x)`
which returns the root of index i of x, given any real base, b.
Computers work in number base 2, so it makes sense that the logarithms and powers they have are of base 2. This is true for the i486 (and therefore we make b=2). The i486 has the following relevant operations:
`x*2^y,`  `y in NN_(0)` 
`2^x  1,`  `1 <= x <= 1` 
`y * log_(2) x,`  `x in RR_(0)^(+)` 
`y * log_(2) (x+1),`  `0 <= x < (2sqrt(2))/2` 
Each line is a CPU operation: this operation performs more than one mathematical operation at once, which may or may not help us. Some are limited in range (the variables that are not listed in the limitations, are not limited). One thing steps out in plain view — there is no operation to raise 2 (or any other base, for that matter) to an arbitrary power, and we need this! Fortunately you can use
`b^m * b^n = b^(m+n)`
to solve that problem using the first two operations shown above. For that you make m=int(z) (being z the power you want to raise 2 to), and n=zint(z). You give these values to the powers of each of the first two operations shown above, respectively, and that's it! Since int(z) is always an integer and zint(z) is always within 1 and 1 (exclusively), regardless of the rounding method used, this is the final link to translate our root to:
`z = 1/i * log_(2) x`
`r = ((2^(z"int"(z))1)+1) * 2^("int"(z))`
We will be using rounding to the nearest integer in the code, because we need that type of rounding for the final result, and don't want to waste time changing the rounding type in the middle of a calculation. This means zint(z) will actually be within 0.5 and 0.5 (inclusively). The final code then looks like:
ctrlword DW ? inv_i REAL8 1/i ; replace i when assembling one REAL8 1 ; Set rounding to nearest integer fstcw ctrlword and ctrlword, 1111001111111111b fldcw ctrlword ; … fld inv_i ; i486 cycles: 3 fild WORD PTR [BX] ; 13 16 fyl2x ; ST(1) = (1/i) * log2 [BX] 196329 fld ST ; dup ST: ST(1) now = ST 4 frndint ; ST = int(ST) 21 30 fxch ; ST <> ST(1) 4 fsub ST, ST(1) ; ST = ST  ST(1) 8 20 (*) f2xm1 ; ST = 2^ST  1 140279 fadd one ; ST++ 8 20 fscale ; ST = ST * 2^ST(1) 30 32 fistp WORD PTR [BX] ; 29 34 ffree ST ; (see next line) 3 fincstp ; pop the old ST(1) 3 ; ; (*) at this point (after the fsub): ; if z = (1/i) * log2 [BX], then ; ST(1) = int(z) ; ST = z  int(z) ; ST is within [0.5 .. 0.5]
The operation f2xm1
will only work on the range [0 .. 0.5]
on an 8087 FPU. In this processor you would need to verify if
ST
was negative. If so, you would feed its absolute value
to f2xm1
and find the inverse of the result. The last two
operations are required to remove from the stack a temporary
calculation value that is not removed by the last operation,
fscale
. If you ignore the warnings the FPU can give you
about the stack being full, then you can ignore the operations
and save a little time.
The "fun" part about this code is that it actually raises the
integer value stored on WORD PTR [BX]
to the real power
stored in inv_i
. If you stored in the latter location
the value of 8.25, for instance, this would still be the
optimum code to raise an integer in WORD PTR [BX]
to
that power. To calculate a root of index i, you must
store the inverse of i (as shown) in that location,
prior to the first execution of the code. This can be done by
the assembler itself. This also means that this same
code applies to all roots other than the square root.
Let us now compare the speed of the algorithm versus this hardwarebased code.
We first show that for the softcuberoot, the clock count is 5 if r=0, or
5×1+(8×1+3)(r1)+(2+1+3)(¼)(r1)+5 = 11.5r  1.5
if r>0, where r is again the calculated root. The clock count of the last 3 operations (between labels "equal" and "done"), occurring on ¼ (25%) of the possible final values for r, or every
`(1/4 * sqrt(2^n))/(2^n) = 2^(n/22)`
random calculations of cube root of nbit numbers, is negligible and therefore is not added to this formula (e.g.: for n=16 you get those operations being run less than 0.1% of the time, and the value to be added to this formula would be less than 0.004). For the hardcuberoot, you have
Therefore, the highest root range that can be calculated faster by softcuberoot is
`11.5*r_min  1.5 = 462` `r_min ~~ 40.3` `x_min = r_min^2 = 1624.09` 
to 
`11.5*r_max  1.5 = 777` `r_max ~~ 67.7` `x_max = r_max^2 = 4583.29` 
This means that softcuberoot is faster than hardcuberoot for values of x between 0 and 1624~4583. Similar calculations for other APX/IA (x86) processors yield similar results (the best ratio being of the i286 processor).
If you often need to calculate an integer root in the above range (which has n>10, i.e. approximately 10 to 13bit numbers), then you need this algorithm. It is a much more appealing option than even the softsqrt algorithm was for square roots!
This work can still be extended in at least two areas: precision and speed.
In terms of precision, you can figure out the math to somehow make use of the returned values (other than r) as remainders, so you can do more precise calculations. The application of this is limited, though.
You can also finetune the algorithm so that when the decimal part of the real result r were .5, it rounds the result to the nearest even number, and not always up as it does now. Like previously stated, the square roots of integer numbers never have decimal parts of precisely .5, so this would be nothing more than an exercise for softsqrt. For other roots, such accuracy is in my opinion pointless since you're dealing with integer numbers after all and precision has gone out the window, but it would make the algorithm perhaps a bit more precise.
In terms of speed, you can always try and use a binary search (the type of search you use when you're looking for a word in a dictionary) if it proves quick enough to find the roots, or simply reverse the search direction for the root (starting at the maximum possible values for x and r and working your way backwards). This would mean that the bigger roots are searched for first, so a greater number of values of x are covered in the first few iterations, and on average a root is found sooner. The worst case is still a full search, though.
For instance, an Intel APX/IA assembly optimisation of this variation
could be (assuming AX
=x and BX
=r, both
unsigned integers, and n=16, where n stands for the
number of bits being calculated):
; … not AX ; AX = (2^n)1AX i486 cycles: 1 mov BX, 256 ; BX = 2^(n/2) 1 sub AX, 254 ; AX = AX  ( (2^(n/2))2 ) 1 jbe done ; 3/1 inc AX ; 1 shr AX, 1 ; 2 loop1: dec BX ; 1 sub AX, BX ; 1 ja loop1 ; 3/1 done:
For n=16, this algorithm has a clock count of 6 if r=256, or
7+5×(256r)2 = 12855r
if r<256, which means it can calculate square roots resulting 230~232 to 256 during the same time it takes to calculate hardsqrt. This is a span of 12636 values of x (19.3%), compared to only 677 (1%) for the previously shown algorithm. The worst case is identical in both, however (or slightly slower in this one, but the difference is fixed and not proportional to r).
We will also take a look at the binary search variation in Part II of this document, but other work is left for the reader to do.
Photography by NASA on Unsplash