# Enhanced Vector Math Support on the Intel®AVX-512 Architecture 

Cristina S. Anderson<br>Intel Corporation<br>2111 NE 25th Ave<br>Hillsboro, OR 97124<br>cristina.s.anderson@intel.com

Jingwei Zhang<br>Intel Corporation<br>2111 NE 25th Ave<br>Hillsboro, OR 97124<br>jingwei.zhang@intel.com

Marius Cornea<br>Intel Corporation<br>2111 NE 25th Ave<br>Hillsboro, OR 97124<br>marius.cornea@intel.com


#### Abstract

The Intel®AVX-512 architecture adds new capabilities such as masked execution, floating-point exception suppression and static rounding modes, as well as a small set of new instructions for mathematical library support. These new features allow for better compliance with floating-point or language standards (e.g. no spurious floating-point exceptions, and faster or more accurate code for directed rounding modes), as well as simpler, smaller footprint implementations that eliminate branches and special case paths. Performance is also improved, in particular for vector mathematical functions (which benefit from easier processing in the main path, and fast access to small lookup tables). In this paper, we describe the relevant new features and their possible applications to floating-point computation. The code examples include a few compact implementation sequences for some common vector mathematical functions.


Index Terms-SIMD, vector mathematical function, floatingpoint

## I. INTRODUCTION

SIMD (single instruction, multiple data) architecture is an effective way to exploit data level parallelism on CPUs. The Intel $\circledR_{\circledR ~ A V X-512 ~ i n s t r u c t i o n ~ s e t ~ i s ~ t h e ~ l a t e s t ~ S I M D ~ e x t e n s i o n ~}^{\text {A }}$ for the x 86 instruction set architecture [1]. In addition to the register widening from 256 -bit to 512 -bit, there are architectural enhancements added in AVX-512 such as new types of SIMD instructions, conditional execution, and instruction embedded floating-point control. Here is the subset of new AVX-512 instructions discussed in this paper:

- Setting of special floating-point outputs (and exception signaling if specified)
- Cross-lane permutation with two sources
- Conversion of exponent of floating-point values to floating-point values and extraction of normalized mantissa from floating-point values
- Scaling by powers of 2, with support for special cases
- Enhanced reciprocal and reciprocal square root approximation
- Testing types of floating-point values
- Rounding to specific number of fractional bits and subsequential argument reduction
- Range restriction calculation

As a result, many vector mathematical functions can have compact, branch-free AVX-512 implementations, conforming to the IEEE Standard 754-2008 for Floating-Point Arithmetic
[2]. The following sections will describe each new feature in detail, and we will show short code sequences to illustrate usage in the implementation of common vector mathematical functions.

## II. NEW FEATURES OF INTEL® AVX-512 INSTRUCTION SET

## A. Static Rounding Modes and Suppress-all-exception Bit

The rounding mode for a given floating-point operation on the x86 architecture is usually specified dynamically in the floating-point environment control register, except for a small number of Intel® AVX-512 round and convert instructions, e.g. VROUNDPD. In these cases, the rounding mode can be specified statically using an instruction encoding attribute in the instruction opcode, which can override the default rounding mode stored in the control register. The static rounding mode override in AVX-512 also implies suppress-allexceptions (SAE), which treats all floating-point exceptions as masked, and in addition ensures that no floating-point status flag is changed. In the absence of SAE, there would have to be a trade-off between standard conformance [2], [3], and performance (additional steps or branches would be needed to ensure strict standard conformance). The static rounding mode feature is also used, e.g. round-toward-zero to avoid generating infinities in intermediate steps (an example will follow in Section E); a static round-to-nearest could also be useful in accuracy-sensitive applications.

## B. Handling of Special Cases

Intel® AVX-512 provides mask registers that can be used for conditional execution. Each AVX-512 instruction in EVEX encoding can take an extra operand called mask register, which contains the predicate bits to control the output of each vector element. The SIMD operation on a given element will be carried out only if its corresponding mask predicate bit is set, otherwise, the destination element is either preserved (mergemasking) or zeroed out (zeroing-masking.) Masked execution is an efficient way of handling special cases in the main path. As an example, consider

VADDPS dest $\{$ mask $\}\{z\}, \operatorname{src} 1,[s r c 2]\{1 t o 16\},\{r z-s a e\}$. This instruction will first load a 32-bit element from memory using the address in $s r c 2$, and will replicate that element 16
times to form a vector 512-bit wide, also known as broadcasting in AVX-512. Then, it will add that vector withsrc 1, element by element, and place the result into dest if the corresponding mask bit is 1 . Otherwise, it will write the value zero to dest as indicated by $\{z\}$ for zeroing-masking (the default is mergemasking.) The addition is performed in round-toward-zero mode, with SAE as indicated by $\{r z-s a e\}$.

There is also a new "fixup" instruction that can be used to set special outputs, and optionally raise Invalid or Divide-by-Zero exceptions (the status flags are set when exceptions are masked). However, in most cases masked execution is sufficient for handling special cases. Its format is:

VFIXUPIMMPS/PD dest $\{$ mask $\}\{z\}$, src1, src2, imm8.
The destination register dest contains the results of a function computation; these results must be correct for valid inputs in the function domain, but not necessarily for special inputs (e.g. the computation may have incorrectly produced a NaN output for an infinity input). VFIXUPIMM can be used to correct the outputs for special cases. The first source srcl is treated as a 32 -bit table ( 8 entries of 4 bits each), which defines the desired outputs for each of the following input classes: qNaN , sNaN , Zero, $+1.0, \pm \infty$, finite positive (and not 1.0), finite negative. A 4-bit table entry is read for each element, based on the class of the input src2. This table entry selects one of 16 possible outputs, including: unmodified destination (used when $\operatorname{src} 2$ is in the function domain and dest already contains the desired result), destination equal to the unmodified input (e.g. to correctly set $\operatorname{sqrt}( \pm 0)= \pm 0, \mathrm{qNaN}$ Indefinite, $\pm \infty$, $\pm 0$, and other values that can be useful for fixing up common functions, e.g. $\pi / 2$ for $\operatorname{atan}(+\infty)$. The 8 bits in the immediate field imm 8 can be set to raise either Invalid, or Divide-by-Zero when $\operatorname{src} 2$ is one of the following: $\pm \infty$, finite negative, sNaN , $+1.0, \pm 0$.

## C. Permute Instructions

The syntax for these instructions is:

- VPERMI2PS/PD dest $\{$ mask $\}\{z\}, s r c 1, s r c 2$.

Permute elements in $s r c l$ and $s r c 2$, using indices from dest; dest is overwritten by the results

- VPERMT2PS/PD dest $\{$ mask $\}\{z\}$, srcl, src2.

Permute elements in $s r c 2$ and dest, using indices from srcl; dest is overwritten by the results

- VPERMPS/PD dest $\{$ mask $\}\{z\}, \operatorname{srcl}, \operatorname{src} 2$.

Permute elements in src2 using indices in srcl; store results in dest
Lookup tables are frequently used in mathematical library design [4]; however, poor vector gather support limits their benefit for vector functions. A small table lookup can be implemented using one of the VPERM instructions. The size of the table would be 32 single or 16 double precision elements for VPERMI2 and VPERMT2, and 16 single or 8 double for VPERM. Larger table lookup can be implemented with a sequence of permute and blend instructions. These instructions can be a lot less expensive than hardware vector gather operation in term of performance.

## D. Exponent and Mantissa Extraction

The new VGETEXP instruction returns the unbiased exponent of the input, in floating-point format. It is implemented as $\operatorname{VGETEXP}(x)=$ floor $(\log 2(|x|))$ for all inputs, including subnormals and special cases. The associated instruction VGETMANT normalizes the mantissa to one of the four following intervals, according to bits in the immediate field: $[1,2),[0.5,1),[0.75,1.5)$, and $[0.5,2)$, where the last one takes into account whether the exponent is even or odd. Other bits in the immediate field can be used to allow VGETMANT to raise Invalid exception on negative inputs, and to specify whether the sign of the input should be preserved in the output.

VGETEXPPS/PD dest $\{$ mask $\}\{z\}$, srcl.
VGETMANTPS/PD dest $\{$ mask $\}\{z\}$, src1, imm8.
The following identity holds for these two instructions: $x=$ $2^{\operatorname{VGETEXP}(x)} * \operatorname{VGETMANT}(x, 0)$, where 0 in the immediate field means the mantissa is normalized to [1,2), the sign of $x$ is preserved, and no exception is raised.

In Example 1, $\log 2(x)$ approximation accurate to 21 bits is computed using piecewise polynomial interpolation. The polynomial coefficients for each of the 16 intervals considered are stored in separate tables, and accessed with VPERMPS instructions. The mantissa is normalized to $m_{x} \in[0.75,1.5)$, and the leading mantissa bits serve as table indices for the coefficient tables. The polynomial approximates $\log 2\left(m_{x}\right)=$ $\log 2\left(\left(m_{x}-1\right)+1\right)$.

The VGETMANT immediate field also requests that the Invalid exception be signaled when the input is negative; the VGETMANT result is set to qNaN Indefinite in that case. VGETMANT returns 1.0 for $\pm \infty$ and $\pm 0$, while VGETEXP returns $+\infty$ and $-\infty$ for $\pm \infty$ and $\pm 0$ respectively. Both instructions generate NaNs for NaN inputs. This ensures that special cases are treated correctly, without additional steps. The only standard requirement that is not met in Example 1 is signaling Divide-by-Zero for $\log 2( \pm 0)$, which could be solved by adding a VFIXUPIMM operation at the end.

```
; input, output in zmm0
vgetmantps zmm1, zmm0, Obh ; mantissa mx in [3/4, 3/2)
vgetexpps zmm2, zmm0 ; exponent
vpsrld zmm3, zmm1, 23-4 ; index for coefficient tables
vsubps zmm1, zmm1, [One]{1to16} ; reduction r=mx-1
vpermps zmm14, zmm3, ZMMWORD PTR [C4] ; c4
vpermps zmm13, zmm3, ZMMWORD PTR [C3] ; c3
vpermps zmm12, zmm3, ZMMWORD PTR [C2] ; c2
vpermps zmm11, zmm3, ZMMWORD PTR [C1] ; c1
vmulps zmm5, zmm1, zmm1 ; r^2
vfmadd213ps zmm14, zmm1, zmm13 ; c3+c4*r
vpermps zmm4, zmm3, ZMMWORD PTR [Exp] ; exponent adjustment
vfmadd213ps zmm12, zmm1, zmm11 ; c1+c2*r
vaddps zmm0, zmm2, zmm4 ; adjusted exponent k
; p = (c1+c2*r) +r^2*(c3+c4*r)
vfmadd213ps zmm14, zmm5, zmm12
vfmadd231ps zmm0, zmm1, zmm14 ; result = k + r*p
```

Example 1. SIMD single precision $\log 2(x)$ accurate to 21 bits

## E. Scaling by Powers of 2

The new instruction is
VSCALEFPD/PS dest $\{$ mask $\}\{z\}$, srcl, src2,
which computes dest $=\operatorname{src} 1 * 2^{\lfloor s r c 2\rfloor}$. Overflow and Underflow exceptions are treated as described in the IEEE Standard 7542008 [2], even though this is not an IEEE-defined operation.

Special case behavior is carefully defined, in a manner that ensures that VSCALEF can function as a last step as well as a fixup instruction in implementations of functions such as $\exp ()$ or pow () (meaning that no other steps are needed to set the correct outputs for special inputs, e.g. $\pm \infty$ ). A branch-free double precision $\exp ()$ computation accurate to 50 bits is shown in Example 2. $\exp (x)$ is formed as $\exp (x-$ $\left.Z_{0} \log (2)\right) * \exp \left(Z_{0} \log (2)\right)=e^{R_{*}} * 2^{Z_{0}}=e^{R_{*}} * 2^{Z_{0}-\left\lfloor Z_{0}\right\rfloor} * 2^{\left\lfloor Z_{0}\right\rfloor}$, where $R=x-Z_{0} \log (2)$ and $Z_{0}$ is $x / \log (2)$ rounded down to 4 fractional bits. $e^{R}$ is calculated through polynomial approximation, $2^{Z_{0}-\left\lfloor Z_{0}\right\rfloor}$ through table-lookup, and a final VSCALEF instruction is used to scale their product by $2^{\lfloor Z 0\rfloor}$ and to generate the final result. VSCALEF also yields correct results for special $\exp ()$ cases ( $\pm \infty$ and NaNs ). Infinity inputs to our computation sequence generate an intermediate NaN result, which is then passed as the first argument to VSCALEF. However, $\operatorname{VSCALEF}$ is defined so that $\operatorname{VSCALEF}(\mathrm{qNaN},+\infty)$ $=+\infty$, and $\operatorname{VSCALEF}(\mathrm{qNaN},-\infty)=0$. Inputs that are very large in magnitude are also handled successfully in the main path: spurious overflow to infinity (which would later lead to an intermediate NaN result) is avoided by executing $x / \log (2)$ in round-toward-zero rounding mode, and other unwanted Overflow/Underflow exceptions in the polynomial computation are avoided by forcing the reduced argument to a limited range. By ensuring a finite and positive result for the $\exp (R) * 2^{\mathrm{frac}\left(Z_{0}\right)}$ computation, the final VSCALEF instruction can correctly generate overflow/underflow results for all out-of-range inputs. Spurious status flags (such as Invalid for infinity inputs) are avoided by using SAE for the affected operations.

```
; input, output in zmm0
; 2^\((52-4) * 1.5+x * \log 2(e), R Z\) mode
vmovapd zmm3, ZMMWORD PTR [Log2E]
vmovapd zmm4, ZMMWORD PTR [Shifter]
vfmadd213pd zmm3, zmm0, zmm4, \{rz-sae\}
; \(Z 0 \sim\) ~ \(x * \log 2(e)\), rounded down to 4 fractional bits
vsubpd zmm5, zmm3, zmm4
; \(R=x-z 0 * \log (2)\), SAE on
vfnmadd231pd zmm0, zmm5, [Ln2Hi]\{1to16\}, \{rn-sae\}
vfnmadd231pd zmm0, zmm5, [Ln2Lo]\{1to16\}, \{rn-sae\}
; Table lookup: \(T{ }^{\sim}=2.0^{\wedge} \mathrm{f}\), where \(\mathrm{f}=\mathrm{ZO}\) - floor(Z0)
vmovapd zmm10, ZMMWORD PTR [ExpTbl]
vpermt2pd zmm10, zmm3, ZMMWORD PTR [ExpTbl+64]
; mask exponent bit to ensure \(|\mathrm{R}|<2\) even for special cases
vandpd zmm0, zmm0, [EMask]\{1to16\}
<Polynomial computation (of degree 6) omitted where
P \((z m m 11)^{\sim}=\exp (R)-1>\)
vfmadd213pd zmm11, zmm10, zmm10; T+T*P
vscalefpd zmm0, zmm11, zmm5 ; \(2^{\wedge}\) floor \((Z 0) *(T+T * P)\)
```

Example 2. SIMD double precision $\exp (x)$ accurate to 50 bits
Some other examples of VSCALEF, VGETMANT, VGETEXP usage are:

- Software division, where $a / b=m_{a} 2^{e_{a}} / m_{b} 2^{e_{b}}=$ $\left(m_{a} / m_{b}\right) * 2^{e_{a}-e_{b}}$ with $m_{x}=\operatorname{VGETMANT}(x, 0), e_{x}=$ $\operatorname{VGETEXP}(x)$, and the final scaling through VSCALEF. This reduction allows for a branch-free implementation of division, which covers overflow, underflow, and also special inputs (zeroes, infinities, subnormals). $\left|m_{x}\right|$ is in $[1,2)$ for all non-NaN inputs. $m_{a} / m_{b}$ can be computed to the desired accuracy with Newton-Raphson iterations. The SAE feature can help ensure spurious flag settings
do not occur. Flags can be set correctly as part of the computation (except for Divide-by-Zero, which may require an additional step).
- $x^{\alpha}$, where $\alpha$ is constant (e.g. $\alpha=1 / 3$ for the cube root function). The basic reduction for this computation is: $x^{\alpha}=\left(m_{x} 2^{e_{x}}\right)^{\alpha}=\left(m_{x}\right)^{\alpha} 2^{e_{x} \alpha}=$ $\left(m_{x}\right)^{\alpha} 2^{e_{x} \alpha-\left\lfloor e_{x} \alpha\right\rfloor} 2^{\left\lfloor e_{x} \alpha\right\rfloor}$.


## F. Approximation Instructions

Approximations for the reciprocal and reciprocal square root functions provided by the following instructions can be refined easily to higher accuracy, and can often be used to get software implements of division, reciprocal, square root, and reciprocal square root with better throughput performance than hardware.

VRCP14PS/PD dest $\{$ mask $\}\{z\}$, srcl.
VRSQRT14PS/PD dest $\{$ mask $\}\{z\}$, src 1 .
The new instructions have an accuracy of at least 14 bits (the relative error of the approximation is less than $2^{-14}$ ), and treat subnormals correctly (unlike the legacy instructions $R C P P S$ and RSQRTPS). They are offered in both single and double precision, where double precision computations benefit the most, since double precision reciprocal and reciprocal square root approximation instructions were not available before. The increased accuracy (from 11.5 bits to 14 bits) is sufficient to eliminate one Newton-Raphson iteration from a 50-52 bit reciprocal calculation.

For some mathematical functions, a rounded VRCP14PD result can be used in place of an expensive reciprocal table lookup. The same technique could be used before via the legacy $R C P P S$, however, less efficient due to the lack of $R C P P D$. Examples of functions that can benefit are logarithm functions (discussed in more detail in Section $\mathrm{H})$ and the cube root functions. For cube root, $\left(m_{x}\right)^{1 / 3}$ with $m_{x} \in[1,2)$ can be rewritten as $\left(m_{x}\right)^{1 / 3}=\left(m_{x} *\right.$ $\left.R C P\left(m_{x}\right)\right)^{1 / 3} *\left(R C P\left(m_{x}\right)\right)^{-1 / 3}=\left(1+\left(m_{x} * R C P\left(m_{x}\right)-\right.\right.$ 1) $)^{1 / 3} *\left(R C P\left(m_{x}\right)\right)^{-1 / 3}=(1+r)^{1 / 3} *\left(R C P\left(m_{x}\right)\right)^{-1 / 3}$, where $R C P\left(m_{x}\right)$ is the reciprocal approximation of $m_{x}$ rounded to a fixed number of fraction bits using VRCP14 and VRNDSCALE instructions. $(1+r)^{1 / 3}$ is calculated with polynomial approximation of $r$ and $\left(R C P\left(m_{x}\right)\right)^{-1 / 3}$ is usually obtained by table lookup.

## G. Testing of Floating-Point Categories

This is achieved with the new instruction VFPCLASS:
VFPCLASSPS/PD dest \{mask\}\{z\}, srcl, imm8.
VFPCLASS checks whether the input belongs to any of the special floating-point classes specified in the immediate field imm8. The instruction sets a bit in the dest mask for each element in the input register, to indicate whether it is in any one of the special classes it was checked against. The different classes that can be checked for are $\mathrm{sNaN}, \mathrm{qNaN},+0$, $-0,+\infty,-\infty$, subnormal, and finite negative value. The user can check for one or more of these classes, by setting the corresponding bits in the imm8 field. VFPCLASS is used to detect special cases so they can be directed to a special path, or alternatively, handled with masked operations in the main path. Two examples are shown below.

```
; input, output in zmm0
vrcp14pd zmm1, zmm0 ; R0 ~ = 1/x
vmovapd zmm4, zmm0 ; Copy x
vfnmadd213pd zmm0, zmm1, [One]{1to16}, {rn-sae} ; e0=1-x*R0
vfpclasspd k2, zmm1, leh ; x +/-Inf or +/-0?
vfmadd213pd zmm0, zmm1, zmm1, {rn-sae} ; R1 = R0+R0*e0
knotw k3, k2 ; non-special mask (k3 = NOT k2)
vfnmadd213pd zmm4, zmm0, [One]{1to16}, {rn-sae} ; e1=1-x*R1
; special cases: return RCP14(x) if k2=1
vblendmpd zmm0 {k2}, zmm0, zmm1
; return result of computation if k3=1
vfmadd213pd zmm0 {k3}, zmm4, zmm0, {rn-sae} ; R1+R1*e1
```

Example 3. SIMD double precision $1 / x$ accurate to 52 bits

1) Reciprocal Sequence, Square Root Sequence: The reduced argument for the $1 / x$ computation is $e=1-x *$ $\operatorname{VRCP} 14(x)$. This expression evaluates to NaN when $x$ is $\pm 0$ or $\pm \infty$, as VRCP14 returns the correct result for these special cases. VFPCLASS allows the programmer to set a mask value for $x= \pm 0$ or $\pm \infty$, and to clear the mask for all other $x$. One can then use this mask to select between the $V R C P 14$ output (result for special cases), or the result of a reciprocal refinement computation starting with VRCP14 (for typical inputs). Example 3 shows a double precision reciprocal computation that follows the steps outlined above. In a similar manner, a square root computation based on VRSQRT14 would use VFPCLASS to create a mask for $x= \pm 0$ or $x=+\infty$ (the result is equal to the input in these special cases).
2) Power function: The main path of pow $(x, y)=2^{y \log 2(x)}$ does not work when $x \leq 0, x=\infty$ or NaN, or $y=$ $\infty$ or NaN. One VFPCLASS can be used to set xSpecialMask to 1 for $x \leq 0$ or $x=\infty$ or NaN. A second VFPCLASS would be used to set ySpecialMask to 1 for $y=\infty$ or NaN. A branch to a secondary path is taken when either mask is set.

## H. Enhanced Rounding and Fraction Computation

The new instructions used for this purpose are VRNDSCALE and VREDUCE:

VRNDSCALEPS/PD dest $\{$ mask $\}\{z\}$, srcl, imm8,
VREDUCEPS/PD dest \{mask\}\{z\}, srcl, imm8.
The VRNDSCALE instruction rounds the input to integer, plus $M$ fraction bits, with result stored in floating-point format. The operation of $V R N D S C A L E(s r c, i m m 8)$ is equivalent to roundToInt $\left(s r c \cdot 2^{M}\right) 2^{-M}$, performed as if with unbounded exponent (i.e. there is no overflow), where $M$ is an integer retrieved form the upper 4 bits of imm 8 and the lower 4 bits of imm8 specifies rounding and exception reporting controls similar to the legacy VROUND instruction. The operation of $\operatorname{VREDUCE}(\mathrm{src}, \mathrm{imm})$ is equivalent to $s r c-V R N D S C A L E(s r c$, imm8). VRNDSCALE can be used in argument reduction step for the $\log ()$ function. The argument reduction for $\log (x)$, where $1 \leq x<2$, can be done like this:

```
y = RCP14(x); // y is in [0.5, 1]
y0}=\operatorname{RNDSCALE (y, k<<4); // y0 has k mantissa bits
R = x*y0-1; // |R|<2^(-14)+2^(-k)
```

We have now $\log (x)=-\log \left(y_{0}\right)+\log (1+R)$, where $\log (1+R)$ can be computed via polynomial approximation,
and $\log \left(y_{0}\right)$ can be retrieved from a lookup table of $2^{k-1}+1$ elements.
The most significant benefit of VREDUCE is latency reduction in common transcendental functions such as $\exp 2()$ and pow(). Uses in other transcendental functions such as $\operatorname{atan}()$ are also possible. Example 4 shows such usage in a single precision $\exp 2()$ implementation. As with other code examples, all inputs are handled correctly in the main path (including special cases.)

```
; input, output in zmm0, K = floor(x)
vreduceps zmm3, zmm0, 0x41 ; reduction r=x - K.b1b2b3b4
vaddps zmm1, zmm0, [C]{1to16}, {rd-sae} ; C=1.5*2^{23-4}
vbroadcastps zmm2, [c3] ; c3
vfmadd213ps zmm2, zmm3, [C2]{1to16} ; c3*r+c2
vpermps zmm4, zmm1, ZMMWORD PTR [T] ;table for 2^(.b1b2b3b4)
vmulps zmm1, zmm4, zmm3 ; T*r
vfmadd213ps zmm2, zmm3, [c1]{1to16} ; c3*r^2+c2*r+c1
vfmadd213ps zmm2, zmm1, zmm4 ; T+T*r*(c3*r^2+c2*r+c1)
vscalefps zmm0, zmm2, zmm0 ; scale by 2^K
```

Example 4. SIMD single-precision $\exp 2(x)$ accurate to 22 bits

## I. Range Restriction Calculation

The range restriction calculation instructions VRANGEP$S / P D$ in AVX-512 can be seen as an extension of the legacy MAXPS/PD and MINPS/PD instructions. The new instructions add a control field imm 8 to select a comparison operation out of minNum, maxNum, minNumMag, and maxNumMag according to the IEEE Standard 754-2008 [2], using imm8[1:0] and the sign of the output using imm8[3:2]. The instruction format is:

VRANGEPS/PD dest $\{$ mask $\}\{z\}, \operatorname{src} 1, \operatorname{src} 2$, imm8.
An example of using such an instruction other than in the IEEE comparison operations mentioned above, could be in the Fast2Sum [5] algorithms, when the relative order of two numbers to be summed is not known in advance.
The Fast2Sum algorithm below requires $|a| \geq|b|$, which could be achieved by using VRANGE instruction to select $a=$ $\operatorname{maxNumMag}(a, b)$ and $b=\operatorname{minNumMag}(a, b)$.

$$
\begin{aligned}
& s=(a+b)_{r n} \\
& z=(s-a)_{r n} \\
& t=(b-z)_{r n}
\end{aligned}
$$

## III. PERFORMANCE EVALUATION AND CONCLUSIONS

Most common functions in the vector mathematical library from the Intel® Math Kernel Library were optimized for AVX512 , using the new features presented in this paper, with their performance numbers published online [6].

Throughput-oriented sequences such as those used in the software-pipelined vector mathematical library typically achieve the highest speedup ratios with AVX-512 instructions. This is no surprise, since the new instructions were primarily designed to reduce the total number of operations in numerical computations; the effect on latency is less pronounced. It is also not unexpected that higher accuracy sequences (requiring more complex computations) tend to benefit more from the new instructions. Implementations using simple arithmetic
evaluations (such as polynomials) and no bit manipulations have little opportunity for further optimization.

TABLE I

| Function | Single precision |  | Double precision |  |
| :---: | :---: | :---: | :---: | :---: |
| AVX-512 new instructions | w/o | with | w/o | with |
| exp2 | .83 | .59 | 1.68 | 1.13 |
| exp | .84 | .67 | 1.69 | 1.33 |
| $\log 2$ | 1.07 | .81 | 2.55 | 2.31 |
| log | .98 | .82 | 2.46 | 2.30 |
| pow | 5.30 | 3.49 | 9.45 | 6.74 |
| cbrt | 2.61 | 1.34 | 3.47 | 3.07 |

Table I compares the performance of selected mathematical functions having ulp errors of at most 4 ulps (units-in-the-last-place), with and then without using the new AVX-512 instructions. The performance numbers are in clocks per element, for vectors of 1024 elements (the SIMD code are not unrolled), on Intel® Xeon® Platinum 8180 CPU @ 2.50 GHz .

## REFERENCES

[1] Intel® Architecture Software Developers Manual, https://software.intel. com/en-us/articles/intel-sdm.
[2] 754-2008 - IEEE Standard for Floating-Point Arithmetic http://standards. ieee.org/findstds/standard/754-2008.html.
[3] ISO/IEC 9899:2011 - Programming languages C https://www.iso.org/ standard/57853.html.
[4] Ping Tak Peter Tang, "Table-driven implementation of the logarithm function in IEEE floating-point arithmetic," ACM Transactions on Mathematical Software, 16(4):378-400, 1990.
[5] T. J. Dekker, "A floating-point technique for extending the available precision," Numerische Mathematik, 18(3):224-242, 1971.
[6] Intel® Math Kernel Library 2018 Vector Mathematics Performance and Accuracy Data https://software.intel.com/sites/products/documentation/ doclib/mkl/vm/vmdata.htm.

