A RADIX-16 FFT ALGORITHM SUITABLE FOR MULTIPLY-ADD INSTRUCTION
BASED ON GOEDECKER METHOD

Daisuke Takahashi

Institute of Information Sciences and Electronics, University of Tsukuba
1-1-1 Tennodai, Tsukuba-shi, Ibaraki 305-8573, Japan
daisuke@is.tsukuba.ac.jp

ABSTRACT

A radix-16 fast Fourier transform (FFT) algorithm suitable for multiply-add instruction is proposed. The proposed radix-16 FFT algorithm requires fewer floating-point instructions than the conventional radix-16 FFT algorithm on processors that have a multiply-add instruction. Moreover, this algorithm has the advantage of fewer loads and stores than either the radix-2, 4 and 8 FFT algorithms or the split-radix FFT algorithm. We use Goedecker’s method to obtain an algorithm for computing radix-16 FFT with fewer floating-point instructions than the conventional radix-16 FFT algorithm. The number of floating-point instructions for the proposed radix-16 FFT algorithm is compared with those of conventional power-of-two FFT algorithms on processors with multiply-add instruction.

1. INTRODUCTION

For computing an $N = 2^n$-point FFT, radix-2, 4, 8 and 16 FFT algorithms and split-radix FFT algorithms have been proposed [1, 2, 3, 4, 5].

Until several years ago, floating-point addition was faster than floating-point multiplication on most processors. For this reason, FFT algorithms that reduced real multiplications, e.g., the Winograd Fourier transform algorithm (WFTA) [6] and the prime factor FFT algorithm (PFA) [7, 8], have been intensively studied. These algorithms show an advantage over processors that require more time for multiplication than addition. Today, floating-point multiplication is as fast as floating-point addition on the latest processors. Moreover, many processors have a multiply-add instruction.

As for related works, Linzer and Feig [9] have shown radix-2, 4 and 8 FFT algorithms and split-radix FFT algorithm for fused multiply-add architectures. These FFT algorithms are based on the Cooley-Tukey FFT algorithm [1] and the split-radix FFT algorithm [3, 4]. On the other hand, radix-2, 3, 4 and 5 FFT algorithms on computers with overlapping multiply-add instructions have been proposed by Goedecker [10] and Karner et al. [11].

The higher radices are more efficient in terms of both memory and floating-point operations. A high ratio of floating-point instructions to memory operations is particularly important in a cache-based processor. In view of the high ratio of floating-point instructions to memory operations, the radix-16 FFT is more advantageous than the radix-2, 4 and 8 FFTs. Thus the FFTW [12], is known as one of the fastest FFT libraries for many processors, contains a radix-16 FFT routine.

Although higher radix FFTs require more floating-point registers to hold intermediate results, some processors have sufficient floating-point registers (e.g., Intel Itanium processor has 128 floating-point registers [13]).

However, efficient implementations of a radix-16 FFT algorithm suitable for multiply-add instruction have not yet been presented.

In this paper, we propose a radix-16 FFT algorithm suitable for multiply-add instruction based on Goedecker method.

Throughout this paper, we use a multiply-add instruction, which computes $d = \pm a \pm b \pm c$ where $a$, $b$, $c$ and $d$ are floating-point registers. Also, we assume that an addition, a multiplication, or a multiply-add each requires one machine cycle on processors that have a multiply-add instruction. We will call any of these computations a floating-point instruction, and assign a unit cost to each.

2. A RADIX-16 FFT ALGORITHM SUITABLE FOR MULTIPLY-ADD INSTRUCTION

The DFT of $N$ points is given by

$$X_k = \sum_{n=0}^{N-1} x_n W_N^k \quad k = 0, \cdots, N - 1$$

(1)

where $W_N = \exp(-j2\pi/N)$, $X_k$ and $x_n$ are sequences of complex numbers.

An FFT kernel [10, 14] calculates the innermost part in a transformation, which has the form [10]

$$Z_{out}(k) = \sum_{n=0}^{P-1} Z_{in}(n)\Omega^{n} W_P^k$$

(2)

for $k = 0, 1, \cdots, P - 1$. The radix of the kernel is given by the prime factor $P$ which is 16 in this paper. $\Omega^n$ is called the twiddle factor and $W_P = \exp(-j2\pi/P)$.

In the radix-$P$ FFT kernel, an input data $Z_{in}(n)$ multiplied by the twiddle factor $\Omega^n$ is performed with “short DFT” [15].

The adaptability of the conventional radix-16 FFT algorithm on processors that have a multiply-add instruction is here discussed.

The conventional radix-16 FFT (decimation-in-time) is split into the first step and the remaining part. The first step is the complex multiplication of $Z_{in}(n) \times \Omega^n \quad (n = 1, 2, \ldots, 15)$. Then 15 complex multiplications are necessary. We assume that a complex multiplication in the $(1, j)$ plane is done with four real multiplications and two real additions. In the first step, since the ratio of
multiplications to additions is two to one, the addition unit cannot be exploited on processors with multiply-add instruction. Then a 16-point DFT is performed in the remaining part. Since the conventional radix-16 FFT has 24 real multiplications and 144 real additions in the remaining part, the multiply unit also cannot be exploited. As a result, the conventional radix-16 FFT algorithm has 220 floating-point instructions on processors with multiply-add instruction.

We conclude that the conventional radix-16 FFT algorithm is therefore not suitable for the multiply-add instruction.

As we mentioned in the above, the multiply-add unit cannot be exploited in the conventional radix-16 FFT algorithm. We will make full use of the multiply-add unit to transform the conventional radix-16 FFT.

Goedecker’s method [10] consists of repeated transformations of the expression:

$$ax + by \rightarrow a(x + (b/a)y) \quad (3)$$

where $a \neq 0$.

Applying repeated transformations of (3) to the conventional radix-16 FFT, a radix-16 FFT algorithm suitable for multiply-add instruction is obtained.

The proposed radix-16 FFT algorithm suitable for multiply-add instruction is shown in Fig. 1. Here, the real part of the array $Z_{in}$ is denoted by $z_{in}$, the imaginary part by $z_{ini}$, and correspondingly for $Z_{out}$. The real part and imaginary part of the twiddle factor $\Omega^r$ are $\tau_{in}$ and $\tau_{ini}$, respectively.

In the proposed radix-16 FFT algorithm, a table for twiddle factors of $\tau_1 = \tau_{in}$, $\tau_{15}$, $\tau_{17}$, $\cos \theta_{81} = \cos \theta_{81}$, and $\tau_{181} = \tau_{282}$ is needed. Since these values can be computed in advance, the overhead of making the table is negligible.

The proposed radix-16 FFT algorithm has only 174 floating-point instructions on processors with multiply-add instruction. We can see that the multiply-add unit can be exploited in the proposed radix-16 FFT algorithm from Fig. 1.

### 3. EVALUATION

In order to evaluate the effectiveness of power-of-two FFT algorithms, we compare the number of floating-point instructions, loads, and stores.

The number of floating-point instructions for various FFT algorithms of $N$ points with multiply-add instruction is shown in Tables 1 and 2. The proposed radix-16 FFT algorithm asymptotically saves about 21% of the floating-point instructions over the conventional radix-16 FFT on processors that have a multiply-add instruction.

In comparison with the Linzer and Feig radix-4 and 8 FFT algorithms [9], the proposed radix-16 FFT algorithm asymptotically saves about 1% of the floating-point instructions. On the other hand, the proposed radix-16 FFT algorithm asymptotically increases about 2% of the floating-point instructions over the Linzer and Feig split-radix FFT algorithm [9].

The number of loads, stores, floating-point instructions used for various FFT butterflies is given in Table 3. In calculating the number of loads and the number of stores, we assume that enough registers are available to perform an entire butterfly in the registers without using any intermediate stores or loads.

In Table 4, the asymptotic number of loads, stores, and floating-point instructions used by each algorithm is given. We will make full use of the multiply-add unit to transform the conventional radix-16 FFT.

### Table 1. Number of Floating-Point Instructions for FFT Algorithms of $N$ Points with Multiply-Add Instruction

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>52</td>
<td>52</td>
<td>52</td>
<td>160</td>
<td>144</td>
</tr>
<tr>
<td>16</td>
<td>144</td>
<td>144</td>
<td>144</td>
<td>160</td>
<td>144</td>
</tr>
<tr>
<td>32</td>
<td>372</td>
<td>372</td>
<td>372</td>
<td>160</td>
<td>144</td>
</tr>
<tr>
<td>64</td>
<td>920</td>
<td>928</td>
<td>912</td>
<td>160</td>
<td>144</td>
</tr>
<tr>
<td>128</td>
<td>2164</td>
<td>2164</td>
<td>2164</td>
<td>6016</td>
<td>5056</td>
</tr>
<tr>
<td>256</td>
<td>5008</td>
<td>5008</td>
<td>5008</td>
<td>25488</td>
<td>5056</td>
</tr>
<tr>
<td>512</td>
<td>11632</td>
<td>11380</td>
<td>11380</td>
<td>56436</td>
<td></td>
</tr>
<tr>
<td>1024</td>
<td>25944</td>
<td>25488</td>
<td>25488</td>
<td>56436</td>
<td></td>
</tr>
<tr>
<td>2048</td>
<td>56436</td>
<td>56436</td>
<td>56436</td>
<td>56436</td>
<td></td>
</tr>
<tr>
<td>4096</td>
<td>126296</td>
<td>126832</td>
<td>123792</td>
<td>152512</td>
<td>125408</td>
</tr>
</tbody>
</table>

### Table 2. Number of Floating-Point Instructions for FFT Algorithms of $N$ Points with Multiply-Add Instruction

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>52</td>
<td>52</td>
<td>52</td>
<td>160</td>
<td>144</td>
</tr>
<tr>
<td>16</td>
<td>144</td>
<td>144</td>
<td>144</td>
<td>160</td>
<td>144</td>
</tr>
<tr>
<td>32</td>
<td>372</td>
<td>372</td>
<td>372</td>
<td>160</td>
<td>144</td>
</tr>
<tr>
<td>64</td>
<td>920</td>
<td>928</td>
<td>912</td>
<td>160</td>
<td>144</td>
</tr>
<tr>
<td>128</td>
<td>2164</td>
<td>2164</td>
<td>2164</td>
<td>6016</td>
<td>5056</td>
</tr>
<tr>
<td>256</td>
<td>5008</td>
<td>5008</td>
<td>5008</td>
<td>25488</td>
<td>5056</td>
</tr>
<tr>
<td>512</td>
<td>11632</td>
<td>11380</td>
<td>11380</td>
<td>56436</td>
<td></td>
</tr>
<tr>
<td>1024</td>
<td>25944</td>
<td>25488</td>
<td>25488</td>
<td>56436</td>
<td></td>
</tr>
<tr>
<td>2048</td>
<td>56436</td>
<td>56436</td>
<td>56436</td>
<td>56436</td>
<td></td>
</tr>
<tr>
<td>4096</td>
<td>126296</td>
<td>126832</td>
<td>123792</td>
<td>152512</td>
<td>125408</td>
</tr>
</tbody>
</table>
Table 3. Number of Loads, Stores, and Floating-Point Instructions for General Butterflies Used in FFT Algorithms with Multiply-Add Instruction. The Number of Loads Does Not Include Loading All of the Constants

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Loads</th>
<th>Stores</th>
<th>Floating-point instructions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Linzer and Feig radix-4 [9]</td>
<td>8</td>
<td>8</td>
<td>22</td>
</tr>
<tr>
<td>Linzer and Feig radix-8 [9]</td>
<td>16</td>
<td>16</td>
<td>66</td>
</tr>
<tr>
<td>Linzer and Feig split-radix [9]</td>
<td>8</td>
<td>8</td>
<td>16</td>
</tr>
<tr>
<td>Conventional radix-16</td>
<td>32</td>
<td>32</td>
<td>220</td>
</tr>
<tr>
<td>Proposed radix-16</td>
<td>32</td>
<td>32</td>
<td>174</td>
</tr>
</tbody>
</table>

Table 4. Number of Loads, Stores, and Floating-Point Instructions Divided by $N \log_2 N$ Used by FFT Algorithms to Compute an $N$ Point DFT. Lower Order Terms Have Been Omitted

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Loads</th>
<th>Stores</th>
<th>Floating-point instructions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Linzer and Feig radix-4 [9]</td>
<td>1</td>
<td>1</td>
<td>11/4</td>
</tr>
<tr>
<td>Linzer and Feig radix-8 [9]</td>
<td>2/3</td>
<td>2/3</td>
<td>11/4</td>
</tr>
<tr>
<td>Conventional radix-16</td>
<td>1/2</td>
<td>1/2</td>
<td>55/16</td>
</tr>
<tr>
<td>Proposed radix-16</td>
<td>1/2</td>
<td>1/2</td>
<td>87/32</td>
</tr>
</tbody>
</table>

4. CONCLUSION

A radix-16 FFT algorithm suitable for multiply-add instruction has been presented. We reduced the number of floating-point instructions necessary for a radix-16 FFT algorithm by maximizing the use of multiply-add instructions.

The proposed radix-16 FFT algorithm requires fewer floating-point instructions than the conventional radix-16 FFT algorithm on processors that have a multiply-add instruction.

If the FFTs are being computed on a machine that has enough registers to perform an entire radix-16 FFT algorithm, those FFTs will use fewer loads and stores than the radix-2, 4 and 8 FFT algorithms or the split-radix FFT algorithm.

5. REFERENCES

/* twiddle factors can be
computed in advance */
\[ r_{1} = \sin(\theta) \]
\[ r_{2} = \cos(\theta) \]
\[ v_{1} = r_{1} + r_{2} \]
\[ v_{2} = r_{1} - r_{2} \]
\[ s_{1} = 1 + r_{1} + r_{2} \]
\[ s_{2} = 1 - r_{1} - r_{2} \]
\[ u_{1} = 1 + i \cdot r_{1} + i \cdot r_{2} \]
\[ u_{2} = 1 - i \cdot r_{1} - i \cdot r_{2} \]
\[ s_{3} = 1 + r_{1} - r_{2} \]
\[ s_{4} = 1 - r_{1} + r_{2} \]
\[ u_{3} = 1 + i \cdot r_{1} - i \cdot r_{2} \]
\[ u_{4} = 1 - i \cdot r_{1} + i \cdot r_{2} \]
\[ s_{5} = 1 + r_{1} + r_{2} \]
\[ s_{6} = 1 - r_{1} + r_{2} \]
\[ u_{5} = 1 + i \cdot r_{1} - i \cdot r_{2} \]
\[ u_{6} = 1 - i \cdot r_{1} + i \cdot r_{2} \]
\[ s_{7} = 1 + r_{1} - r_{2} \]
\[ s_{8} = 1 - r_{1} + r_{2} \]
\[ u_{7} = 1 + i \cdot r_{1} + i \cdot r_{2} \]
\[ u_{8} = 1 - i \cdot r_{1} - i \cdot r_{2} \]
\[ s_{9} = 1 + r_{1} - r_{2} \]
\[ s_{10} = 1 - r_{1} + r_{2} \]
\[ u_{9} = 1 + i \cdot r_{1} + i \cdot r_{2} \]
\[ u_{10} = 1 - i \cdot r_{1} - i \cdot r_{2} \]
\[ s_{11} = 1 + r_{1} + r_{2} \]
\[ s_{12} = 1 - r_{1} + r_{2} \]
\[ u_{11} = 1 + i \cdot r_{1} - i \cdot r_{2} \]
\[ u_{12} = 1 - i \cdot r_{1} + i \cdot r_{2} \]
\[ s_{13} = 1 + r_{1} - r_{2} \]
\[ s_{14} = 1 - r_{1} + r_{2} \]
\[ u_{13} = 1 + i \cdot r_{1} + i \cdot r_{2} \]
\[ u_{14} = 1 - i \cdot r_{1} - i \cdot r_{2} \]
\[ s_{15} = 1 + r_{1} + r_{2} \]
\[ s_{16} = 1 - r_{1} - r_{2} \]
\[ u_{15} = 1 + i \cdot r_{1} - i \cdot r_{2} \]
\[ u_{16} = 1 - i \cdot r_{1} + i \cdot r_{2} \]

ror = r1 + v9 * cr28
s0 = v1 + v9 * cr28
r1 = u1 - u9 * cr28
s1 = v1 - v9 * cr28
r2 = u5 + u13 * cr31
s2 = v5 + v13 * cr31
r3 = v5 - v13 * cr31
s3 = u13 * cr31 - u5
zoutu(0) = r0 + r2 * cr181
zoutu(1) = s0 + s2 * cr181
zoutu(2) = r1 + r3 * cr181
zoutu(3) = s1 + s3 * cr181
zoutu(4) = r2 + r4 * cr181
zoutu(5) = s2 + s4 * cr181
zoutu(6) = r3 + r5 * cr181
zoutu(7) = s3 + s5 * cr181
zoutu(8) = r4 + r6 * cr181
zoutu(9) = s4 + s6 * cr181
zoutu(10) = r5 + r7 * cr181
zoutu(11) = s5 + s7 * cr181
zoutu(12) = r6 + r8 * cr181
zoutu(13) = s6 + s8 * cr181
zoutu(14) = r7 + r9 * cr181
zoutu(15) = s7 + s9 * cr181
zoutu(16) = r8 + r10 * cr181
zoutu(17) = s8 + s10 * cr181
zoutu(18) = r9 + r11 * cr181
zoutu(19) = s9 + s11 * cr181
zoutu(20) = r10 + r12 * cr181
zoutu(21) = s10 + s12 * cr181
zoutu(22) = r11 + r13 * cr181
zoutu(23) = s11 + s13 * cr181
zoutu(24) = r12 + r14 * cr181
zoutu(25) = s12 + s14 * cr181
zoutu(26) = r13 + r15 * cr181
zoutu(27) = s13 + s15 * cr181
zoutu(28) = r14 + r16 * cr181
zoutu(29) = s14 + s16 * cr181
zoutu(30) = r15 + r17 * cr181
zoutu(31) = s15 + s17 * cr181
zoutu(32) = r16 + r18 * cr181
zoutu(33) = s16 + s18 * cr181
zoutu(34) = r17 + r19 * cr181
zoutu(35) = s17 + s19 * cr181
zoutu(36) = r18 + r20 * cr181
zoutu(37) = s18 + s20 * cr181
zoutu(38) = r19 + r21 * cr181
zoutu(39) = s19 + s21 * cr181
zoutu(40) = r20 + r22 * cr181
zoutu(41) = s20 + s22 * cr181
zoutu(42) = r21 + r23 * cr181
zoutu(43) = s21 + s23 * cr181
zoutu(44) = r22 + r24 * cr181
zoutu(45) = s22 + s24 * cr181
zoutu(46) = r23 + r25 * cr181
zoutu(47) = s23 + s25 * cr181
zoutu(48) = r24 + r26 * cr181
zoutu(49) = s24 + s26 * cr181
zoutu(50) = r25 + r27 * cr181
zoutu(51) = s25 + s27 * cr181
zoutu(52) = r26 + r28 * cr181
zoutu(53) = s26 + s28 * cr181
zoutu(54) = r27 + r29 * cr181
zoutu(55) = s27 + s29 * cr181
zoutu(56) = r28 + r30 * cr181
zoutu(57) = s28 + s30 * cr181
zoutu(58) = r29 + r31 * cr181
zoutu(59) = s29 + s31 * cr181
zoutu(60) = r30 + r32 * cr181
zoutu(61) = s30 + s32 * cr181
zoutu(62) = r31 + r33 * cr181
zoutu(63) = s31 + s33 * cr181

Fig. 1. Proposed radix-16 FFT algorithm suitable for multiply-add instruction.