# The Fast Fourier Transform Intuition

##### π A Project Cauchy Op-ed

Named after French mathematician Augustin-Louis Cauchy, Project Cauchy column is where I invite some of the HFI Programming club members to provide neat proofs or explanations about some number theory puzzles. I highly suggest that you read these articles with a pencil and paper so you can sketch things out and scribble solutions to exercises as you come across them. This week, I will continue to explore the proof of correctness of Cooley-Tukey's FFT Algorithm and its merits over ordinary DFT.

The cover image is from xkcd: Fourier. The title text is: "That cat has some serious periodic components".

Cueball has applied this Fourier transform to his cat. Although it seems to still be alive and possibly even unharmed, it is clearly not in its familiar shape, and it is not clear if this condition is permanent or not.

"Periodic components" in the title text refers to the spikes in the graph. Because sine waves repeat themselves as you go along, the presence of large amounts of one particular sine wave in the Fourier transform graph (each spike) shows that the overall result (the initial graph) is likely to have parts that also repeat themselves, like a periodic function. In other words, the cat has repeating parts. Β―_(γ)_/Β―

## The Story So Far

A month ago we saw how researchers and engineers switch between real space and frequency space using Discrete Fourier Transform (DFT); however, if there were ever an algorithm to radically change the landscape of computer science and engineering by making seemingly impossible problems possible, it would be the Fast Fourier Transform (FFT).

On the surface, the algorithm seems like a simple application of recursion, and in principle, it is; however, it is such ingenious usage of recursion that allows people bounce back and forth between the two spaces. Nowadays, from calculating superfluid vortex positions to super-resolution imaging, Fast Fourier Transforms lay at the heart of many scientific discipline.

## The Cooley-Tukey Algorithm

### Raison D'Γͺtre

Vanilla DFTs are notoriously slow, but in order that you see how slow it actually is, I record the time it takes for vanilla DFT, Cooley-Tukey Algorithm, and Iterative Cooley-Tukey Algorithm to handle an array of 4096 elements, respectively. Notice that the second prompt executes much faster than the first one.

Simply put, the Discrete Fourier Transform is a beautiful application of complex number systems; however, rarely would it be used, were it not for the ability to quickly perform the operation with Fast Fourier Transform, first introduced by the great Frederick Gauss in 1805 and later independently discovered by James Cooley and John Tukey in 1965 1. Gauss, of course, already had way too many concepts named after him and Cooley and Tukey both had cooler names, so the most common algorithm for FFTs today is coined the Cooley-Tukey algorithm.

### Proof of Correctness

The Discrete Fourier Transform (DFT) can be written as follows:

$\mathbf{X}[k] = \mathbf{F_N}(\mathbf{x}[p]) = \sum_{p=0}^{N-1}\mathbf x[p]\mathbf W^{kp}_N, \ \mathbf W = \exp(-\frac {2 \pi i} N)$

To determine the DFT of a discrete signal $x[p]$ (where $N$ is the size of its domain), we multiply each of its value by $\mathbf e$ raised to some function of $p$. We then sum the results obtained for a given $p$. If we used a computer to calculate the Discrete Fourier Transform of a signal, it would need to perform N (multiplications) x N (additions) = $O(N^2)$ operations.

Suppose, we separated the Fourier Transform into even and odd indexed sub-sequences, namely $X[2k]$ and $X[2k+1]$, and after performing a bit of algebra, we end up with the summation of two terms. The advantage of this approach lies in the fact that the even and odd indexed sub-sequences can be computed concurrently.

\begin{aligned} X[2k] & = \sum_{p=0}^{N-1} x[p]W^{2kp}_N = \sum^{N-1}_{p=0} x[p](e^{\frac {2 \pi i} N})^{2kp}\\ & = \sum_{p=0}^{N-1} x[p]W^{kp}_{\frac N 2}\\ & = \sum_{p=0}^{N/2-1} x[p]W^{kp}_{N/2} + \sum_{p=N/2}^{N-1} x[p]W^{kp}_{N/2}\\ & = \sum_{p=0}^{N/2-1} x[p]W^{kp}_{N/2} + \sum_{p=0}^{N/2-1} x[p+N/2]W^{k(p+N/2)}_{N/2}\\ & = \sum_{p=0}^{N/2-1} x[p]W^{kp}_{N/2} + \sum_{p=0}^{N/2-1} x[p+N/2]W^{kp}_{N/2}\\ & = \sum_{p=0}^{N/2-1} (x[p] +x[p+N/2])W^{kp}_{N/2}\\ & = F_{N/2}(x[p] +x[p+N/2])\\ \\ X[2k+1] & = \sum_{p=0}^{N-1} x[p]W^{(2k+1)p}_N = \sum_{p=0}^{N-1} x[p]W^{2kp}_NW^p_N\\ & = \sum_{p=0}^{N-1} x[p]W^{2kp}_NW^p_N\\ & = \sum_{p=0}^{N-1} x[p]W^{kp}_{N/2}W^p_N\\ & = \sum_{p=0}^{N/2-1} x[p]W^{kp}_{N/2}W^p_N + \sum_{p=N/2}^{N-1} x[p]W^{kp}_{N/2}W^p_N\\ & = \sum_{p=0}^{N/2-1} x[p]W^{kp}_{N/2}W^p_N + \sum_{p=0}^{N/2-1} x[p+N/2]W^{k(p+N/2)}_{N/2}W^{p+N/2}_N\\ & = \sum_{p=0}^{N/2-1} x[p]W^{kp}_{N/2}W^p_N - \sum_{p=0}^{N/2-1} x[p+N/2]W^{kp}_{N/2}W^{p}_N\\ & = \sum_{p=0}^{N/2-1} (x[p]- x[p+N/2])W^{kp}_{N/2}W^{p}_N\\ & = F_{N/2}(x[p] -x[p+N/2])W^p_N\\ \end{aligned}

As those who are familiar with asymptotic analysis may notice, the FFT reduces the number of computations needed for a problem of size $N$ from $O(N^2)$ to $O(NlogN)$. On the surface, this might not seem like a big deal. However, when $N$ grows large it can make a world of difference, as shown in the following table.

$N$ $N^2$ $Nlog_2N$
$1 \times 10^3$ $1 \times 10^6$ $1 \times 10^4$
$1 \times 10^6$ $1 \times 10^{12}$ $2 \times 10^7$
$1 \times 10^{9}$ $1 \times 10^{18}$ $3 \times 10^{10}$

Say it took 1 nanosecond to perform one operation. It would take the Fast Fourier Transform algorithm approximately 30 seconds to compute the Discrete Fourier Transform for a problem of size $N = 10^9$. In contrast, the regular algorithm would need several decades.

\begin{aligned} 1 \times 10^{18}\ ns & \to 31.2\ years\\ 3 \times 10^{10}\ ns & \to 30\ seconds \end{aligned}

### Implementation

As mentioned in the text, the Cooley-Tukey algorithm may be implemented either recursively or non-recursively, with the recursive method being much easier to implement. I would ask that you implement either the recursive or non-recursive methods (or both, if you feel so inclined).

In the end, the code looks like:

#### Python

def cooley_tukey(X):
N = len(X)
if N <= 1:
return X
even = cooley_tukey(X[0::2])
odd = cooley_tukey(X[1::2])

temp = [i for i in range(N)]
for k in range(N // 2):
temp[k] = even[k] + exp(-2.0j * pi * k / N) * odd[k]
temp[k + N // 2] = even[k] - exp(-2.0j * pi * k / N) * odd[k]
return temp

#### C

void cooley_tukey(double complex *X, const size_t N) {
if (N >= 2) {
double complex tmp [N / 2];
for (size_t i = 0; i < N / 2; ++i) {
tmp[i] = X[2*i + 1];
X[i] = X[2*i];
}
for (size_t i = 0; i < N / 2; ++i) {
X[i + N / 2] = tmp[i];
}

cooley_tukey(X, N / 2);
cooley_tukey(X + N / 2, N / 2);

for (size_t i = 0; i < N / 2; ++i) {
X[i + N / 2] = X[i] - cexp(-2.0 * I * M_PI * i / N) * X[i + N / 2];
X[i] -= (X[i + N / 2]-X[i]);
}
}
}

As a side note, we are enforcing that the array must be a power of 2 for the operation to work. This is a limitation of the fact that we are using recursion and dividing the array in 2 every time; however, if your array is not a power of 2, you can simply pad the leftover space with 0's until your array is a power of 2.

The above method is a perfectly valid FFT; however, it is missing the pictorial heart and soul of the Cooley-Tukey algorithm: Butterfly Diagrams.

## Butterfly Diagrams

Visualization of an FFT with Butterfly Diagrams helps us keep track of each element in the array. As mentioned, an FFT must ultimately perform several DFTs in their base case. This means that we are still to perform the following operation,

$\mathbf{X}[k] = \mathbf{F_N}(\mathbf{x}[p]) = \sum_{p=0}^{N-1}\mathbf x[p]\mathbf W^{kp}_N, \ \mathbf W = \exp(-\frac {2 \pi i} N).$

However, after shuffling the initial array (by bit reversing or recursive subdivision), we perform the matrix multiplication of the $\mathbf W$ terms in pieces, and each step, we use the appropriate term. For instance, imagine we only need to transform a 2-element array. We can use this Radix-2 Butterfly,

to represent the operation,

$b[0] = a[0] + \mathbf W^0_2 a[1]\\ b[1] = a[0] + \mathbf W^1_2 a[0]\\.$

However, it turns out that the $\mathbf W$ term of the second half is the opposite of the first half ($\mathbf W^0_2 = -\mathbf W^1_2$), since $\mathbf W^{kp}_N = e^{-2\pi ikp/N}$, which means,

$b[0] = a[0] + \mathbf W^0_2 a[1]\\ b[1] = a[0] - \mathbf W^0_2 a[0]\\.$

By swapping out the second $\mathbf W$ value in this way, we can save a good amount of space and computation. Now imagine we need to combine more elements. In this case, we start with simple butterflies, as shown above, and then sum butterflies of butterflies. For example, if we have 8 elements, this might look like this:

Ultimately, that's all I want to say about Fourier Transforms for now, but the story doesn't end here, as there are so many Fourier-related Transforms like DCTs out there. We'll definitely come back to this at some point, so let us know what you liked and didn't like and we can go from there!

## Example Code

Note that I do not claim that this is the most efficient way to implement the Cooley-Tukey method, so if you have a better way to do it, feel free to implement it that way.

#### Python

from random import random
from cmath import exp, pi
from math import log2

def dft(X):
N = len(X)
temp = [0] * N
for i in range(N):
for k in range(N):
temp[i] += X[k] * exp(-2.0j * pi * i * k / N)
return temp

def cooley_tukey(X):
N = len(X)
if N <= 1:
return X
even = cooley_tukey(X[0::2])
odd = cooley_tukey(X[1::2])

temp = [i for i in range(N)]
for k in range(N // 2):
temp[k] = even[k] + exp(-2.0j * pi * k / N) * odd[k]
temp[k + N // 2] = even[k] - exp(-2.0j * pi * k / N) * odd[k]
return temp

def bit_reverse(X):
N = len(X)
temp = [i for i in range(N)]
for k in range(N):
b = sum(1 << int(log2(N)) - 1 -
i for i in range(int(log2(N))) if k >> i & 1)
temp[k] = X[b]
temp[b] = X[k]
return temp

def iterative_cooley_tukey(X):
N = len(X)

X = bit_reverse(X)

for i in range(1, int(log2(N)) + 1):
stride = 2 ** i
w = exp(-2.0j * pi / stride)
for j in range(0, N, stride):
v = 1
for k in range(stride // 2):
X[k + j + stride // 2] = X[k + j] - v * X[k + j + stride // 2]
X[k + j] -= X[k + j + stride // 2] - X[k + j]
v *= w
return X

X = []

for i in range(4096):
X.append(random())

Y = cooley_tukey(X)
Z = iterative_cooley_tukey(X)
T = dft(X)

#### C

#include <complex.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <fftw3.h>

void fft(double complex *x, int n) {
double complex y[n];
memset(y, 0, sizeof(y));
fftw_plan p;

p = fftw_plan_dft_1d(n, (fftw_complex*)x, (fftw_complex*)y,
FFTW_FORWARD, FFTW_ESTIMATE);

fftw_execute(p);
fftw_destroy_plan(p);

for (size_t i = 0; i < n; ++i) {
x[i] = y[i] / sqrt((double)n);
}
}

void dft(double complex *X, const size_t N) {
double complex tmp[N];
for (size_t i = 0; i < N; ++i) {
tmp[i] = 0;
for (size_t j = 0; j < N; ++j) {
tmp[i] += X[j] * cexp(-2.0 * M_PI * I * j * i / N);
}
}

memcpy(X, tmp, N * sizeof(*X));
}

void cooley_tukey(double complex *X, const size_t N) {
if (N >= 2) {
double complex tmp [N / 2];
for (size_t i = 0; i < N / 2; ++i) {
tmp[i] = X[2*i + 1];
X[i] = X[2*i];
}
for (size_t i = 0; i < N / 2; ++i) {
X[i + N / 2] = tmp[i];
}

cooley_tukey(X, N / 2);
cooley_tukey(X + N / 2, N / 2);

for (size_t i = 0; i < N / 2; ++i) {
X[i + N / 2] = X[i] - cexp(-2.0 * I * M_PI * i / N) * X[i + N / 2];
X[i] -= (X[i + N / 2]-X[i]);
}
}
}

void bit_reverse(double complex *X, size_t N) {
for (int i = 0; i < N; ++i) {
int n = i;
int a = i;
int count = (int)log2((double)N) - 1;

n >>= 1;
while (n > 0) {
a = (a << 1) | (n & 1);
count--;
n >>= 1;
}
n = (a << count) & ((1 << (int)log2((double)N)) - 1);

if (n > i) {
double complex tmp = X[i];
X[i] = X[n];
X[n] = tmp;
}
}
}

void iterative_cooley_tukey(double complex *X, size_t N) {
bit_reverse(X, N);

for (int i = 1; i <= log2((double)N); ++i) {
int stride = pow(2, i);
double complex w = cexp(-2.0 * I * M_PI / stride);
for (size_t j = 0; j < N; j += stride) {
double complex v = 1.0;
for (size_t k = 0; k < stride / 2; ++k) {
X[k + j + stride / 2] = X[k + j] - v * X[k + j + stride / 2];
X[k + j] -= (X[k + j + stride / 2] - X[k + j]);
v *= w;
}
}
}
}
int main() {
srand(time(NULL));
double complex x[4096], y[4096], z[4096];
for (size_t i = 0; i < 4096; ++i) {
x[i] = rand() / (double) RAND_MAX;
y[i] = x[i];
z[i] = x[i];
}

fft(x, 4096);
cooley_tukey(y, 4096);
iterative_cooley_tukey(z, 4096);

return 0;
}