A simple and efficient parallel FFT algorithm using the BSP model
We present a new parallel radix-4 FFT algorithm based on the BSP model. Our parallel algorithm uses the group-cyclic distribution family, which makes it simple to understand and easy to implement. We show how to reduce the communication cost of the algorithm by a factor of 3, in the case that the input/output vector is in the cyclic distribution. We also show how to reduce computation time on computers with a cache-based architecture. We present performance results on a Cray T3E with up to 64 processors, obtaining reasonable efficiency levels for local problem sizes as small as 256 and very good efficiency levels for local sizes larger than 2048.
Inda, M. A, & Bisseling, R. H. (2001). A simple and efficient parallel FFT algorithm using the BSP model. Parallel Computing, 27, 1847–1878.