The optimized compilation of FFTW3 (version 3.3.10) is explored in this post. Even though some questions remains to be answered, it currently seems to be the optimal procedure. Tested on NHPC101 (OpenSUSE Leap 15.3), Dec. 29, 2022.

FFTW is a portable package for fast Fourier transformation (FFT), which is useful in multiple applications. One of its important applications in DFT is that it transforms and diagonalizes matrices such as Fock & overlap matrices between reciprocal and real space (I am not very clear about the details at the current stage), so it is a common prerequisite of DFT codes. It also helps solving the long-range Coulomb interactions in molecular dynamics, for example, the KSPACE package of LAMMPS requires FFT as well.

Environment

Compiler: GCC 11.3.0
MPI: MPICH 4.0.2

Paths to binary executable, header files and static libs of GCC and MPICH are add to the environmental variable ${PATH}, ${INCLUDE} and ${LD_LIBRARY_PATH}respectively. To use GCC compiler, set CC=gcc, CXX=g++ and FC=gfortran (the later two are probably not needed since FFTW3 is written in C). Using environment modulues is recommended.

Compilation

Minimal procedures

Download the source code from its website. The compilation of FFTW3 follows the traditional make procedures, so the minimal installation steps are:

  
$ tar -zxvf fftw-3.3.10.tar.gz
$ cd fftw-3.3.10
$ ./configure
$ make
$ make install

That gives a serial version of FFTW3. To optimize the compilation, extra compilation options are needed. The command ./configure --help gives necessary information of command-line flags.

Explanation of options

The following example gives a compilation of executable and libs optimized for single & double precision float, static & dynamic libs, MPI, OpenMP, threading and Intel i7-7700 CPU.

Firstly, build the single-precision libs:

  
$ ./configure --prefix=PREFIX --enable-single --enable-shared=yes --enable-static=yes --enable-mpi --enable-openmp --enable-threads --enable-sse --enable-sse2 --enable-avx --enable-avx2 --enable-fma 
$ make
$ make install

Then, clean the linked files from the previous step, keep the settings and build the double-precision libs:

  
$ make clean
$ ./configure --prefix=PREFIX --enable-shared=yes --enable-static=yes --enable-mpi --enable-openmp --enable-threads --enable-sse2 --enable-avx --enable-avx2 --enable-fma 
$ make
$ make install

By default, make executes a global installation at ‘/usr/local/’, which requires sudo. Alternatively, the --prefix=PREFIX option builds the executable at the user-specified location.

The double-precision float version is compiled without --enable-single. In some occasions, the single-precision float libs are supported in codes to trade precision for acceleration.

Static libs ending with ‘.a’ are generated by default. Dynamic libs with ‘.so’ are generated with --enable-shared, which takes less space since it functions as an index. It is recommended to compile dynamic libs with dynamic FFTW3 libs and to compile static libs with the static one. It is possible to compile static libs with dynamic FFTW3 libs but another flag -fPIC should be added. Not tested.

--enable-mpi, --enable-openmp and --enable-threads are valid only MPI has been loaded in the environment. To activate the later two, the MPI should also support OpenMP and multi-threading, which is widely supported in modern MPI implementations.

Different CPU architectures support different sets of instructions. FFTW3 can optimize its performance according to specific CPU architectures during compilation. To get the supported sets of instructions of the current server, use the following command:

  
$ cat /proc/cpuinfo | grep flags | uniq
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Note: --enable-sse requires a single-precision compilation.

Available libs

In ‘PREFIX/lib64’, or ‘PREFIX/lib’, 4 versions ending with nothing, ‘_mpi’, ‘_omp’, ‘_thread’ are generated, of which the names are self-explanatory. Libs begin with ‘libfftw3’ stands for double-precision libs and ‘libfftw3f’ for single-precision ones. They can be installed under the same directory but need 2 makes as instructed above.

Problems with Intel compilers

Compilation with Intel OneAPI compilers (version 2022.1.2.146 and 2023.0.0) always fail with the error message of unknown flag -ansi-alias. This flag seems to be a universal one but the error only occurs when compiling the mpi version, so Intel compilers should be ok if the serial version is needed, i.e., --enable-mpi, --enable-openmp and --enable-threads commands should be avoided (it is not clear whether the later 2 work without the first flag). Using either the classic C/C++ compiler, CC=icc, or the modern DPC++/C++ one, CC=icx cannot address this problem. Compilation with a more recent MPICH 4.0.2 + Intel DFC++/C++, rather than the built-in Intel MPI did not solve this problem as well. This section might be updated if more hints are available.

Optimized compilation of FFTW3

Environment

Compilation

Minimal procedures

Explanation of options

Available libs

Problems with Intel compilers

Further Reading

Structure and usage of clusters

Compile ONETEP and run quality tests

LAMMPS compilation and integration with Python