\[\]

Hyak HPC#

Just keeping any useful tips/tricks/hard-fought successful build scripts I find for working with UW’s compute clusters Klone and Tillicum

GPU Klone Nodes#

In the aaplasma account we don’t have direct access to any GPU-enabled partitions, but there’s always the checkpoint partition. To simply get an interactive session for a single node with a GPU attached, we can:

salloc -A aaplasma --partition=ckpt-all --gpus-per-node=1  --mem=24G --time=2:00:00

Sort of unfortunately, the nodes are a real mix of different GPU architectures, so you can’t really tell what sort of device you’re going to get that way. As of January 2026, these seem to be the different flavors available on the public checkpoint partition:

GPUArchitectureCUDA ArchKlone restrictionMemory per GPUGPU/node# nodes
P100Pascal6.0p10016GB HBM242
2080 TiTuring7.52080ti11GB GDDR6810
RTX 6000Turing7.5rtx6k48GB GDDR6811
A40Ampere8.6a4048GB GDDR6832
A100Ampere8.0a10040GB HBM288
L40Lovelace8.9l40 / l40s48GB GDDR6815 / 25
H200Hopper9.0h200141 (not allowed on checkpoint)88

To get a sense of just what these cards can do, I’ve used this exceptionally useful OpenCL benchmark demo: https://github.com/ProjectPhysX/OpenCL-Benchmark

P1002080 TiRTX 6000A40A100L40
Compute Units56687284108142
Clock (MHz)132815451770174014102490
Cores35844352460810752691218176
VRAM (MB)162691082024021454888115145457
Global Cache (KB)134421762304235230243976
FP64 (TFLOPs/s)3.5080.4900.5610.5759.5571.391
FP32 (TFLOPs/s)8.63415.75317.94435.40319.23484.753
FP64/FP321/31/321/321/641/21/64
FP16 (TFLOPs/s)17.57331.11435.63236.84572.48688.687
INT64 (TIOPs/s)1.1993.4003.5882.8482.6523.696
INT32 (TIOPs/s)3.11915.27417.47518.46719.41644.439
INT16 (TIOPs/s)8.87112.92014.60716.68517.61635.483
INT8 (TIOPs/s)1.76553.97360.09664.74571.156103.977
Mem Read coalesced (GB/s)544.50535.62557.27581.171514.39713.57
Mem Write coalesced (GB/s)595.68538.19594.67549.211804.67474.25
Mem Read misaligned (GB/s)303.00491.31559.38568.021219.07725.41
Mem Write misaligned (GB/s)90.52149.62170.52177.77209.38240.57
PCIe Send (GB/s)10.974.974.854.575.7021.92
PCIe Receive (GB/s)8.965.045.274.568.2421.94
PCIe Bidirectional (GB/s)9.594.824.967.106.5420.65
PCIe GenerationGen4 x16Gen3 x16Gen3 x16Gen3 x16Gen3 x16Gen4 x16

Checking GPU availability#

The sinfo command they provide in their documentation here https://hyak.uw.edu/docs/gpus/gpu_start/#gpu-jobs will list the nodes with GPUs and their usage status, but it’s a bit hard to glance at. I’ve got a big awk command to parse it out and make it a bit easier to read:

gpucount bash command
# Show available GPU counts by type on the ckpt-all partition
gpucount() {
    sinfo -p ckpt-all -O nodehost,cpusstate,freemem,gres,gresused -S nodehost \
  | grep -v null \
  | tail -n +2 \
  | awk '
{
    # Parse GRES column: gpu:<type>:<total>
    split($4, gres, ":")
    gpu_type = gres[2]
    total = gres[3] + 0

    # Parse GRES_USED column: gpu:<type>:<used>(IDX:...)
    split($5, used_parts, ":")
    # used count is the number before "(" in the third field
    split(used_parts[3], used_num, "(")
    used = used_num[1] + 0

    # Check if node is down/offline (O > 0 in A/I/O/T)
    split($2, cpu, "/")
    offline = cpu[3] + 0
    total_cpus = cpu[4] + 0
    if (offline == total_cpus && total_cpus > 0) next

    avail = total - used
    available[gpu_type] += avail
    total_gpus[gpu_type] += total
    in_use[gpu_type] += used
    nodes[gpu_type]++
    if (used == 0) empty[gpu_type]++
}
END {
    # Sort by GPU type
    n = asorti(available, sorted)

    printf "%-12s  %5s  %5s  %5s  %5s  %5s\n", \
        "GPU TYPE", "AVAIL", "USED", "TOTAL", "NODES", "EMPTY"
    printf "%-12s  %5s  %5s  %5s  %5s  %5s\n", \
        "--------", "-----", "----", "-----", "-----", "-----"

    total_avail = 0
    total_used = 0
    total_total = 0
    total_nodes_up = 0
    total_empty = 0

    for (i = 1; i <= n; i++) {
        t = sorted[i]
        printf "%-12s  %5d  %5d  %5d  %5d  %5d\n", \
            t, available[t], in_use[t], total_gpus[t], nodes[t], empty[t]+0
        total_avail += available[t]
        total_used += in_use[t]
        total_total += total_gpus[t]
        total_nodes_up += nodes[t]
        total_empty += empty[t]+0
    }

    printf "%-12s  %5s  %5s  %5s  %5s  %5s\n", \
        "--------", "-----", "----", "-----", "-----", "-----"
    printf "%-12s  %5d  %5d  %5d  %5d  %5d\n", \
        "TOTAL", total_avail, total_used, total_total, total_nodes_up, total_empty
}
'
}

This gives us something like this, which is a bit easier for a human to parse:

[embluhm@klone-login03 runs]$ gpucount
GPU TYPE      AVAIL   USED  TOTAL  NODES  EMPTY
--------      -----   ----  -----  -----  -----
2080ti           13     47     60      8      1
a100             12     52     64      8      0
a40              78    162    240     30      2
h200              8     56     64      8      1
l40              31     81    112     14      0
l40s             83    101    184     23      6
p100              0      8      8      2      0
rtx6k            50     38     88     11      3
--------      -----   ----  -----  -----  -----
TOTAL           275    545    820    104     13

Kokkos/RAJA on Klone#

Requesting Nodes#

The CUDA installation is available on each node by loading the lmod module module load cuda.

RAJA and Kokkos are happy to compile GPU kernels for whatever CUDA architectures are supported by the cuda module. AFAICT Kokkos only lets you pick one, while RAJA lets you specify multiple architectures. We should build for the highest version that is supported by every device we might need to run on. Unfortunately, there are parts of RAJA/Kokkos that require 7.0+, so we can’t target the old P100 cards. To work around this, we can supply a constraint to our slurm command to tell it which GPUs we are willing to work with:

salloc -A aaplasma --partition=ckpt-all --gpus-per-node=1 --constraint="2080ti|a40|a100|l40|l40s|rtx6k" --mem=24G --time=2:00:00

This puts us on a node that does not have a P100 card. That way we can target CUDA 7.5 and it should work for all of the other cards.

To build RAJA on Klone, I’ve used this set of commands after cloning the RAJA repo (with –recurse-submodules):

module purge
module load cuda gcc/13.2.0
cd /gscratch/aaplasma/embluhm/tools/src/RAJA
mkdir -p build
cmake -B build/ -DCMAKE_INSTALL_PREFIX=/gscratch/aaplasma/embluhm/tools/RAJA -DRAJA_ENABLE_CUDA=ON -DENABLE_CUDA=ON -DCUDA_TOOLKIT_ROOT_DIR=/sw/cuda/12.4.1 -DCMAKE_CUDA_COMPILER=/sw/cuda/12.4.1/bin/nvcc -DCMAKE_CUDA_ARCHITECTURES="75" -DRAJA_ENABLE_EXAMPLES=ON -DRAJA_ENABLE_TESTS=ON -DRAJA_ENABLE_BENCHMARKS=ON -DENABLE_BENCHMARKS=ON .
cmake --build build/ -j12

That seems to be working so far! The full build took about 30 minutes. I’m using the gcc/13.2.0 compiler module because that’s the one used for the latest CUDA-aware OpenMPI module on Klone. Gotta make sure all the versions match up if we don’t want to be chasing down the most annoying linking problems imaginable.

Building METIS#

When linking with the latest CUDA and OpenMP modules, I want a METIS built with the same gcc that was used for OpenMPI. That means building METIS from scratch.

Building METIS is pretty easy, but by default it does not statically link in the GKlib dependencies. To create a static libmetis.a library that WARPXM can use without also needing to link in a GKlib installation, we can manually combine GKlib into the libmetis.a:

module load gcc/13.2.0
git clone https://github.com/KarypisLab/METIS.git
git clone https://github.com/KarypisLab/GKlib.git
cd GKlib
make config prefix=/gscratch/aaplasma/embluhm/tools/GKlib
make install
cd ../METIS
make config gklib_path=/mmfs1/gscratch/aaplasma/embluhm/tools/GKlib prefix=/mmfs1/gscratch/aaplasma/embluhm/tools/METIS
make install

# Create a combined static libmesh.a, and copy it over top of the old libmetis.a
mkdir /tmp/metis_combined && cd /tmp/metis_combined
ar -x /mmfs1/gscratch/aaplasma/embluhm/tools/GKlib/lib64/libGKlib.a
ar -x /mmfs1/gscratch/aaplasma/embluhm/tools/METIS/lib/libmetis.a
ar -rcs /mmfs1/gscratch/aaplasma/embluhm/tools/METIS/lib/libmetis.a *.o
cd && rm -rf /tmp/metis_combined

Static build with -fPIC#

At some point for some reason, my build stopped working without building METIS with support for position-independent code (the -fPIC option for most compilers). This means building metis and gklib with the shared=1 option, and making some fixes to their build pipeline. For this, I’ve chosen to use the same prefix for both gklib and metis, since that solves a lot of dependency resolution problems and gets rid of the need for that archive mess above:

module load gcc/13.2.0
git clone https://github.com/KarypisLab/METIS.git
git clone https://github.com/KarypisLab/GKlib.git

cd GKlib
make config shared=1 prefix=/mmfs1/gscratch/aaplasma/embluhm/tools/METIS
cd build/Linux-x86_64 && make -j12 install
# For reasons I do not understand, METIS is looking for the new static libGKlib.so under lib/, but the 
# GKlib build puts it under lib64/libGKlib.so.0.0.1, so we need to add a symbolic link to where METIS is looking
ln -s /gscratch/aaplasma/embluhm/tools/METIS/lib64/libGKlib.so /gscratch/aaplasma/embluhm/tools/METIS/lib/
ln -s /gscratch/aaplasma/embluhm/tools/METIS/lib64/libGKlib.so.0 /gscratch/aaplasma/embluhm/tools/METIS/lib/
ln -s /gscratch/aaplasma/embluhm/tools/METIS/lib64/libGKlib.so.0.0.1 /gscratch/aaplasma/embluhm/tools/METIS/lib/

cd ../../../METIS
make config shared=1 prefix=/mmfs1/gscratch/aaplasma/embluhm/tools/METIS
cd build && make -j12 install

Building HDF5-parallel#

With no HDF5 module to rely on, we can just build it ourselves. It’s not too bad, as long as we get lucky :)

salloc -A aaplasma -c 20 --mem=24G --time=2:00:00
module purge
module load ompi/4.1.6-2 gcc/13.2.0
wget https://github.com/HDFGroup/hdf5/releases/download/hdf5_1.14.6/hdf5-1.14.6.tar.gz
tar -zxvf hdf5-1.14.6.tar.gz
cd hdf5-1.14.6/
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release  -DHDF5_ENABLE_PARALLEL=ON  -DBUILD_SHARED_LIBS=ON  -DBUILD_TESTING=ON  -DCMAKE_INSTALL_PREFIX=/gscratch/aaplasma/embluhm/tools/hdf5-mpi-1.14.6 -DCMAKE_C_COMPILER=/sw/ompi/4.1.6-2/bin/mpicc -DMPI_HOME=/sw/ompi/4.1.6-2 -DMPIEXEC_MAX_NUMPROCS=4 ..
make -j20 install

Building Kokkos#

  1. Grab the Kokkos source from a GitHub release (latest is probably good)
  2. Create a build script (e.g. build_kokkos.sh). I’ve chosen to only enable the CUDA backend and not the OpenMP backend, because I’m not using OpenMP thread parallelism and I don’t want to have to deal with setting OpenMP environment variables every time I run something just to avoid it spinning up OpenMP threads to do nothing. If you want OpenMP enabled, just turn on that option. I’ve picked CUDA arch 7.5 as the minimum version to use.
    #!/bin/bash -ex
    
    module purge
    module load cuda/12.9.1 ompi/4.1.6-2 gcc/13.2.0
    
    rm -rf build
    mkdir -p build
    cmake -B build/ \
      -DCMAKE_INSTALL_PREFIX=/gscratch/aaplasma/embluhm/tools/kokkos-cuda12.9.1 \
      -DCMAKE_BUILD_TYPE=Release \
      -DKokkos_ENABLE_CUDA=ON -DCUDA_ROOT=/sw/cuda/12.9.1 -DKokkos_ARCH_TURING75=ON \
      -DKokkos_ENABLE_OPENMP=OFF \
      -DKokkos_ENABLE_SERIAL=ON  
    cmake --build build/ -j20
    cd build
    make install
    ln -s /gscratch/aaplasma/embluhm/tools/kokkos-cuda12.9.1/lib64 /gscratch/aaplasma/embluhm/tools/kokkos-cuda12.9.1/lib
  3. Make it executable
    chmod +x build_kokkos.sh
  4. Submit a job to build Kokkos using the build script. This will take a little while. Afterwards, Kokkos will be installed at the location specified by CMAKE_INSTALL_PREFIX in the build script (can be wherever you want)
    salloc -A aaplasma -c 20 --mem=24G --time=2:00:00 srun ./build_kokkos.sh

Building Kokkos-enabled WARPXM#

Now that I’ve added a whole bunch of Kokkos-related dependencies to WARPXM, I can’t just use the existing PETSc-based modules for all of the other dependencies. At present, the dependencies required to build all of the WARPXM features (other than the old PETSc-based implicit solver) are:

  • MPI implementation (OpenMPI)
  • CMake
  • METIS
  • HDF5 (linked against MPI)
  • Kokkos (with CUDA)

There are modules for CUDA-aware OpenMPI, so we don’t have to worry about that. We just

module load cuda/12.9.1 ompi/4.1.6-2 gcc/13.2.0

We do need HDF5 and METIS, so let’s go ahead and build them first, following the instructions above.

The script that I’m using to put all of this together and build WARPXM looks like this:

#!/bin/bash -ex
module purge
module load cuda/12.9.1 ompi/4.1.6-2 gcc/13.2.0
export HDF5_DIR=/gscratch/aaplasma/embluhm/tools/hdf5-mpi-1.14.6/cmake
export OMPI_CXX=/gscratch/aaplasma/embluhm/tools/kokkos-cuda12.9.1/bin/nvcc_wrapper
export PKG_CONFIG_PATH=/sw/ompi/4.1.6-2/lib/pkgconfig:$PKG_CONFIG_PATH
cd /gscratch/aaplasma/embluhm/code/warpxm || return 1

if [ ! -d build ]; then
    mkdir build || return 1
else
    rm -f build/CMakeCache.txt
fi

# RelWithDebInfo
cmake -B build/ \
    -DWXM_ENABLE_KOKKOS=ON \
    -DHDF5_ROOT=/gscratch/aaplasma/embluhm/tools/hdf5-mpi-1.14.6 \
    -DMetis_ROOT=/gscratch/aaplasma/embluhm/tools/METIS \
    -DKokkos_ROOT=/gscratch/aaplasma/embluhm/tools/kokkos-cuda12.9.1 \
    -DCMAKE_BUILD_TYPE=Debug \
    -DCMAKE_C_COMPILER=/sw/ompi/4.1.6-2/bin/mpicc \
    -DCMAKE_CXX_COMPILER=/sw/ompi/4.1.6-2/bin/mpicxx \
    -DWXM_ENABLE_TRACY= \
    -DCUDAToolkit_ROOT=/sw/cuda/12.9.1 \
    -DPython_EXECUTABLE=/gscratch/aaplasma/embluhm/conda/python3-11/bin/python3 \
    -DMPI_HOME=/sw/ompi/4.1.6-2 \
    . || return 1

cmake --build build/ -j20 || return 1

cd build || return 1
ctest -j20 --output-on-failure --label-regex "Unit" || return 1

I put all of this into a klone_build_script.sh in the warpxm project root, and then kick off a build with:

salloc -A aaplasma -c 20 --mem=24G --time=2:00:00 srun ./klone_build_script.sh