Apollo dynamic tuning framework for RAJA

Uri mentioned a project by David Beckingsale at LLNL called “Apollo” that attempts to tune kernel parameters at runtime in order to optimize for the available hardware. It’s especially for codes with adaptive capabalities (like adaptive mesh refinement) in which the runtime characteristics of the compute kernels tend to be very input dependent.

This is the main paper they put out: https://computing.llnl.gov/sites/default/files/Apollo-Fast-Lightweight-Dynamic-Tuning-Data-Dependent-Code-LLNL-paper_0.pdf

And here’s the github repository: https://github.com/LLNL/apollo

The approach they take is: At runtime, for each kernel execution apollo collects the runtime and a “feature vector” describing various features of the kernel. These features include:

  • Parameters passed to the RAJA forall
  • Instruction features gathered from the kernel body
  • Runtime measurements
  • Optional features specified by the developer

img/research/apollo-features.png

To be able to change the execution policy at runtime (which is generally not possible in RAJA), they added a RAJA extension

I’d really like to see it in action, so let’s see if we can get a simple project using their framework to compile. We can eventually get Claude to spit out something like this:

#include <iostream>
#include <iomanip>
#include <vector>
#include <memory>
#include <cmath>
#include <chrono>
#include <algorithm>
#include <numeric>
#include <random>
#include <cstring>
#include <thread>

#include "RAJA/RAJA.hpp"
#include "apollo/Apollo.h"
#include "apollo/Region.h"
#include "kernels.h"

// Define execution policies for CPU
using sequential_policy = RAJA::seq_exec;

#ifdef RAJA_ENABLE_OPENMP
using omp_policy_1 = RAJA::omp_parallel_for_exec;
using omp_policy_static = RAJA::omp_parallel_for_static_exec<16>;
using omp_policy_dynamic = RAJA::omp_parallel_for_dynamic_exec<8>;
using omp_policy_guided = RAJA::omp_parallel_for_guided_exec<>;

// Policy list for Apollo to choose from
using CPUPolicyList = RAJA::list<
    sequential_policy,
    omp_policy_1,
    omp_policy_static,
    omp_policy_dynamic,
    omp_policy_guided
>;

constexpr int NUM_POLICIES = 5;
const char* policy_names[] = {
    "Sequential",
    "OpenMP Default",
    "OpenMP Static-16",
    "OpenMP Dynamic-8",
    "OpenMP Guided"
};
#else
// If OpenMP is not available, use only sequential
using CPUPolicyList = RAJA::list<sequential_policy>;
constexpr int NUM_POLICIES = 1;
const char* policy_names[] = {"Sequential"};
#endif

namespace apollo_demo {

// Timer utility
class Timer {
private:
    std::chrono::high_resolution_clock::time_point start_time;
    std::chrono::high_resolution_clock::time_point end_time;

public:
    void start() {
        start_time = std::chrono::high_resolution_clock::now();
    }

    void stop() {
        end_time = std::chrono::high_resolution_clock::now();
    }

    double getElapsedMilliseconds() const {
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
        return duration.count() / 1000.0;
    }
};

// Benchmark runner for a specific kernel
class KernelBenchmark {
private:
    std::unique_ptr<KernelBase> kernel;
    Apollo::Region* region;
    std::string kernel_name;
    bool use_apollo;

public:
    KernelBenchmark(const std::string& name, bool enable_apollo = true)
        : kernel_name(name), use_apollo(enable_apollo), region(nullptr) {
        kernel = createKernel(name);
        if (!kernel) {
            throw std::runtime_error("Unknown kernel: " + name);
        }

        if (use_apollo) {
            // Create Apollo region with 5 features (from KernelFeatures)
            region = new Apollo::Region(
                5,  // number of features
                kernel->getName(),
                NUM_POLICIES,
                50  // minimum training data before making predictions
            );
        }
    }

    ~KernelBenchmark() {
        if (region) delete region;
        cleanup();
    }

    void cleanup() {
        if (region) {
            delete region;
            region = nullptr;
        }
    }

    template<typename KernelType>
    double runWithApollo(KernelType* k, const ProblemConfig& config) {
        if (!region) {
            throw std::runtime_error("Apollo not enabled for this benchmark");
        }

        // Extract features for Apollo
        KernelFeatures features = k->extractFeatures(config);
        std::vector<float> feature_vec = features.toVector();

        // Start Apollo region
        Apollo::RegionContext* context = region->begin(feature_vec);

        // Get policy recommendation from Apollo
        int policy_idx = region->getPolicyIndex(context);

        Timer timer;
        timer.start();

        // Execute with selected policy
        switch(policy_idx) {
            case 0:
                k->template execute<sequential_policy>();
                break;
#ifdef RAJA_ENABLE_OPENMP
            case 1:
                k->template execute<omp_policy_1>();
                break;
            case 2:
                k->template execute<omp_policy_static>();
                break;
            case 3:
                k->template execute<omp_policy_dynamic>();
                break;
            case 4:
                k->template execute<omp_policy_guided>();
                break;
#endif
            default:
                k->template execute<sequential_policy>();
                break;
        }

        timer.stop();
        double elapsed = timer.getElapsedMilliseconds();

        // End Apollo region with performance metric
        region->end(context, elapsed);

        if (policy_idx >= 0 && policy_idx < NUM_POLICIES) {
            std::cout << "  Apollo selected: " << policy_names[policy_idx]
                      << " (policy " << policy_idx << ")" << std::endl;
        }

        return elapsed;
    }

    template<typename KernelType, typename ExecPolicy>
    double runWithPolicy(KernelType* k) {
        Timer timer;
        timer.start();
        k->template execute<ExecPolicy>();
        timer.stop();
        return timer.getElapsedMilliseconds();
    }

    void runBenchmark(const ProblemConfig& config, int iterations = 10) {
        std::cout << "\n=== Running " << kernel_name << " Kernel ===" << std::endl;
        std::cout << "Problem size: " << config.size << std::endl;
        std::cout << "Iterations: " << iterations << std::endl;

        kernel->initialize(config);

        std::vector<double> times;

        for (int iter = 0; iter < iterations; ++iter) {
            double elapsed = 0.0;

            if (use_apollo) {
                if (auto* k = dynamic_cast<VectorAddKernel*>(kernel.get())) {
                    elapsed = runWithApollo(k, config);
                } else if (auto* k = dynamic_cast<DaxpyKernel*>(kernel.get())) {
                    elapsed = runWithApollo(k, config);
                } else if (auto* k = dynamic_cast<MatVecKernel*>(kernel.get())) {
                    elapsed = runWithApollo(k, config);
                } else if (auto* k = dynamic_cast<Stencil1DKernel*>(kernel.get())) {
                    elapsed = runWithApollo(k, config);
                } else if (auto* k = dynamic_cast<ReductionKernel*>(kernel.get())) {
                    elapsed = runWithApollo(k, config);
                } else if (auto* k = dynamic_cast<SpMVKernel*>(kernel.get())) {
                    elapsed = runWithApollo(k, config);
                } else if (auto* k = dynamic_cast<IndirectKernel*>(kernel.get())) {
                    elapsed = runWithApollo(k, config);
                }
            } else {
                // Run with default policy (sequential)
                if (auto* k = dynamic_cast<VectorAddKernel*>(kernel.get())) {
                    elapsed = runWithPolicy<VectorAddKernel, sequential_policy>(k);
                } else if (auto* k = dynamic_cast<DaxpyKernel*>(kernel.get())) {
                    elapsed = runWithPolicy<DaxpyKernel, sequential_policy>(k);
                } else if (auto* k = dynamic_cast<MatVecKernel*>(kernel.get())) {
                    elapsed = runWithPolicy<MatVecKernel, sequential_policy>(k);
                } else if (auto* k = dynamic_cast<Stencil1DKernel*>(kernel.get())) {
                    elapsed = runWithPolicy<Stencil1DKernel, sequential_policy>(k);
                } else if (auto* k = dynamic_cast<ReductionKernel*>(kernel.get())) {
                    elapsed = runWithPolicy<ReductionKernel, sequential_policy>(k);
                } else if (auto* k = dynamic_cast<SpMVKernel*>(kernel.get())) {
                    elapsed = runWithPolicy<SpMVKernel, sequential_policy>(k);
                } else if (auto* k = dynamic_cast<IndirectKernel*>(kernel.get())) {
                    elapsed = runWithPolicy<IndirectKernel, sequential_policy>(k);
                }
            }

            times.push_back(elapsed);
            std::cout << "  Iteration " << std::setw(2) << iter + 1
                      << ": " << std::fixed << std::setprecision(3)
                      << elapsed << " ms" << std::endl;
        }

        // Compute statistics
        double avg_time = std::accumulate(times.begin(), times.end(), 0.0) / times.size();
        double min_time = *std::min_element(times.begin(), times.end());
        double max_time = *std::max_element(times.begin(), times.end());

        std::cout << "\nStatistics:" << std::endl;
        std::cout << "  Average: " << std::fixed << std::setprecision(3) << avg_time << " ms" << std::endl;
        std::cout << "  Min:     " << std::fixed << std::setprecision(3) << min_time << " ms" << std::endl;
        std::cout << "  Max:     " << std::fixed << std::setprecision(3) << max_time << " ms" << std::endl;

        kernel->cleanup();
    }

    void trainApollo(int step) {
        if (region) {
            region->train(step);
            std::cout << "Apollo training triggered at step " << step << std::endl;
        }
    }

    void cleanupRegion() {
        cleanup();
    }
};

// Scenario runner - runs different problem configurations
class ScenarioRunner {
private:
    Apollo* apollo;
    bool verbose;

public:
    ScenarioRunner(bool verbose_mode = false)
        : apollo(Apollo::instance()), verbose(verbose_mode) {}

    void runAdaptiveScenario() {
        std::cout << "\n" << std::string(60, '=') << std::endl;
        std::cout << "ADAPTIVE SCENARIO: Apollo learns optimal policies" << std::endl;
        std::cout << std::string(60, '=') << std::endl;

        // Create benchmarks for different kernels
        std::vector<std::unique_ptr<KernelBenchmark>> benchmarks;
        benchmarks.push_back(std::make_unique<KernelBenchmark>("VectorAdd", true));
        benchmarks.push_back(std::make_unique<KernelBenchmark>("DAXPY", true));
        benchmarks.push_back(std::make_unique<KernelBenchmark>("Stencil1D", true));
        benchmarks.push_back(std::make_unique<KernelBenchmark>("Reduction", true));

        // Different problem configurations to test adaptation
        std::vector<ProblemConfig> configs;

        // Small problems
        ProblemConfig small_config;
        small_config.size = 10000;
        small_config.iterations = 5;
        configs.push_back(small_config);

        // Medium problems
        ProblemConfig medium_config;
        medium_config.size = 100000;
        medium_config.iterations = 5;
        configs.push_back(medium_config);

        // Large problems
        ProblemConfig large_config;
        large_config.size = 1000000;
        large_config.iterations = 5;
        configs.push_back(large_config);

        // Very large problems
        ProblemConfig xlarge_config;
        xlarge_config.size = 10000000;
        xlarge_config.iterations = 5;
        configs.push_back(xlarge_config);

        int global_step = 0;

        // Run exploration phase
        std::cout << "\n--- EXPLORATION PHASE ---" << std::endl;
        std::cout << "Apollo explores different policies..." << std::endl;

        for (size_t config_idx = 0; config_idx < 2; ++config_idx) {
            for (auto& benchmark : benchmarks) {
                benchmark->runBenchmark(configs[config_idx], 5);
                global_step++;
            }
        }

        // Trigger training
        std::cout << "\n--- TRAINING PHASE ---" << std::endl;
        apollo->train(0);
        std::cout << "Apollo has trained models based on exploration data" << std::endl;

        // Run exploitation phase
        std::cout << "\n--- EXPLOITATION PHASE ---" << std::endl;
        std::cout << "Apollo uses learned models to select optimal policies..." << std::endl;

        for (size_t config_idx = 2; config_idx < configs.size(); ++config_idx) {
            for (auto& benchmark : benchmarks) {
                benchmark->runBenchmark(configs[config_idx], 5);
                global_step++;

                // Periodically retrain to adapt
                if (global_step % 10 == 0) {
                    apollo->train(global_step / 10);
                }
            }
        }

        // Clean up benchmarks explicitly
        for (auto& benchmark : benchmarks) {
            benchmark->cleanupRegion();
        }
        benchmarks.clear();
    }

    void runComparisonScenario() {
        std::cout << "\n" << std::string(60, '=') << std::endl;
        std::cout << "COMPARISON SCENARIO: Apollo vs Fixed Policies" << std::endl;
        std::cout << std::string(60, '=') << std::endl;

        ProblemConfig config;
        config.size = 5000000;
        config.iterations = 20;

        std::cout << "\n--- Running with Apollo (adaptive) ---" << std::endl;
        KernelBenchmark apollo_bench("VectorAdd", true);
        apollo_bench.runBenchmark(config, 20);

        // Train after exploration
        apollo->train(1);

        std::cout << "\n--- Running with Apollo (after training) ---" << std::endl;
        apollo_bench.runBenchmark(config, 10);

        // Clean up benchmark
        apollo_bench.cleanupRegion();

#ifdef RAJA_ENABLE_OPENMP
        std::cout << "\n--- Running with fixed Sequential policy ---" << std::endl;
        KernelBenchmark seq_bench("VectorAdd", false);
        seq_bench.runBenchmark(config, 10);
        seq_bench.cleanupRegion();
#endif
    }

    void runDataDependentScenario() {
        std::cout << "\n" << std::string(60, '=') << std::endl;
        std::cout << "DATA-DEPENDENT SCENARIO: Apollo adapts to data patterns" << std::endl;
        std::cout << std::string(60, '=') << std::endl;

        KernelBenchmark spmv_bench("SpMV", true);
        KernelBenchmark indirect_bench("Indirect", true);

        // Test with different sparsity levels
        std::vector<double> sparsity_levels = {0.01, 0.1, 0.5, 0.9, 0.99};

        for (double sparsity : sparsity_levels) {
            std::cout << "\n--- Sparsity Level: " << (sparsity * 100) << "% ---" << std::endl;

            ProblemConfig config;
            config.size = 100000;
            config.sparsity = sparsity;
            config.iterations = 5;

            spmv_bench.runBenchmark(config, 10);

            // Train after each sparsity level
            apollo->train(static_cast<int>(sparsity * 100));
        }

        // Clean up benchmarks
        spmv_bench.cleanupRegion();

        // Test with different access patterns
        std::cout << "\n--- Testing Indirect Access Patterns ---" << std::endl;

        ProblemConfig seq_config;
        seq_config.size = 1000000;
        seq_config.use_indirect = false;
        seq_config.iterations = 5;

        ProblemConfig rand_config;
        rand_config.size = 1000000;
        rand_config.use_indirect = true;
        rand_config.iterations = 5;

        std::cout << "\nSequential access pattern:" << std::endl;
        indirect_bench.runBenchmark(seq_config, 10);

        std::cout << "\nRandom access pattern:" << std::endl;
        indirect_bench.runBenchmark(rand_config, 10);

        // Clean up benchmark
        indirect_bench.cleanupRegion();
    }
};

} // namespace apollo_demo

// Main function
int main(int argc, char** argv) {
    std::cout << "Apollo-RAJA CPU Demonstration" << std::endl;
    std::cout << "==============================" << std::endl;

    // Print configuration
    std::cout << "\nConfiguration:" << std::endl;
    std::cout << "  Number of policies: " << NUM_POLICIES << std::endl;
    std::cout << "  Available policies:" << std::endl;
    for (int i = 0; i < NUM_POLICIES; ++i) {
        std::cout << "    " << i << ": " << policy_names[i] << std::endl;
    }

#ifdef RAJA_ENABLE_OPENMP
    int num_threads = 1;
    #pragma omp parallel
    {
        #pragma omp single
        num_threads = omp_get_num_threads();
    }
    std::cout << "  OpenMP threads: " << num_threads << std::endl;
#else
    std::cout << "  OpenMP: DISABLED" << std::endl;
#endif

    // Check for Apollo environment variables
    const char* policy_model = std::getenv("APOLLO_POLICY_MODEL");
    if (policy_model) {
        std::cout << "  Apollo Policy Model: " << policy_model << std::endl;
    } else {
        std::cout << "  Apollo Policy Model: Default (Static)" << std::endl;
        std::cout << "  Tip: Set APOLLO_POLICY_MODEL env var to enable tuning" << std::endl;
        std::cout << "  Example: APOLLO_POLICY_MODEL=DecisionTree,explore=RoundRobin,max_depth=4" << std::endl;
    }

    const char* train_period = std::getenv("APOLLO_GLOBAL_TRAIN_PERIOD");
    if (train_period) {
        std::cout << "  Apollo Global Train Period: " << train_period << std::endl;
    }

    // Parse command line arguments
    bool run_all = true;
    bool run_adaptive = false;
    bool run_comparison = false;
    bool run_data_dependent = false;

    for (int i = 1; i < argc; ++i) {
        std::string arg(argv[i]);
        if (arg == "--adaptive") {
            run_all = false;
            run_adaptive = true;
        } else if (arg == "--comparison") {
            run_all = false;
            run_comparison = true;
        } else if (arg == "--data-dependent") {
            run_all = false;
            run_data_dependent = true;
        } else if (arg == "--help" || arg == "-h") {
            std::cout << "\nUsage: " << argv[0] << " [options]" << std::endl;
            std::cout << "Options:" << std::endl;
            std::cout << "  --adaptive        Run adaptive scenario" << std::endl;
            std::cout << "  --comparison      Run comparison scenario" << std::endl;
            std::cout << "  --data-dependent  Run data-dependent scenario" << std::endl;
            std::cout << "  --help, -h        Show this help message" << std::endl;
            std::cout << "\nEnvironment variables:" << std::endl;
            std::cout << "  APOLLO_POLICY_MODEL       Set Apollo tuning model" << std::endl;
            std::cout << "  APOLLO_GLOBAL_TRAIN_PERIOD  Set automatic training period" << std::endl;
            std::cout << "  APOLLO_TRACE_CSV          Enable CSV tracing (0/1)" << std::endl;
            std::cout << "  APOLLO_STORE_MODELS       Store/load trained models (0/1)" << std::endl;
            return 0;
        }
    }

    // Create scenario runner
    apollo_demo::ScenarioRunner runner(false);

    // Run selected scenarios
    if (run_all || run_adaptive) {
        runner.runAdaptiveScenario();
    }

    if (run_all || run_comparison) {
        runner.runComparisonScenario();
    }

    if (run_all || run_data_dependent) {
        runner.runDataDependentScenario();
    }

    std::cout << "\n" << std::string(60, '=') << std::endl;
    std::cout << "Apollo-RAJA CPU Demonstration Complete" << std::endl;
    std::cout << std::string(60, '=') << std::endl;

    // Give some time for cleanup
    std::this_thread::sleep_for(std::chrono::milliseconds(100));

    return 0;
}

with a wide variety of compute kernels defined in kernels.h:

// kernels.h
//
#ifndef APOLLO_RAJA_DEMO_KERNELS_H
#define APOLLO_RAJA_DEMO_KERNELS_H

#include <vector>
#include <memory>
#include <cmath>
#include <algorithm>

#include "RAJA/RAJA.hpp"

namespace apollo_demo {

// Structure to hold problem configuration
struct ProblemConfig {
    int size;              // Problem size
    int iterations;        // Number of iterations
    double sparsity;       // Sparsity level (0.0 to 1.0)
    int stencil_size;      // Size of stencil pattern
    bool use_indirect;     // Use indirect access pattern

    ProblemConfig()
        : size(1000000), iterations(100), sparsity(0.0),
          stencil_size(5), use_indirect(false) {}
};

// Features that Apollo can use for tuning decisions
struct KernelFeatures {
    float problem_size_log;     // Log of problem size
    float memory_footprint;      // Estimated memory footprint in MB
    float arithmetic_intensity;  // Ops per byte
    float sparsity_level;        // Sparsity of data (0-1)
    float access_pattern;        // 0 for sequential, 1 for random

    KernelFeatures()
        : problem_size_log(0), memory_footprint(0),
          arithmetic_intensity(0), sparsity_level(0), access_pattern(0) {}

    std::vector<float> toVector() const {
        return {problem_size_log, memory_footprint, arithmetic_intensity,
                sparsity_level, access_pattern};
    }
};

// Base class for kernels
class KernelBase {
public:
    virtual ~KernelBase() = default;
    virtual const char* getName() const = 0;
    virtual KernelFeatures extractFeatures(const ProblemConfig& config) const = 0;
    virtual void initialize(const ProblemConfig& config) = 0;
    virtual void cleanup() = 0;
};

// Vector addition kernel: c = a + b
class VectorAddKernel : public KernelBase {
private:
    double* a;
    double* b;
    double* c;
    int size;

public:
    VectorAddKernel() : a(nullptr), b(nullptr), c(nullptr), size(0) {}
    ~VectorAddKernel() { cleanup(); }

    const char* getName() const override { return "VectorAdd"; }

    void initialize(const ProblemConfig& config) override;
    void cleanup() override;
    KernelFeatures extractFeatures(const ProblemConfig& config) const override;

    template<typename ExecPolicy>
    void execute() {
        RAJA::forall<ExecPolicy>(RAJA::RangeSegment(0, size),
            [=] RAJA_DEVICE (int i) {
                c[i] = a[i] + b[i];
            });
    }

    double* getResult() { return c; }
};

// DAXPY kernel: y = alpha * x + y
class DaxpyKernel : public KernelBase {
private:
    double* x;
    double* y;
    double alpha;
    int size;

public:
    DaxpyKernel() : x(nullptr), y(nullptr), alpha(2.0), size(0) {}
    ~DaxpyKernel() { cleanup(); }

    const char* getName() const override { return "DAXPY"; }

    void initialize(const ProblemConfig& config) override;
    void cleanup() override;
    KernelFeatures extractFeatures(const ProblemConfig& config) const override;

    template<typename ExecPolicy>
    void execute() {
        double a = alpha;
        RAJA::forall<ExecPolicy>(RAJA::RangeSegment(0, size),
            [=] RAJA_DEVICE (int i) {
                y[i] = a * x[i] + y[i];
            });
    }

    double* getResult() { return y; }
};

// Matrix-vector multiplication: y = A * x
class MatVecKernel : public KernelBase {
private:
    double* A;  // Matrix in row-major format
    double* x;  // Input vector
    double* y;  // Output vector
    int rows;
    int cols;

public:
    MatVecKernel() : A(nullptr), x(nullptr), y(nullptr), rows(0), cols(0) {}
    ~MatVecKernel() { cleanup(); }

    const char* getName() const override { return "MatVec"; }

    void initialize(const ProblemConfig& config) override;
    void cleanup() override;
    KernelFeatures extractFeatures(const ProblemConfig& config) const override;

    template<typename ExecPolicy>
    void execute() {
        RAJA::forall<ExecPolicy>(RAJA::RangeSegment(0, rows),
            [=] RAJA_DEVICE (int i) {
                double sum = 0.0;
                for (int j = 0; j < cols; ++j) {
                    sum += A[i * cols + j] * x[j];
                }
                y[i] = sum;
            });
    }

    double* getResult() { return y; }
};

// 1D Stencil computation
class Stencil1DKernel : public KernelBase {
private:
    double* input;
    double* output;
    double* weights;
    int size;
    int stencil_radius;

public:
    Stencil1DKernel() : input(nullptr), output(nullptr), weights(nullptr),
                        size(0), stencil_radius(2) {}
    ~Stencil1DKernel() { cleanup(); }

    const char* getName() const override { return "Stencil1D"; }

    void initialize(const ProblemConfig& config) override;
    void cleanup() override;
    KernelFeatures extractFeatures(const ProblemConfig& config) const override;

    template<typename ExecPolicy>
    void execute() {
        int radius = stencil_radius;
        RAJA::forall<ExecPolicy>(RAJA::RangeSegment(radius, size - radius),
            [=] RAJA_DEVICE (int i) {
                double sum = 0.0;
                for (int j = -radius; j <= radius; ++j) {
                    sum += input[i + j] * weights[j + radius];
                }
                output[i] = sum;
            });
    }

    double* getResult() { return output; }
};

// Reduction kernel: sum all elements
class ReductionKernel : public KernelBase {
private:
    double* data;
    int size;
    double result;

public:
    ReductionKernel() : data(nullptr), size(0), result(0.0) {}
    ~ReductionKernel() { cleanup(); }

    const char* getName() const override { return "Reduction"; }

    void initialize(const ProblemConfig& config) override;
    void cleanup() override;
    KernelFeatures extractFeatures(const ProblemConfig& config) const override;

    template<typename ExecPolicy>
    void execute() {
        RAJA::ReduceSum<RAJA::seq_reduce, double> sum(0.0);

        RAJA::forall<ExecPolicy>(RAJA::RangeSegment(0, size),
            [=] RAJA_DEVICE (int i) {
                sum += data[i];
            });

        result = sum.get();
    }

    double getResult() { return result; }
};

// Sparse matrix-vector multiplication
class SpMVKernel : public KernelBase {
private:
    // CSR format
    double* values;     // Non-zero values
    int* col_indices;   // Column indices
    int* row_ptr;       // Row pointers
    double* x;          // Input vector
    double* y;          // Output vector
    int num_rows;
    int num_nonzeros;

public:
    SpMVKernel() : values(nullptr), col_indices(nullptr), row_ptr(nullptr),
                   x(nullptr), y(nullptr), num_rows(0), num_nonzeros(0) {}
    ~SpMVKernel() { cleanup(); }

    const char* getName() const override { return "SpMV"; }

    void initialize(const ProblemConfig& config) override;
    void cleanup() override;
    KernelFeatures extractFeatures(const ProblemConfig& config) const override;

    template<typename ExecPolicy>
    void execute() {
        RAJA::forall<ExecPolicy>(RAJA::RangeSegment(0, num_rows),
            [=] RAJA_DEVICE (int row) {
                double sum = 0.0;
                for (int idx = row_ptr[row]; idx < row_ptr[row + 1]; ++idx) {
                    sum += values[idx] * x[col_indices[idx]];
                }
                y[row] = sum;
            });
    }

    double* getResult() { return y; }
};

// Indirect memory access kernel (gather-scatter pattern)
class IndirectKernel : public KernelBase {
private:
    double* src;
    double* dst;
    int* indices;
    int size;

public:
    IndirectKernel() : src(nullptr), dst(nullptr), indices(nullptr), size(0) {}
    ~IndirectKernel() { cleanup(); }

    const char* getName() const override { return "Indirect"; }

    void initialize(const ProblemConfig& config) override;
    void cleanup() override;
    KernelFeatures extractFeatures(const ProblemConfig& config) const override;

    template<typename ExecPolicy>
    void execute() {
        RAJA::forall<ExecPolicy>(RAJA::RangeSegment(0, size),
            [=] RAJA_DEVICE (int i) {
                dst[i] = src[indices[i]] * 2.0;
            });
    }

    double* getResult() { return dst; }
};

// Factory function to create kernels
std::unique_ptr<KernelBase> createKernel(const std::string& kernel_name);

// Utility functions
void initializeRandomData(double* data, int size, double min_val = 0.0, double max_val = 1.0);
void initializeRandomIndices(int* indices, int size, int max_index);
double computeChecksum(const double* data, int size);

} // namespace apollo_demo

#endif // APOLLO_RAJA_DEMO_KERNELS_H
// kernels.cpp
#include "kernels.h"
#include <cstring>
#include <random>
#include <algorithm>
#include <iostream>

namespace apollo_demo {

// Utility functions implementation

void initializeRandomData(double* data, int size, double min_val, double max_val) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<double> dist(min_val, max_val);

    for (int i = 0; i < size; ++i) {
        data[i] = dist(gen);
    }
}

void initializeRandomIndices(int* indices, int size, int max_index) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<int> dist(0, max_index - 1);

    for (int i = 0; i < size; ++i) {
        indices[i] = dist(gen);
    }
}

double computeChecksum(const double* data, int size) {
    double sum = 0.0;
    for (int i = 0; i < size; ++i) {
        sum += data[i];
    }
    return sum;
}

// VectorAddKernel implementation

void VectorAddKernel::initialize(const ProblemConfig& config) {
    size = config.size;

    // Allocate memory
    a = new double[size];
    b = new double[size];
    c = new double[size];

    // Initialize with random data
    initializeRandomData(a, size, 0.0, 1.0);
    initializeRandomData(b, size, 0.0, 1.0);
    std::memset(c, 0, size * sizeof(double));
}

void VectorAddKernel::cleanup() {
    delete[] a;
    delete[] b;
    delete[] c;
    a = b = c = nullptr;
    size = 0;
}

KernelFeatures VectorAddKernel::extractFeatures(const ProblemConfig& config) const {
    KernelFeatures features;
    features.problem_size_log = std::log10(static_cast<float>(config.size));
    features.memory_footprint = (3.0f * config.size * sizeof(double)) / (1024.0f * 1024.0f); // MB
    features.arithmetic_intensity = 1.0f / (3.0f * sizeof(double)); // 1 op per 3 memory accesses
    features.sparsity_level = 0.0f; // Dense operation
    features.access_pattern = 0.0f; // Sequential access
    return features;
}

// DaxpyKernel implementation

void DaxpyKernel::initialize(const ProblemConfig& config) {
    size = config.size;

    // Allocate memory
    x = new double[size];
    y = new double[size];

    // Initialize with random data
    initializeRandomData(x, size, 0.0, 1.0);
    initializeRandomData(y, size, 0.0, 1.0);
    alpha = 2.5;
}

void DaxpyKernel::cleanup() {
    delete[] x;
    delete[] y;
    x = y = nullptr;
    size = 0;
}

KernelFeatures DaxpyKernel::extractFeatures(const ProblemConfig& config) const {
    KernelFeatures features;
    features.problem_size_log = std::log10(static_cast<float>(config.size));
    features.memory_footprint = (2.0f * config.size * sizeof(double)) / (1024.0f * 1024.0f);
    features.arithmetic_intensity = 2.0f / (3.0f * sizeof(double)); // 2 ops per 3 memory accesses
    features.sparsity_level = 0.0f;
    features.access_pattern = 0.0f;
    return features;
}

// MatVecKernel implementation

void MatVecKernel::initialize(const ProblemConfig& config) {
    // Make it a square matrix for simplicity
    rows = static_cast<int>(std::sqrt(config.size));
    cols = rows;

    // Allocate memory
    A = new double[rows * cols];
    x = new double[cols];
    y = new double[rows];

    // Initialize with random data
    initializeRandomData(A, rows * cols, 0.0, 1.0);
    initializeRandomData(x, cols, 0.0, 1.0);
    std::memset(y, 0, rows * sizeof(double));
}

void MatVecKernel::cleanup() {
    delete[] A;
    delete[] x;
    delete[] y;
    A = x = y = nullptr;
    rows = cols = 0;
}

KernelFeatures MatVecKernel::extractFeatures(const ProblemConfig& config) const {
    KernelFeatures features;
    int matrix_size = static_cast<int>(std::sqrt(config.size));
    features.problem_size_log = std::log10(static_cast<float>(matrix_size * matrix_size));
    features.memory_footprint = ((matrix_size * matrix_size + 2 * matrix_size) * sizeof(double)) / (1024.0f * 1024.0f);
    features.arithmetic_intensity = 2.0f / sizeof(double); // 2 ops per memory access
    features.sparsity_level = 0.0f;
    features.access_pattern = 0.2f; // Some non-sequential due to matrix access
    return features;
}

// Stencil1DKernel implementation

void Stencil1DKernel::initialize(const ProblemConfig& config) {
    size = config.size;
    stencil_radius = std::min(config.stencil_size / 2, 4); // Cap at 4 for safety

    // Allocate memory
    input = new double[size];
    output = new double[size];
    weights = new double[2 * stencil_radius + 1];

    // Initialize with random data
    initializeRandomData(input, size, 0.0, 1.0);
    std::memset(output, 0, size * sizeof(double));

    // Initialize stencil weights (normalized)
    double total_weight = 0.0;
    for (int i = 0; i < 2 * stencil_radius + 1; ++i) {
        weights[i] = 1.0 / (std::abs(i - stencil_radius) + 1.0);
        total_weight += weights[i];
    }
    for (int i = 0; i < 2 * stencil_radius + 1; ++i) {
        weights[i] /= total_weight;
    }
}

void Stencil1DKernel::cleanup() {
    delete[] input;
    delete[] output;
    delete[] weights;
    input = output = weights = nullptr;
    size = 0;
    stencil_radius = 0;
}

KernelFeatures Stencil1DKernel::extractFeatures(const ProblemConfig& config) const {
    KernelFeatures features;
    features.problem_size_log = std::log10(static_cast<float>(config.size));
    features.memory_footprint = (2.0f * config.size * sizeof(double)) / (1024.0f * 1024.0f);
    int stencil_size = 2 * std::min(config.stencil_size / 2, 4) + 1;
    features.arithmetic_intensity = static_cast<float>(stencil_size) / sizeof(double);
    features.sparsity_level = 0.0f;
    features.access_pattern = 0.1f; // Mostly sequential with some locality
    return features;
}

// ReductionKernel implementation

void ReductionKernel::initialize(const ProblemConfig& config) {
    size = config.size;

    // Allocate memory
    data = new double[size];

    // Initialize with random data
    initializeRandomData(data, size, 0.0, 1.0);
    result = 0.0;
}

void ReductionKernel::cleanup() {
    delete[] data;
    data = nullptr;
    size = 0;
    result = 0.0;
}

KernelFeatures ReductionKernel::extractFeatures(const ProblemConfig& config) const {
    KernelFeatures features;
    features.problem_size_log = std::log10(static_cast<float>(config.size));
    features.memory_footprint = (config.size * sizeof(double)) / (1024.0f * 1024.0f);
    features.arithmetic_intensity = 1.0f / sizeof(double); // 1 op per memory access
    features.sparsity_level = 0.0f;
    features.access_pattern = 0.0f; // Pure sequential
    return features;
}

// SpMVKernel implementation

void SpMVKernel::initialize(const ProblemConfig& config) {
    num_rows = static_cast<int>(std::sqrt(config.size));
    int num_cols = num_rows;

    // Generate sparse matrix in CSR format
    // Number of non-zeros based on sparsity level
    double density = 1.0 - config.sparsity;
    num_nonzeros = static_cast<int>(num_rows * num_cols * density);
    num_nonzeros = std::max(num_rows, num_nonzeros); // At least one per row

    // Allocate memory
    values = new double[num_nonzeros];
    col_indices = new int[num_nonzeros];
    row_ptr = new int[num_rows + 1];
    x = new double[num_cols];
    y = new double[num_rows];

    // Initialize sparse matrix
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<double> val_dist(0.0, 1.0);
    std::uniform_int_distribution<int> col_dist(0, num_cols - 1);

    int nnz_idx = 0;
    row_ptr[0] = 0;

    for (int row = 0; row < num_rows; ++row) {
        int nnz_per_row = num_nonzeros / num_rows;
        if (row < num_nonzeros % num_rows) nnz_per_row++;

        std::vector<int> cols;
        for (int i = 0; i < nnz_per_row && nnz_idx < num_nonzeros; ++i) {
            cols.push_back(col_dist(gen));
        }
        std::sort(cols.begin(), cols.end());

        for (int col : cols) {
            values[nnz_idx] = val_dist(gen);
            col_indices[nnz_idx] = col;
            nnz_idx++;
        }

        row_ptr[row + 1] = nnz_idx;
    }

    // Initialize vectors
    initializeRandomData(x, num_cols, 0.0, 1.0);
    std::memset(y, 0, num_rows * sizeof(double));
}

void SpMVKernel::cleanup() {
    delete[] values;
    delete[] col_indices;
    delete[] row_ptr;
    delete[] x;
    delete[] y;
    values = nullptr;
    col_indices = row_ptr = nullptr;
    x = y = nullptr;
    num_rows = num_nonzeros = 0;
}

KernelFeatures SpMVKernel::extractFeatures(const ProblemConfig& config) const {
    KernelFeatures features;
    features.problem_size_log = std::log10(static_cast<float>(config.size));
    float density = 1.0f - config.sparsity;
    features.memory_footprint = ((config.size * density + config.size) * sizeof(double)) / (1024.0f * 1024.0f);
    features.arithmetic_intensity = 2.0f / (sizeof(double) + sizeof(int)); // 2 ops per value+index access
    features.sparsity_level = config.sparsity;
    features.access_pattern = 0.5f + 0.5f * config.sparsity; // More random with higher sparsity
    return features;
}

// IndirectKernel implementation

void IndirectKernel::initialize(const ProblemConfig& config) {
    size = config.size;

    // Allocate memory
    src = new double[size];
    dst = new double[size];
    indices = new int[size];

    // Initialize data
    initializeRandomData(src, size, 0.0, 1.0);
    std::memset(dst, 0, size * sizeof(double));

    // Initialize indices
    if (config.use_indirect) {
        // Random access pattern
        initializeRandomIndices(indices, size, size);
    } else {
        // Sequential access pattern
        for (int i = 0; i < size; ++i) {
            indices[i] = i;
        }
    }
}

void IndirectKernel::cleanup() {
    delete[] src;
    delete[] dst;
    delete[] indices;
    src = dst = nullptr;
    indices = nullptr;
    size = 0;
}

KernelFeatures IndirectKernel::extractFeatures(const ProblemConfig& config) const {
    KernelFeatures features;
    features.problem_size_log = std::log10(static_cast<float>(config.size));
    features.memory_footprint = (2.0f * config.size * sizeof(double) + config.size * sizeof(int)) / (1024.0f * 1024.0f);
    features.arithmetic_intensity = 1.0f / (2.0f * sizeof(double) + sizeof(int));
    features.sparsity_level = 0.0f;
    features.access_pattern = config.use_indirect ? 1.0f : 0.0f;
    return features;
}

// Factory function

std::unique_ptr<KernelBase> createKernel(const std::string& kernel_name) {
    if (kernel_name == "VectorAdd") {
        return std::make_unique<VectorAddKernel>();
    } else if (kernel_name == "DAXPY") {
        return std::make_unique<DaxpyKernel>();
    } else if (kernel_name == "MatVec") {
        return std::make_unique<MatVecKernel>();
    } else if (kernel_name == "Stencil1D") {
        return std::make_unique<Stencil1DKernel>();
    } else if (kernel_name == "Reduction") {
        return std::make_unique<ReductionKernel>();
    } else if (kernel_name == "SpMV") {
        return std::make_unique<SpMVKernel>();
    } else if (kernel_name == "Indirect") {
        return std::make_unique<IndirectKernel>();
    }
    return nullptr;
}

} // namespace apollo_demo

Reading through the Apollo documentation, I think this setup could be a lot less verbose if I were to use the Apollo-enabled RAJA fork from ( https://github.com/LLNL/RAJA/compare/develop...ggeorgakoudis:RAJA:feature/apollo). But at this point that’s an extremely old version of RAJA, and I don’t really trust Claude to be able to cope with the older API and any important changes that might have happened since.

For the parameters that it chose by default, we don’t really get anything exciting out of Apollo. It determines that for all of those kernels, with an input size of around 10^5 the fastest execution policy is just the serial one and that the OpenMP thread overhead isn’t worth it. But if I increase the problem size by a factor of 100, it starts picking other OpenMP policies. It also really helps if I make sure my system isn’t doing anything else at the time, as the OpenMP policies with explicit batch policies are the ones that suffer the most if other processes are taking resources:

Command: APOLLO_POLICY_MODEL='DecisionTree,explore=RoundRobin,max_depth=4' ./apollo_raja_demo_cpu --adaptive

Apollo-RAJA CPU Demonstration
==============================

Configuration:
  Number of policies: 5
  Available policies:
    0: Sequential
    1: OpenMP Default
    2: OpenMP Static-16
    3: OpenMP Dynamic-8
    4: OpenMP Guided
  OpenMP threads: 8
  Apollo Policy Model: DecisionTree,explore=RoundRobin,max_depth=4
== APOLLO: Looked for APOLLO_COLLECTIVE_TRAINING with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_LOCAL_TRAINING with getenv(), found nothing, using '1' (default) instead.
== APOLLO: Looked for APOLLO_SINGLE_MODEL with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_REGION_MODEL with getenv(), found nothing, using '1' (default) instead.
== APOLLO: Looked for APOLLO_GLOBAL_TRAIN_PERIOD with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_PER_REGION_TRAIN_PERIOD with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_TRACE_POLICY with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_STORE_MODELS with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_TRACE_RETRAIN with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_TRACE_ALLGATHER with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_TRACE_BEST_POLICIES with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_RETRAIN_ENABLE with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_RETRAIN_TIME_THRESHOLD with getenv(), found nothing, using '2.0' (default) instead.
== APOLLO: Looked for APOLLO_RETRAIN_REGION_THRESHOLD with getenv(), found nothing, using '0.5' (default) instead.
== APOLLO: Looked for APOLLO_TRACE_CSV with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_PERSISTENT_DATASETS with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_STORE_EXEC_INFO with getenv(), found nothing, using '0' (default) instead.
== APOLLO: Looked for APOLLO_OUTPUT_DIR with getenv(), found nothing, using '.apollo' (default) instead.
== APOLLO: Looked for APOLLO_DATASETS_DIR with getenv(), found nothing, using 'datasets' (default) instead.
== APOLLO: Looked for APOLLO_TRACES_DIR with getenv(), found nothing, using 'traces' (default) instead.
== APOLLO: Looked for APOLLO_MODELS_DIR with getenv(), found nothing, using 'models' (default) instead.

============================================================
ADAPTIVE SCENARIO: Apollo learns optimal policies
============================================================

--- EXPLORATION PHASE ---
Apollo explores different policies...

=== Running VectorAdd Kernel ===
Problem size: 100000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 0.032 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 0.182 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 0.173 ms
  Apollo selected: OpenMP Dynamic-8 (policy 3)
  Iteration  4: 0.799 ms
  Apollo selected: OpenMP Guided (policy 4)
  Iteration  5: 0.092 ms

Statistics:
  Average: 0.256 ms
  Min:     0.032 ms
  Max:     0.799 ms

=== Running DAXPY Kernel ===
Problem size: 100000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 0.025 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 0.054 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 0.086 ms
  Apollo selected: OpenMP Dynamic-8 (policy 3)
  Iteration  4: 0.126 ms
  Apollo selected: OpenMP Guided (policy 4)
  Iteration  5: 0.124 ms

Statistics:
  Average: 0.083 ms
  Min:     0.025 ms
  Max:     0.126 ms

=== Running Stencil1D Kernel ===
Problem size: 100000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 0.132 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 0.131 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 0.145 ms
  Apollo selected: OpenMP Dynamic-8 (policy 3)
  Iteration  4: 0.172 ms
  Apollo selected: OpenMP Guided (policy 4)
  Iteration  5: 0.140 ms

Statistics:
  Average: 0.144 ms
  Min:     0.131 ms
  Max:     0.172 ms

=== Running Reduction Kernel ===
Problem size: 100000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 0.100 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 0.086 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 0.077 ms
  Apollo selected: OpenMP Dynamic-8 (policy 3)
  Iteration  4: 0.129 ms
  Apollo selected: OpenMP Guided (policy 4)
  Iteration  5: 0.102 ms

Statistics:
  Average: 0.099 ms
  Min:     0.077 ms
  Max:     0.129 ms

=== Running VectorAdd Kernel ===
Problem size: 1000000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 1.090 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 0.569 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 1.090 ms
  Apollo selected: OpenMP Dynamic-8 (policy 3)
  Iteration  4: 0.679 ms
  Apollo selected: OpenMP Guided (policy 4)
  Iteration  5: 0.522 ms

Statistics:
  Average: 0.790 ms
  Min:     0.522 ms
  Max:     1.090 ms

=== Running DAXPY Kernel ===
Problem size: 1000000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 1.183 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 0.459 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 0.733 ms
  Apollo selected: OpenMP Dynamic-8 (policy 3)
  Iteration  4: 0.709 ms
  Apollo selected: OpenMP Guided (policy 4)
  Iteration  5: 0.369 ms

Statistics:
  Average: 0.691 ms
  Min:     0.369 ms
  Max:     1.183 ms

=== Running Stencil1D Kernel ===
Problem size: 1000000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 1.275 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 0.764 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 0.950 ms
  Apollo selected: OpenMP Dynamic-8 (policy 3)
  Iteration  4: 0.965 ms
  Apollo selected: OpenMP Guided (policy 4)
  Iteration  5: 0.618 ms

Statistics:
  Average: 0.914 ms
  Min:     0.618 ms
  Max:     1.275 ms

=== Running Reduction Kernel ===
Problem size: 1000000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 0.983 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 0.416 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 0.385 ms
  Apollo selected: OpenMP Dynamic-8 (policy 3)
  Iteration  4: 0.616 ms
  Apollo selected: OpenMP Guided (policy 4)
  Iteration  5: 0.305 ms

Statistics:
  Average: 0.541 ms
  Min:     0.305 ms
  Max:     0.983 ms

--- TRAINING PHASE ---
Apollo has trained models based on exploration data

--- EXPLOITATION PHASE ---
Apollo uses learned models to select optimal policies...

=== Running VectorAdd Kernel ===
Problem size: 10000000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 18.672 ms
  Apollo selected: Sequential (policy 0)
  Iteration  2: 5.530 ms
  Apollo selected: Sequential (policy 0)
  Iteration  3: 5.048 ms
  Apollo selected: Sequential (policy 0)
  Iteration  4: 5.151 ms
  Apollo selected: Sequential (policy 0)
  Iteration  5: 5.259 ms

Statistics:
  Average: 7.932 ms
  Min:     5.048 ms
  Max:     18.672 ms

=== Running DAXPY Kernel ===
Problem size: 10000000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 4.582 ms
  Apollo selected: Sequential (policy 0)
  Iteration  2: 4.827 ms
  Apollo selected: Sequential (policy 0)
  Iteration  3: 4.528 ms
  Apollo selected: Sequential (policy 0)
  Iteration  4: 4.759 ms
  Apollo selected: Sequential (policy 0)
  Iteration  5: 4.752 ms

Statistics:
  Average: 4.690 ms
  Min:     4.528 ms
  Max:     4.827 ms

=== Running Stencil1D Kernel ===
Problem size: 10000000
Iterations: 5
  Apollo selected: OpenMP Default (policy 1)
  Iteration  1: 5.894 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 6.527 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  3: 6.348 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  4: 6.901 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  5: 6.106 ms

Statistics:
  Average: 6.355 ms
  Min:     5.894 ms
  Max:     6.901 ms

=== Running Reduction Kernel ===
Problem size: 10000000
Iterations: 5
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  1: 3.036 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  2: 2.883 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 3.074 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  4: 2.620 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  5: 2.647 ms


Statistics:
  Average: 2.852 ms
  Min:     2.620 ms
  Max:     3.074 ms

=== Running VectorAdd Kernel ===
Problem size: 100000000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 526.750 ms
  Apollo selected: Sequential (policy 0)
  Iteration  2: 84.982 ms
  Apollo selected: Sequential (policy 0)
  Iteration  3: 48.321 ms
  Apollo selected: Sequential (policy 0)
  Iteration  4: 59.781 ms
  Apollo selected: Sequential (policy 0)
  Iteration  5: 751.197 ms

Statistics:
  Average: 294.206 ms
  Min:     48.321 ms
  Max:     751.197 ms

=== Running DAXPY Kernel ===
Problem size: 100000000
Iterations: 5
  Apollo selected: Sequential (policy 0)
  Iteration  1: 81.666 ms
  Apollo selected: Sequential (policy 0)
  Iteration  2: 72.213 ms
  Apollo selected: Sequential (policy 0)
  Iteration  3: 46.080 ms
  Apollo selected: Sequential (policy 0)
  Iteration  4: 43.956 ms
  Apollo selected: Sequential (policy 0)
  Iteration  5: 44.062 ms

Statistics:
  Average: 57.595 ms
  Min:     43.956 ms
  Max:     81.666 ms

=== Running Stencil1D Kernel ===
Problem size: 100000000
Iterations: 5
  Apollo selected: OpenMP Default (policy 1)
  Iteration  1: 63.077 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  2: 52.470 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  3: 52.742 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  4: 52.084 ms
  Apollo selected: OpenMP Default (policy 1)
  Iteration  5: 53.496 ms

Statistics:
  Average: 54.774 ms
  Min:     52.084 ms
  Max:     63.077 ms

=== Running Reduction Kernel ===
Problem size: 100000000
Iterations: 5
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  1: 41.429 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  2: 25.145 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  3: 25.065 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  4: 24.666 ms
  Apollo selected: OpenMP Static-16 (policy 2)
  Iteration  5: 25.017 ms

Statistics:
  Average: 28.264 ms
  Min:     24.666 ms
  Max:     41.429 ms

So overall pretty interesting! In the state the project is currently in, it is not at all obvious to me how I would use it in a codebase that isn’t already structured around RAJA execution policies.