LinearSolve.jl Autotuning: Community-Driven Algorithm Selection for Optimal Performance

Linear algebra operations form the computational backbone of scientific computing, yet choosing the optimal algorithm for a given problem and hardware configuration remains a persistent challenge. Today, we're excited to introduce LinearSolveAutotune.jl, a new community-driven autotuning system that automatically benchmarks and selects the best linear solver algorithms for your specific hardware configuration.

The Challenge: One Size Doesn't Fit All

LinearSolve.jl provides a unified interface to over 20 different linear solving algorithms, from generic Julia implementations to highly optimized vendor libraries like Intel MKL, Apple Accelerate, and GPU-accelerated solvers. Each algorithm excels in different scenarios:

The optimal choice depends on matrix size, sparsity, numerical type, and critically, your specific hardware. An M2 MacBook Pro has very different performance characteristics than an AMD Threadripper workstation with an NVIDIA GPU.

Enter LinearSolveAutotune: Community-Powered Performance

LinearSolveAutotune addresses this challenge through a unique approach: collaborative benchmarking with optional telemetry sharing. Here's how it works:

  1. Local Benchmarking

Run comprehensive benchmarks on your machine with a simple command:

using LinearSolve, LinearSolveAutotune

# Run benchmarks across different matrix sizes and types
results = autotune_setup()

# View performance summary
display(results)

# Generate performance visualization
plot(results)

The system automatically:

  1. Smart Recommendations

Based on your benchmarks, LinearSolveAutotune generates tailored recommendations for each scenario:

# Example output from an Apple M2 system:
# ┌─────────────┬──────────────────────────────┐
# │ Size Range  │ Best Algorithm               │
# ├─────────────┼──────────────────────────────┤
# │ tiny (5-20) │ RFLUFactorization            │
# │ small       │ RFLUFactorization            │
# │ medium      │ AppleAccelerateLUFactorization │
# │ large       │ AppleAccelerateLUFactorization │
# │ huge        │ MetalLUFactorization         │
# └─────────────┴──────────────────────────────┘

  1. Community Telemetry (Optional)

The real innovation lies in opt-in community telemetry. By sharing your benchmark results, you contribute to a growing database that helps improve algorithm selection heuristics for everyone:

# Share your results with the community
share_results(results)

This creates an automatic GitHub comment on our results collection issue with:

Privacy First: The telemetry system:

Real-World Impact: Performance Gains in the Wild

The community has already contributed benchmarks from diverse hardware configurations, revealing fascinating insights:

Apple Silicon Optimization

On Apple M2 processors, we discovered that Apple's Accelerate framework delivers exceptional performance for medium-sized matrices, achieving 750+ GFLOPS for large Float32 matrices. However, for tiny matrices (< 20×20), the pure Julia RFLUFactorization is 3-5x faster due to lower call overhead.

GPU Acceleration Patterns

Metal acceleration on Apple Silicon shows interesting threshold behavior:

Complex Number Performance

For complex arithmetic, we found that specialized algorithms matter even more:

Using the Results: Automatic Algorithm Selection

The beauty of LinearSolve.jl's autotuning system is that you don't need to manually specify algorithms. The benchmark results from the community directly improve the default heuristics, so you simply use:

using LinearSolve

# Create your linear problem
A = rand(100, 100)
b = rand(100)
prob = LinearProblem(A, b)

# Just solve - LinearSolve automatically picks the best algorithm!
sol = solve(prob)  # Uses optimized heuristics based on community benchmarks

The autotuning results you and others share help LinearSolve.jl make intelligent decisions about:

By contributing your benchmark results with share_results(), you're directly improving the default algorithm selection for everyone. The more diverse hardware configurations we collect, the smarter the automatic selection becomes.

Performance Visualization: A Picture Worth 1000 Benchmarks

LinearSolveAutotune generates comprehensive performance visualizations showing:

Here's an example from recent community submissions showing the dramatic performance differences across algorithms:

Metal GPU vs CPU Performance (Apple M2)
┌────────────────────────────────────────────┐
│ 1000 ┤ ▁▁▁▁▁▂▂▃▄▅▆▇█ Metal GPU        │
│      │                                      │
│  500 ┤     ▅▆▇██████ Apple Accelerate     │
│      │   ▂▄████▅▃▂▁                        │
│  100 ┤ ▆████▃▁      Generic LU            │
│      │████▁                                │
│   10 ┤██            RF Factorization      │
│      │                                     │
│    1 └────────────────────────────────────┘
│       10   100   1000   10000              │
│            Matrix Size (n×n)               │
└────────────────────────────────────────────┘

How the Telemetry System Works

The telemetry system is designed with transparency and user control at its core:

  1. Local Execution: All benchmarks run locally on your machine

  2. Data Generation: Results are formatted as markdown tables and plots

  3. Authentication: Uses GitHub OAuth for secure, transparent submission

  4. Public Sharing: Creates a comment on a public GitHub issue

  5. Community Analysis: Results feed into improved algorithm selection heuristics

The collected data helps us:

Getting Started

Ready to optimize your linear algebra performance? Here's how to get started:

# Install the packages
using Pkg
Pkg.add(["LinearSolve", "LinearSolveAutotune"])

# Run comprehensive benchmarks
using LinearSolve, LinearSolveAutotune
results = autotune_setup()

# Analyze your results
display(results)
plot(results)

# Optional: Share with the community
share_results(results)

The Road Ahead

LinearSolveAutotune represents a new paradigm in scientific computing: community-driven performance optimization. By aggregating performance data across diverse hardware configurations, we can:

We envision expanding this approach to other SciML packages, creating a comprehensive performance knowledge base that benefits the entire Julia scientific computing ecosystem.

Join the Community Effort

The success of LinearSolveAutotune depends on community participation. Whether you're running on a laptop, workstation, or HPC cluster, your benchmarks provide valuable data that helps improve performance for everyone.

Visit our results collection issue to see community submissions, and consider running the autotuning suite on your hardware. Together, we're building a faster, smarter linear algebra ecosystem for Julia.

Acknowledgments

LinearSolveAutotune was developed as part of the SciML ecosystem with contributions from the Julia community. Special thanks to all early adopters who have shared their benchmark results and helped refine the system.


For more information, see the LinearSolve.jl documentation and join the discussion on Julia Discourse.