Leveraging LLMs for SIMD Optimization - A Journey from Knowledge Base to AI Assistant

1. Introduction

Problem statement: Need for efficient SIMD optimization across architectures

Software optimization using SIMD (Single Instruction, Multiple Data) instructions remains a critical yet challenging aspect of high-performance computing. While SIMD instructions can dramatically improve performance by processing multiple data elements simultaneously, developers face several significant challenges:

  1. Architecture Diversity: Different processor architectures (x86, ARM, POWER, RISC-V, MIPS, Loongson) implement SIMD instructions differently, making cross-platform optimization complex.

  2. Steep Learning Curve: Each architecture's SIMD instruction set (SSE4.2, AVX2, AVX512, NEON, SVE, SVE2) requires specific knowledge and expertise, creating a significant barrier to entry.

  3. Documentation Complexity: Official documentation for SIMD intrinsics is often scattered across multiple sources, making it time-consuming to find equivalent instructions across architectures.

  4. Constant Evolution: As new processor generations introduce enhanced SIMD capabilities, keeping track of the latest optimizations becomes increasingly challenging.

Overview of SIMD.info knowledge base

To address these challenges, we developed SIMD.info, a comprehensive online knowledge base that serves as a centralized repository for SIMD instruction information. The platform contains:

  • A database of over 10,000 intrinsics across multiple architectures
  • Detailed documentation including purpose, results, and usage examples
  • Cross-reference capabilities for finding equivalent instructions
  • Code examples and assembly output
  • Architecture-specific information and compatibility details
  • Advanced search capabilities (intrinsic/architecture specific, natural language)
  • Community features (discussion forum - StackOverflow-like Q&A system)

Why we considered LLMs as a solution

While our knowledge base significantly improved access to SIMD information, we recognized an opportunity to enhance its utility through Large Language Models (LLMs). Our motivation for exploring LLMs included:

  1. Leveraging Structured Data Assets: With over 10,000 well-structured, categorized SIMD intrinsics in our database, we recognized that data is the new currency in the AI era. Our carefully curated knowledge base represented an valuable asset that could be transformed into an intelligent system through LLMs, potentially creating additional value from our existing documentation efforts.

  2. Code Optimization Assistant: Enable developers to input code snippets that need optimization, and receive suggestions for SIMD-based improvements, along with relevant instruction alternatives and implementation strategies.

  3. Cross-Architecture Translation: Automatically identify equivalent SIMD instructions across different architectures.

  4. Context-Aware Assistance: Provide optimization suggestions based on the specific code context and performance requirements.

  5. Knowledge Synthesis: Combine information from multiple intrinsics to suggest optimal implementation strategies.

Vision: AI-Powered SIMD Optimization Ecosystem

Our roadmap for transforming the static SIMD knowledge base into an intelligent optimization system consists of three key initiatives:

  1. Web Platform Integration
  • Direct integration with simd.ai
  • Users input functions and specify target architectures
  • System provides optimized SIMD translations in real-time
  1. CI/CD Pipeline Automation
  • Automated SIMD optimization within existing development workflows
  • Developers provide source code and unit tests
  • AI system performs:
    • Function translation
    • Iterative testing
    • Performance optimization
    • Best implementation selection
  1. IDE Integration
  • VS Code extension for seamless developer experience
  • Select code directly in the editor
  • Choose target architecture
  • Receive SIMD-optimized translations instantly

These initiatives represent our commitment to making SIMD optimization accessible and efficient for developers at every stage of their development process.

2. Initial Exploration Phase & Data Engineering

LLM Landscape Analysis

Model Categories Evaluated

  1. Proprietary Models
  • GPT-4, GPT-3.5, Claude, PaLM
  • High capability, but costly with vendor lock-in
  1. Open-Source Models
  • LLaMA derivatives, Bert, GPT-J, Falcon
  • Greater flexibility and control
  1. Code-Specific Models
  • StarCoder, CodeLlama, Codestral
  • Optimized for code tasks

Key Technical Factors

  1. Model Characteristics
  • Size range: 3B-405B parameters
  • CPU deployment focus: 3-7B models (with an exception for Nemotron-70B)
  • Memory and speed trade-offs
  1. Implementation Concerns
  • Licensing (Apache, MIT, proprietary)
  • CPU-only deployment requirements
  • Usage of the available system's memory
  • Quantization options
  • Framework selection (starting with Hugging Face)

Implementation Steps

Our implementation began with Hugging Face integration, leveraging StarCoder2-3b and 7b models to establish our baseline workflow. Through this process, we developed effective prompt engineering techniques and established fundamental operational patterns.

Data processing presented unique challenges in converting our YAML-based knowledge into LLM-compatible formats. We optimized context lengths and standardized terminology across different architectures while preserving technical accuracy.

For CPU optimization, we experimented with 8-bit & 16-bit quantization (when available), but found significant degradation in response quality, particularly for coding tasks which is our main concern. We ultimately maintained FP32 precision to ensure accuracy, while implementing efficient batch processing and threading strategies.

Data Engineering

Our SIMD knowledge base contains over 10,000 intrinsics across multiple processor architectures, structured in YAML format with comprehensive technical details for each instruction. We transformed this data into LLM-friendly formats using a source-target approach, where each example pair contained structured information about architectures and instruction mappings.

Key challenges in this process included managing the large volume of technical data, maintaining accuracy across different architectures, and balancing detailed context with model limitations. We developed automated Python scripts for data processing and evaluation to address these challenges effectively.

3. Learning Through Failure and Model Selection

Our first ambitious attempt at fine-tuning StarCoder2-7B resulted in a significant setback. The training process was projected to take 22 days and failed after 10 days, revealing our overconfidence in a single model and underestimation of CPU constraints. This failure prompted a more methodical approach to model selection.

We developed a systematic evaluation framework focusing mainly on models within the 3-7B parameter range, with strong CPU inference support and open-source licensing. Our testing methodology included:

  1. Reference Testing
  • Curated set of equivalent SIMD functions across architectures
  • Automated testing pipeline
  • Structured response analysis criteria
  1. Response Analysis
  • Format: Code block presentation (XML-style) and documentation style
  • Structure: Consistent organization and clear sections
  • Style Categories:
    • Comprehensive analysts: Detailed but verbose (models like Reflection & NeuralChat)
    • Direct implementers: Concise, code-focused (models like Codestral & CodeLlama)
    • Academic theorists: Theory-heavy approaches (earlier versions models)

This structured approach helped identify models that could provide accurate, well-formatted, and practical optimization suggestions, forming the foundation for our subsequent development phases.

4. Refinement and Evaluation

Integration with Ollama

Ollama is a Local AI Model Management tool that grants you full control to download, update, and delete models easily on your system. Our transition to it marked a significant improvement in model interaction and testing efficiency. Ollama provided two key interfaces:

  • CLI: Enabled rapid prototyping and testing through simple terminal commands
  • Python API: Facilitated automated testing and systematic evaluation

The platform's efficient model management system significantly reduced friction in our testing process, allowing quick iteration through different configurations.

SYSTEM Prompts Implementation

The discovery of SYSTEM prompts proved pivotal for optimization. Our experimentation revealed three approaches:

  1. Behavioral Guidance: Focused on role and response format
  2. Technical Data: Direct SIMD knowledge base integration
  3. Hybrid Approach: Combined technical knowledge with behavioral guidelines (most effective)

The hybrid approach proved most successful, maintaining our source-target JSON structure while adding clear interaction parameters.

Evaluation Framework

We placed particular emphasis on measuring the quality of code-related responses, as this was crucial for our SIMD optimization and cross-architecture translation goals. While our research uncovered various established metrics like BLEU score and CodeBLEU, we decided to develop a custom suite of metrics that specifically addressed our unique requirements for SIMD instruction comparison and code similarity.

  1. Core Metrics
  • Levenshtein Similarity for syntactic accuracy
  • Abstract Syntax Tree (AST) Similarity for structural correctness
  • Token Overlap Similarity for functional completeness
  • Response Line Count for output efficiency
  1. Selection Process
  • Initial pool of 30 models reduced to 4 final selections
  • Over 20 comprehensive test runs per model
  • Focus on consistent high performance across metrics
  • Emphasis on reliable output formatting and resource efficiency

This metric suite provided a balanced view of model performance, considering both syntactic accuracy and functional correctness while remaining sensitive to the specific requirements of SIMD optimization tasks.

Testing Methodology

Building upon our initial response analysis, we developed a more structured and comprehensive approach to model evaluation. This refined methodology leveraged our experience with Ollama modelfiles and SYSTEM prompts, implementing a multi-phase testing strategy to systematically evaluate model performance across different complexity levels.

Phase 1: Direct Mapping

  • Focus on one-to-one mapping of SIMD intrinsics
  • Sample size: 100 selected intrinsics that have 1-1 direct mapping from one architecture to another
  • Multiple source-target architecture pairs (SSE4.2 -> NEON, SSE4.2 -> VSX, NEON -> VSX)
  • Clean, focused code output priority

For example, we prompted the LLMs to translate specific SIMD intrinsics between architectures, providing the source intrinsic and target architecture. Each prompt was enhanced with systematic context through the SYSTEM prompt, which included architecture-specific documentation from our knowledge-base.

Sample prompt structure:

input:
  simd: SSE4.2
  intrinsic: _mm_mul_pd
output:
  simd: NEON
  intrinsic: vmul_f64

Phase 2: Function-Level Translation

  • Complete function translation between architectures
  • Validation of intrinsic translation accuracy
  • Structure maintenance verification
  • Type conversion accuracy checks

For our function-level testing, we selected a diverse set of SIMD functions with escalating complexity levels. While the code below is shown in ARM NEON syntax, we maintained equivalent implementations across all target architectures (SSE4.2, VSX) in our test suite. This approach allowed us to validate translations between any architecture pair while maintaining functional equivalence.

// Basic Level: Performs a series of NEON vector operations
float32x4_t perform_calculations_neon(float32x4_t a, float32x4_t b) {

  uint32x4_t cmp_result = vcgtq_f32(a, b);
  float32x4_t add_result = vaddq_f32(a, b);
  float32x4_t mul_result = vmulq_f32(add_result, b);
  float32x4_t sqrt_result = vsqrtq_f32(mul_result);
    
  return sqrt_result;
}

// Intermediate Level: Multiply and add arrays with scalar
void mul_add_arrays(float* a, float* b, float* result, float scalar) {

  float32x4_t vec_a = vld1q_f32(a);
  float32x4_t vec_b = vld1q_f32(b);
  
  float32x4_t vec_res = vmulq_f32(vec_a, vec_b);   
  float32x4_t vec_scalar = vdupq_n_f32(scalar);
  vec_res = vaddq_f32(vec_res, vec_scalar);
  
  vst1q_f32(result, vec_res);
}

// Advanced Level: Dot product with scaling operation
void dot_product_and_scale_neon(float *a, float *b, float *c, float *result) {
  float32x4_t va = vld1q_f32(a);
  float32x4_t vb = vld1q_f32(b);
  float32x4_t vc = vld1q_f32(c);
  
  float32x4_t mul = vmulq_f32(va, vb);
  float32x2_t sum = vpadd_f32(vget_low_f32(mul), vget_high_f32(mul));
  sum = vpadd_f32(sum, sum);   
  float32x4_t scale_factor = vdupq_lane_f32(sum, 0);   
  float32x4_t scaled_result = vmulq_f32(vc, scale_factor);
  
  vst1q_f32(result, scaled_result);
}

Each function tests different aspects of SIMD optimization, from basic arithmetic operations to more complex algorithms involving horizontal operations and data reorganization.

While not yet implemented, we have designed Phase 3 to test more complex scenarios. We deliberately postponed phase 3 until we could establish strong baseline performance in the more straightforward translation tasks

Results Analysis

Our analysis revealed distinct performance patterns across tested models. The integration of our knowledge base via SYSTEM prompts significantly improved translation accuracy and consistency. Here's a representative sample that show the difference between the unmodified models and the same models using SYSTEM prompts:

Code Snippets Results:

                    
                    
Model Modelfile with custom SYSTEM directive Unmodified Model
Model Levenshtein AST Token Overlap Weighted Score Line Count Levenshtein AST Token Overlap Weighted Score Line Count
neural-chat 0.9702 0.9091 0.8947 0.9368 8 0.7345 0.6071 0.6000 0.6694 8
codestral 0.9281 0.9167 0.9048 0.9200 16 0.3708 0.3385 0.3016 0.3472 21
nemotron 0.7910 0.8976 0.8043 0.8257 13 0.6728 0.4315 0.3116 0.3872 15
solar 0.9459 0.8800 0.8636 0.9097 9 0.7343 0.5405 0.5000 0.6293 9
codellama 0.8900 0.8750 0.8571 0.8789 11 0.2143 0.2211 0.1868 0.2108 25
codegemma 0.9435 0.7391 0.7000 0.8335 18 0.9474 0.7391 0.7000 0.8354 24
llama3.2 0.8970 0.7391 0.7000 0.8102 23 0.2987 0.2424 0.1875 0.2596 32
reflection 0.7845 0.8182 0.7895 0.7956 76 0.6348 0.5556 0.5152 0.5871 79
llama3.1:8b 0.8235 0.6429 0.5600 0.7166 28 0.3851 0.2857 0.1667 0.3116 45
qwen2.5:14b 0.7861 0.6562 0.6207 0.7141 21 0.5983 0.3878 0.3478 0.4851 35
qwen2.5-coder:7b 0.7598 0.6250 0.5667 0.6807 22 0.2996 0.3134 0.2812 0.3001 47
phi3:medium 0.7172 0.6400 0.5217 0.6549 17 0.6051 0.4706 0.4194 0.5276 10
codellama:70b 0.5928 0.5676 0.5000 0.5667 18 0.7174 0.7333 0.7037 0.7194 17
qwen2.5:7b 0.8602 0.9655 0.9600 0.9117 9 0.7942 0.8750 0.8621 0.8320 26
gemma2 0.5856 0.3953 0.3250 0.4764 29 0.7955 0.6452 0.6071 0.7127 29
llama3.1 0.5585 0.3953 0.2895 0.4557 26 0.1973 0.1635 0.1200 0.1717 52
mistral 0.5945 0.3182 0.2308 0.4388 16 0.8033 0.6207 0.6522 0.7183 21
starcoder:7b 0.4263 0.4737 0.4000 0.4352 22 0.5988 0.7083 0.6842 0.6487 35
starling-lm 0.4983 0.3542 0.3111 0.4176 32 0.5762 0.4500 0.4167 0.5065 37
gemma2:2b 0.3550 0.2857 0.2222 0.3077 50 0.4172 0.3478 0.2791 0.3687 30
starcoder2:7b 0.1884 0.2353 0.1705 0.1989 34 0.0101 0.0000 0.0000 0.0051 3
phi3 0.1396 0.0724 0.0541 0.1023 17 0.1081 0.0609 0.0359 0.0795 72
1-1 Mapping Results:
Model Levenshtein
codestral:latest 0.7473
nemotron:latest 0.7674
codellama:latest 0.5141
qwen2.5-coder:7b 0.4559
...

Through this rigorous process, we narrowed our selection to 4 models. We deliberately included models that might not have topped every metric but have demonstrated:

  • Consistently high metric scores
  • Reliable performance across different test scenarios
  • Efficient resource utilization
  • Clean Response formatting (code inside XML tags & not being so verbose)

For example, while Nemotron showed slightly lower accuracy in function translations, it excelled in direct mapping tasks.

These models showed particular strength in all key selection criteria & becamse our final choices (for now at least). They also demonstrated complementary strengths, allowing us to potentially leverage different models for different architectures translations.

5. Current Implementation

Selected Models

After the extensive evaluation, four models demonstrated superior performance for SIMD optimization:

  1. Codestral
  2. Nemotron
  3. CodeLlama
  4. Qwen2.5-7B

Performance Analysis

Our extensive testing revealed clear patterns in model performance across different tasks. In one-to-one SIMD intrinsic mapping, Codestral and Nemotron consistently emerged as the top performers, maintaining Levenshtein similarity scores above 0.74. For complete function translations, all four selected models (Codestral, Nemotron, CodeLlama, and Qwen2.5-7B) consistently ranked in the top 4-5 positions, though their relative ordering showed some variability across different test cases. Notably, Qwen2.5-7B often emerged as the number one model in handling function translations, indicating its particular strength in this area. To visualize these patterns, we've created two key visualizations:

  1. A bar chart showing the consistent superior performance of Codestral and Nemotron in one-to-one mapping tasks
One-to-One Mapping Performance
  1. A line graph demonstrating the varying but consistently strong performance of all four models across multiple function translation tests
Function Mapping Performance

A notable observation from the performance graph is test case 6, where all models except Codestral show a significant performance drop. This particular test incorporated the most challenging one-to-one SIMD mappings available in our knowledge-base.

Next Steps

With these promising results, we are proceeding to the model fine-tuning phase to further optimize performance. Our next steps include:

  • Leveraging GPU acceleration for the fine-tuning process to enhance training efficiency
  • Publishing comprehensive benchmarks and performance metrics in our follow-up analysis

Stay tuned for our next blog where we'll share detailed insights from the fine-tuning experiments and their impact on model performance.

Event about "Software Optimization: Using SIMD on modern architectures"

We also organized an event in collaboration with the ACM Student Chapter at the Computer Engineering & Informatics Department (CEID) of the University of Patras, Greece. The event was sponsored by Arm. To learn more about the event, check out our LinkedIn post.

You can download the presentation here.

DB statistics
SIMD Engines: 5
C Intrinsics: 10702
NEON: 4232
AVX2: 462
AVX512: 4955
SSE4.2: 652
VSX: 401

Recent Updates
July 2025
  • Intrinsics Organization: Ongoing restructuring of uncategorized intrinsics for improved accessibility
  • Enhanced Filtering: New advanced filters added to the intrinsics tree for more precise results
  • Search Validation: Improved empty search handling with better user feedback
  • Changelog Display: Recent changes now visible to users for better transparency
  • New Blog Post: "Best Practices & API Integration" guide added to the blogs section
  • Dark Theme: Added support for dark theme for improved accessibility and user experience