The Optimization Layer Between Models and Silicon
MarsCompute is the adaptive optimization layer for heterogeneous AI hardware, continuously improving latency, efficiency, and small-batch behavior across NVIDIA, AMD, and mixed accelerator fleets.
- Architecture-aware optimization across heterogeneous hardware
- Self-evolving performance engine driven by live execution behavior
- Cross-platform execution intelligence for production AI infrastructure

The Problem
AI infrastructure bottlenecks are increasingly caused by optimization limits, not lack of theoretical compute.
- Hardware heterogeneity across NVIDIA, AMD, and custom ASIC environments
- Underutilized accelerators caused by weak architecture-specific execution plans
- Poor small-batch latency in real production serving paths
- Manual kernel-level tuning loops that do not scale
- Framework-hardware fragmentation across compiler, runtime, and deployment layers
The Solution: Mars Optimization Brain
A system layer that continuously optimizes execution across heterogeneous hardware instead of relying on static, one-time tuning.
- Continuously discovers optimization opportunities from live workload behavior
- Learns hardware topology and architecture-level constraints
- Models kernel, memory, and runtime interactions before applying changes
- Plans non-conflicting optimization sequences using conflict-aware scheduling
- Evolves as new architectures, drivers, and serving patterns are introduced
Core system language
Why AMD / MI300 Matters
MI300-class hardware has major headroom. Turning that potential into production performance still requires deeper optimization intelligence.
Current Gap
MarsCompute Role
Ecosystem Positioning
Performance Evidence
Representative optimization result from our current execution pipeline.
Automated optimization on Flash Attention 2 by 52% on H100
Recorded demonstration of MarsCompute optimization workflow on a production-relevant inference kernel path.
Product Layers
Three coordinated layers connect architecture signals to production execution improvements.
Layer 1: Architecture Discovery
Layer 2: Optimization Planning Engine
Layer 3: Execution & Continuous Adaptation
Who This Is For
MarsCompute is built for teams managing infrastructure-critical AI performance at system scale.
Built for
- Hyperscalers operating multi-architecture AI fleets
- AI infrastructure and platform engineering teams
- GPU and accelerator vendors scaling production adoption
- Advanced model serving teams focused on latency and efficiency
Not designed for
- Casual developers
- Hobby ML users
- Non-production experimentation stacks
Long-Term Vision
Execution OS for heterogeneous AI hardware.
In five years, multi-architecture AI infrastructure will be the default operating condition. Manual optimization workflows will not scale, and static compilers alone will not keep pace with workload and silicon change. MarsCompute is building the adaptive intelligence layer between models and hardware.
Team
Led by Prof. Bingsheng He (National University of Singapore). We are a focused team of experts in compilers, kernels, and AI systems.
Mars began as a pioneering project on GPU acceleration led by Prof. He in 2007, the year CUDA was born. You can read more about the original work here.
Blog
Read our weekly engineering and release updates.
Follow the Mars Compute weekly release notes and technical updates here:
FA4 Forward Kernel Optimization Report
Mar 09, 2026Author: Zhuobin Huang @ mars-compute | Date:: March 9, 2026
Read PostFlash Attention Kernel Optimization Results with `mars-compute`
Mar 01, 2026Author: Zhuobin Huang @ mars-compute | Date: March 1, 2026
Read PostBuilding the Adaptive Optimization Layer for AI Infrastructure
Use the form to start a technical discussion, submit a partnership inquiry, or apply as a design partner.
We work with teams operating production AI systems where architecture-aware optimization is infrastructure-critical.
- Technical architecture reviews
- Partnership and ecosystem collaboration
- Design partner program participation