AI-native workflow orchestration

Run workflows across
any cluster - intelligently

The standardised orchestration platform for HPC. Define workflows as code, run across any cluster, and let AI ensure your workloads complete successfully. Focus on research, not infrastructure.

Request Early Access Join Discord

$ expanse run workflow.yaml

✓ Authenticated as eren@imperial.ac.uk

⚡ Analysing job requirements...

⚠ Memory estimate: 189GB peak — consider requesting 256GB

✓ No failure risks detected for this configuration

Submitting workflow 'training-pipeline' to cluster 'archer2'...

✓ Job scheduled! ID: job-1734012847

[stage 1/3] preprocess (cluster: local)

[stage 2/3] train (cluster: archer2)

[stage 3/3] evaluate (cluster: local)

✓ Job completed successfully

The Problem

HPC is broken

💥

Jobs fail unpredictably OOM errors, walltime exceeded, silent crashes — hours of compute wasted

🔥

Resources get wasted Over-provisioning "just to be safe" burns through allocations

😤

Every cluster is different New job scripts, new schedulers, new quirks — for every system

Expanse fixes this.
One workflow definition. Any cluster. ML that predicts failures before they happen.

Features

Built for modern HPC

Expanse standardises how you define, share, and run computational workflows across any infrastructure — from your laptop to the world's largest supercomputers.

📐

Standardised orchestration

Define nodes as code functions with simple configuration files. No more custom job scripts — workflows are declarative, version-controlled, and portable across clusters.

🌐

Cross-language data transfer

Data flows seamlessly from Python preprocessing on your laptop to Fortran simulations on a supercomputer and back to C analysis — automatically. Expanse uses a zero-copy transfer protocol for maximum efficiency.

🧠

Smart scheduling decisions

State-of-the-art ML models predict resource needs and failure risks before your workload runs. Get suggestions to prevent crashes and optimize allocations — so you can focus on research, not debugging.

📚

Node repository

Share and discover reusable computational nodes through a centralized registry. Eliminate version mismatches and scattered code. Make HPC accessible to everyone with reproducible, verified components.

🔒

Governance & compliance

Complete audit trails track who ran what, when, and with which changes. Built for regulated industries requiring compliance with FDA 21 CFR Part 11, MiFID II, and other standards. Administrators get full visibility into usage patterns and modifications.

⚡

Multi-cluster execution

Run different stages on local machines, SLURM clusters, or Ray — all within a single workflow. Expanse handles the complexity of heterogeneous compute environments.

AI-Native

Intelligent workload optimisation

Expanse uses machine learning to analyse your workflows before execution, predicting resource requirements and potential failures. Get actionable suggestions to ensure your workloads run end-to-end without crashes.

🎯

Failure prediction AI

Before execution, Expanse analyses your code, data, and cluster state to predict out-of-memory errors, walltime exceeded, and other failure modes. Get warnings and recommendations to prevent crashes.

📊

Resource optimisation AI

ML models analyse workload patterns to suggest optimal CPU, memory, and walltime allocations. Reduce wasted resources and queue wait times with data-driven recommendations.

⚡

Smart scheduling AI

Predict queue wait times and suggest alternative clusters or resource configurations. Make informed decisions about when and where to run your workloads for maximum efficiency.

Pre-flight analysis

Memory Usage Needs attention

Predicted peak: 189GB · Requested: 128GB
High risk of OOM at ~47 minutes into training.

Walltime On track

Predicted runtime: 18h 34m · Requested: 24h
Comfortable margin. Job should complete successfully.

Queue Wait Estimated

Expected wait: 2h 15m on archer2 standard queue.
Jobs like this typically start within 1-4 hours.

Workflows

Nodes as code, workflows as config

Define nodes as reusable code functions with simple YAML configuration. Compose workflows by referencing nodes — each can target different clusters, and data flows between them automatically with zero-copy efficiency.

local

slurm

ray

archer2

cirrus

workflow.yaml

name: training-pipeline

stages:
  - name: preprocess
    node: ./nodes/preprocess
    cluster: local

  - name: train
    node: ./nodes/train
    cluster: archer2
    resources:
      cpu: 128
      memory: 256GB
      walltime: 24h

  - name: evaluate
    node: ./nodes/evaluate
    cluster: local
                

Run workflows acrossany cluster - intelligently