ML training reliability platform

Deterministic reliability for large-scale AI training

Denpex isolates hardware degradation, resolves collective-communication deadlocks, and prevents silent data corruption — automatically. Maximize cluster utilization and take your engineers out of log archaeology.

The console runs on simulated failures — nothing to install, no logs required.

denpex console — llama3-70b-finetune-run-47 diagnosed

$ denpex diagnose --last-run

parsing 64-rank log stream… done

SYMPTOM  NCCL watchdog timeout (ALLREDUCE #184729)

CAUSE    rank 42 · node-05 · Xid 48 double-bit ECC error

CLASS    HARDWARE — route to infra, not ML

# fix: drain node, relaunch with elastic restarts

scontrol update nodename=node-05 state=drain

torchrun --max-restarts=3 train.py --resume latest

root cause in seconds · resume from step 84000 (verified)

39 + 1

failure classes, plus AI fallback for novel errors

Seconds

to root cause — not days of cross-rank grep

64 ranks

cascade-traced to the first failed GPU

Your VPC

on-prem agent: logs never leave your cluster

Your logs say "NCCL timeout." The timeline says rank 17.

Distributed timeline reconstruction orders every event across every rank — clock-drift corrected — so the cascade reads in causal order, not log order. The watchdog is the last thing that happened, never the first.

t+0 msrank 17ECC uncorrectable error — the true initiatorROOT CAUSE
t+340 msrank 4stalls waiting on the wedged collective
t+600 sall ranksNCCL watchdog times out — the only line your logs showed you

The complete reliability surface

Diagnosis is the entry point. The platform covers the whole failure lifecycle — before launch, during training, and after the fix ships.

Twenty-five failure classes diagnosed deterministically, with AI fallback analysis for anything novel. Every diagnosis ends in a prescriptive fix — the exact command or config change, not an essay.

Root cause analysis

The originating fault — never the symptom that woke you up

First-failed-rank detection

Which rank failed first, isolated across the whole fleet

Distributed rank correlation

Cross-rank telemetry stitched into one causal picture

Cascade failure analysis

How one bad GPU took 63 healthy ranks down with it

NCCL timeout diagnosis

The initiator behind the watchdog's generic timeout

CUDA OOM diagnosis + memory attribution

Not "CUDA OOM" — the exact tensor that caused the OOM

Memory fragmentation diagnosis

Reserved-but-unallocated signatures, allocator-level fixes

Gradient explosion diagnosis

Norm spikes traced back to layer and step

NaN loss diagnosis

The propagation source, not just the first poisoned batch

Weight divergence diagnosis

Drift measured against your own healthy baselines

Silent hang diagnosis

Heartbeat detection for jobs that die without a stack trace

Device assert diagnosis

Device-side asserts mapped to the offending operation

Checkpoint corruption diagnosis

Torn writes and truncated shards caught before resume

Import error diagnosis

Environment faults separated from training faults

Version mismatch diagnosis

PyTorch × CUDA × cuDNN conflicts flagged precisely

Disk full diagnosis

Storage exhaustion before it masquerades as a framework crash

AI fallback analysis

Unknown failures get deep analysis on masked excerpts — never a shrug

Prescriptive fixes

Copy-paste resolution paths, verified against the failure class

Resume checkpoint recommendations

The last verified-good step to restart from

Hardware vs software classification

"Infra issue" or "ML issue" — instantly, so the right team moves

Architecture your security team can say yes to

No paste-your-logs surprises. The agent is transparent about what it touches, what it masks, and what — if anything — leaves your cluster.

Request the full architecture brief
01 · your boundary

On-prem agent

A single file, stdlib only, running in user space — no root, no kernel module. It wraps your training command, heartbeats every 30s, and ships the last 500 log lines on a non-zero exit. PII/PHI masking runs client-side, before anything leaves your VPC.

02 · pattern-first, AI-last

Deterministic engine

A pattern corpus over 39 failure classes plus clock-drift-corrected timeline reconstruction does the work — deterministically. The AI fallback only sees masked excerpts of novel failures, and your logs are never used to train anything.

03 · one incident, one owner

Routed resolution

Ownership mapping sends one correlated incident — root cause, classification, and the exact fix — to the engineer who owns the job, on Slack, PagerDuty, SMS, iMessage or webhook. Hardware issues route to infra; ML issues route to research.

What do failures cost your cluster?

Cost intelligence translates every incident into GPU hours wasted, money wasted and engineering time lost — the same math, fleet-wide, lives in your dashboard. Run your numbers, not ours.

Your cluster wastes this much every month
$16,537
12 failures × 64 H100 SXM5 (80GB) × 3 hrs avg
Cluster size
GPU type
Training failures / month12
Meta Llama 3 averaged 7.7 failures/day on 16k GPUs
Hours lost per failure3 hrs
Average 34.7 hrs at enterprise scale (Huawei/Platform-X FSE 2025). Estimate conservatively for your team.
Compute cost wasted
$5,737
2,304 GPU-hours/mo
Engineering time wasted
$10,800
2 engineers @ $150/hr
Most popular
Team
$499/mo
up to 64 GPUs
Monthly savings
$13,557
Pays back in 2 days
Scale
$2,499/mo
up to 512 GPUs
Monthly savings
$11,557
Pays back in 6 days
Data Center
$9,999/mo
unlimited GPUs
Monthly savings
$4,057
Pays back in 22 days
Stop losing GPU budget to failures you can't diagnose.
No credit card. No setup. Diagnosis in under 12 seconds.
89.9% of failures require 3+ hrs (Huawei Cloud 2025) · Average 34.7 hrs at enterprise scale (FSE 2025) · ByteDance FALCON study

Priced against your GPU bill, not your seat count

A single failure on a 64-GPU cluster wastes ~$847 in compute and an afternoon of engineering time. Every plan starts free, no credit card needed.

Free

$0/ month

Prove it on your last failure. No credit card.

  • 3 diagnoses per month
  • Paste logs in the web UI, nothing to install
  • 15 most common failure types
  • Root cause + exact fix, not an essay
  • 7-day history
  • 1 seat

Your last NCCL timeout, diagnosed in 12 seconds.

Most popular

Team

$499/ month

For teams training on up to 64 GPUs.

  • Everything in Free, plus:
  • Unlimited diagnoses
  • All 39 failure types + AI fallback for novel errors
  • Up to 64 GPUs monitored
  • Multi-rank cascade analysis: which rank failed first
  • Slack, email + iMessage/SMS alerts
  • Cross-run comparison (last 5 runs)
  • Team knowledge base: every confirmed fix, shared
  • 5 seats
  • 90-day history

$847 avg recovered per incident. Pays for itself on the first failure.

Scale

$2,499/ month

For training infrastructure up to 512 GPUs.

  • Everything in Team, plus:
  • On-premise agent: logs never leave your cluster
  • Silent data corruption (SDC) detection
  • Straggler + gray failure detection
  • Zombie process detection + auto-kill
  • Per-layer checkpoint weight delta analysis
  • Checkpoint integrity validation before you resume
  • PyTorch × CUDA × cuDNN compatibility database
  • PagerDuty, webhooks + custom alert routing
  • Cross-run comparison (unlimited history)
  • Unlimited seats · 1-year history
  • Pre-flight cluster validation: block doomed runs before launch

One caught SDC incident saves a multi-day training run.

Data Center

$9,999/ month

For GPU clouds and enterprise data centers. Custom contracts available.

  • Everything in Scale, plus:
  • Unlimited GPUs, multi-tenant deployment
  • White-label / OEM: offer diagnosis to your customers
  • SLURM, Ray + Kubernetes scheduler integration
  • Predictive failure scoring (early access)
  • Auto-remediation engine (early access)
  • Configurable PII/PHI log masking
  • Custom knowledge base ingestion
  • 99.9% uptime SLA with credits
  • SOC 2 Type II (in progress) · GDPR DPA · HIPAA BAA
  • Dedicated Customer Success Manager
  • Custom contracts, invoicing + procurement

Turn 'why did my job fail' tickets into self-service diagnoses.

Logs deleted after diagnosis · On-premise agent on Scale+ · Upgrade or cancel anytime

From diagnosis to autonomy

Diagnosis closes the loop on understanding. The autonomy layer closes the loop on recovery — available today in early access on the Data Center tier.

Early access

Predictive failure scoring

Telemetry models flag deteriorating GPUs and nodes before the crash, so jobs migrate instead of dying.

Early access

Automated node cordoning

A bad GPU is marked unhealthy and removed from the scheduler automatically — it never takes down a second run.

Early access

Automated checkpoint rollback

On failure: find the last good checkpoint, validate it, resume training. No human in the loop.

Early access

Auto-remediation engine

Detect → cordon → roll back → resume. Autonomous recovery that closes the loop end to end.

Security & compliance posture

Encryption in transit + at restPII / PHI masking before egressOn-prem & air-gapped deploymentRetention + purge controlsRBAC + SSO (SAML/OIDC)GDPR DPA · HIPAA BAA · SOC 2 Type II in progress

We label compliance honestly: SOC 2 Type II is in progress, not claimed. Ask for the current audit status in your architecture brief.

Frequently asked questions

Your next failure is already scheduled.

The only question is whether it costs you twelve hours of grep — or twelve seconds of Denpex.