The ML failure diagnosis platform

A deterministic pipeline that collects logs from all ranks, correlates failure events using clock-drift corrected ordering, and outputs the exact root cause and fix in under 12 seconds.

How Denpex diagnoses failures in under 12 seconds

Paste your logs. Get a diagnosis. Fix the problem.

01

Paste your logs

Copy the error output from your training job. Works with PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, and Axolotl. Paste it into the diagnosis box.

Paste your training output:
[ERROR] NCCL timeout on ranks 0-63
Rank 17: out of memory
Checkpoint failed to save
✓ Diagnosis: 11 seconds
02

Get instant diagnosis

Denpex pattern-matches your logs against known failure types. For common issues like CUDA OOM, NCCL timeout, gradient explosion, and checkpoint corruption, you get an instant match with the root cause and fix.

Pattern match found:
Rank 17: OOM at step 8,432
Root cause: memory fragmentation
Fix: PYTORCH_CUDA_ALLOC_CONF
03

39 failure types covered

CUDA OOM, memory fragmentation, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, silent hangs, and more. Each with a prescriptive fix. 99.7% accuracy on known patterns.

CUDA_OOMOOM_FRAGMENTATIONNCCL_TIMEOUTGRADIENT_EXPLOSIONCHECKPOINT_CORRUPTIONIMPORT_ERRORVERSION_MISMATCHDEVICE_ASSERTSILENT_HANGNAN_LOSSDISK_FULLWEIGHT_DIVERGENCE
12 of 39 failure types, each with a prescriptive fix
04

Unknown failures get AI analysis

If your failure doesn't match a known pattern, Denpex uses AI to analyze and suggest what happened. Always get a next step, even for novel errors.

AI Analysis
Novel failure detected

This error pattern doesn't match known failures. AI analysis suggests checking memory allocator configuration and batch size settings.

Confidence: Lower, verify suggestions manually

One line. No config. Works on your next failure.

train.py
import denpex

# Add before your training loop
denpex.init(
    api_key="dpx_...",
    job_name="llama3-70b-finetune",
    notify=["slack", "sms"]  # optional
)

# The rest of your training code is unchanged
trainer.train()

Monitoring tells you the job died. Denpex tells you why.

W&B and Grafana are dashboards. ChatGPT has no idea what rank 42 was doing. Denpex is the only tool built specifically to diagnose distributed GPU training failures, and hand you the fix.

Diagnose from paste-only logs

Don't need an agent or integration. Paste your error output, get a diagnosis. Works with any framework. PyTorch, DeepSpeed, Megatron, Axolotl, whatever you're using.

Prescriptive fixes, not just error codes

Denpex doesn't just tell you what broke. It tells you how to fix it. Every diagnosis ends in a specific env var to set, config to change, or checkpoint to resume from. No essays. Just the fix.

39 failure types, 99.7% accuracy

CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, stragglers, zombie processes, weight delta anomalies, and more, all with exact pattern matching and known fixes.

Instant diagnosis, no waiting

Pattern matching runs in seconds. No AI hallucination risk for known failures. Novel errors get AI analysis with confidence scores so you know how reliable the suggestion is.

CapabilityManual debuggingW&B / GrafanaGeneric LLM chatDenpex
Tells you which rank failed first3-34 hrs of log diving✕ no cluster context✓ 11.3s avg
Root cause, not symptom dashboardsEventually✕ shows metrics, not causesGuesses✓ 99.7% on known types
Exact fix: env var, config, checkpointIf you find itUnverified suggestions✓ prescriptive, every time
Remembers your cluster's failure historyTribal knowledge✓ team knowledge base
Works at 2am with zero setup✕ you are the setup✕ needs instrumentation✓ paste logs, done