The ML failure diagnosis platform
A deterministic pipeline that collects logs from all ranks, correlates failure events using clock-drift corrected ordering, and outputs the exact root cause and fix in under 12 seconds.
How Denpex diagnoses failures in under 12 seconds
Paste your logs. Get a diagnosis. Fix the problem.
Paste your logs
Copy the error output from your training job. Works with PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, and Axolotl. Paste it into the diagnosis box.
Get instant diagnosis
Denpex pattern-matches your logs against known failure types. For common issues like CUDA OOM, NCCL timeout, gradient explosion, and checkpoint corruption, you get an instant match with the root cause and fix.
39 failure types covered
CUDA OOM, memory fragmentation, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, silent hangs, and more. Each with a prescriptive fix. 99.7% accuracy on known patterns.
Unknown failures get AI analysis
If your failure doesn't match a known pattern, Denpex uses AI to analyze and suggest what happened. Always get a next step, even for novel errors.
This error pattern doesn't match known failures. AI analysis suggests checking memory allocator configuration and batch size settings.
One line. No config. Works on your next failure.
import denpex
# Add before your training loop
denpex.init(
api_key="dpx_...",
job_name="llama3-70b-finetune",
notify=["slack", "sms"] # optional
)
# The rest of your training code is unchanged
trainer.train()Monitoring tells you the job died. Denpex tells you why.
W&B and Grafana are dashboards. ChatGPT has no idea what rank 42 was doing. Denpex is the only tool built specifically to diagnose distributed GPU training failures, and hand you the fix.
Diagnose from paste-only logs
Don't need an agent or integration. Paste your error output, get a diagnosis. Works with any framework. PyTorch, DeepSpeed, Megatron, Axolotl, whatever you're using.
Prescriptive fixes, not just error codes
Denpex doesn't just tell you what broke. It tells you how to fix it. Every diagnosis ends in a specific env var to set, config to change, or checkpoint to resume from. No essays. Just the fix.
39 failure types, 99.7% accuracy
CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, stragglers, zombie processes, weight delta anomalies, and more, all with exact pattern matching and known fixes.
Instant diagnosis, no waiting
Pattern matching runs in seconds. No AI hallucination risk for known failures. Novel errors get AI analysis with confidence scores so you know how reliable the suggestion is.
| Capability | Manual debugging | W&B / Grafana | Generic LLM chat | Denpex |
|---|---|---|---|---|
| Tells you which rank failed first | 3-34 hrs of log diving | ✕ | ✕ no cluster context | ✓ 11.3s avg |
| Root cause, not symptom dashboards | Eventually | ✕ shows metrics, not causes | Guesses | ✓ 99.7% on known types |
| Exact fix: env var, config, checkpoint | If you find it | ✕ | Unverified suggestions | ✓ prescriptive, every time |
| Remembers your cluster's failure history | Tribal knowledge | ✕ | ✕ | ✓ team knowledge base |
| Works at 2am with zero setup | ✕ you are the setup | ✕ needs instrumentation | ✓ | ✓ paste logs, done |