Deterministic reliability for large-scale AI training
Denpex isolates hardware degradation, resolves collective-communication deadlocks, and prevents silent data corruption — automatically. Maximize cluster utilization and take your engineers out of log archaeology.
The console runs on simulated failures — nothing to install, no logs required.
$ denpex diagnose --last-run
parsing 64-rank log stream… done
SYMPTOM NCCL watchdog timeout (ALLREDUCE #184729)
CAUSE rank 42 · node-05 · Xid 48 double-bit ECC error
CLASS HARDWARE — route to infra, not ML
# fix: drain node, relaunch with elastic restarts
scontrol update nodename=node-05 state=drain
torchrun --max-restarts=3 train.py --resume latest
root cause in seconds · resume from step 84000 (verified)
39 + 1
failure classes, plus AI fallback for novel errors
Seconds
to root cause — not days of cross-rank grep
64 ranks
cascade-traced to the first failed GPU
Your VPC
on-prem agent: logs never leave your cluster
Your logs say "NCCL timeout." The timeline says rank 17.
Distributed timeline reconstruction orders every event across every rank — clock-drift corrected — so the cascade reads in causal order, not log order. The watchdog is the last thing that happened, never the first.
The complete reliability surface
Diagnosis is the entry point. The platform covers the whole failure lifecycle — before launch, during training, and after the fix ships.
Twenty-five failure classes diagnosed deterministically, with AI fallback analysis for anything novel. Every diagnosis ends in a prescriptive fix — the exact command or config change, not an essay.
Root cause analysis
The originating fault — never the symptom that woke you up
First-failed-rank detection
Which rank failed first, isolated across the whole fleet
Distributed rank correlation
Cross-rank telemetry stitched into one causal picture
Cascade failure analysis
How one bad GPU took 63 healthy ranks down with it
NCCL timeout diagnosis
The initiator behind the watchdog's generic timeout
CUDA OOM diagnosis + memory attribution
Not "CUDA OOM" — the exact tensor that caused the OOM
Memory fragmentation diagnosis
Reserved-but-unallocated signatures, allocator-level fixes
Gradient explosion diagnosis
Norm spikes traced back to layer and step
NaN loss diagnosis
The propagation source, not just the first poisoned batch
Weight divergence diagnosis
Drift measured against your own healthy baselines
Silent hang diagnosis
Heartbeat detection for jobs that die without a stack trace
Device assert diagnosis
Device-side asserts mapped to the offending operation
Checkpoint corruption diagnosis
Torn writes and truncated shards caught before resume
Import error diagnosis
Environment faults separated from training faults
Version mismatch diagnosis
PyTorch × CUDA × cuDNN conflicts flagged precisely
Disk full diagnosis
Storage exhaustion before it masquerades as a framework crash
AI fallback analysis
Unknown failures get deep analysis on masked excerpts — never a shrug
Prescriptive fixes
Copy-paste resolution paths, verified against the failure class
Resume checkpoint recommendations
The last verified-good step to restart from
Hardware vs software classification
"Infra issue" or "ML issue" — instantly, so the right team moves
Architecture your security team can say yes to
No paste-your-logs surprises. The agent is transparent about what it touches, what it masks, and what — if anything — leaves your cluster.
On-prem agent
A single file, stdlib only, running in user space — no root, no kernel module. It wraps your training command, heartbeats every 30s, and ships the last 500 log lines on a non-zero exit. PII/PHI masking runs client-side, before anything leaves your VPC.
Deterministic engine
A pattern corpus over 39 failure classes plus clock-drift-corrected timeline reconstruction does the work — deterministically. The AI fallback only sees masked excerpts of novel failures, and your logs are never used to train anything.
Routed resolution
Ownership mapping sends one correlated incident — root cause, classification, and the exact fix — to the engineer who owns the job, on Slack, PagerDuty, SMS, iMessage or webhook. Hardware issues route to infra; ML issues route to research.
What do failures cost your cluster?
Cost intelligence translates every incident into GPU hours wasted, money wasted and engineering time lost — the same math, fleet-wide, lives in your dashboard. Run your numbers, not ours.
Priced against your GPU bill, not your seat count
A single failure on a 64-GPU cluster wastes ~$847 in compute and an afternoon of engineering time. Every plan starts free, no credit card needed.
Free
Prove it on your last failure. No credit card.
- ✓3 diagnoses per month
- ✓Paste logs in the web UI, nothing to install
- ✓15 most common failure types
- ✓Root cause + exact fix, not an essay
- ✓7-day history
- ✓1 seat
Your last NCCL timeout, diagnosed in 12 seconds.
Team
For teams training on up to 64 GPUs.
- Everything in Free, plus:
- ✓Unlimited diagnoses
- ✓All 39 failure types + AI fallback for novel errors
- ✓Up to 64 GPUs monitored
- ✓Multi-rank cascade analysis: which rank failed first
- ✓Slack, email + iMessage/SMS alerts
- ✓Cross-run comparison (last 5 runs)
- ✓Team knowledge base: every confirmed fix, shared
- ✓5 seats
- ✓90-day history
$847 avg recovered per incident. Pays for itself on the first failure.
Scale
For training infrastructure up to 512 GPUs.
- Everything in Team, plus:
- ✓On-premise agent: logs never leave your cluster
- ✓Silent data corruption (SDC) detection
- ✓Straggler + gray failure detection
- ✓Zombie process detection + auto-kill
- ✓Per-layer checkpoint weight delta analysis
- ✓Checkpoint integrity validation before you resume
- ✓PyTorch × CUDA × cuDNN compatibility database
- ✓PagerDuty, webhooks + custom alert routing
- ✓Cross-run comparison (unlimited history)
- ✓Unlimited seats · 1-year history
- ✓Pre-flight cluster validation: block doomed runs before launch
One caught SDC incident saves a multi-day training run.
Data Center
For GPU clouds and enterprise data centers. Custom contracts available.
- Everything in Scale, plus:
- ✓Unlimited GPUs, multi-tenant deployment
- ✓White-label / OEM: offer diagnosis to your customers
- ✓SLURM, Ray + Kubernetes scheduler integration
- ✓Predictive failure scoring (early access)
- ✓Auto-remediation engine (early access)
- ✓Configurable PII/PHI log masking
- ✓Custom knowledge base ingestion
- ✓99.9% uptime SLA with credits
- ✓SOC 2 Type II (in progress) · GDPR DPA · HIPAA BAA
- ✓Dedicated Customer Success Manager
- ✓Custom contracts, invoicing + procurement
Turn 'why did my job fail' tickets into self-service diagnoses.
Logs deleted after diagnosis · On-premise agent on Scale+ · Upgrade or cancel anytime
From diagnosis to autonomy
Diagnosis closes the loop on understanding. The autonomy layer closes the loop on recovery — available today in early access on the Data Center tier.
Predictive failure scoring
Telemetry models flag deteriorating GPUs and nodes before the crash, so jobs migrate instead of dying.
Automated node cordoning
A bad GPU is marked unhealthy and removed from the scheduler automatically — it never takes down a second run.
Automated checkpoint rollback
On failure: find the last good checkpoint, validate it, resume training. No human in the loop.
Auto-remediation engine
Detect → cordon → roll back → resume. Autonomous recovery that closes the loop end to end.
Security & compliance posture
We label compliance honestly: SOC 2 Type II is in progress, not claimed. Ask for the current audit status in your architecture brief.
Frequently asked questions
Your next failure is already scheduled.
The only question is whether it costs you twelve hours of grep — or twelve seconds of Denpex.