The ML failure diagnosis platform

A deterministic pipeline that collects logs from all ranks, correlates failure events using clock-drift corrected ordering, classifies them against 16,500+ documented patterns, and outputs the exact root cause and fix in under 12 seconds.

Try 3 diagnoses free Read the quickstart

How the diagnosis pipeline works

Six sequential steps from process exit to prescriptive fix. Each step is optimized for latency. Total pipeline time is under 12 seconds for failures in the deterministic library.

Multi-rank log collection

The Denpex agent runs as the parent process of your training command. On failure, it simultaneously collects stderr and stdout from all GPU ranks, whether they ran on a single node or across 4,096 nodes. Collection completes in under 2 seconds.

Clock-drift correction

In distributed training, each node has its own system clock. Clock skew between nodes can exceed 500 milliseconds. That is enough to make the causal order of log events completely wrong. Denpex applies a distributed timestamp correction protocol to produce a single, accurate causal timeline across all ranks.

Pattern matching against 21,200+ failure classes

The corrected log corpus is matched against Denpex's pattern library: 16,500+ documented failure classes covering NCCL, CUDA, PyTorch, DeepSpeed, FSDP, Megatron-LM, and GPU hardware faults. Each pattern includes the root cause, the affected systems, and the prescriptive fix. Matches are returned in under 12 seconds. JAX/Flax and XLA stack traces are accepted today via the AI fallback; native JAX pattern packs are in progress on the roadmap.

Failure classification

Every diagnosis returns a failure classification: hardware fault, ML/numerical, data pipeline, environment/configuration, or distributed communication. The classification determines alert routing: hardware faults go to the infrastructure team, ML failures go to the training team, data bugs go to the data team.

Prescriptive fix output

The diagnosis output is not just a failure class: it includes the specific remediation: the exact environment variable to change, the config option to update, the checkpoint to resume from, or the GPU to replace. The fix is written in the alert body so engineers can act immediately without reading a runbook.

LLM fallback for novel failures

When no deterministic pattern matches with sufficient confidence, Denpex routes the full log corpus, including the clock-corrected causal timeline and hardware telemetry, to an LLM-powered diagnosis path. The result arrives in under 60 seconds and novel failures are flagged for addition to the deterministic library.

How Denpex diagnoses failures in under 12 seconds

Paste your logs. Get a diagnosis. Fix the problem.

Paste your logs

Copy the error output from your training job. Works with PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, and Axolotl. Paste it into the diagnosis box.

Paste your training output:

[ERROR] NCCL timeout on ranks 0-63

Rank 17: out of memory

Checkpoint failed to save

✓ Diagnosis: 11 seconds

Get instant diagnosis

Denpex pattern-matches your logs against known failure types. For common issues like CUDA OOM, NCCL timeout, gradient explosion, and checkpoint corruption, you get an instant match with the root cause and fix.

Pattern match found:

Rank 17: OOM at step 8,432

Root cause: memory fragmentation

Fix: PYTORCH_CUDA_ALLOC_CONF

21,200+ failure types covered

CUDA OOM, memory fragmentation, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, silent hangs, and more. 16,500+ are documented patterns that ship with a prescriptive fix; the rest are real-world incidents and their resolutions, reached by semantic recall.

CUDA_OOMOOM_FRAGMENTATIONNCCL_TIMEOUTGRADIENT_EXPLOSIONCHECKPOINT_CORRUPTIONIMPORT_ERRORVERSION_MISMATCHDEVICE_ASSERTSILENT_HANGNAN_LOSSDISK_FULLWEIGHT_DIVERGENCE

16,500+ documented failure types, each with a prescriptive fix

Unknown failures get AI analysis

If your failure doesn't match a known pattern, Denpex uses AI to analyze and suggest what happened. Always get a next step, even for novel errors.

AI Analysis

Novel failure detected

This error pattern doesn't match known failures. AI analysis suggests checking memory allocator configuration and batch size settings.

Confidence: Lower, verify suggestions manually

One line. No config. Works on your next failure.

train.py

import denpex

# Add before your training loop
denpex.init(
    api_key="dpx_...",
    job_name="llama3-70b-finetune",
    notify=["slack", "sms"]  # optional
)

# The rest of your training code is unchanged
trainer.train()

Monitoring tells you the job died. Denpex tells you why.

W&B and Grafana are dashboards. ChatGPT has no idea what rank 42 was doing. Denpex is the only tool built specifically to diagnose distributed GPU training failures, and hand you the fix.

⌨

Diagnose from paste-only logs

Don't need an agent or integration. Paste your error output, get a diagnosis. Works with any framework. PyTorch, DeepSpeed, Megatron, Axolotl, JAX/Flax, whatever you're using.

⚠

Prescriptive fixes, not just error codes

Denpex doesn't just tell you what broke. It tells you how to fix it. Every diagnosis ends in a specific env var to set, config to change, or checkpoint to resume from. No essays. Just the fix.

◉

15,800+ failure types, exact-match diagnosis

CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, stragglers, zombie processes, weight delta anomalies, and more, all with exact pattern matching and known fixes. Beyond exact matches, semantic recall reaches 21,200+ total failure signatures, including 4,700+ real-world incidents and their resolutions.

⇄

Instant diagnosis, no waiting

Pattern matching runs in seconds. No AI hallucination risk for known failures. Novel errors get AI analysis with confidence scores so you know how reliable the suggestion is.

↗

Goodput and fleet reliability

Calculate tokens/sec, point-in-time MFU, achieved PFLOP/s, wasted GPU-equivalent compute, node risk scores, and RMA candidates from the same run and fleet evidence operators already collect.

▣

Preflight, scheduler actions, and RMA packets

Run NCCL/IB/NVLink/topology preflight checks, issue governed SLURM/Kubernetes remediation tokens, and generate vendor-ready RMA evidence packets with dead-vs-recoverable verdicts.

Capability	Manual debugging	W&B / Grafana	Generic LLM chat	Denpex
Tells you which rank failed first	hours of log diving	✕	✕ no cluster context	✓ in seconds
Root cause, not symptom dashboards	Eventually	✕ shows metrics, not causes	Guesses	✓ exact match on known types
Exact fix: env var, config, checkpoint	If you find it	✕	Unverified suggestions	✓ prescriptive, every time
Remembers your cluster's failure history	Tribal knowledge	✕	✕	✓ team knowledge base
Works at 2am with zero setup	✕ you are the setup	✕ needs instrumentation	✓	✓ paste logs, done

Platform FAQ

How does Denpex handle clock skew between nodes?

Denpex applies a distributed timestamp correction protocol at log collection time. The agent on each node records a high-resolution timestamp at collection and exchanges reference timestamps with the coordinator. The correction is applied before the causal ordering step, producing a timeline accurate to within a few milliseconds across any cluster size.

What is the maximum cluster size Denpex supports?

Denpex is designed for and supports clusters up to 16,384 GPUs. Cascade analysis is exercised to 65,536 ranks against synthetic fixtures, fleet heartbeats shard per-tenant with multiples of headroom at 2,000 nodes, and mass-crash coalescing collapses a fabric-wide event. Thousands of simultaneous node crashes. Into a single diagnosis and a single page instead of thousands. Contact us for fleets beyond 16,384 GPUs; they're onboarded through a scoped pilot.

Does Denpex require changes to my training code?

No. Denpex wraps your training command at the process level. You change one line in your launcher or sbatch script. Your training code, logging calls, and framework configuration are completely unchanged.

How does silent data corruption detection work?

Denpex monitors the per-layer weight-delta distribution across training steps. SDC events cause specific statistical anomalies in these distributions: weight updates that are too uniform, too sparse, or outside the expected magnitude range for the layer type and training step. These anomalies are invisible in the loss curve and produce no log output.

Can Denpex run entirely within my VPC?

Yes, on the Scale+ plan. The in-VPC deployment runs the full pattern engine locally with no external API calls. Only anonymized diagnostic metadata (failure class, severity, rank ID) leaves the VPC, no raw log content. Set DENPEX_LOCAL=1 for complete zero-egress operation.

How long does it take to install Denpex?

Typical installation takes under 30 minutes. Download the single-file agent (curl -O https://denpex.com/agent/denpex.py) or install the published SDK (pip install denpex), add your API key to the environment, wrap your training command with python denpex.py run --, and run a test diagnosis. The quickstart guide covers SLURM, Kubernetes, and bare-metal setups.