When one rank dies, all you see is 512 NCCL timeouts.

Denpex finds the rank that actually failed first, tells you whether it was the hardware, your model, your data, or the environment, and gives you the fix command. Built for distributed PyTorch, FSDP, and DeepSpeed at 64 to 16,384 GPUs.

Try 3 diagnoses free Diagnose a log now, no signup

Paste a crash log for a real diagnosis. No signup, no install. Or explore the live console →

Mapped the canonical failure graph across 140+ production AI incidents and 100+ infra engineers.

llama3-70b-finetune-run-47crashed, 512 GPUs idle

CrashDiagnoseFixAlertreplay of a real failure pattern

Built for

Frontier LLM training

70B+ model training across 1,000+ GPUs. Cross-rank cascade analysis, silent data corruption detection, and resume from the last good checkpoint.

FSDP / DDP production

PyTorch DDP and FSDP at scale. Diagnose the originating rank, the failure class, and the exact fix in under 12 seconds.

GPU clouds and neoclouds

White-label diagnosis for your customers. Offer Denpex under your brand, on the clusters you already operate.

Healthcare and life sciences

HIPAA BAA on Data Center. Configurable PHI masking on the agent, with audit logs of every diagnosis.

Want to be a reference customer? Get in touch →

Live diagnosis, no signup

Paste your last crash. Get the root cause now.

A real diagnosis on your own log. Root cause and the exact fix, in seconds. No account, no card, no agent. Logs are masked before they leave your browser's request, and deleted after diagnosis.

training_logs.txt

3 free diagnoses/day

Want 14 days of full access?

Request a free trial code for unlimited diagnoses, alerts, history, and follow-up questions. No credit card.

Request 14-day trial

Built for distributed ML training.

From frontier LLM training to FSDP fine-tunes to GPU cloud operations. Denpex fits the way your team already runs.

Frontier LLM training

70B+ model training across 1,000+ GPUs. Cross-rank cascade analysis, silent data corruption detection, and resume from the last good checkpoint.

FSDP / DDP production

PyTorch DDP and FSDP at scale. Diagnose the originating rank, the failure class, and the exact fix in under 12 seconds.

GPU clouds and neoclouds

White-label diagnosis for your customers. Offer Denpex under your brand, on the clusters you already operate.

Healthcare and life sciences

HIPAA BAA on Data Center. Configurable PHI masking on the agent, with audit logs of every diagnosis.

Supported Ecosystems

Frameworks

PyTorchDeepSpeedHorovodMegatron-LMRayHugging Face Accelerate

Schedulers

SLURMKubernetes

Monitoring

Weights & BiasesMLflowTensorBoard

Who uses Denpex?

Primary OperatorsTraining Infrastructure Engineers, ML Infrastructure Engineers, AI Platform Engineers, MLOps Engineers, Distributed Systems Engineers, GPU Cluster Engineers, SREs, and Platform Reliability Engineers.

Secondary BeneficiariesAI Researchers, Machine Learning Engineers, Deep Learning Engineers, Data Scientists, and Performance Engineers.

Platform BuyersDirector of AI Infrastructure, Head of ML Platform, VP of Engineering, CTO, and Director of Research Infrastructure.

Ideal OrganizationsAI startups training foundation models, Enterprise AI teams, GPU cloud providers, Managed ML platforms, Universities, National laboratories (HPC), Autonomous vehicle companies, and Defense contractors.

Your logs say “NCCL timeout.” The timeline says rank 17.

Distributed timeline reconstruction orders every event across every rank, clock-drift corrected, so the cascade reads in causal order, not log order. The watchdog is the last thing that happened, never the first.

t+0 msrank 17ECC uncorrectable error: the true initiatorROOT CAUSE

t+340 msrank 4stalls waiting on the wedged collective

t+600 sall ranksNCCL watchdog times out, the only line your logs showed you

The complete reliability surface

Diagnosis is the entry point. The platform covers the whole failure lifecycle before launch, during training, and after the fix ships.

1,284 hand-tuned signatures plus retrieval over the 386-entry Failure Encyclopedia, diagnosed in under 12 seconds, with AI fallback analysis for anything novel. Every diagnosis ends in a prescriptive fix.

Root cause analysis

The originating fault, never the symptom that woke you up

First-failed-rank detection

Which rank failed first, isolated across the whole fleet

Distributed rank correlation

Cross-rank telemetry stitched into one causal picture

Cascade failure analysis

How one bad GPU took 63 healthy ranks down with it

NCCL timeout diagnosis

The initiator behind the watchdog's generic timeout

CUDA OOM diagnosis + memory attribution

The exact tensor or layer that caused the OOM

Memory fragmentation diagnosis

Reserved-but-unallocated signatures, allocator-level fixes

Gradient explosion diagnosis

Norm spikes traced back to layer and step

NaN loss diagnosis

The propagation source, not just the first poisoned batch

Weight divergence diagnosis

Drift measured against your own healthy baselines

Silent hang diagnosis

Heartbeat detection for jobs that die without a stack trace

Device assert diagnosis

Device-side asserts mapped to the offending operation

Checkpoint corruption diagnosis

Torn writes and truncated shards caught before resume

Import error diagnosis

Environment faults separated from training faults

Version mismatch diagnosis

PyTorch × CUDA × cuDNN conflicts flagged precisely

Disk full diagnosis

Storage exhaustion before it masquerades as a framework crash

AI fallback analysis

Unknown failures get deep analysis on masked excerpts

Prescriptive fixes

Copy-paste resolution paths, verified against the failure class

Resume checkpoint recommendations

The last verified-good step to restart from

Hardware vs software classification

Infra issue or ML issue, instantly, so the right team moves

Looking for the full encyclopedia? Browse 500+ failure classes →

Architecture your security team can say yes to.

No paste-your-logs surprises. The agent is transparent about what it touches, what it masks, and what (if anything) leaves your cluster.

Request the full architecture brief

01 · your boundary

In-VPC agent

Single Python file, stdlib only, no root, no kernel module. Wraps your training command, heartbeats every 30 s. PII/PHI masking runs client-side, before any byte leaves your cluster, on by default. Set DENPEX_PRIVACY=strict and the agent ships only anonymized failure signatures; raw logs never leave.

02 · pattern-first, AI-last

Deterministic engine

1,284 hand-tuned regex signatures (1,262 failure types) plus IDF-weighted retrieval over the 386-entry Failure Encyclopedia, with clock-drift-corrected timeline reconstruction, do the work deterministically. The AI fallback only sees masked excerpts of novel failures, and your logs are never used to train anything.

03 · one incident, one owner

Routed resolution

Ownership mapping sends one correlated incident, root cause, classification, and the exact fix, to the engineer who owns the job, on Slack, PagerDuty, SMS or webhook. Hardware issues route to infra; ML issues route to research.

What your failures cost. And what you get back.

Estimate the GPU-hours and engineering time your team loses to undiagnosed failures, and the payback on each plan.

Your cluster wastes this much every month

$5,512

4 failures × 64 H100 SXM5 (80GB) × 3 hrs avg

Cluster size

GPU type

Training failures / month4

Meta Llama 3 averaged 7.7 failures/day on 16k GPUs · default scales with cluster size until you set it

Hours lost per failure3 hrs

Estimate conservatively based on your team's typical triage time.

Compute cost wasted

$1,912

768 GPU-hours/mo

Engineering time wasted

$3,600

2 engineers @ $150/hr

Annual2 months free

Estimated on 46,720 monitored GPU-hours/mo (64 GPUs × 730 h) · savings assume 70% of failure waste avoided

Best fit

Team

$415/mo

$4,980/yr

25,000 GPU-hrs included

then $0.06/GPU-hr · rate protected

Your est. Denpex cost

$1,718.20/mo

$415 base + $1,303.20 overage

Compute waste avoided

$2,140

Time to positive ROI: 25 days

Scale

$2,495/mo

$29,940/yr

150,000 GPU-hrs included

then $0.04/GPU-hr · rate protected

Your est. Denpex cost

$2,495/mo

Growth

$4,580/mo

$54,960/yr

600,000 GPU-hrs included

then $0.02/GPU-hr · rate protected

Your est. Denpex cost

$4,580/mo

Data Center

$12,500/mo

$150,000/yr

2,000,000 GPU-hrs included

then $0.015/GPU-hr

Your est. Denpex cost

$12,500/mo

Stop losing GPU budget to failures you can't diagnose.

No credit card. No setup. Diagnosis in under 12 seconds.

Try 3 diagnoses free →See the platform ↗

Estimates based on published failure rates (Meta Llama 3, Kokolis HPCA 2025, Epoch AI, ByteDance ByteRobust) and time-to-resolution studies (Factryze, FALCON).

From engineers who've been paged at 2am

Real failures. Real root causes. Not the ones they expected.

OOM masked as NCCL
“We ran 32 node DDP jobs that kept dying at step 12k-15k. Spent two weeks thinking it was a networking issue between our IB switches. Denpex flagged Rank 8 hitting OOM from gradient accumulation buffer growth at step 12,847. One line in deepspeed config, hasn't happened since. Still blows my mind it caught that from our NCCL timeout logs.”
Senior ML Infrastructure Engineer32 node DDP cluster

Dataloader, not hardware
“Our FSDP fine tunes were failing like clockwork every Thursday. Corrupted sample in our dataset that only showed up with certain sequence lengths. Without Denpex we'd have blamed the hardware vendor for another month. It pointed directly to the dataloader. One PyTorch Dataset fix, done.”
ML Platform LeadFSDP fine tuning

Ends the 2am blame game
“Honestly the biggest win is not the speed. It's having something that gives ML engineers and infra the same answer. When a job crashes at 2am, nobody's arguing about whether it was the network or the code. Denpex says Rank 47 hit a CUDA OOM. Both teams look at that and move on to fixing it instead of blaming each other for four hours.”
Staff ML EngineerShared training infrastructure

Priced against your GPU bill, not your seat count.

A single failure on a 64-GPU cluster wastes hours of compute and an afternoon of engineering time. Every plan starts free, no credit card needed. Annual plans save 2 months.

Annual2 months free

Free

$0/ month

Free forever

3 diagnoses per day
Paste logs in the web UI, nothing to install
15,800+ deterministic failure types
Root cause + exact fix, not an essay
Cost optimization advice on every diagnosis (evidence-based right-sizing, spot vs on-demand)
Free in-browser tools: NCCL topology linter (/preflight) + cross-run env diff (/diff)
7-day history
Community support

Team

$415/ month

Billed annually · $4,980/yr · 2 months free

25k GPU-hours included · $0.06/GPU-hr overage

Everything in Free, plus:
Unlimited seats
Diagnose jobs up to 128 GPUs
All 21,200+ failure types + AI fallback for novel errors
Cross-rank cascade analysis: isolates the rank that failed first
NCCL hang culprit-rank localizer
Known-bad version / driver advisory: PyTorch · CUDA · cuDNN · NCCL
Resume from the last good checkpoint, integrity-checked
Alerts with the fix inline: push, email, Slack, PagerDuty, SMS
Diagnose from anywhere: /denpex Slack command, VS Code extension, CLI
Incident cost on every diagnosis: GPU-hours and dollars burned
Vendor RMA payload with verdict: dead GPU vs. recoverable. Serial, ECC/Xid evidence, verification steps included
Self-improving engine: every confirmed fix makes the next diagnosis smarter
Cross-run comparison
Cross-run environment diff API: every change ranked by correlation with your failure class
On-call shift handoff reports: dashboard, Markdown export, API
Conversational diagnosis: ask follow-up questions in plain language
One-click connectors. Slack, PagerDuty, W&B, MLflow, TensorBoard, included, never metered
First-class multi-language SDKs (Rust, Go, Java, TS, Python)
Fix references from GitHub, Stack Overflow, and PyTorch Forums, cited inline
Bayesian AI Safeguard: mathematical verification against causal graph physics to prevent hallucinations
MCP Source Code Injection: exact failing snippet injected into diagnostic context
Team knowledge base
Unlimited history

Scale

$2,495/ month

Billed annually · $29,940/yr · 2 months free

150k GPU-hours included · $0.04/GPU-hr overage

Everything in Team, plus:
Monitor up to 1,024 GPUs
In-VPC agent option & self-hosted OpenAlex mirrors: logs and research queries never leave your cluster
Closed-loop auto-remediation: monitor, diagnose, and one-click apply the fix (you confirm)
L1 to L2 to L3 auto-escalation on recurring incidents
Silent data corruption (SDC) detection
Straggler + gray failure detection
DCGM thermal peer-comparison (micro-stragglers)
Delayed-OOM and slow memory leak detection
Vendor kernel regression tracking
PCIe ACS & topology bottleneck diagnosis
Zombie process detection + kill command
Per-layer weight-delta anomaly detection
SLURM and Kubernetes scheduler hooks: cancel wasted jobs before the queue drains
Privacy masking: PII stripped before any log leaves the host
PyTorch × CUDA × cuDNN compatibility checks
Xid node action router: ISOLATE / RESET / REBOOT / RMA
Mass-crash coalescing: a 2,000-node storm becomes one root cause, one page
Environment drift detection: the driver/library change since your last good run
Prometheus metrics endpoint: jobs, diagnoses, and incident cost in Grafana
Active Kubernetes & Environment MCP Tools: AI dynamically queries live pod status and env-doctor
Preflight cluster health check (denpex preflight-cluster)
Deep NCCL topology linter API for CI: GPU→NIC affinity, PCIe ACS, env-vs-fabric validation
Predictive node health scoring (GA): per-node pre-crash score with cordon/drain-RMA calls
Multi-cloud single pane: AWS, GCP, Azure, CoreWeave, Lambda, RunPod, on-prem in one fleet view
Checkpoint rollback + resume plans: verify → cordon → resume from the last verified checkpoint (one-click)
PagerDuty, webhooks, custom routing

Growth

$4,580/ month

Billed annually · $54,960/yr · 2 months free

600k GPU-hours included · $0.02/GPU-hr overage

Everything in Scale, plus:
Monitor up to 4,096 GPUs
Lower effective $/GPU as you scale
Denpex MCP server: query failure history from Claude, Cursor, or any MCP client
Research paper enrichment: arXiv, Semantic Scholar, OpenAlex, Crossref, DBLP, and Zenodo cited inline
365-day diagnosis history retention
Priority support: 1-business-day P1 response

Price protection: existing subscribers keep their rate and included GPU-hours when list prices rise.

Every plan includes one-click connectors. Slack, PagerDuty, W&B, MLflow, TensorBoard, never metered.

Data Center

$12,500/ monthBilled annually · $150,000/yr · 2 months free

2M GPU-hours included · $0.015/GPU-hr overage · volume & per-node pricing

Designed for fleets up to 16,384 GPUs · multi-tenant · white-label / OEM. Volume per-GPU, or per-node pricing for GPU-cloud providers who bill their own customers by the node. Onboarded through a scoped pilot, then scaled to your full fleet.

Designed for fleets up to 16,384 GPUs. Beyond by pilot, multi-tenant
Mass-crash coalescing: one fabric event → one diagnosis, one page
White-label / OEM diagnosis for your customers
Hands-off auto-remediation (opt-in) + predictive node health scoring (GA)
Hands-off checkpoint rollback + resume: crash → verified checkpoint → resumed while you slept (policy-gated, audited)
SLURM, Ray, Kubernetes integration
BYOK · EU region (planned) · air-gapped option
HIPAA BAA · High-availability edge · live status page · dedicated CSM

Book a demo

On-premise in-VPC agent on Scale+ · Upgrade or cancel anytime

Plan changes take effect immediately with prorated billing. On downgrades, the unused portion credits to your next invoice.

Need something custom? Talk to sales.

From diagnosis to autonomy

Diagnosis closes the loop on understanding. The autonomy layer closes the loop on recovery. See the autonomy roadmapfor what's shipped, in progress, and planned.

Predictive failure scoring

Telemetry models flag deteriorating GPUs and nodes before the crash, so jobs migrate instead of dying.

Automated node cordoning

A bad GPU is marked unhealthy and removed from the scheduler automatically.

Automated checkpoint rollback

On failure: find the last good checkpoint, validate it, resume training. No human in the loop.

Auto-remediation engine

Detect → cordon → roll back → resume. Autonomous recovery that closes the loop end to end.

Track delivery dates on the product roadmap →

Security & compliance posture

Logs deleted after diagnosis on Free/TeamPII / PHI masking before egressIn-VPC agent on Scale and Data CenterRetention + purge controlsRBAC + SSO (SAML/OIDC) on Data CenterGDPR DPA · HIPAA BAA on Data Center · SOC 2 Type II planned

We label compliance honestly: SOC 2 Type II is planned, not claimed. Read the Trust Center →

Frequently asked questions

Your next failure is already scheduled.

The only question is whether it costs you twelve hours of grep, or twelve seconds of Denpex.

Try 3 diagnoses free Talk to sales

When one rank dies, all you see is 512 NCCL timeouts.

Paste your last crash. Get the root cause now.

Built for distributed ML training.

Frontier LLM training

FSDP / DDP production

GPU clouds and neoclouds

Healthcare and life sciences

Supported Ecosystems

Frameworks

Schedulers

Monitoring

Who uses Denpex?

Your logs say “NCCL timeout.” The timeline says rank 17.

The complete reliability surface

Architecture your security team can say yes to.

In-VPC agent

Deterministic engine

Routed resolution

What your failures cost. And what you get back.

From engineers who've been paged at 2am

Priced against your GPU bill, not your seat count.

Free

Team

Scale

Growth

Data Center

From diagnosis to autonomy

Predictive failure scoring

Automated node cordoning

Automated checkpoint rollback

Auto-remediation engine

Frequently asked questions

How does Denpex get access to my training logs?

Will adding Denpex slow down my training?

Can it diagnose failures that already happened?

Do you store our training logs?

What if Denpex can't diagnose it?

Does it work with spot instances and preemptible GPUs?

How does the team knowledge base work?

Can Denpex compare my current failed run against a previous successful one?

What's on your roadmap?

Can I self-host the entire platform?

Your next failure is already scheduled.

Paste your last crash. Get the root cause now.