Priced against your GPU bill, not your seat count.

A single failure on a 64-GPU cluster wastes hours of compute and an afternoon of engineering time. Every plan starts free, no credit card needed. Annual plans save 2 months.

Annual2 months free

Free

$0/ month

Free forever

3 diagnoses per day
Paste logs in the web UI, nothing to install
15,800+ deterministic failure types
Root cause + exact fix, not an essay
Cost optimization advice on every diagnosis (evidence-based right-sizing, spot vs on-demand)
Free in-browser tools: NCCL topology linter (/preflight) + cross-run env diff (/diff)
7-day history
Community support

Team

$415/ month

Billed annually · $4,980/yr · 2 months free

25k GPU-hours included · $0.06/GPU-hr overage

Everything in Free, plus:
Unlimited seats
Diagnose jobs up to 128 GPUs
All 21,200+ failure types + AI fallback for novel errors
Cross-rank cascade analysis: isolates the rank that failed first
NCCL hang culprit-rank localizer
Known-bad version / driver advisory: PyTorch · CUDA · cuDNN · NCCL
Resume from the last good checkpoint, integrity-checked
Alerts with the fix inline: push, email, Slack, PagerDuty, SMS
Diagnose from anywhere: /denpex Slack command, VS Code extension, CLI
Incident cost on every diagnosis: GPU-hours and dollars burned
Vendor RMA payload with verdict: dead GPU vs. recoverable. Serial, ECC/Xid evidence, verification steps included
Self-improving engine: every confirmed fix makes the next diagnosis smarter
Cross-run comparison
Cross-run environment diff API: every change ranked by correlation with your failure class
On-call shift handoff reports: dashboard, Markdown export, API
Conversational diagnosis: ask follow-up questions in plain language
One-click connectors. Slack, PagerDuty, W&B, MLflow, TensorBoard, included, never metered
First-class multi-language SDKs (Rust, Go, Java, TS, Python)
Fix references from GitHub, Stack Overflow, and PyTorch Forums, cited inline
Bayesian AI Safeguard: mathematical verification against causal graph physics to prevent hallucinations
MCP Source Code Injection: exact failing snippet injected into diagnostic context
Team knowledge base
Unlimited history

Scale

$2,495/ month

Billed annually · $29,940/yr · 2 months free

150k GPU-hours included · $0.04/GPU-hr overage

Everything in Team, plus:
Monitor up to 1,024 GPUs
In-VPC agent option & self-hosted OpenAlex mirrors: logs and research queries never leave your cluster
Closed-loop auto-remediation: monitor, diagnose, and one-click apply the fix (you confirm)
L1 to L2 to L3 auto-escalation on recurring incidents
Silent data corruption (SDC) detection
Straggler + gray failure detection
DCGM thermal peer-comparison (micro-stragglers)
Delayed-OOM and slow memory leak detection
Vendor kernel regression tracking
PCIe ACS & topology bottleneck diagnosis
Zombie process detection + kill command
Per-layer weight-delta anomaly detection
SLURM and Kubernetes scheduler hooks: cancel wasted jobs before the queue drains
Privacy masking: PII stripped before any log leaves the host
PyTorch × CUDA × cuDNN compatibility checks
Xid node action router: ISOLATE / RESET / REBOOT / RMA
Mass-crash coalescing: a 2,000-node storm becomes one root cause, one page
Environment drift detection: the driver/library change since your last good run
Prometheus metrics endpoint: jobs, diagnoses, and incident cost in Grafana
Active Kubernetes & Environment MCP Tools: AI dynamically queries live pod status and env-doctor
Preflight cluster health check (denpex preflight-cluster)
Deep NCCL topology linter API for CI: GPU→NIC affinity, PCIe ACS, env-vs-fabric validation
Predictive node health scoring (GA): per-node pre-crash score with cordon/drain-RMA calls
Multi-cloud single pane: AWS, GCP, Azure, CoreWeave, Lambda, RunPod, on-prem in one fleet view
Checkpoint rollback + resume plans: verify → cordon → resume from the last verified checkpoint (one-click)
PagerDuty, webhooks, custom routing

Growth

$4,580/ month

Billed annually · $54,960/yr · 2 months free

600k GPU-hours included · $0.02/GPU-hr overage

Everything in Scale, plus:
Monitor up to 4,096 GPUs
Lower effective $/GPU as you scale
Denpex MCP server: query failure history from Claude, Cursor, or any MCP client
Research paper enrichment: arXiv, Semantic Scholar, OpenAlex, Crossref, DBLP, and Zenodo cited inline
365-day diagnosis history retention
Priority support: 1-business-day P1 response

Price protection: existing subscribers keep their rate and included GPU-hours when list prices rise.

Every plan includes one-click connectors. Slack, PagerDuty, W&B, MLflow, TensorBoard, never metered.

Data Center

$12,500/ monthBilled annually · $150,000/yr · 2 months free

2M GPU-hours included · $0.015/GPU-hr overage · volume & per-node pricing

Designed for fleets up to 16,384 GPUs · multi-tenant · white-label / OEM. Volume per-GPU, or per-node pricing for GPU-cloud providers who bill their own customers by the node. Onboarded through a scoped pilot, then scaled to your full fleet.

Designed for fleets up to 16,384 GPUs. Beyond by pilot, multi-tenant
Mass-crash coalescing: one fabric event → one diagnosis, one page
White-label / OEM diagnosis for your customers
Hands-off auto-remediation (opt-in) + predictive node health scoring (GA)
Hands-off checkpoint rollback + resume: crash → verified checkpoint → resumed while you slept (policy-gated, audited)
SLURM, Ray, Kubernetes integration
BYOK · EU region (planned) · air-gapped option
HIPAA BAA · High-availability edge · live status page · dedicated CSM

Book a demo

On-premise in-VPC agent on Scale+ · Upgrade or cancel anytime

Plan changes take effect immediately with prorated billing. On downgrades, the unused portion credits to your next invoice.

Need something custom? Talk to sales.

Loading pricing...

The Silent Cost of Thermal Stragglers

Calculate the hidden infrastructure waste from fail-slow incidents.

Cluster size (GPUs)128

GPU hourly cost$18.00

Straggler degradation30% wasted

Pipeline-parallel training runs at the speed of the slowest node. A 30% slower node wastes 30% of the entire cluster's compute. 100% represents a full stall or rollback.

Event duration72 min

Events per month10

Cost per event

$829

Wasted compute every time this straggler pattern hits the cluster.

Monthly wasted spend

$8,294

The Denpex ROI

Denpex detects thermal and memory stragglers within 30 seconds of degradation, auto-fencing the node before the pipeline bubble expands.

What we do. And what we don't.

Concrete controls, not vague promises. We label compliance honestly: SOC 2 Type II is planned, not claimed.

◈

PII / PHI masking

Client-side masking runs on the agent before any log is transmitted. Default patterns catch emails, SSNs, phone numbers, credit cards, and common PHI (MRN, NPI). Add your own patterns. Raw PII/PHI never leaves your cluster.

⬡

Logs deleted after diagnosis on Free and Team

On Free and Team, raw logs are processed in memory and never written to durable storage. We retain anonymized failure signatures and resolution metadata only, never raw lines. An in-VPC agent (no log egress at all) ships on Scale and Data Center.

🔒

Encryption

TLS 1.3 in transit, AES-256 at rest. Encryption keys are customer-managed (BYOK) on Data Center via your KMS.

☐

Compliance

SOC 2 Type II planned. GDPR DPA available, HIPAA BAA available on Data Center. Sub-processor list and data flow on the Trust Center.

◉

Access controls

SSO via Google, Discord, GitHub, and Microsoft. Enterprise SAML/OIDC SSO is on our roadmap. Role-based access (owner, admin, member, viewer) on Team+. 30-day audit log of all billing and team changes.

✦

Deployment options

Cloud (default), single-tenant on AWS / Azure / GCP, or fully air-gapped on your hardware on Data Center. White-label / OEM available for GPU clouds.

Priced against your GPU bill, not your seat count.

Free

Team

Scale

Growth

Data Center

The Silent Cost of Thermal Stragglers

Cost per event

Monthly wasted spend

The Denpex ROI

What we do. And what we don't.

PII / PHI masking

Logs deleted after diagnosis on Free and Team

Encryption

Compliance

Access controls

Deployment options

Frequently asked questions

Priced against your GPU bill, not your seat count.

Free

Team

Scale

Growth

Data Center

The Silent Cost of Thermal Stragglers

Cost per event

Monthly wasted spend

The Denpex ROI

What we do. And what we don't.

PII / PHI masking

Logs deleted after diagnosis on Free and Team

Encryption

Compliance

Access controls

Deployment options

Frequently asked questions

Priced against your GPU bill, not your seat count.

Free

Team

Scale

Growth

Data Center

The Silent Cost of Thermal Stragglers

Cost per event

Monthly wasted spend

The Denpex ROI

What we do. And what we don't.

PII / PHI masking

Logs deleted after diagnosis on Free and Team

Encryption

Compliance

Access controls

Deployment options

Frequently asked questions

How does Denpex get access to my training logs?

Will adding Denpex slow down my training?

Can it diagnose failures that already happened?

Do you store our training logs?

What if Denpex can't diagnose it?

Does it work with spot instances and preemptible GPUs?

How does the team knowledge base work?

Can Denpex compare my current failed run against a previous successful one?

What's on your roadmap?

Can I self-host the entire platform?

Priced against your GPU bill, not your seat count.

Free

Team

Scale

Growth

Data Center

The Silent Cost of Thermal Stragglers

Cost per event

Monthly wasted spend

The Denpex ROI

What we do. And what we don't.

PII / PHI masking

Logs deleted after diagnosis on Free and Team

Encryption

Compliance

Access controls

Deployment options

Frequently asked questions

How does Denpex get access to my training logs?

Will adding Denpex slow down my training?

Can it diagnose failures that already happened?

Do you store our training logs?

What if Denpex can't diagnose it?

Does it work with spot instances and preemptible GPUs?

How does the team knowledge base work?

Can Denpex compare my current failed run against a previous successful one?

What's on your roadmap?

Can I self-host the entire platform?