Priced against your GPU bill, not your seat count

A single failure on a 64-GPU cluster wastes ~$847 in compute and an afternoon of engineering time. Every plan starts free, no credit card needed.

Free

$0/ month

Prove it on your last failure. No credit card.

  • 3 diagnoses per month
  • Paste logs in the web UI, nothing to install
  • 15 most common failure types
  • Root cause + exact fix, not an essay
  • 7-day history
  • 1 seat

Your last NCCL timeout, diagnosed in 12 seconds.

Most popular

Team

$499/ month

For teams training on up to 64 GPUs.

  • Everything in Free, plus:
  • Unlimited diagnoses
  • All 39 failure types + AI fallback for novel errors
  • Up to 64 GPUs monitored
  • Multi-rank cascade analysis: which rank failed first
  • Slack, email + iMessage/SMS alerts
  • Cross-run comparison (last 5 runs)
  • Team knowledge base: every confirmed fix, shared
  • 5 seats
  • 90-day history

$847 avg recovered per incident. Pays for itself on the first failure.

Scale

$2,499/ month

For training infrastructure up to 512 GPUs.

  • Everything in Team, plus:
  • On-premise agent: logs never leave your cluster
  • Silent data corruption (SDC) detection
  • Straggler + gray failure detection
  • Zombie process detection + auto-kill
  • Per-layer checkpoint weight delta analysis
  • Checkpoint integrity validation before you resume
  • PyTorch × CUDA × cuDNN compatibility database
  • PagerDuty, webhooks + custom alert routing
  • Cross-run comparison (unlimited history)
  • Unlimited seats · 1-year history
  • Pre-flight cluster validation: block doomed runs before launch

One caught SDC incident saves a multi-day training run.

Data Center

$9,999/ month

For GPU clouds and enterprise data centers. Custom contracts available.

  • Everything in Scale, plus:
  • Unlimited GPUs, multi-tenant deployment
  • White-label / OEM: offer diagnosis to your customers
  • SLURM, Ray + Kubernetes scheduler integration
  • Predictive failure scoring (early access)
  • Auto-remediation engine (early access)
  • Configurable PII/PHI log masking
  • Custom knowledge base ingestion
  • 99.9% uptime SLA with credits
  • SOC 2 Type II (in progress) · GDPR DPA · HIPAA BAA
  • Dedicated Customer Success Manager
  • Custom contracts, invoicing + procurement

Turn 'why did my job fail' tickets into self-service diagnoses.

Logs deleted after diagnosis · On-premise agent on Scale+ · Upgrade or cancel anytime

Your training logs contain your IP. We treat them that way.

PII/PHI Masking

Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.

On-Premise Option

Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted, never raw logs.

🔒

Encryption

All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.

Compliance

Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.

Frequently asked questions