Priced against your GPU bill, not your seat count
A single failure on a 64-GPU cluster wastes ~$847 in compute and an afternoon of engineering time. Every plan starts free, no credit card needed.
Free
Prove it on your last failure. No credit card.
- ✓3 diagnoses per month
- ✓Paste logs in the web UI, nothing to install
- ✓15 most common failure types
- ✓Root cause + exact fix, not an essay
- ✓7-day history
- ✓1 seat
Your last NCCL timeout, diagnosed in 12 seconds.
Team
For teams training on up to 64 GPUs.
- Everything in Free, plus:
- ✓Unlimited diagnoses
- ✓All 39 failure types + AI fallback for novel errors
- ✓Up to 64 GPUs monitored
- ✓Multi-rank cascade analysis: which rank failed first
- ✓Slack, email + iMessage/SMS alerts
- ✓Cross-run comparison (last 5 runs)
- ✓Team knowledge base: every confirmed fix, shared
- ✓5 seats
- ✓90-day history
$847 avg recovered per incident. Pays for itself on the first failure.
Scale
For training infrastructure up to 512 GPUs.
- Everything in Team, plus:
- ✓On-premise agent: logs never leave your cluster
- ✓Silent data corruption (SDC) detection
- ✓Straggler + gray failure detection
- ✓Zombie process detection + auto-kill
- ✓Per-layer checkpoint weight delta analysis
- ✓Checkpoint integrity validation before you resume
- ✓PyTorch × CUDA × cuDNN compatibility database
- ✓PagerDuty, webhooks + custom alert routing
- ✓Cross-run comparison (unlimited history)
- ✓Unlimited seats · 1-year history
- ✓Pre-flight cluster validation: block doomed runs before launch
One caught SDC incident saves a multi-day training run.
Data Center
For GPU clouds and enterprise data centers. Custom contracts available.
- Everything in Scale, plus:
- ✓Unlimited GPUs, multi-tenant deployment
- ✓White-label / OEM: offer diagnosis to your customers
- ✓SLURM, Ray + Kubernetes scheduler integration
- ✓Predictive failure scoring (early access)
- ✓Auto-remediation engine (early access)
- ✓Configurable PII/PHI log masking
- ✓Custom knowledge base ingestion
- ✓99.9% uptime SLA with credits
- ✓SOC 2 Type II (in progress) · GDPR DPA · HIPAA BAA
- ✓Dedicated Customer Success Manager
- ✓Custom contracts, invoicing + procurement
Turn 'why did my job fail' tickets into self-service diagnoses.
Logs deleted after diagnosis · On-premise agent on Scale+ · Upgrade or cancel anytime
Your training logs contain your IP. We treat them that way.
PII/PHI Masking
Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.
On-Premise Option
Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted, never raw logs.
Encryption
All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.
Compliance
Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.