Skip to main content

CASE STUDY: FROM SHUTDOWN TO 40TB DOWNLOADED

Wraith: Autonomous Extraction Protocol

A platform was sunsetting content I paid for. No export, no download button, encrypted streams behind Cloudflare. So I built an autonomous pipeline that authenticates, intercepts, and downloads - across multiple sites, VPNs, and machines.

John Pratt
John Pratt
April 15, 20267 min read
Wraith: Autonomous Extraction Protocol demo
Personal project - active development
71K+
Lines of Code
41
Sites Supported
51K+
Videos Downloaded
3
Distributed Machines
40+
Terabytes Transferred

Wraith demo

Performed by a trained professional on a closed system with permission from the platform owner, granted at my request before their planned sunset. Automated bypasses against systems you don't own can violate the CFAA, DMCA, and equivalent laws elsewhere - this is a case study, not a how-to. Check your jurisdiction and get authorization in writing first.

Background

When a video platform you're subscribed to announces they're sunsetting a content library, you've got two options: lose access to thousands of videos you've paid for, or build something to get them out before the lights go off. There was no bulk export, no download button, and the platform's DRM was purpose-built to make this impossible. We're talking encrypted HLS streams, Cloudflare bot detection, aggressive IP throttling, and authentication flows designed to stop exactly this kind of thing.

I needed a system that could authenticate like a human, navigate like a human, and download like a machine - across multiple sites, at scale, without getting banned.

The key goals?

  • Fully autonomous operation - login, navigate, extract, download, catalog - with a single command
  • Stealth browser automation that survives Cloudflare Turnstile and advanced fingerprinting
  • Multi-site support with per-site configuration, credentials, and download rules
  • VPN orchestration to rotate exit nodes and avoid IP-based throttling
  • Distributed execution across multiple machines sharing a central database
  • Graceful failure handling - rate limit backoff, partial download recovery, clean shutdown

Challenge

Every layer of the stack was actively fighting me:

Browser Detection. These sites use Cloudflare's Turnstile managed challenge pages with fingerprinting that detects standard Selenium immediately. It's not just navigator.webdriver - they check GPU renderers, hardware concurrency, screen dimensions, device memory, and behavioral timing patterns. A single inconsistency and the entire session is burned. The Turnstile widget itself lives inside a closed shadow root, so extracting the sitekey for third-party solving requires CDP injection before the page even loads.

Authentication Complexity. Each site has its own login flow - form-fill with email verification, age gates, cookie consent banners, intermediate "upgrade your plan" pages, and occasionally full MFA. Some sites silently ban accounts after too many automated attempts, and the only way back in is through an automated chatbot appeal flow that I also had to script.

Download Protection. Videos aren't served as MP4 files. They're HLS streams split into hundreds of encrypted .ts segments behind a master playlist. The m3u8 URL is only discoverable via network interception during browser playback - you can't just scrape it from the page source. Some CDNs rate-limit at 429 after a handful of downloads, requiring exponential backoff and server rotation.

IP Throttling at Scale. Downloading more than a few videos from the same IP triggers throttling or outright blocks. Each site needs its own dedicated VPN exit node, and those nodes can't be shared across concurrent sessions on different machines.

Solution

I built Wraith - a ~25,000-line Python system that orchestrates the full pipeline from VPN connection to file-on-disk.

Browser Stealth Layer

The foundation is SeleniumBase's UC (undetected ChromeDriver) mode, but stock UC mode wasn't cutting it. I built a hardware profile system on top of it that generates and persists a consistent browser fingerprint across sessions - screen resolution, CPU cores, device memory, GPU renderer - all randomizable but stable within a session. The browser presents as one specific machine to the site's fingerprinting, not a rotating set of bot signatures.

For Cloudflare Turnstile, I integrated 2captcha as a solver backend. The system uses Chrome DevTools Protocol to inject interception scripts via Page.addScriptToEvaluateOnNewDocument before the page loads, capturing Turnstile parameters as Cloudflare's own JavaScript initializes them - before the closed shadow root locks everything down.

VPN Orchestration

Each site has reserved Mullvad WireGuard servers defined in YAML config. A central coordination daemon running on a home server manages server allocation across machines using a lease-based system with automatic TTL expiry. When a scraping session starts, it queries the daemon for available servers, claims one, and the daemon blocks other sessions from using the same exit node. If a session crashes, the lease expires automatically after 30 minutes - no manual cleanup needed.

Download Pipeline

Once authenticated and on a video page, the system monitors network traffic via CDP's Performance API to intercept the m3u8 playlist URL in real time. It parses the master playlist for available bitrates, selects the highest resolution, then hands off to N_m3u8DL-RE for parallel segment downloading - 16 concurrent connections pulling .ts segments simultaneously, with ffmpeg remuxing on completion.

If the CDN rate-limits mid-download, the system stores the m3u8 URL with an expiry timestamp in PostgreSQL and moves on. A background process retries these backoff URLs once the cooldown elapses - no manual intervention required.

Distributed Execution

The system scales horizontally via a --distributed flag that activates an orchestrator. It SSH's into configured machines, each pulling from the same PostgreSQL database. URL-level locking prevents duplicate work, and each machine tracks downloads via a machine_name field. The master rotates work across machines based on configurable thresholds, distributing the IP footprint across independent VPN connections.

Configuration-Driven Design

Every site-specific behavior lives in a single site_configs.yaml - credentials, CSS selectors, download limits, VPN assignments, URL rewrite rules, branding templates, and CAPTCHA fallback strategies. Adding a new site means adding a YAML block, not writing code. Sites can inherit from parent configs and override only what differs, so an entire network of properties sharing a common player can be onboarded in minutes.

Results

Topic: Building Autonomous Systems That Operate in Adversarial Environments

Importance & Risks:

  • Bot detection is an arms race, not a solved problem. Cloudflare updates fingerprinting signatures regularly. Chrome patches the CDP tricks that UC mode relies on. Every bypass has a shelf life, and a rigid system breaks the moment one component gets detected.
  • Coordination failures are silent killers. Before the VPN daemon existed, I was manually tracking which server was in use on which machine. Collisions caused mysterious download failures that took hours to debug - the kind of bug that looks like a CDN issue but is actually two sessions fighting over the same exit IP.
  • Scale amplifies every flaw. A login flow that works 95% of the time fails on every 20th video. At 50 videos a day across 3 machines, that's multiple manual interventions daily. Every edge case needs handling or the system can't run unattended.

Our Advice (from the trenches):

  • Make it modular or die maintaining it. The Turnstile solver, the fingerprinting layer, the network interception - they're all pluggable. When Cloudflare breaks one approach, the rest of the system keeps running while I swap in a fix. In my experience, the projects that survive long-term are the ones where you can replace any single component without touching the rest.
  • Put a coordinator in front of shared resources immediately. A simple HTTP daemon with in-memory lease tracking eliminated an entire class of VPN collision bugs. If multiple processes need to share anything - servers, database connections, file locks - centralize the coordination even if it feels like overkill for two machines. You'll be at four before you know it.
  • PostgreSQL is criminally underrated as an application database. I started with SQLite because it was simpler. The migration to Postgres paid for itself within a week. pg_trgm for fuzzy duplicate detection, connection pooling for distributed access, transactional URL locking to prevent race conditions - these aren't exotic features, they're table stakes for any system that coordinates across machines.
  • Build for observability from day one. The demo mode redaction system started as a one-off feature for recording a GIF, but it forced me to audit every log statement in the codebase. In my experience, the act of making logs safe to share is what makes them actually useful for debugging - you strip the noise and keep the signal.

Conclusion

Wraith has processed thousands of videos across multiple platforms, fully autonomously. A single command connects the VPN, launches a stealth browser, authenticates, navigates to pending URLs, intercepts HLS streams, downloads in parallel, embeds metadata and thumbnails, catalogs everything in PostgreSQL, and cleans up on exit. Graceful shutdown on Ctrl+C releases VPN leases, removes partial downloads, and disconnects cleanly. The demo mode shown above redacts all sensitive data from logs - site names, URLs, credentials, cookies - so the system can be showcased without exposing anything.

If you're building automation that has to operate in an environment that's actively trying to stop you - whether that's bot detection, rate limiting, or distributed coordination - the playbook is the same: make it modular, make it observable, and make it recover from failure without human intervention. That's the difference between a script and a fully autonomous AI agent.

John Pratt

John Pratt

Founder, Pratt Solutions · Previously at Northern Trust, Duke Energy, Capital One

Built enterprise systems at Northern Trust, Duke Energy, and Capital One. Now freelancing and building tools that solve hard problems at scale.

More about the author →
© 2026 John Pratt. All rights reserved. | Privacy Policy
Pratt Solutions

Let's talk outcomes.

If you're ready to ship, I'm ready to build.

I'll only use this to respond to your message. No newsletter, no marketing emails, no selling your info.