9 LTE Modems, 2 Raspberry Pis, and a Headless Chromium Clicking Reboot 4,000 Times

A production residential proxy pool — nine LTE modems, two Raspberry Pis, Playwright as the rotation engine, and the IP economics that make it work.

May 31, 2026 · 32 min read

Nine TP-Link M7000/M7200 LTE modems connected via USB hub to two Raspberry Pis on a wooden desk, each modem physically labeled with its subnet identifier. — The fleet: nine TP-Link M7000/M7200 LTE modems across two Raspberry Pis (one Pi 4, one Pi 5), connected via a Terminus 7-port USB 2.0 hub. Each modem physically labeled with its subnet identifier — the only way to keep modem identity stable across reboots, because the firmware reports an identical fake serial number for every device.

On this page

The setup, in one paragraph

Two Raspberry Pis run a pool of nine TP-Link MiFi LTE routers as a rotating residential-IP proxy fleet for my AI products. Each modem appears on its Pi as its own USB network interface; each modem has its own GOST HTTP-proxy instance bound to that interface; clients reach the pool over WireGuard. Every five minutes (give or take), a headless Chromium instance — supervised by pm2 — logs into each modem's web admin UI, clicks Reboot, watches it bounce, and verifies through ipify that the carrier handed it a new public IP. When something fails, an escalating recovery ladder takes over: nmcli down/up, then USB power-cycle through sysfs authorized toggling, then — if it were enabled — a host reboot. It isn't enabled. The fleet has been running for nearly four weeks straight with zero throttling and zero crashes, having completed roughly 4,000 reset cycles and pushed close to a terabyte through the WireGuard tunnel.

That's the system. The rest of this post explains the parts of it that I find interesting after running it in production: a few architectural decisions that turned out smaller than I expected they would be, and a few details that turned out larger.

What this solves

The setup is built around three properties: residential IPs that anti-bot systems can't reliably block, consumer-hardware cost economics, and high availability without a managed service or a complex cluster.

The first comes from cellular CGNAT. Mobile carriers hide individual subscribers behind dynamic pools of public IPs that anti-bot vendors are structurally reluctant to ban — too many real users live there. Run nine independent SIMs across two carriers and rotate their public addresses every twelve minutes or so, and you have an IP pool that anti-bot mitigation can't fit into its usual "one IP, one identity, one rate limit" model. In practice, individual modems get rate-limited constantly — that's what triggers the rotation in the first place — but the pool as a whole has never been blocked. Querying public sites, Google included, is functionally unblockable from this setup.

The second comes from refusing to pay residential proxy vendors per gigabyte. Two Raspberry Pis and nine consumer MiFi modems amortize against any comparable commercial residential proxy service in well under a year. The dominant ongoing cost is the SIMs themselves, which I run on flat-rate consumer plans.

The third is the cheapest property to get right: it falls out of running the pool across two Pis. If one node crashes, gets a kernel update, or needs hardware maintenance, the other carries the load with four or five modems instead of nine. It's high availability through redundancy, not through coordination — no cluster manager, no shared state, no failover orchestration. Each Pi is independently operational. The cost is one extra Raspberry Pi.

Why this exists

I build AI products. Two of them — Advanty and Discury — consume large amounts of public web data: ad creative inventories, product listings, social discussion threads. The pattern is familiar to anyone doing this kind of work in 2026: most of the high-value pages are behind Cloudflare, Datadome, or Akamai bot-mitigation layers that flag datacenter IPs immediately. You either pay one of the commercial residential proxy vendors at $X per gigabyte, or you find another way to put requests on IPs that look organically residential.

Cellular carriers, it turns out, are the other way. Mobile traffic in most countries hides individual subscribers behind carrier-grade NAT (CGNAT), and the pool of public IPs that CGNAT rotates over is exactly the kind of "residential" address space anti-bot vendors are reluctant to ban. They can't — too many real users sit on the same /16 for a few minutes at a time. From the bot's perspective, this is the cleanest IP space money can buy.

The economics flip if you can control your own LTE modems. A SIM with a flat-rate or sufficiently generous data plan costs in the low single-digit euros per month. A consumer MiFi modem is a one-time cost in the same range. Once you can force the modem to reconnect to the carrier on demand — and the carrier hands you a new CGNAT-assigned IP each time — you have your own residential proxy node. Repeat nine times across two carriers, glue it together, and you have a pool.

That's the entire thesis. The interesting part isn't the idea; it's the operational details of making it actually work in production over weeks at a time, on hardware that wasn't really designed for it.

A few rough numbers for context. The commercial residential proxy market in 2026 prices traffic at roughly $3 to $15 per gigabyte depending on vendor, tier, and IP type — Bright Data, Oxylabs, Smartproxy and the rest of the established names cluster around that range. A workload pushing meaningful volume — say, 1 TB per month, which matches what the larger of my two Pi nodes alone handles — would invoice at four to five figures monthly through that channel.

The equivalent operating cost on this setup is dominated by nine consumer SIM plans. Czech flat-rate or high-FUP mobile data plans in the relevant tier run between roughly 300 and 500 CZK per SIM per month — figures in the same ballpark in most European markets. Total monthly operating cost for the SIMs is in the low thousands of CZK, or low hundreds of euros at current rates. The hardware amortizes over the SIM cost in well under a year.

The savings number depends heavily on workload size and which commercial provider you'd otherwise pay. For volumes large enough to make the build worth the operational overhead — which is real, this isn't a zero-maintenance system — the gap is several orders of magnitude, not a percentage. For small workloads, the math doesn't favor building this; just pay one of the commercial vendors.

A clarifying note on the carriers' side: this setup runs on consumer SIM plans, used for my own AI products' data ingestion. It is not a commercial resale operation, and the modems are not being rented or sublet to third parties. Czech operators treat that distinction carefully, as do operators in most jurisdictions, and the line between "personal/business use of multiple SIMs" and "commercial proxy resale" is the line where their tolerance ends. I stay on the right side of it.

The fleet

Two nodes, both running Debian/Raspbian Trixie on the latest kernel:

Node 1 (proxy1) — a Raspberry Pi 4 Model B with 4 GB RAM. Four TP-Link M7200 modems plugged into an internal powered USB hub (a VIA Labs four-port).
Node 2 (proxy2) — a Raspberry Pi 5 with 8 GB RAM. Five modems: one TP-Link M7000 plugged directly into a USB-3 port, plus four M7200s hanging off a seven-port USB 2.0 hub (a Terminus Technology FE 2.1).

Both Pis sit in Argon ONE cases with the fan controller daemon running. Power comes from the standard official USB-C supplies. The modems get all their power over USB, from the Pis — no external supplies anywhere in the fleet.

This last detail surprised me when I first built it out. Conventional wisdom says you cannot power nine LTE modems off two Raspberry Pis through their USB ports without running into undervolt or current-draw issues. In practice — after nearly four weeks of continuous operation under real production load — vcgencmd get_throttled returns 0x0 on both nodes, and CPU temperatures sit at 51°C and 45°C respectively. There's been zero throttling, zero undervolt warnings in dmesg, zero modem brownouts under the watchdog's recovery ladder.

I want to be careful here: this is not a recommendation that USB power for nine modems is fine in general. It's a report that for these specific modems (TP-Link M7000/M7200, which idle low and don't peak hard during transmit) on these specific PSUs (the official Pi supplies, which have generous headroom), through these specific hubs (one of them properly powered, the other passive but only ever serving full-speed USB), it's been working. Power delivery is the single most fragile assumption in a setup like this, and I'd watch it carefully if I were building from scratch today.

Carrier split

Seven of the nine SIMs run on O2 (the Czech mobile operator); the remaining two run on T-Mobile. The split isn't ideological — it's about diversifying CGNAT pools. If I were running nine modems on one carrier, I'd be drawing from a single public IP block and the rotation would feel less rotation-y in practice (more on this in a moment). Two carriers means two independent IP pools, which roughly doubles the effective IP variety you see in any given hour.

The choice of M7200 vs. M7000 is also carrier-driven. The M7000 is an older modem revision, and one specific O2 prepaid tariff plays better with that hardware than the M7200 firmware does. It's a minor compatibility quirk I worked around by keeping one M7000 in the fleet. If you're building a similar setup elsewhere, expect to discover one or two of these tariff/firmware idiosyncrasies during the first month — it's the kind of thing that has no documentation and only shows up in practice.

A second carrier-side detail worth mentioning: most flat-rate or high-FUP mobile plans have soft throttling triggers somewhere — sustained heavy hotspot/tethering use will eventually attract reduced speeds, especially on the cheaper plans. The rotation behavior helps here too: from the carrier's perspective, no single SIM is sustaining unusual throughput, because the rotation period is short enough that each SIM's sliding-window data usage stays within normal-looking bounds. I monitor per-SIM monthly data usage and rotate SIM-plan tiers if any single modem starts pushing toward its FUP ceiling. So far, this hasn't required intervention on the current plan mix.

Why two Pis instead of one

The honest reason: I started with one Pi and outgrew it. The accidental reason it's now a permanent design choice: redundancy. If either Pi goes down — for a kernel update, an SD card replacement, a moved cable — the other one continues serving requests with its share of the modems. Clients see reduced pool capacity, never a full outage. There is no orchestration here, no leader election, no shared state. Each Pi runs its own independent watchdog and its own copy of the rotation engine. The pool degrades gracefully because there's nothing to fail in the connection between the two nodes; they are independent service units that happen to be addressable as a single fleet from the WireGuard side.

This is the cheapest high-availability story I've ever shipped. The cost is one extra Raspberry Pi.

The architecture surprise: no policy routing

Most tutorials I read while building this setup do per-interface egress with Linux's policy routing: a separate routing table per modem, ip rule directives that match traffic by fwmark or source IP, and either a tc classifier or per-listener iptables rules to tag traffic correctly. It's a clean, well-understood pattern. It's also a stack of moving parts, each of which can break independently.

This setup uses none of it.

$ ip rule
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

That's a stock Debian routing configuration — no rules added, no custom tables defined. The main table just contains nine default routes, one per modem, distinguished only by their metric value:

default via 192.168.1.1   dev eth0     metric 10     ← the Pi's own default
default via 192.168.12.1  dev router1  metric 100
default via 192.168.14.1  dev router2  metric 200
default via 192.168.15.1  dev router3  metric 300
…

Outbound traffic from the Pi itself takes the lowest-metric default (eth0). So how does a request to TCP port 8081 actually leave through modem 2's LTE connection, and not just take the Ethernet default? The answer is that it doesn't rely on routing at all. The work happens one layer up, in GOST, which is the HTTP proxy server I'm running on each port:

# /etc/systemd/system/gost-router2.service
[Unit]
Description=GOST proxy for router2
After=network.target
 
[Service]
ExecStartPre=/bin/bash -c 'for attempt in {1..30}; do \
  ip addr show router2 | grep -q "inet " && break; sleep 2; done'
ExecStart=/usr/local/bin/gost -L=http://0.0.0.0:8081?interface=router2
Restart=always
RestartSec=10
User=hilgard
 
[Install]
WantedBy=multi-user.target

The magic is the ?interface=router2 query parameter on the listener URL. GOST interprets this as a SO_BINDTODEVICE instruction: any outbound socket the proxy opens to fulfill a request on port 8081 gets bound to the router2 network interface, which means egress is pinned to modem 2's LTE connection — regardless of what the routing table says.

SO_BINDTODEVICE is a fairly old Linux feature, but most HTTP proxy software doesn't expose it as a configuration option. GOST does, and that single design decision collapses an entire layer of complexity. There is no fwmark. There is no per-interface routing table. There is no source-IP rule. There is one line of systemd configuration per modem, and the egress goes where you tell it.

The first time I got this working, I spent twenty minutes looking for the catch. There wasn't one.

How clients reach a specific modem

GOST listens on 0.0.0.0:808x (a different port per modem), and the Pi has its WireGuard interface (wg0) bound to 10.66.66.31/32 on proxy1 and 10.66.66.32/32 on proxy2. Both addresses live on the same 10.66.66.0/24 WireGuard overlay; the WireGuard hub routes traffic between client peers and the two proxy nodes.

So a client that wants to send a request out through, say, modem 3 on the Pi 4 — opens an HTTP CONNECT to http://10.66.66.31:8082. The connection arrives at proxy1 over WireGuard, hits the GOST instance bound to router3 via SO_BINDTODEVICE, and the egress goes out through router3's LTE connection. The "address" of a specific modem from a client's perspective is just (WireGuard node, port) — there's no separate routing layer above WireGuard's standard peer-to-peer routing. Adding a new modem is systemd unit + one new port; adding a new Pi is new WireGuard peer + clients now have a second node:port range to choose from. The complexity stays flat as the fleet grows.

The ExecStartPre loop is there because modems take time to come up after a reboot, and binding to a not-yet-DHCP-ed interface would fail. The loop waits up to sixty seconds for router2 to acquire an IP before letting GOST start.

If I had to nominate the single architectural decision in this setup that I'd carry forward to any future project, it would be this one: when application-layer software can express what you need natively, you don't need to push the problem down to the network stack. The result is a setup that someone unfamiliar with policy routing can read and understand in five minutes, instead of an hour.

The rotation engine: Playwright clicks Reboot

This is the part of the setup that, when I describe it to other engineers, gets either a laugh or a "wait, what?" reaction. Both are appropriate.

To rotate the public IP behind a given modem, you need to force the modem to disconnect from the carrier and reconnect. The carrier will then assign a (likely) different CGNAT-mapped IP. The TP-Link M7200 and M7000 expose this functionality through their web admin UI as a "Reboot" button.

There is no API. There is no AT command interface accessible to userspace without driver work I didn't want to do. There is a web page at http://192.168.X.1/ with a login form, an admin password, and a Reboot button buried three menu levels deep. The button does what it says.

So the rotation engine is a Playwright script — running headless Chromium, supervised by pm2 — that logs into each modem's admin UI on a loop, clicks Reboot, and waits for the modem to come back up. The repository is around 4,000 lines of TypeScript and JavaScript at this point, most of which is failure-handling.

The six-step cycle

The core function, resetIPForRouter, runs the same six steps for every modem:

Login. Navigate to http://192.168.X.1, fill the password field, click the Login button. (The admin UI RSA-encrypts the password client-side before submission, which is why there's a node-rsa dependency in the project.) Up to five retries with backoff.
Get the current external IP. Send a request through the modem's own proxy port (http://localhost:808N → api.ipify.org). This is the "before" snapshot for the verification step at the end.
Click Reboot. Navigate to the Advanced → Device → Shutdown page, click a.shutDownBtn.reboot, confirm in the #shutdownOK dialog. The page redirects to a "rebooting" status screen.
Watch it bounce. Poll the admin URL until it stops responding (up to 30 seconds, this is the modem actually rebooting), then poll until it answers again (up to 90 seconds, this is the modem reconnecting to the carrier).
Re-login. Same login flow as step 1, but the modem can be slow to fully boot, so the retry budget here is generous.
Verify. Re-query api.ipify.org through the proxy port. Check that the new external IP differs from the one captured in step 2. Up to five attempts with ten seconds between each, because the modem can sometimes reconnect with a brief gap before its proxy is fully ready.

Every step emits a Socket.IO event to a live monitoring dashboard (more on that in a moment). The full cycle for one modem takes about two and a half minutes on a good day; if there's a recovery escalation, longer.

Live observability

Every action the rotation script takes — login attempt, reboot click, IP comparison, recovery escalation — streams as a Socket.IO event to a small Express + Vite dashboard running on each Pi. The dashboard is deliberately unfancy: per-modem reset progress widget, aggregated cycle summary, recent external IPs with old → new diffs, and the Pi's current temperature. Nothing exotic. The point is that when something goes wrong I can see exactly which modem failed at which step, in which cycle, without grepping logs.

Proxy Monitoring dashboard mid-cycle on proxy1: Router 2 in the login step (1/6) of its reset, with Router 1 showing a completed IP rotation from 37.188.222.132 to 37.188.146.126. Routers 3 and 4 queued. CPU temperature 49.1°C, 3 of 4 modems serving proxy traffic. — Mid-cycle on proxy1: Router 2 in the login step (1/6) of its reset, while Router 1 has already completed a verified IP rotation. Routers 3 and 4 are queued. Three modems serving traffic while one rotates — the natural-HA point in motion.

The dashboard runs as a separate pm2 process and stays up across the rotation script's restarts. That separation matters: the rotation script can crash, hang, or be killed by pm2's autorestart, and the dashboard keeps serving the last known state and continues receiving events from any other script instance that takes over. Observability and operation are decoupled, which is what observability is supposed to be.

The pm2 anti-pattern that works

Here's a detail that confused me when I first looked at pm2's status output: the rotation script shows ~4,000 restarts in 25 days, with an uptime of seconds to minutes. The dashboard script next to it shows zero restarts and 25 days of uptime. Naively, this would suggest the rotation script is crashing constantly and the dashboard is fine.

Neither is true. The rotation script is intentionally exiting at the end of each cycle. The main function calls process.exit() after completing the full nine-modem sweep — exit code 0 if every modem reset cleanly, exit code 1 if any modem failed. pm2 sees the exit, respects its autorestart: true configuration, and immediately launches a fresh process to begin the next cycle.

This is not how pm2 is typically used. The conventional pattern is a long-running event loop where the script lives for days. Here, every cycle is a fresh Node process with a fresh Chromium instance and a clean memory state. The "restart count" is effectively a cycle counter. It works because:

Chromium is restarted cleanly every cycle, which prevents the slow memory growth I'd otherwise need to monitor.
A bug or stuck state in one cycle can't compound across cycles — the next cycle gets a fresh slate.
pm2's restart policy gives me a free supervisor for "did the cycle finish?" — if Node crashes or hangs, pm2 restarts it anyway.

I wouldn't reach for this pattern by default in greenfield code, but it composes well with what Playwright needs (a clean browser context) and what pm2 gives you for free (supervision + automatic recovery). The 4,000-restart count in the pm2 dashboard looks alarming and is actually the healthy signal.

The dashboard, by contrast, is a long-running Express server with Socket.IO listeners — that one is supposed to stay up, and it has, for the entire 25-27 day window.

If you ever inherit this setup, please do not "fix" the restart counter.

The watchdog: why "just reboot" isn't enough

The web-UI-click-Reboot mechanism handles the happy path. Everything else needs the watchdog.

LTE modems are not server-grade hardware. They drop carrier connections without warning. Their internal firmware leaks memory and freezes. Their USB enumeration occasionally goes sideways for no reason that ever shows up in dmesg. NetworkManager sometimes loses track of a modem that was working five minutes ago. Over the course of weeks, every individual mode of failure happens to at least one modem in the fleet, and a setup that only knows how to reboot will eventually wedge itself.

Before each rotation cycle, the script runs checkAndRecoverUSBInterface for every modem. This is an escalating ladder, and what it does in each branch:

The interface isn't enumerated at all. This means the kernel doesn't see the modem on the USB bus — it could be unplugged, dead, or in a state where the USB controller didn't enumerate it on boot. The script logs this, emits a system:reboot event to the dashboard, and stops there. A full host reboot would resolve most of these cases, and the original code did call sudo reboot here. I commented it out: the watchdog is reliable enough at the other recovery branches that a host reboot turned out to be unnecessary, and the small risk of a boot loop (if the underlying problem persists across reboots) wasn't worth the residual benefit. The setup runs continuously now without ever needing a full restart.

The interface is present but has no IP address. Most often this means NetworkManager hasn't completed DHCP, or the modem reconnected to the carrier but didn't re-up the local connection. The fix is mechanical: nmcli connection down "Router N", wait, nmcli connection up "Router N", wait again. If the connection comes back with an IP, restart that modem's GOST service (systemctl restart gost-routerN) so it picks up the fresh network state and we're done.

nmcli itself fails. Now the modem is in a state NetworkManager can't reason about. The escalation is a USB power-cycle: toggle the authorized sysfs file on the modem's USB device (echo 0 > .../authorized, wait three seconds, echo 1 > .../authorized), wait ten seconds for re-enumeration plus DHCP, then run GOST restart. This is a hard reset at the USB layer, slightly more aggressive than a modem reboot, and it recovers a category of failures that a soft reboot cannot.

The actual command, against a modem at USB topology path 1-2.4:

echo 0 > /sys/bus/usb/devices/1-2.4/authorized \
  && sleep 3 \
  && echo 1 > /sys/bus/usb/devices/1-2.4/authorized

That's the entire power-cycle. The kernel sees the device drop off the bus, then re-enumerate from scratch a few seconds later — same as physically unplugging and reseating it. The script wraps this with a NetworkManager reconnect afterward, because a freshly re-enumerated modem starts without any active connection profile.

The interface has an IP, but the gateway isn't reachable. This is the "frozen modem" state: from the Pi's perspective everything looks healthy, but ping -c1 -W3 192.168.X.1 against the modem's own admin IP fails. The modem firmware has wedged itself. Same recovery as above — USB power-cycle, then GOST restart.

The script tracks a persistent failure counter for each modem in /tmp/usb-recovery-failures.json (which survives the per-cycle script restarts), and after three consecutive failures on the same modem it escalates to the host-reboot branch — which, as above, is currently disabled. In practice the counter rarely climbs above one. The USB power-cycle branch is the workhorse, and it resolves more than it has any right to.

The decision tree, simplified:

# Watchdog escalation, per modem, before each rotation cycle
def recover(modem):
    if not interface_present(modem):
        # Dead device. Would host-reboot, but disabled — emit event, skip.
        emit("system:reboot", modem)
        return
 
    if not has_ip(modem):
        # NetworkManager lost the connection profile.
        nmcli_down_up(modem)
        if has_ip(modem):
            restart_gost(modem)
            return
 
    if not pingable(modem.gateway_ip):
        # Modem firmware frozen — IP present but gateway dead.
        usb_power_cycle(modem.usb_path)  # the bash one-liner above
        sleep(10)
        restart_gost(modem)
        return
 
    # Persistent failure escalation
    if consecutive_failures[modem] >= MAX_RECOVERY_FAILURES:
        # Would host-reboot here; disabled.
        emit("system:reboot", modem)

The logic itself isn't novel — it's the standard "try increasingly aggressive recovery actions and stop when something works" pattern from any decent supervision system. What matters is that every branch is known to recover a real failure mode I have observed in production. Nothing in here is theoretical; every recovery action exists because at some point in the first month of operation, something failed in exactly that way and the previous branch wasn't enough.

What turns nine flaky consumer MiFi dongles into something that behaves like infrastructure is this ladder. The Playwright reboot is the headline feature. The watchdog is the unsung hero.

Production gotchas

These are the details that didn't appear in any tutorial I read, that I learned by losing time to them.

Every M7200 has the same fake USB serial

Run lsusb -v on an M7200 and you get vendor 37ad, product 0001, and a serial number reading 0123456789ABCDEF. Every M7200 in the fleet reports the identical serial. TP-Link's QA process apparently considered 0123456789ABCDEF a sufficiently unique identifier for the entire production run.

This means udev rules can't pin persistent names to specific modems based on serial number — every modem looks the same to udev. The trick is to key on USB topology path instead. Every modem has a stable bus path (1-2.4, 1-1.3, etc.) determined by which physical USB port it's plugged into; that path doesn't change unless I physically unplug the modem and move it. The usbPort field in the script's config maps a logical name (router4) to a path (1-2.6).

The fragility is exactly what you'd expect: if I unplug a modem and reseat it in a different port, it gets a new identity, and the proxy bound to its old name talks to whichever modem now lives in the old port. I learned this the first time I rearranged cabling and watched four modems' IP rotations swap into the wrong configs. The fix is to label every USB cable physically and never reorganize in haste.

Modems briefly impersonate Google during reboot

A modem mid-reboot enumerates as a different USB device. For about two seconds, the M7200 firmware exposes a USB descriptor that reads:

idVendor=18d1   (Google)
idProduct=d00d
Manufacturer: Google
Product: Android

18d1:d00d is the Google fastboot/recovery bootloader identifier. The TP-Link firmware appears to be Android-derived, and during its boot phase it announces itself as a generic Android device before transitioning to the regular TP-Link M7200 descriptor. You can watch this happen in dmesg every time the watchdog issues a USB power-cycle — there's a window where the kernel briefly reports a Google Android device on the bus, followed by the M7200 descriptor returning.

This is harmless. NetworkManager has nothing to do during the Google phase (no usable interface exposed), so it just waits for the real device to come back. But it's an entertaining detail to watch in production logs: whatever Linux-derived firmware TP-Link is shipping clearly has opinions about its own identity during boot.

The 12 Mbit/s USB bottleneck

The Terminus Technology hub on proxy2 is a USB 2.0 device, but lsusb -t shows it negotiating as full-speed (12 Mbit/s) rather than high-speed (480 Mbit/s) — even though both the hub and the M7200 modems support high-speed. The Linux kernel logs not running at top speed; connect to a high speed hub when each modem enumerates. The cause is likely a chain-of-fallbacks issue with the hub's internal transaction translator, and I haven't traced it deeper than that.

What this means in practice: four M7200 modems on proxy2 share a 12 Mbit/s aggregate USB ceiling. Per-modem LTE throughput in real usage tops out at a few Mbit/s on a good carrier connection, so I haven't been bottlenecked yet — but the headroom is much thinner than the modem capacity suggests. The one M7000, which is plugged directly into a USB 3 port, runs at full 480 Mbit/s and has no such ceiling. The Pi 4 on proxy1, with its different (powered) hub, runs all four modems at high-speed.

If I were rebuilding proxy2 today I'd replace the hub with a properly powered USB 3 unit that exposes high-speed downstream ports. As a "you've been warned" detail for anyone trying to copy this setup, the hub choice matters more than the modem choice.

The TP-Link M7200 web admin UI doesn't submit the admin password in plaintext. It RSA-encrypts the password client-side using a public key it serves from a JavaScript bundle, and only the encrypted blob ever goes over the wire. This is more security than I expected from consumer prosumer-grade hardware, and it forced an extra dependency (node-rsa) into the Playwright script because the script has to perform the same encryption to submit a working login.

The good news: the public key is consistent across modems and you can pull it out of the page's JS bundle once. The amusing news: this is the kind of detail that would be easy to gloss over and would silently break the entire setup. The login form looks like an ordinary password input until you sniff the network traffic.

Memory and thermal headroom

After 25-27 days continuous operation, both Pis sit comfortably below thermal throttling thresholds (51°C on the Pi 4 doing the heavier work, 45°C on the Pi 5), have no undervolt warnings, and report throttled=0x0. RAM utilization is modest — even with a fresh Chromium instance every cycle, the rotation script's peak working set stays well under 1 GB. The Pi 4's 4 GB and the Pi 5's 8 GB are both massive overkill for the workload; the actual bottleneck across the system is the USB power topology, not compute or memory.

The production numbers

A snapshot of the fleet at roughly the four-week mark:

	proxy1 (Pi 4)	proxy2 (Pi 5)
Uptime	25 days	27 days
Modems online	4	5
Cycles completed	~4,170	~3,600
Effective cycle period	~10–12 min	~10–12 min
WireGuard traffic out	720 GiB	247 GiB
WireGuard traffic in	47 GiB	22 GiB
CPU temperature	51 °C	45 °C
Thermal throttling	none	none
Host reboots required	0	0

The asymmetry in throughput between the two nodes — proxy1 has pushed about three times the egress traffic of proxy2 — reflects how my AI workloads are scheduled rather than any difference in capacity. The Pi 4 happens to carry the heavier shift, and the data plans on its modems are sized accordingly.

The "cycle period" deserves a footnote. The script's configured cycle floor is five minutes — the rotation loop never sleeps less than that. But each cycle reboots nine modems sequentially with a sixty-second gap between each and roughly seventy-five seconds of reboot/verify time per modem, so the effective full-fleet cadence runs closer to ten or twelve minutes in practice. The five-minute number is the minimum interval between cycle starts, not the typical interval.

What this means for downstream consumers: any given modem's IP changes roughly every ten to twelve minutes. With nine modems on rotation, the pool offers a new IP every minute or two on average. That's plenty of churn for the kinds of long-form scraping and content-extraction jobs my AI products run; for synchronous user-facing requests it would not be enough, but that's not what this pool is for.

A note on getting blocked, in practice

The "unblockable" claim from the intro deserves the technical follow-up.

If a target site decides one of my IPs is annoying, the modem behind that IP cycles to a fresh public address within about twelve minutes. By the time the block is consequential, it's already irrelevant. If a site blocks more aggressively — banning a whole /24 in CGNAT space, say — the eight other modems on the rotation don't notice; some are on the other carrier and the rest are mapped through entirely different CGNAT pools at O2 anyway.

The closest equivalent to a full-pool outage was when O2 ran maintenance on a tower somewhere and one or two modems lost carrier for ten minutes; the pool ran with a slight limp for ten minutes and then everything came back. That's the worst the system has done in months of continuous operation.

The underlying observation worth keeping: anti-bot mitigation is built around a model of one IP, one identity, one rate limit. A rotation pool sized to a small target site's full daily traffic, distributed across two carriers, simply doesn't fit that model cleanly. Most sites give up before they ever block all of you, because they can't tell that all of you is the same operator. Resilience as a side-effect of redundancy, not as a feature anyone explicitly built.

What I'd do differently, and what I won't change

The setup is comfortable enough to run unattended for weeks at a time, so the obvious question is: would I build it the same way again?

Mostly yes. The application-layer egress via GOST's SO_BINDTODEVICE binding is the design decision I'm most pleased with — I'd carry it into any future project that needs per-interface routing. The Playwright-clicks-Reboot rotation engine is silly-sounding but durable in practice, and the alternative (modifying TP-Link firmware to expose a CLI, or hot-patching the modem's HTTP API) would be a much larger maintenance burden.

The things I'd change on a rebuild:

A properly powered USB 3 hub for the second Pi. The 12 Mbit/s ceiling on the Terminus hub isn't biting yet, but it's the most obvious latent bottleneck in the system and is also the cheapest single thing to fix.
Physical USB cable labeling from day one. The combination of identical fake serials and topology-path naming means any cable reorganization is risky; physical labels prevent that risk entirely. I added them after the first time I scrambled the fleet's identity, and I should have done it before.
Native systemd ordering for proxy startup, rather than the bash ExecStartPre polling loop. The polling works, but the right way to express "wait for the interface to have an IP before starting" is probably a systemd device unit or a NetworkManager dispatcher script, not thirty rounds of sleep 2.

The things I considered changing and decided not to:

The 5-minute cycle floor. I experimented with shorter intervals; the carriers throttle aggressive reconnects, and you start seeing the same IP returned more often than you'd like below about four minutes. Five minutes is the floor where rotation is reliably rotation-y.
Enabling the host reboot branch. The watchdog's lower escalations are reliable enough that a host reboot has never proven necessary in production. The risk of an enabled host-reboot path is a boot loop if the underlying problem persists across reboots. The reward is recovering from a category of failure I haven't seen in four weeks. Net assessment: leave it off.
Migrating to fewer, more capable modems. A single M.2 industrial cellular modem can sometimes outperform several consumer MiFis on aggregate throughput, but you lose the IP diversity benefit (one modem = one IP pool at a time). Nine independent MiFis, each on its own SIM, gives me nine separate carrier sessions and nine independent CGNAT views. That's the whole point.

Alternatives I evaluated before building this

For anyone considering the same problem, the obvious alternatives I evaluated:

Commercial residential proxies (Bright Data, Oxylabs, Smartproxy). At my volume the monthly invoice would have been five figures. Killed on economics for sustained workload, retained as overflow fallback in the application layer.
Tailscale instead of WireGuard. Easier mesh management and NAT traversal out of the box, but Tailscale's coordination plane adds an external dependency and a control-plane cost I didn't need. WireGuard hub-and-spoke is two .conf files and zero external dependencies. Killed on simplicity.
FlareSolverr / similar Cloudflare bypass services. Useful for one-off requests but doesn't solve the upstream IP-quality problem. The pool was always going to need to exist regardless of how the Cloudflare layer was solved.
AT command interface to the modems (instead of clicking the web UI). Possible in theory but requires driver-level work on the TP-Link firmware that I wasn't going to maintain. The Playwright path is silly-sounding but durable.

The alternatives are all reasonable answers to slightly different questions. The pool is the answer to "how do I run residential-IP scraping at meaningful sustained volume without invoicing a commercial vendor monthly."

Where this architecture stops making sense

The consumer-hardware design is deliberately scoped. It works for my volume — nine modems, low single-digit terabytes per month aggregate, one product family — because the price of entry stays low enough to be self-funded and the operational overhead stays small enough that one person handles it part-time. Beyond a certain threshold, this stops being the right architecture.

The threshold is somewhere around 50 modems, give or take. Past that, you run into problems that consumer hardware doesn't solve cleanly: RF interference between physically clustered LTE devices, USB host controller saturation that no off-the-shelf hub design fixes, power delivery requirements that need real electrical engineering rather than careful PSU selection, and physical real-estate problems that a Raspberry Pi cluster doesn't graciously absorb. At that scale the architectural class shifts — rackmount industrial multi-SIM gateways, PCIe cellular cards in proper server hardware, IPMI management, and the operational discipline of a small datacenter rather than a closet.

I considered going that direction. At my volume the consumer setup amortizes in months while industrial gear amortizes in years, and the capex case for industrial doesn't close yet. Different problem, different answer. Worth flagging because the architectural decisions here don't scale linearly: a setup that's clearly correct at 9 modems would be clearly wrong at 90.

What it powers

This pool sits underneath the data-ingestion layer of three AI products I'm running in production. The workloads share a common shape: high-volume, asynchronous, residential-IP-required, and economically marginal at any commercial proxy vendor's pricing.

The clearest concrete example is querying Google search at scale. Google's Custom Search JSON API costs $5 per 1,000 queries above a small free tier, with a hard ceiling of 10,000 queries per day per project. For AI workloads that consume hundreds of thousands of search queries per month — AI Search Visibility analysis tracking AI Overview responses across thousands of keywords, or competitive intelligence monitoring SERP positioning across geographies and time — that pricing model is prohibitive. Worse, the API returns structured JSON that omits exactly the SERP features (AI Overviews, Featured Snippets, People Also Ask, knowledge panels) that you most want to see in 2026.

The proxy pool lets me query Google's actual web frontend directly, receive the full SERP HTML including all the rich features, and pay only for the SIM data the requests consume. The parsing is non-trivial — Google rotates class names release-to-release, AI components are JavaScript-hydrated, SERP layouts shift weekly, and anti-bot detection inspects timing patterns as much as content fingerprints. A robust SERP parser is months of initial work and a constant gardening job to keep working. But it amortizes across hundreds of thousands of queries at effectively zero marginal cost.

There's a deeper economic point worth surfacing here, specific to Google. Commercial residential proxies are tiered by IP exclusivity, and each tier has its own failure mode against Google's reputation scoring:

Shared rotating residential (low single-digit dollars per gigabyte) gets burned quickly because Google fingerprints these pools at the IP-block level — one customer's bad behavior poisons the IPs for every other customer sharing the pool. You're paying for IPs whose reputation you don't control.
Sticky residential (mid-tier pricing) extends individual IP sessions, but the underlying IPs are still part of a known commercial pool. The session duration helps with site-specific rate limits; it doesn't help with pool-level reputation scoring.
Dedicated non-shared residential (an order of magnitude more expensive than shared) gives you IPs no one else is burning. But the IPs are still hosted by a commercial residential proxy provider — fingerprintable as such — and dedicated IPs lose the rotation benefit, so a single dedicated IP burns faster against aggressive use.

The structural problem across all three tiers is that the IPs are commercial proxy infrastructure. Google's reputation systems eventually classify them as such, regardless of price tier. You can stretch the working lifetime of a clean dedicated IP with careful traffic shaping, but you're playing a losing game against scoring systems that classify IP patterns at population scale. And the moment you also need rotation — which you do, because dedicated IPs without rotation get burned and dedicated rotating IPs cost orders of magnitude more — the math falls apart fast.

Carrier CGNAT IPs sit on the other side of this game entirely. The IP behind each of my modems serves dozens or hundreds of real consumer mobile users at any given moment, with my scraping traffic mixed into the same stream. The IP's "reputation" — whatever score Google assigns it — is dominated by the human users on that CGNAT mapping, not by anything I do. Google cannot afford to mass-block these IP ranges without breaking legitimate consumer mobile access. The reputation problem that defines commercial-proxy economics simply doesn't apply, because I'm not paying for the IP — I'm renting a slot in a pool that the carrier is operating for an entirely different reason.

This is the kind of workload where the economics of the pool flip from "savings vs commercial alternative" to "enabling a workload that wouldn't otherwise be financially viable at all." Beyond Google, the same pattern applies to other gated-but-public data sources my products consume: product listings on large e-commerce platforms, social discussion threads, ad creative inventories, paid-search results across geographies. The pool is the substrate that makes these workloads affordable.

Honest limitation: the pool is not 100% available

Nine LTE modems on consumer hardware on consumer SIMs are not the same uptime guarantee as a commercial residential proxy service. Carriers run scheduled maintenance. Towers occasionally lose connectivity. SIM plans can throttle when usage patterns look unusual. A bad firmware day on a single modem can put it in a state the watchdog takes three cycles to recover. None of these have meant zero pool capacity — the natural-HA point applies — but pool capacity does fluctuate, sometimes by 30–40% for short periods.

For workloads where availability matters — anything time-sensitive, anything where a missed query has costs — I run the proxy pool as the primary path with a paid commercial residential proxy as a hot fallback. The application layer routes by default through the pool, retries through the commercial vendor when the pool returns errors or available capacity drops below a configured threshold. The economics still work strongly in my favor because the commercial path handles only the overflow — typically a few percent of total volume — and the pool handles the rest. This is the architecture that gets you both the cost advantage of the pool and the SLA of a commercial vendor, layered.

That second layer — the application-level routing logic, the failover thresholds, the cost-monitoring dashboards that decide when to spend on commercial fallback — is enough material for its own post. I'll write it up next.

Latency, cost, and architecture

A practical note on latency: LTE links have higher RTT and more jitter than datacenter networking — typical latencies are 30–80 ms to the carrier core, occasionally spiking during handovers or cell congestion. For my workloads — asynchronous scraping pipelines where the per-request latency budget is loose and overall throughput matters more than tail latency — this is fine. For synchronous user-facing requests where the user is waiting on the wire, it would not be. The pool is sized and tuned for batch work, not interactive paths.

The end-to-end cost picture is dominated by the SIM data plans, not the hardware. The amortized hardware cost across the modem fleet and the two Pis is small in monthly terms. Compared with the commercial residential proxy services I priced before building this, the operating cost is one or two orders of magnitude lower depending on workload, and the IP quality has been at least as good. Your mileage will depend on your local carrier landscape and how aggressively your target sites are filtering.

The code for the rotation engine and the dashboard isn't open source. The architecture — the topology, the GOST trick, the recovery ladder, the dashboard pattern, the paid-fallback layering — is hopefully useful to anyone trying to build something similar. If you do, the things to think hard about are USB power delivery (the boring problem), physical cable labeling (the embarrassing problem), and your carrier choice (the unpredictable problem). I expected the hard part to be the Playwright reliability and the network stack. It turned out to be the cables.

This is the first in a series on production infrastructure for AI products. Next up: the application-layer routing logic — how I run the pool as a primary path with a commercial residential proxy as a hot fallback, getting the cost economics of the pool with the SLA of a vendor. After that: the parsing layer — turning raw SERP HTML into LLM-ready signal.

infrastructureai-productsscrapingraspberry-pi

ShareTwitter LinkedIn Hacker News Reddit