Skip to content
WORK

The full portfolio and technical depth

For anyone who wants to see more — past roles, products, and how far down the stack I go.

01WHAT I BUILD

What I'm building now

B2B · PRODUCTION

Autonomous Data Feeds

Managed B2B data feeds with zero silent errors and zero client-side maintenance.

The pipeline combines fast local deterministic parsers, real-time cognitive AI validation, and a custom physical LTE proxy pool in the Czech Republic. It automatically self-repairs (heals) on layout changes and guarantees continuous output validity (SLA).

No dashboards, just clean structured data delivered directly to S3, SFTP, Webhooks, or REST API.

Live Demo & Tech Specs ↗
SOLO · R&D

Surfaced

AI Search Visibility scanner + GEO content methodology.

Measures where a brand is missing from Google's AI Overview and generates content to fill the gap.

Stack: early R&D, not finalized.

Solo · R&D phase.

More info: jan.hilgard@gmail.com

CTO of production AI products (Advanty, Margly, Discury)

Three production AI products where I built and led the technical architecture and AI stack. Past role — not products of mine today.

Advanty

AI-powered competitive intelligence for marketing agencies.

Agents auto-tag ads, extract hooks, classify CTAs, and tag creatives — all as reliable structured outputs.

Stack: Qwen 3.6 on vllm-mlx (Apple Silicon M3 Ultra). A batch-friendly workload with reliable structured outputs — owned inference made sense economically and operationally.

Led the AI stack and agentic flow of the project as CTO of production AI products.

Margly

Shoptet e-commerce analytics for online merchants.

AI agents identify margin leaks, recommend pricing changes, auto-tag transactions, and run multi-step orchestration over orders, shipping, ad costs, and returns.

Stack: Google AI (Gemini). Chosen deliberately — Margly needed low-latency responses for multi-step tool calling and autonomous orchestration. Running inference onsite wouldn't have been economically viable at this task class.

Led the AI stack and agentic flow of the project as CTO of production AI products.

Discury

Customer intelligence — mines Reddit, Hacker News, and Product Hunt for pain points, trends, and market gaps.

Discovery and classification agents surface signals at high volume; summarization agents distill the nuance worth acting on.

Stack: hybrid orchestration. Discovery and classification agents on Qwen 3.6 / vllm-mlx (high-volume, batch-tolerant); final summarization and nuance-heavy reasoning on Google AI where the per-token premium was justified by output quality. Routing decided per agent task.

Led the AI stack and agentic flow of the project as CTO of production AI products.

07PAST PROJECTS

What I co-founded and keep developing

CO-FOUNDED · ACTIVELY DEVELOPED

Lobot.chat

AI chatbot for automating customer support in e-commerce. Resolves up to 98% of inquiries without a human, recommends products, and actively helps close sales. I designed and fully led the technical architecture, including integrations into Shopify, WooCommerce, Magento, PrestaShop, and OpenCart. As co-founder I continue to actively contribute to its technical development and operations.

lobot.chat
CO-FOUNDED · ONGOING CONSULTING

GuruWatch

Comprehensive B2B monitoring dashboard for manufacturers and distributors. GuruWatch tracks partner stock levels and pricing across e-shops in real time, watches price trends, and sends immediate alerts. Clients include brands like Lenovo, Niceboy, and Infinix. I designed and fully built the entire data and scraping pipeline that runs reliably at scale. Day-to-day operations have been handed over to a new owner, but I continue to provide strategic technical support and consulting for GuruWatch.

www.guruwatch.cz
03HOW I BUILD

Most “AI agents” are demos. Production systems need the whole stack underneath.

I build across the whole stack — the agentic flows that run a product, and the inference, proxy, and data infrastructure underneath them. Here's what that looks like at the orchestration layer; the infrastructure proof is below.

  1. 01

    Multi-model orchestration

    Routing requests between local models (Qwen, Gemma) and cloud APIs (Claude, GPT) based on task complexity + cost. Production cost savings of 60–80% vs. pure cloud setup.

    Built into Discury's high-volume agent tasks — owned Qwen 3.6, with frontier cloud models only where task quality demanded the premium.

  2. 02

    Hermes-style tool calling

    No brittle prompt chains. Agent receives a tool set, decides on its own. Requires a strong reasoning model + correct tool granularity. Lessons learned from production deployment.

    Built into Margly for autonomous multi-step orchestration over merchant order, cost, and ad data.

  3. 03

    MCP-native architecture

    Model Context Protocol as foundation for tool integration. Practical patterns for context management, error recovery, and debugging multi-step agents.

  4. 04

    Production failure modes

    Tool calling loops, hallucinated calls, context window poisoning, infinite retry loops. What I've seen in production and how to fix it.

    Patterns from building and running multiple production AI-powered web apps.

  5. 05

    Token economics of agentic systems

    Prefill vs generation cost. KV cache reuse. Speculative decoding for agent loops. Practical ROI analyses.

    Why Advanty and Discury were built on owned inference — measured ROI on M3 Ultra vs. cloud API per task class; local inference automatically failed over to public cloud when unavailable.

I write about this regularly. If you have a production agentic workflow that's bleeding tokens or has failure mode issues, get in touch →

04PROOF OF DEPTH

How far down the stack I go — when the economics demand it.

When the unit economics demand it, I go all the way down — to the inference layer and to the data/access layer. vllm-mlx (79 merged PRs) and a self-built LTE proxy pool are the two ends of that story: owned inference that made products affordable to run, and a residential-IP scraping stack that makes gated public data reachable. I do MLX out of efficiency necessity, not as a research specialty.

  • vllm-mlx core contributor

    79 merged PRs to open-source LLM inference for Apple Silicon (581+ stars). Primary implementor of Anthropic Messages API (/v1/messages) — the compatibility layer that makes vllm-mlx work with Claude Code and OpenCode.

    Main areas of work:

    • ·KV cache quantization: QuaRot live inference, asymmetric K/V bit quantization for prefix cache, TurboQuant R1 Hadamard rotation for outlier-free MoE weight quantization
    • ·Constrained decoding: JSON schema enforcement, thinking suppression, preamble handling, array-of-objects fixes
    • ·MLLM infrastructure: logits processor context, token duplication fixes, tools/tool_choice in chat templates
    • ·Production reliability: client disconnect detection, in-flight token credit on request abort, generation_tps batch stats
    • ·Streaming: UTF-8-safe incremental decode, tool calls with reasoning parser, leak fixes for Anthropic streaming
    github.com/waybarrios/vllm-mlx
  • Data-access infrastructure (the other end)

    A self-built LTE proxy pool — Raspberry Pis plus consumer MiFi modems on rotating CGNAT residential IPs — that puts scraping traffic on organically residential addresses, with a commercial proxy as hot fallback. Anti-detect scraping across Cloudflare / DataDome / Akamai. Real throughput from production pipelines (10k+ requests/day).

    Read: the LTE proxy pool
  • Production batch inference

    Apple M3 Ultra 256GB as primary inference machine. Workloads with 9:1 prefill/generation ratio (image classification, content tagging, structured extraction). 274 tok/s sustained throughput on Gemma 4 26B-A4B at concurrency 8.

  • Hardware economics

    Real ROI analyses: M3 Ultra vs RTX PRO 6000 Blackwell for different workload types. Cost-per-token calculations across cloud providers vs. owned infrastructure. Payback period modeling for hardware investments.

  • Local LLM deployment patterns

    vLLM, SGLang, llama.cpp, MLX. When to use which stack. Quantization tradeoffs. Multi-model serving. Auto-scaling on bare metal vs Kubernetes.

06TIMELINE

The journey from the start

  1. TODAY

    Current focus

    vllm-mlx core contributor. Building Surfaced (R&D). Open to Fractional CTO and technical advisory engagements.

  2. EARLY 2025

    Launch of Advanty, Margly & Discury + vllm-mlx

    Official deployment and launch of three end-to-end AI products as CTO and AI architect: Margly (profitability analytics for online merchants), Discury (customer intelligence platform), and Advanty (AI competitive intelligence). The architectures combine local inference with frontier cloud APIs. In parallel, active role as core contributor to the open-source framework vllm-mlx (LLM inference on Apple Silicon, over 80 merged PRs, author of the compatible Messages API layer for tools like Claude Code).

  3. 2024

    R&D and AI stack development

    Intensive development, testing of local models, and design of hybrid infrastructure for the upcoming AI products. Work on reliability of agentic flow orchestration and on token cost optimisation (inference economics).

  4. 2023

    Development and co-founding of Lobot.chat

    Development of an advanced e-commerce chatbot with integrations into platforms like Shopify, WooCommerce, and Magento. As co-founder I continue to actively contribute technically.

  5. 2022

    Full shift to the AI/ML stack

    Deep dive into local LLM models, agentic workflows, and inference optimisation. Bridging the gap between academic research and infrastructure a solo founder can actually run in production.

  6. 2021

    Co-founding of GuruWatch

    Development and launch of a robust scraping and data pipeline for tracking pricing trends across hundreds of e-shops (brands like Lenovo, Infinix, Niceboy). I still act as the project's technical consultant.

  7. SEPTEMBER 2020

    Successful Hosting90 exit

    Full sale of Hosting90 systems s.r.o. to the international WY Group (operator of the Ignum brand). The transaction was publicly announced.

    hostingy.net
  8. 2002

    Founding of Hosting90

    Founding of a hosting and infrastructure company and 18 years of building it from scratch — from a garage to a 25-person team running its own servers.

Advisory / fractional CTO

For technical teams — inference economics, agentic architecture, technical due diligence.

Advisory / fractional CTO
Consult my project for free