Skip to content
AGENTIC AI PRODUCTS · OWNED & CLOUD INFERENCE · BOOTSTRAP

Jan HilgardAGENTIC AI PRODUCTS · OWNED & CLOUD INFERENCE · BOOTSTRAP

I build AI products where AI is the engine.
Agentic flows, tool calling, autonomous decisions.

I choose owned inference (vllm-mlx) or frontier cloud models per workload — that judgment is where the value lives. 79 PRs into vllm-mlx. Exit Hosting90 (2020).

01ABOUT

20+ years building tech companies and products

I founded Hosting90 in 2002. Eighteen years of building from garage to a 25-person team to an international exit to WY Group in 2020.

Then I took a year off. Went deeper into the AI/ML stack — local LLM models, agentic workflows, inference infrastructure. Realized there's a massive gap between what AI labs publish and what a solo founder can actually run on their own infrastructure.

That gap is what interests me most right now.

So that's what I build now: AI products where the AI is the engine, not a feature. Three of them right now — Margly, Discury, and Advanty — run agentic flows that make decisions, call tools, and carry out multi-step work on their own.

I choose the inference stack per workload. Advanty's batch-friendly work runs fully on owned inference (Qwen 3.6 on vllm-mlx, Apple Silicon); Margly's more complex agent orchestration runs on frontier cloud (Google AI) for the reliability it needs; Discury orchestrates both. The 79 PRs I've merged into vllm-mlx are the depth that makes that judgment — owned or cloud, and where — possible.

When I'm not building my own products, I do audits of inference economics and agentic workflows for AI startups and tech companies.

An open question I'm working through: I'm building Surfaced to apply GEO (Generative Engine Optimization) — getting cited in Google's AI Overview — but I don't yet know how reproducible that is across different niches. Until I have my own proven case studies, it stays a project in development; a content offering without its own track record is just selling promises.

Based near Prague. Czech and English (written). I publish about LLM economics and infrastructure patterns.

Portrait photo of Jan Hilgard, builder of agentic AI products
02HOW MY PRODUCTS WORK

Most “AI agents” are demos. Production systems require something different.

I actively work on failure modes, token economics, and orchestration patterns for multi-step agentic workflows in production.

  1. 01

    Multi-model orchestration

    Routing requests between local models (Qwen, Gemma) and cloud APIs (Claude, GPT) based on task complexity + cost. Production cost savings of 60–80% vs. pure cloud setup.

    In production at Discury — high-volume agent tasks on owned Qwen 3.6, frontier cloud models only where task quality demands the premium.

  2. 02

    Hermes-style tool calling

    No brittle prompt chains. Agent receives a tool set, decides on its own. Requires a strong reasoning model + correct tool granularity. Lessons learned from production deployment.

    Used in Margly for autonomous multi-step orchestration over merchant order, cost, and ad data.

  3. 03

    MCP-native architecture

    Model Context Protocol as foundation for tool integration. Practical patterns for context management, error recovery, and debugging multi-step agents.

  4. 04

    Production failure modes

    Tool calling loops, hallucinated calls, context window poisoning, infinite retry loops. What I've seen in production and how to fix it.

    Patterns derived from running three AI products in production.

  5. 05

    Token economics of agentic systems

    Prefill vs generation cost. KV cache reuse. Speculative decoding for agent loops. Practical ROI analyses.

    Why Advanty and Discury run on owned inference — measured ROI on M3 Ultra vs. cloud API per task class; local inference automatically fails over to public cloud if it's unavailable.

I write about this regularly. If you have a production agentic workflow that's bleeding tokens or has failure mode issues, get in touch →

03CURRENT PRODUCTS

What I'm currently building

SOLO · LIVE

Advanty

AI-powered competitive intelligence for marketing agencies.

Agents auto-tag ads, extract hooks, classify CTAs, and tag creatives — all as reliable structured outputs.

Stack: Qwen 3.6 on vllm-mlx (Apple Silicon M3 Ultra). Workload (auto-tagging, hook extraction, CTA classification, creative tagging) is batch-friendly with reliable structured outputs — owned inference makes sense economically and operationally.

Solo-built and run. Live.

www.advanty.io
MIRANDA MEDIA · LIVE

Margly

Shoptet e-commerce analytics for online merchants.

AI agents identify margin leaks, recommend pricing changes, auto-tag transactions, and run multi-step orchestration over orders, shipping, ad costs, and returns.

Stack: Google AI (Gemini). Chosen deliberately — Margly's complex multi-step tool calling and autonomous orchestration require frontier-model reliability that current open-weights models don't yet match at this task class.

Built and led the technical architecture. Live.

margly.io
MIRANDA MEDIA · LIVE

Discury

Customer intelligence — mines Reddit, Hacker News, and Product Hunt for pain points, trends, and market gaps.

Discovery and classification agents surface signals at high volume; summarization agents distill the nuance worth acting on.

Stack: Hybrid orchestration. Discovery and classification agents on Qwen 3.6 / vllm-mlx (high-volume, batch-tolerant). Final summarization and nuance-heavy reasoning on Google AI where per-token premium is justified by output quality. Routing decided per agent task.

Built and led the technical architecture. Live.

discury.io
SOLO · R&D PHASE

Surfaced

AI Search Visibility scanner + GEO content methodology.

Measures where a brand is missing from Google's AI Overview and generates content to fill the gap.

Stack: early R&D, not finalized.

Solo · R&D phase.

More info: jan.hilgard@gmail.com
04THE INFERENCE MOAT

The moat is owning the inference layer.

Why my products work economically as bootstrap: I build the inference layer underneath. vllm-mlx (79 merged PRs) isn't a hobby — it's the operational substrate that makes Advanty and Discury affordable to run, and the depth that lets me decide when cloud is the right call for products like Margly. Owning the stack means I'm not paying retail for inference.

  • vllm-mlx core contributor

    79 merged PRs to open-source LLM inference for Apple Silicon (581+ stars). Primary implementor of Anthropic Messages API (/v1/messages) — the compatibility layer that makes vllm-mlx work with Claude Code and OpenCode.

    Main areas of work:

    • ·KV cache quantization: QuaRot live inference, asymmetric K/V bit quantization for prefix cache, TurboQuant R1 Hadamard rotation for outlier-free MoE weight quantization
    • ·Constrained decoding: JSON schema enforcement, thinking suppression, preamble handling, array-of-objects fixes
    • ·MLLM infrastructure: logits processor context, token duplication fixes, tools/tool_choice in chat templates
    • ·Production reliability: client disconnect detection, in-flight token credit on request abort, generation_tps batch stats
    • ·Streaming: UTF-8-safe incremental decode, tool calls with reasoning parser, leak fixes for Anthropic streaming
    github.com/waybarrios/vllm-mlx
  • Production batch inference

    Apple M3 Ultra 256GB as primary inference machine. Workloads with 9:1 prefill/generation ratio (image classification, content tagging, structured extraction). 274 tok/s sustained throughput on Gemma 4 26B-A4B at concurrency 8.

  • Hardware economics

    Real ROI analyses: M3 Ultra vs RTX PRO 6000 Blackwell for different workload types. Cost-per-token calculations across cloud providers vs. owned infrastructure. Payback period modeling for hardware investments.

  • Local LLM deployment patterns

    vLLM, SGLang, llama.cpp, MLX. When to use which stack. Quantization tradeoffs. Multi-model serving. Auto-scaling on bare metal vs Kubernetes.

  • Scraping & data infrastructure

    Anti-detect web scraping. LTE modem pools with CGNAT rotation. Anonymizing proxy stack. Real throughput data from production pipelines (10k+ requests/day).

05HOW I THINK

A few principles I work by

  1. 01

    Cost arbitrage as strategy

    Cost arbitrage is strategy, not preference. Who owns the inference stack competes on different terms than who pays the OpenAI bill. Engineering decision with P&L impact.

  2. 02

    Production > novelty

    Trends are expensive. Working production systems = long-term moat. Six months with one provider > three months chasing every new release.

  3. 03

    Bridge between tech and business

    I spent 18 years running a tech company — where the CEO chair meant understanding code and cash flow at the same time. Today, when I solve architecture, I see P&L consequences. When I talk to investors, I talk about KV cache too. This combination is rare and that's where the value lives.

  4. 04

    Bootstrap by choice, not by default

    I've had an exit. I know what the VC track looks like. I consciously choose bootstrap because for AI infrastructure tooling, profit beats scale. Not dogma — context-aware decision.

  5. 05

    Outcomes > activity

    20 years taught me shipping features ≠ creating value. I measure myself and projects by real outcomes (retention, margin, ARR), not activities (PRs, posts, meetings). This perspective only comes after several building/selling cycles.

If this resonates, we might be on the same wavelength.

06TIMELINE

The journey from the start

  1. TODAY

    Current focus

    Core contributor to vllm-mlx. CTO at Miranda Media Group (Margly, Discury). Building Surfaced — AI Search Visibility scanner for CZ/SK SaaS.

  2. 2025

    vllm-mlx core contributor

    79 merged PRs to vllm-mlx (open-source LLM inference for Apple Silicon, 581+ stars). Authored the Anthropic Messages API compatibility layer that makes vllm-mlx work with Claude Code. Main focus: KV cache quantization (QuaRot, asymmetric, TurboQuant), constrained decoding, MLLM infrastructure, production reliability.

  3. 2025

    Launched Advanty

    AI-powered competitive intelligence for marketing agencies.

  4. 2024

    Margly + Discury

    Launched Margly (e-commerce profitability analytics for Shoptet) and Discury (customer intelligence platform) at Miranda Media.

  5. 2023

    Co-founded Lobot.chat

    AI customer-support chatbot for e-commerce. Live today, handed over to the operating team.

  6. 2022

    Shift to AI

    Started working with local LLM models and inference infrastructure.

  7. 2021

    Co-founded GuruWatch

    B2B monitoring dashboard for manufacturers and distributors tracking partner stock and pricing across e-shops. Live today, handed over.

  8. SEPTEMBER 2020

    Hosting90 exit

    Sale of Hosting90 systems s.r.o. to WY Group (operator of Ignum brand). Transaction publicly announced.

    hostingy.net
  9. 2002

    Founded Hosting90

    Start of entrepreneurial journey in hosting and web services. Operated as Hosting90 systems s.r.o. (Company ID 28545711).

07PAST PROJECTS

What I shipped before

CO-FOUNDED · HANDED OVER

Lobot.chat

AI customer-support chatbot for e-commerce — resolves up to 98% of inquiries without a human, recommends products and closes sales. Drops into Shopify, WooCommerce, Magento, PrestaShop or OpenCart via a JS snippet. Co-founded; I owned the technical build. Live today, now run by the team.

lobot.chat
CO-FOUNDED · HANDED OVER

GuruWatch

B2B monitoring dashboard for manufacturers and distributors — tracks partner stock levels and pricing across e-shops, with real-time alerts and historical price trends. Customers include Lenovo, Niceboy and Infinix. Co-founded; owned the data pipeline and infrastructure. Live today, handed over.

www.guruwatch.cz
FAQ

Frequently asked questions

Who is Jan Hilgard?
Jan Hilgard is an AI product builder based near Prague, Czech Republic. He builds AI products where the AI is the engine — agentic flows, tool calling, and autonomous decisions — and is a core contributor to vllm-mlx with 79 merged PRs. He founded Hosting90 in 2002 and exited it to WY Group in 2020.
What does Jan Hilgard build?
AI products where the AI is the engine, not a feature: Margly (Shoptet e-commerce analytics), Discury (customer intelligence), and Advanty (competitive intelligence for marketing agencies). He chooses owned or cloud inference per workload.
What is vllm-mlx and what is his role in it?
vllm-mlx is open-source LLM inference for Apple Silicon — a vLLM fork with an MLX backend. Jan has merged 79 PRs, including the Anthropic Messages API compatibility layer that makes it work with Claude Code, plus KV cache quantization and constrained decoding.
What inference stack do his products run on?
It depends on the workload. Advanty runs fully on owned inference (Qwen 3.6 on vllm-mlx, Apple Silicon); Margly runs on frontier cloud (Google AI) for agent-orchestration reliability; Discury orchestrates both.
Is Jan Hilgard available for work?
He's open to fractional CTO engagements, advisory on inference economics or agentic architecture, short-term technical due diligence, and speaking or podcasts. The best contact is jan.hilgard@gmail.com.

Let's work together

Email or LinkedIn — written communication in Czech or English, same speed.
For calls, I'm strongest in Czech; English calls work best when scheduled with a clear agenda. I usually reply same day.

I'm open to

  • Fractional CTO engagements for AI / infrastructure startups
  • Advisory work where inference economics or agentic architecture decisions are in play
  • Short-term technical due diligence — AI products, inference stacks, scraping infrastructure
  • Speaking and podcasts on production AI infrastructure, owned-inference economics, or the Hosting90 → AI transition

Not currently looking for

  • Full-time relocation roles outside the Czech Republic
  • Projects requiring more than ~20 hours per week