Services

ai / running

LM Studio

Canonical local LLM runtime on the Zenbook/Jarvis host, serving an OpenAI-compatible API to Home Assistant, Open WebUI, n8n, and read-only agent health checks.

What it is

LM Studio is a desktop GUI for running local LLMs. You browse a model catalogue, click download, slide the GPU offload bar, and start chatting. The part that matters for the rest of my homelab is that it exposes a local OpenAI-compatible HTTP server, so anything that speaks the OpenAI API speaks to it.

The host is now agent-accessible as Zenbook/Jarvis. That means future automation can check the local AI runtime over SSH instead of waiting for someone to remote into the laptop GUI.

Why I run it

I wanted real local LLMs for two specific workloads: Home Assistant's Sleep Lab briefings (a 2 AM and a wake-up automation that summarise overnight sensor data and the night's Withings score) and Open WebUI as a daily chat client. Both work fine with hosted APIs, but the latency, privacy, and pay-per-token math all point at local for these particular jobs.

The catch was the hardware. The machine doing this work is a small laptop with an AMD Ryzen AI 9 HX 370 — a Strix Point chip with a Radeon 890M iGPU. The 890M is RDNA 3.5 (gfx1150), which isn't on the supported-GPU list for ROCm, which is what most "real" local-LLM stacks (Ollama, llama.cpp's ROCm backend) want. Ollama would run, but silently fall back to CPU.

LM Studio uses Vulkan as its compute backend, which the 890M is happy to drive. It was the only stack in my bake-off that auto-detected the iGPU on first launch — no HSA_OVERRIDE_GFX_VERSION hack, no driver wrangling, no rebuilds. That on its own was the deciding factor.

How I use it

Three clients hit the server, all using the same OpenAI-compatible endpoint:

Different jobs get different models, swapped via LM Studio's model picker:

| Model | Job | |---|---| | qwen2.5-7b-instruct (Q4_K_M) | HA daily-driver — clean instruct output, no chain-of-thought leakage in notifications. | | qwen2.5-coder-7b-instruct (Q4_K_M) | n8n code-gen workflows. | | deepseek-r1-distill-qwen-14b (Q3_K_S) | Reasoning-heavy n8n. Specifically not used for HA — emits <think> blocks that leak into phone push notifications. | | google/gemma-4-e4b (Q4_K_M) | General fallback, multimodal. |

Observed performance on Q4_K_M 7B models: roughly 16 tok/s sustained generation, 80–84% GPU compute utilization, GPU sits around 65 °C under load. The CPU is free during all of this. For a chip nobody marketed as an "AI" laptop the day I bought it, this is a fine outcome.

Setup notes

Runbook