What it is
LM Studio is a desktop GUI for running local LLMs. You browse a model catalogue, click download, slide the GPU offload bar, and start chatting. The part that matters for the rest of my homelab is that it exposes a local OpenAI-compatible HTTP server, so anything that speaks the OpenAI API speaks to it.
The host is now agent-accessible as Zenbook/Jarvis. That means future automation can check the local AI runtime over SSH instead of waiting for someone to remote into the laptop GUI.
Why I run it
I wanted real local LLMs for two specific workloads: Home Assistant's Sleep Lab briefings (a 2 AM and a wake-up automation that summarise overnight sensor data and the night's Withings score) and Open WebUI as a daily chat client. Both work fine with hosted APIs, but the latency, privacy, and pay-per-token math all point at local for these particular jobs.
The catch was the hardware. The machine doing this work is a small laptop with an AMD Ryzen AI 9 HX 370 — a Strix Point chip with a Radeon 890M iGPU. The 890M is RDNA 3.5 (gfx1150), which isn't on the supported-GPU list for ROCm, which is what most "real" local-LLM stacks (Ollama, llama.cpp's ROCm backend) want. Ollama would run, but silently fall back to CPU.
LM Studio uses Vulkan as its compute backend, which the 890M is happy to drive. It was the only stack in my bake-off that auto-detected the iGPU on first launch — no HSA_OVERRIDE_GFX_VERSION hack, no driver wrangling, no rebuilds. That on its own was the deciding factor.
How I use it
Three clients hit the server, all using the same OpenAI-compatible endpoint:
- Home Assistant, via the HACS Extended OpenAI Conversation integration (the native HA OpenAI integration doesn't let you change the base URL). One conversation entity, configured to use a 7B instruct model, used by the Sleep Lab automations.
- Open WebUI, configured as a second "OpenAI API" connection alongside the legacy Ollama one. Daily-driver chat.
- n8n, for workflows that need to reason over text or generate code in a Code node.
Different jobs get different models, swapped via LM Studio's model picker:
| Model | Job |
|---|---|
| qwen2.5-7b-instruct (Q4_K_M) | HA daily-driver — clean instruct output, no chain-of-thought leakage in notifications. |
| qwen2.5-coder-7b-instruct (Q4_K_M) | n8n code-gen workflows. |
| deepseek-r1-distill-qwen-14b (Q3_K_S) | Reasoning-heavy n8n. Specifically not used for HA — emits <think> blocks that leak into phone push notifications. |
| google/gemma-4-e4b (Q4_K_M) | General fallback, multimodal. |
Observed performance on Q4_K_M 7B models: roughly 16 tok/s sustained generation, 80–84% GPU compute utilization, GPU sits around 65 °C under load. The CPU is free during all of this. For a chip nobody marketed as an "AI" laptop the day I bought it, this is a fine outcome.
Setup notes
- Host: a separate Windows laptop, not on Proxmox — LM Studio is a desktop app and the iGPU needs the Windows driver stack.
- Agent access: SSH is available for headless health checks; GUI access is the fallback, not the first move.
- Server: "Serve on Local Network" mode (not just
127.0.0.1), port 1234. JIT loading on, KV cache offload to GPU on, hardware guardrail set to Balanced. - VRAM: 8 GB (the BIOS default for Strix Point). A 6.3 GB Q4_K_M model fits cleanly with a bit of headroom for context. Bumping the BIOS allocation to 16 GB would unlock larger models — parked until I have a workload that actually needs them.
- Reverse proxy: Open WebUI is the proxied internal browser surface, not LM Studio itself. The LM Studio port stays local-network-only.
- Update cadence: manual. The Vulkan backend has been quietly improving with every release, so I check monthly.
Runbook
- Healthy looks like: HA's conversation entity is
available, Open WebUI shows the LM Studio models in the picker, a curl to/v1/modelsreturns the loaded set. - Agent check: SSH to the Zenbook, then query the OpenAI-compatible
/v1/modelsendpoint. - HA entity flips to
unavailablebetween scheduled fires: this is the gotcha I lost the most time to. LM Studio's default is to auto-unload models after 60 minutes idle. Between the 2 AM briefing and the morning one there's a five-hour gap, so the model would unload, the HA entity would go unavailable, and the automation would silently fall back to Gemini. Fix: disable auto-unload in the server settings, or set the idle TTL to 1440+ minutes. With a single daily-use model that fits in VRAM, keeping it loaded forever has no real cost. - Open WebUI can't reach LM Studio: Open WebUI runs in Docker, so the base URL has to be
http://host.docker.internal:1234/v1, notlocalhost:1234/v1.localhostfrom inside the container resolves to the container, not the host. - HA push notifications contain
<think>...</think>blocks: you accidentally pointed HA at a reasoning model. Switch to a plain instruct model for HA; keep the reasoning model for n8n where the wrapper isn't a problem. - Where logs live: LM Studio's own log panel for inference details, the Extended OpenAI Conversation integration logs in HA for the request/response side.