The moment that pushed me over the edge wasn't dramatic. It was routine.
I asked Alexa to turn off my office lights. For the thousandth time, my voice left my house, traveled to an Amazon data center, got processed somewhere I couldn't see, and came back to control a smart plug three feet away from me.
It worked.
But it felt... wrong.
That small friction turned into a bigger question: Why does something this simple require the cloud at all?
That question became Project Athena — a fully local AI voice assistant that now runs 30+ services, processes every request entirely on hardware in my home, and responds in under five seconds.
No cloud. No subscriptions. No one listening.
This is what I built, what broke along the way, and why I think the future of AI assistants is local-first.
The Problem Isn't Just Privacy — It's Architecture
We all know the trade-offs by now.
Latency. Random failures. "Sorry, I can't help with that." The quiet understanding that your voice is being processed somewhere else.
But the real issue isn't just privacy.
It's that these systems are built on a centralized model:
- Your voice leaves your home
- It's processed remotely
- A response comes back — if everything works
That model introduces friction everywhere: latency, dependency, fragility.
I wanted something different:
- 2–5 second responses
- 100% local processing
- Real contextual understanding
- Full-home coverage across 10 zones
So I built it.
This Isn't an App. It's a Distributed System.
Project Athena isn't a single service. It's a system.
At the center is an 11,000+ line LangGraph-based orchestrator that acts as the brain. Around it are dozens of specialized microservices — each responsible for a single domain.
When I say:
"Hey Jarvis, what's the weather and turn off the living room lights"
Here's what actually happens:
Wake word detection triggers locally
Speech-to-text runs on-device
A gateway (OpenAI-compatible) receives the request with rate limiting + circuit breaking
The orchestrator classifies intent and scores complexity
Domain-specific RAG services fetch real-time data
A local LLM generates the response
A validation pipeline checks for hallucinations
Text is normalized for natural speech
Audio streams back to the correct room
End-to-end latency: under five seconds
The key isn't just that it works. It's that every step is intentional.
Most Queries Never Touch an LLM
One of the biggest performance wins came from a simple realization:
LLMs are expensive. Most queries don't need them.
Before any model is invoked, every request goes through a six-layer deterministic pipeline:
- STT error correction — fixes predictable transcription issues
- Slang + typo normalization — yes, it understands "no cap"
- Analytical query detection — routes long-form tasks immediately
- False memory detection — prevents hallucinated context
- Emotional classification — distinguishes intent beyond keywords
- Pattern matching — 21 intent categories via regex
Only after all of that does the system optionally call an LLM.
In practice, most queries never need one.
That's where the real latency savings come from.
Intelligence Comes From Specialization
Instead of building one massive "AI brain," I broke knowledge into domains.
There are currently 30+ independent RAG services:
- Weather
- Sports
- Dining
- Flights
- Stocks
- Recipes
- Directions
- Tesla vehicle data
- Local events
- And more
Each service runs independently, caches and aggregates real-time data, registers with a central service registry, and pulls encrypted API keys at startup.
When you ask about a Ravens game, the system doesn't hit a single API — it queries multiple sources simultaneously and synthesizes the best answer.
When you ask for a restaurant, you don't get raw data. You get a recommendation.
That separation — between data retrieval and response synthesis — is what makes the system feel intelligent.
Not Every Question Deserves the Same Model
Not all queries are equal.
So instead of using one model for everything, I built a regex-based complexity scoring system that evaluates comparisons, conditionals, aggregations, time references, and multi-entity queries.
That score determines which model gets used:
- Simple → Qwen3 4B (quantized)
- Complex → Qwen 2.5 14B
- Super complex → Qwen 2.5 32B
Every query gets the minimum viable intelligence required. That's what keeps the system both fast and capable.
Hallucination Isn't a Prompt Problem — It's a System Problem
Telling an LLM "don't hallucinate" doesn't work.
So I treated trust as a system-level concern.
Every response goes through a four-stage validation pipeline:
- Bounds checking — length, structure
- Pattern detection — dates, prices, phone numbers
- Data support validation — does the source data actually contain these claims?
- Secondary LLM verification — a separate model evaluates factual consistency
If a response includes a phone number, address, or time that wasn't present in the data — it gets rejected.
If necessary, the system falls back and regenerates.
Slower in edge cases. But correct.
Smart Home Control Needed to Feel Human
The smart home layer ended up being far more complex than expected.
It's not just "Turn off the lights." It's:
- "Did I lock the doors?"
- "Lock everything down for the night"
- "Is anyone home?"
- "Set the lights to Ravens colors"
The system handles 70+ intent patterns, occupancy awareness, exclusion rules (don't kill ambient lighting), and randomized responses (no robotic repetition).
It behaves less like a command parser — and more like an assistant.
The Hardest Problem Was Text-to-Speech
Not orchestration. Not models. Not infrastructure.
Speech normalization.
Turning "123 N Main St, Baltimore, MD at 10:30 AM" into something that sounds natural required address parsing, directional context, time formatting, and abbreviation handling.
That layer alone is nearly 800 lines of logic.
And it matters more than almost anything else — because it's what users actually hear.
This Only Works Because It's Manageable
At a certain point, building the system wasn't the hard part. Managing it was.
So I built a full admin platform: model routing configuration, RAG service control, encrypted API key management, feature flags, real-time pipeline tracing, and service orchestration + restart control.
Without it, this would be a demo. With it, it's something I actually use every day.
What I Learned
The biggest lessons weren't about AI. They were about systems.
- Deterministic preprocessing beats model-based classification
- Local models are production-ready — if you design for them
- Hallucination is a validation problem, not a prompting problem
- Microservices work because domains behave differently
- Most failures are boring — and you need to design for them
Why This Matters
We're at an inflection point.
Cloud AI is getting more powerful. But local AI is getting good enough.
And "good enough" — when it's faster, private, and fully under your control — is a fundamentally different value proposition.
This isn't the final form of AI assistants. It's my attempt at bridging the gap.
If this resonates — or even if you think parts of it are wrong — I'd genuinely love feedback, ideas, or contributions.
Because even if this isn't the answer... it's a step in the right direction.
Project Athena (open source): github.com/jstuart0/project-athena-oss