Adding Voice Chat Context Awareness to Firstly Academy
How I built conversation history and contextual feedback into the AI speaking tutor, reducing repetitive questions and improving student experience.
Students were asking why the AI kept forgetting what they just said. Made sense—the system had no memory between turns. Each conversation started from zero.
The Setup
Initial version was fully stateless. Voice input → transcribe → assess → respond. Simple, but every API call was independent. The AI had no idea what happened 30 seconds ago, what task the student was working on, or what mistakes they kept making.
Works fine for one-off queries. Pretty bad for anything that feels like an actual conversation.
Building Context Management
Needed a way to track conversation state without making the whole system slow or complex. Ended up with a three-part approach:
Session Layer
Added a context management layer that sits between the voice interface and the LLM. Tracks:
- Last 10 turns of conversation
- Current task type (FCE has 4 parts)
- Recurring error patterns
- Whether student is improving or struggling
This layer is reusable across the ecosystem. WhatsEnglish and Teacher’s Scribe can use the same pattern.
Storage Strategy
Went with a hybrid approach:
PostgreSQL for permanent storage—full conversation history, lets me analyze patterns later, track progress over time.
Redis for active sessions—keeps current context in memory for fast reads. Auto-expires after an hour. Drops latency from 200-300ms down to sub-50ms.
Trade-off is added complexity, but the performance difference is noticeable. Voice chat needs to feel real-time.
Context Window Management
Can’t just dump the entire conversation history into the LLM. Llama 3.3 has a 32k token limit and you hit it faster than you’d think.
Solution was to weight recent context higher than old context. Most recent turn gets full weight, previous turn gets 0.8x weight, before that 0.64x, and so on. Older stuff is still there but has less influence.
Keeps the total context under 3k tokens—about 5 recent turns plus a summary of error patterns. Prevents the AI from looping on old references while still maintaining continuity.
How It Works
User speaks
↓
Deepgram transcription
↓
Session Manager checks Redis for context
↓
Context Builder assembles last 5 turns + error patterns
↓
Prompt includes conversation history
↓
Groq/Llama 3.3 generates response
↓
Azure TTS speaks response
↓
Update Redis cache + async write to PostgreSQL
Results
The numbers moved pretty significantly:
- Repeat questions dropped from 32% to 4%
- Average session length went from 3.2 to 7.8 minutes
- 7-day retention jumped from 38% to 61%
More importantly, the feedback changed. Students said it felt like talking to someone who actually remembered the conversation. That retention jump suggests they’re right—people come back when they feel heard.
Technical Stuff I Ran Into
Token Budget
Full conversation history eats tokens fast. Had to be selective about what goes into the context. Last 5 turns verbatim, error patterns as a short summary, and dropped all the metadata like timestamps and IDs. Keeps it manageable.
Latency Requirements
Voice chat breaks down if the response takes too long. Target was under 200ms total. Database queries were killing that.
Redis cache fixed it—load context once when the session starts, update it after each turn, sync back to PostgreSQL every few turns asynchronously. If Redis somehow loses it, rebuild from the database.
Context Drift
Early versions had this problem where the AI would over-reference past turns and get stuck in loops. Weighted context helped, but also had to add explicit instructions in the prompt: “focus on the last 2 turns, don’t repeat questions you already asked.”
Combination of architecture and prompting fixed it.
What This Unlocked
State management ended up being the interesting part of building conversational AI. The LLM itself is straightforward—you send a prompt, you get a response. Managing what goes into that prompt over multiple turns is where the actual work is.
This pattern works for any app in the ecosystem. WhatsEnglish can use the same context layer, Teacher’s Scribe can track lesson conversations the same way.
Also learned that latency matters way more than clean code. The Redis cache adds complexity, but users can’t perceive 50ms. They absolutely notice 300ms. Optimize for what people actually experience.
Next Steps
Current system resets between sessions. Next version needs cross-session memory.
Planning to add a vector database (probably Pinecone or Qdrant) to track student mistakes over weeks, not just within a single session. Eventually want the AI to “know” the student across all apps in the ecosystem—errors in Firstly inform practice suggestions in WhatsEnglish, lesson transcripts from Teacher’s Scribe feed personalized tutoring.
The goal is for students to feel like the apps actually know them. Shared context across the whole ecosystem.
Try it: firstly.academy
Related:
- Multi-App Database Architecture for the ESL Ecosystem (coming soon)
- Building Voice-First EdTech with Sub-200ms Latency (coming soon)