Adding Voice Chat Context Awareness to Firstly Academy

Students were asking why the AI kept forgetting what they just said. Made sense—the system had no memory between turns. Each conversation started from zero.

The Setup

Initial version was fully stateless. Voice input → transcribe → assess → respond. Simple, but every API call was independent. The AI had no idea what happened 30 seconds ago, what task the student was working on, or what mistakes they kept making.

Works fine for one-off queries. Pretty bad for anything that feels like an actual conversation.

Building Context Management

Needed a way to track conversation state without making the whole system slow or complex. Ended up with a three-part approach:

Session Layer

Added a context management layer that sits between the voice interface and the LLM. Tracks:

Last 10 turns of conversation
Current task type (FCE has 4 parts)
Recurring error patterns
Whether student is improving or struggling

This layer is reusable across the ecosystem. WhatsEnglish and Teacher’s Scribe can use the same pattern.

Storage Strategy

Went with a hybrid approach:

PostgreSQL for permanent storage—full conversation history, lets me analyze patterns later, track progress over time.

Redis for active sessions—keeps current context in memory for fast reads. Auto-expires after an hour. Drops latency from 200-300ms down to sub-50ms.

Trade-off is added complexity, but the performance difference is noticeable. Voice chat needs to feel real-time.

Context Window Management

Can’t just dump the entire conversation history into the LLM. Llama 3.3 has a 32k token limit and you hit it faster than you’d think.

Solution was to weight recent context higher than old context. Most recent turn gets full weight, previous turn gets 0.8x weight, before that 0.64x, and so on. Older stuff is still there but has less influence.

Keeps the total context under 3k tokens—about 5 recent turns plus a summary of error patterns. Prevents the AI from looping on old references while still maintaining continuity.

How It Works

User speaks
  ↓
Deepgram transcription
  ↓
Session Manager checks Redis for context
  ↓
Context Builder assembles last 5 turns + error patterns
  ↓
Prompt includes conversation history
  ↓
Groq/Llama 3.3 generates response
  ↓
Azure TTS speaks response
  ↓
Update Redis cache + async write to PostgreSQL

Results

The numbers moved pretty significantly:

Repeat questions dropped from 32% to 4%
Average session length went from 3.2 to 7.8 minutes
7-day retention jumped from 38% to 61%

More importantly, the feedback changed. Students said it felt like talking to someone who actually remembered the conversation. That retention jump suggests they’re right—people come back when they feel heard.

Technical Stuff I Ran Into

Token Budget

Full conversation history eats tokens fast. Had to be selective about what goes into the context. Last 5 turns verbatim, error patterns as a short summary, and dropped all the metadata like timestamps and IDs. Keeps it manageable.

Latency Requirements

Voice chat breaks down if the response takes too long. Target was under 200ms total. Database queries were killing that.

Redis cache fixed it—load context once when the session starts, update it after each turn, sync back to PostgreSQL every few turns asynchronously. If Redis somehow loses it, rebuild from the database.

Context Drift

Early versions had this problem where the AI would over-reference past turns and get stuck in loops. Weighted context helped, but also had to add explicit instructions in the prompt: “focus on the last 2 turns, don’t repeat questions you already asked.”

Combination of architecture and prompting fixed it.

What This Unlocked

State management ended up being the interesting part of building conversational AI. The LLM itself is straightforward—you send a prompt, you get a response. Managing what goes into that prompt over multiple turns is where the actual work is.

This pattern works for any app in the ecosystem. WhatsEnglish can use the same context layer, Teacher’s Scribe can track lesson conversations the same way.

Also learned that latency matters way more than clean code. The Redis cache adds complexity, but users can’t perceive 50ms. They absolutely notice 300ms. Optimize for what people actually experience.

Next Steps

Current system resets between sessions. Next version needs cross-session memory.

Planning to add a vector database (probably Pinecone or Qdrant) to track student mistakes over weeks, not just within a single session. Eventually want the AI to “know” the student across all apps in the ecosystem—errors in Firstly inform practice suggestions in WhatsEnglish, lesson transcripts from Teacher’s Scribe feed personalized tutoring.

The goal is for students to feel like the apps actually know them. Shared context across the whole ecosystem.

Try it: firstly.academy

Multi-App Database Architecture for the ESL Ecosystem (coming soon)
Building Voice-First EdTech with Sub-200ms Latency (coming soon)