Engineering12 min2026-02-20
How to Test an LLM Interview Simulator — 5 Layers, From API Ping to Production

There's no automated test that tells you if an LLM response "makes sense." Here are the 5 layers of testing you actually need — from a curl command to golden datasets.
Good practical question. There are several layers to testing an LLM in an app, and each answers a different question.

## Layer 1 — Does It Work At All? (API Connectivity)

The simplest first test. Before building any UI, just call the API directly from your terminal with curl or a Python script:

```python
import anthropic

client = anthropic.Anthropic(api_key="your-key")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are Ana Ferreira, a senior HR interviewer at a Portuguese company. Ask behavioral interview questions in Portuguese.",
    messages=[
        {"role": "user", "content": "Olá, estou pronto para a entrevista."}
    ]
)

print(response.content[0].text)
```

Run this. If you get a coherent Portuguese response, your API connection works. This takes 5 minutes and costs fractions of a cent.

## Layer 2 — Does the Conversation Make Sense? (Prompt Testing)

This is where most of the real work is. You're testing whether your system prompt produces an interviewer that behaves correctly across different scenarios. You do this before writing any frontend code.

Write a simple script that simulates a full conversation loop:

```python
import anthropic

client = anthropic.Anthropic(api_key="your-key")

system_prompt = """
You are Ana Ferreira, a senior HR consultant conducting a job interview
in Portuguese for a marketing manager position at a Lisbon retail company.

Rules:
- Ask one question at a time
- Follow up on vague answers
- After 5 questions, say INTERVIEW_COMPLETE and give a brief assessment
- Be professional but warm
- If the candidate goes off-topic, gently redirect
"""

conversation = []

print("Starting interview simulation. Type your answers.")

# First message from interviewer
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=system_prompt,
    messages=[{"role": "user", "content": "Olá, estou pronto."}]
)

interviewer_msg = response.content[0].text
print(f"Ana: {interviewer_msg}")
conversation.append({"role": "assistant", "content": interviewer_msg})

# Conversation loop
while True:
    user_input = input("You: ")
    conversation.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=system_prompt,
        messages=conversation
    )

    reply = response.content[0].text
    print(f"Ana: {reply}")
    conversation.append({"role": "assistant", "content": reply})

    if "INTERVIEW_COMPLETE" in reply:
        break
```

Run this yourself and actually go through 3-4 full fake interviews. This is the most valuable test you can do. You're looking for:

- Does it stay in Portuguese throughout?
- Does it ask sensible follow-up questions or just move mechanically to the next one?
- Does it handle a bad answer gracefully?
- Does it handle a very short answer ("não sei") without breaking?
- Does it handle off-topic responses?
- Is the tone right — professional but not robotic?

## Layer 3 — Does the Scoring Make Sense? (Output Validation)

After the interview, your system sends the transcript to Claude for scoring. Test this separately with a fixed transcript:

```python
transcript = """
Ana: Fale-me de uma situação em que teve de gerir um conflito na equipa.
Candidato: Bem... houve uma vez que dois colegas não se entendiam.
Fui falar com cada um separadamente e depois reunimos os três.
Conseguimos resolver.
Ana: Pode dar mais detalhes sobre o que disse especificamente?
Candidato: Hm, não me lembro muito bem dos detalhes exatos.
"""

scoring_prompt = """
Analyze this Portuguese job interview transcript and return a JSON score:
{
  "overall_score": 0-100,
  "star_method_score": 0-100,
  "specificity_score": 0-100,
  "communication_score": 0-100,
  "filler_words_detected": [],
  "key_strengths": [],
  "improvement_areas": [],
  "summary_pt": "brief summary in Portuguese"
}
"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=scoring_prompt,
    messages=[{"role": "user", "content": f"Transcript: {transcript}"}]
)

print(response.content[0].text)
```

Then do sanity checks on the scores:

- Does a clearly bad answer score low?
- Does a good detailed answer score high?
- Is the JSON always valid and parseable?
- Does it sometimes return text outside the JSON that breaks your parser?

That last one is important — LLMs sometimes add preamble like "Here is the analysis:" before the JSON. You need to handle that in your parsing code.

## Layer 4 — Does It Break? (Edge Case Testing)

Test the weird inputs that real users will definitely send:

- **Empty response** — ""
- **Punctuation only** — "..."
- **Language switch** — "I want to speak English"
- **Very long response** — 2000+ characters
- **Refusal** — "Não quero responder a isso"
- **Off-topic question** — "Quanto é que paga?"
- **Meta-question** — "Você é uma IA?"

For each one, does the interviewer respond sensibly? This tells you what edge cases you need to handle in your system prompt before going live.

## Layer 5 — Does It Perform Consistently? (Regression Testing)

Once your prompt is working, save 5-10 good test conversations to a file. Every time you change your system prompt, re-run those same inputs and compare outputs. This is called a **golden dataset** — it prevents you from "fixing" one thing and accidentally breaking another.

A simple way to do this:

```python
test_cases = [
    {
        "input": "Não tenho muita experiência nessa área.",
        "expected_behavior": "should probe further, not accept"
    },
    {
        "input": "Na minha última empresa aumentei as vendas em 40%.",
        "expected_behavior": "should ask for specifics on how"
    }
]

# Run each through your system prompt and manually check
```

You don't need automated scoring for this at first — just read the outputs yourself. Your own judgment is the test.

## The Honest Answer on "Making Sense"

There's no automated test that tells you if an LLM response "makes sense." That judgment requires a human. What you're really building is a feedback loop:

1. **Run the simulation script yourself daily** while you're developing
2. **Show it to 3-5 real people** (friends, colleagues) and watch them use it without explaining anything — where do they get confused?
3. **Look at the actual transcripts** from your first beta users and read them

The terminal simulation script in Layer 2 is the single most valuable test you can run. It costs almost nothing, requires no frontend, and will tell you within 20 minutes whether your core interview logic works. **Start there.**