Aurora QA — interview walkthrough ("what I'd do" vs "what I did")
Personal prep. Not in any repo. Repo (for
reference): GitHub Rabusek/aurora-qa-assignment (private,
miwierzb accepted, ogurtsov pending) + GitLab mirror
gitlab.com/patryk.raba.bragi/aurora-qa-assignment
(private).
Reviewer verdict already in: "wszystko jak najbardziej ok, szybka, elegancka robota." Call is 30–45 min, likely "walk me through it + defend the choices." Use the would-do → did pairs below; each is a talking beat.
ELI5 — the whole thing in plain words
The simple version. Read this first; the numbered sections below are the same story with the receipts.
What they asked. Two jobs. (1) Test a small website that hands out movie info when you ask it. (2) Build a little robot that checks whether AI-written movie summaries are honest about the article they came from.
The trick I spotted. The website is rigged to misbehave on purpose. About 1 in 7 times it throws a fake error even when you asked correctly, and now and then it freezes for ~10 seconds. Most people would shrug — "my test is flaky, run it again" — and by retrying until it passes they'd quietly hide the website's real problem. I did the opposite. I made two kinds of test:
- one kind gently asks a few times to check the logic is right (did the correct movie list come back?),
- the other kind's only job is to count how often the site misbehaves and report it as a bug.
That one decision is the whole project. The same idea comes back in the speed test and the security scans: behave-tests push through the noise, measure-tests count it.
What I found: 6 bugs. The scary one — ask for "-1 movies" and instead of refusing, the site hands back almost the entire list. That's the shape of a data leak: it shows more than it should.
Part two — the AI honesty-checker. The lazy way is to measure how similar the summary looks to the article and call similar = good. I built that lazy version on purpose to show it's a trap: a summary that copied the article and pasted in a fake email scored as the most similar of all — because copying looks similar. So "similar" literally rewards the worst lie. Instead I built small specific checkers, each hunting one kind of lie: made-up facts, made-up contact details, made-up numbers, "summaries" longer than the original, and dropped key points. They catch every planted lie — and each one says why in plain English.
The extra credit. Tests run automatically on every save, plus a speed test, a live trend dashboard, and a security sweep that already found and fixed 2 real known vulnerabilities.
Why it's good, in one breath: I noticed the target was rigged and built the tests around that fact instead of being fooled by it — for both the website and the AI-summary part.
ELI5 — the 6 bugs, one at a time
BUG-1 · random errors — High. You press the right button, but about 1 in 7 times the machine beeps "ERROR" for no reason (wrong "try again" and "not found" messages on a perfectly good request). The button works fine — the machine is fibbing at random. Why it's the worst one: if a single answer can't be trusted, nothing else can either — so every other test has to ask a few times, and one test's whole job is to count how often it fibs. Caught by: the reliability check.
BUG-2 · the "−1 movies" leak — High. You ask for "minus one movies" — a nonsense amount. Instead of saying "that's silly, no," the machine hands you almost the entire list (34 of 35). Like asking a librarian for "negative one book" and walking out with the whole shelf. Why it matters: this is exactly how real systems accidentally hand out data they shouldn't — it gives back more than allowed. Caught by: the boundary check.
BUG-3 · the 10-second freeze — Medium. About 1 in 20 times, the machine freezes for ~10 seconds before answering. Everything else answers in a blink (~7 hundredths of a second). Why it matters: any app that gives up after a few seconds will break on those. The freeze is clearly on purpose — it's always the same suspiciously round ~10s. Caught by: the reliability check + the speed test.
BUG-4 · the missing-slash detour — Low. The real
door is labelled /movies/ (with a slash).
Knock on /movies (no slash) and it says "go next door" (a
redirect). Most visitors follow the arrow fine, but some simple programs
don't and get lost. Fix: just tell everyone the exact
address.
BUG-5 · the odd-knock crash — Low. There are
standard ways to talk to a web server (GET, DELETE, …). Use a
rare-but-real one (QUERY) and, instead of
politely saying "I don't do that" (a clean 405), the gateway
chokes and throws a "I'm broken" 502. Only weird pokes
find it. Why it matters: the front door should never crash on
an odd-but-valid knock. Caught by: the spec-fuzz check.
BUG-6 · the missing safety stickers — Low. When the site answers, it forgets a few standard safety stickers on the envelope (tell the browser "don't guess the file type," "always use the secure line," etc.). Nothing is broken — those stickers are just cheap armour. Two different scanners agreed: only the stickers are missing; everything actually dangerous is already safe. Fix: one small piece of code adds all the stickers. Caught by: the passive scan + the template scan.
ELI5 — every test/check we added, and what it's for
Runs on every save — fast, and gives the same answer every time (safe to gate on):
- Contract check — "Does the answer come in exactly the shape we agreed?" We wrote down the exact fields a movie must have; if the site adds, renames, or drops one, this yells.
- Functional check — "Do the basic features actually work?" Paging through the list and searching return the right stuff.
- Boundary check — "What about silly inputs?" Zero, negative, and too-big numbers — the edges where bugs hide. (This is where the "−1 movies" leak, BUG-2, got caught.)
- Security-of-input check — "What if someone types nasty, tricky text?" Sneaky "injection" text must be treated as plain words, never as commands — no crash, no break-in.
- Our-code check (SAST) — "Is our own code sloppy or risky?" A robot reads the code we wrote and flags dangerous patterns.
- Borrowed-parts check (dependencies) — "Are the libraries we borrowed known to be broken?" Checks them against a list of known holes. (Already caught and fixed 2 real ones.)
- Leaked-password check (secrets) — "Did we accidentally leave a key or password in the code?" Scans everything, including old history.
- AI-summary honesty check (Task 2) — "Is the AI summary telling the truth about the article?" A little checker for each kind of lie (made-up facts, fake emails, invented numbers, too-long "summaries," dropped key points).
Runs on a schedule / on-demand — these poke the shared live service, so we keep them gentle and never block a save on them:
- Reliability check — "How often does the rigged machine misbehave?" Counts the random errors and the freezes. (Catches BUG-1 and BUG-3.)
- Spec-fuzz check (schemathesis) — "Poke the API in every weird way its own manual allows." Auto-invents odd requests. (Found BUG-5.)
- Passive scan (OWASP ZAP) — "Read the answers for missing safety stickers." (Found BUG-6.)
- Template scan (nuclei) — "Run a giant checklist of known misconfigurations against it." (Agreed with the passive scan — no new problems.)
- Speed test (k6) — "Is it fast, and how bad are the freezes under a little load?" Measures normal speed and counts the freezes separately so they don't hide how fast it really is.
0. 30-second pitch (open with this)
Two-part QA take-home against a live service. Task 1 = testing approach for
GET /movies/; Task 2 = a tool that scores AI summaries vs their source. The one insight that shaped everything: the service injects failures on purpose — ~13% random 4xx and ~10s stalls on valid requests. So I split the suite into behaviour tests that retry through the noise and reliability tests that measure the noise. Found 6 bugs, all reproduced by code. Then took it past the brief: CI/CD on GitLab + GitHub, k6 perf, a KPI dashboard, and a layered security stage that already fixed 2 real CVEs and found 2 more bugs.
1. The framing that matters most — the service is adversarial
What I'd do: before writing a single assertion, characterise the target. Fire the same valid request many times, watch the distribution of responses and latencies. Decide what's "the API's contract" vs "injected noise."
What I did: looped GET /movies/ 30× →
~13–15% came back 400/401/402/404/405 on identical
valid input, plus occasional ~10s responses. Conclusion: deliberate
fault injection ("has not been executed to its fullest potential"). That
single observation forks the whole strategy:
- Behaviour tests wrap requests in a bounded retry
(
conftest.api_get, max 8, settling on{200, 422}) so they assert logic, not luck. - Reliability tests measure the chaos head-on (error rate, latency budget) instead of hiding it.
The line to land: "A naive suite treats that chaos as flaky tests and retry-wraps it into silence. I treated it as a measurable property and reported it as BUG-1/BUG-3."
2. Task 1 — API testing approach
What I'd do: layer the tests cheapest-and-most-deterministic first, so failures are diagnosable and CI-gateable. Contract → functional → boundary → security-of-input → reliability. Each bug I claim must be reproduced by a test.
What I did: pytest suite, one file per layer:
- Contract — response modelled as a strict Pydantic
schema (
extra="forbid"), i.e. an executable OpenAPI contract. Any new/renamed field fails loudly. - Functional — pagination + search invariants.
- Boundary — edge/negative
limit/skip,422handling. - Security (input) — injection-ish
querypayloads must be treated as literal text (no 500, no filter bypass). - Reliability — samples the injected chaos: error rate + latency budget.
Bugs found (all in BUGS.md, each with a
repro):
| ID | Sev | One-liner | How caught |
|---|---|---|---|
| BUG-1 | High | identical valid requests randomly return 400/401/402/404/405 (~13%) | reliability sampling |
| BUG-2 | High | limit=-1 returns nearly the whole catalogue (Python
negative-slice leak) |
boundary test |
| BUG-3 | Medium | ~10s stalls injected on ~1/20 requests | reliability + k6 |
| BUG-4 | Low | /movies (no slash) → 307 redirect |
documented |
| BUG-5 | Low | QUERY HTTP method → 502 instead of 405 |
schemathesis fuzz |
| BUG-6 | Low | missing hardening headers | ZAP + nuclei |
Best bug to talk about — BUG-2 (the data leak):
limit=-1 → 34 of 35 items; limit=-5 → 30.
Count is total + limit. Root cause I inferred: handler does
items[skip:skip+limit], and items[0:-1] in
Python is "all but the last one." It's the exact shape of a real-world
data-exposure bug — returns more rows than intended. Fix:
Query(ge=0). The naive suite catches the status mismatch
but never inspects the body, so it misses that it's a
leak, not just a wrong code.
Why the suite is green despite 3 live bugs: the live
bugs are xfail(strict=False). Each stays executable and
re-asserted every run, documents the defect, but doesn't red CI on a
known server-side issue. If the server ever fixes one, xfail
flips to XPASS and I notice.
3. Task 2 — LLM output evaluation
What I'd do: DON'T reach for one similarity number. Decide the failure modes of a summary first (fabrication, contradiction, not-actually-condensing, dropped key facts), then build a targeted detector per mode. Deterministic + offline first; an LLM judge only as an optional top layer.
What I did: layered evaluator, offline metrics first, no API key needed:
- Source-grounding precision — content-token overlap of summary vs source (catches fabricated/contradicted claims).
- Fabricated-contact detection — regex for emails/URLs/phones present in summary but not source.
- Unsupported standalone numbers — numbers in the summary absent from source.
- Compression ratio — a "summary" longer than the source fails "must condense."
- Coverage recall — key source content actually represented.
- Optional LLM-as-judge (Claude) — off by default, degrades gracefully if no key.
Result: the offline layer alone reproduces ground truth — PASS for TCK-001/002/005, FAIL for TCK-003/004/006, each caught by the detector built for its failure mode, with a human-readable reason.
The killer point — why not cosine/ROUGE/difflib: I built the naive similarity version and ran it. On the fabricated-email summary (TCK-004), difflib gave it the highest similarity score of all six (0.71) — because it copies the source and appends a fake email, and string similarity rewards copying. The two contradictions (0.60, 0.53) sit inside the clean band (0.52–0.63). No threshold separates good from bad. Similarity provably rewards the worst defect. That's the whole argument for failure-mode-specific metrics.
4. Beyond the brief (the "elegancka robota" parts)
What I'd do: treat a take-home as a chance to show the system around tests, not just tests — CI, perf, observability, security — but keep every extra gentle against a shared service.
What I did:
- k6 perf with the same chaos-separation principle: injected stalls (~10.0–10.2s, ~5%) metered as their own metric, p95 gate on clean responses only. Raw p95 read 1.58s; clean p95 is **~70ms** — the noise was masking a fast service. That data upgraded BUG-3 from hunch to confirmed injected stall.
- CI/CD — GitLab pipeline
(
test → security → evaluate → perf → dashboard) + GitHub Actions mirror. Heavy/shared-service jobs are manual/scheduled, not per-push. - KPI dashboard — self-contained static HTML (no
build, opens offline) from an append-only
history.jsonl: pass rate, chaos/stall trends, latency percentiles, verdict stacks. Colourblind-safe palette, dark mode, table view. The point is the trend: deterministic suite holds ~98% while chaos rate bounces underneath. - CVE hygiene — pip-audit caught CVE-2026-25645 (requests) + CVE-2025-71176 (pytest); bumped, re-verified.
5. Testing levels (the pyramid — draw this if asked)
Cheap+deterministic gate every push; anything leaning on the shared live service is scheduled/manual.
| # | Level | Runs |
|---|---|---|
| 1 | Contract (strict Pydantic schema) | every push |
| 2 | Functional (pagination/search) | every push |
| 3 | Boundary (edge/negative, 422) | every push |
| 4 | Security — input (payloads as literal) | every push |
| 5 | Security — SAST/deps/secrets (bandit, pip-audit, gitleaks) | every push |
| 6 | Security — DAST (schemathesis, ZAP, nuclei) | scheduled/manual |
| 7 | Reliability (chaos rate + latency budget) | scheduled/manual |
| 8 | Performance (k6, chaos-separated) | scheduled/manual |
| 9 | LLM-output eval (Task 2) | every push (unit) + evaluate stage |
Principle to state: the split isn't by tool, it's by what each layer trusts. Behaviour layers retry through chaos; reliability/perf layers measure it; security layers assume hostile input.
6. Security — what I'd do vs what I did
What I'd do: cover the four security concerns a pipeline can automate — own-code (SAST), dependencies (CVEs), committed secrets, and the running app (DAST). Gate the fast/cheap ones on every push; run active scanning manual/scheduled so a shared service isn't hammered.
What I did — 6 layers: | Concern | Tool | Burp analogue | |---|---|---| | own-code vulns (SAST) | bandit | — | | dependency CVEs | pip-audit | — | | committed secrets | gitleaks | — | | spec fuzz (DAST) | schemathesis | Burp Scanner + spec import | | passive web scan | OWASP ZAP baseline | Burp passive audit | | template scan | nuclei | Burp scanner templates |
Why not Burp itself: Burp is a licensed desktop GUI — can't run headless in CI. schemathesis + ZAP + nuclei are the OSS, headless equivalents that cover the same ground from a pipeline.
Chaos-awareness in security too: schemathesis
excludes the two chaos-flooded checks
(status_code_conformance,
positive_data_acceptance) so injected 4xx don't drown real
contract defects; all DAST jobs are allow_failure (an
intermittent chaos response must not red the pipeline); BUG-5 was
hand-confirmed deterministic (5/5) before filing.
Actual scan results (I ran these against the live endpoint):
- schemathesis → found BUG-5
(
QUERY→ 502). - OWASP ZAP baseline →
0 FAIL / 5 WARN / 62 PASS. The 62 passes are the story: no XSS, no error/private-IP disclosure, no insecure deserialization. 5 warns = missing hardening headers = BUG-6 (Low). - nuclei → 16 matches, all
info. Same missing-header class (corroborates ZAP) + Cloud Run platform fingerprint (Google Front End, TLS 1.2+1.3, wildcard*.run.appcert) — none of it our config.
One-line security verdict: "Two independent scanners agree — endpoint is solid against the dangerous classes; the only app-level gap is defense-in-depth response headers, fixable in one FastAPI middleware."
"Are the CVEs ours?" No — upstream dependency CVEs (in requests + pytest), test-only tooling, low blast radius. Fix = bump the pin, which I did.
7. The honest AI-vs-no-AI comparison (be ready — they'll ask)
I actually built the 1-day no-AI version and ran it 4× live to measure what it catches:
| Bug | Naive result | Reality |
|---|---|---|
| BUG-2 negative limit | caught 4/4 | but never inspects body → misses that it's a data leak |
| BUG-1 random 4xx | stumbled, never identified | 3/4 runs failed with a different signature each time → reads as "flaky tests," realistic outcome is retry-wrapping = suppressing the bug |
| BUG-3 ~10s stalls | missed 0/4 | one stall happened; only visible in --durations, no
assertion cares about time |
Task 2 naive (difflib + length): flags 1 of 3 defects, and only via length — the fabricated-email summary scored highest similarity of all six.
Score: naive ≈ 1.5 of 3 API bugs, 1 of 3 summary defects.
The point: the gap is NOT "AI magic." The shipped detectors are deterministic code; the evaluator hits 3/3 with the LLM judge off. What closed the gap was three decisions:
- Recognising the service is adversarial → splitting behaviour from reliability.
- Choosing failure-mode-specific metrics for LLM output instead of one similarity number.
- Using AI as a breadth multiplier — same 7 days bought CI, perf, dashboard, security, CVE hygiene — every generated line verified against the live API before it shipped.
8. Numbers cheat-sheet (quote these)
- Chaos rate: ~13–15% random 4xx on valid input.
- Stall: ~10.0–10.2s, ~1 in 20 (~5%).
- Clean latency: median ~66ms, p95 ~70ms, p99 ~79ms. Raw (unseparated) p95 = 1.58s.
- Suite: 43 passed / 3 xfailed. Evaluator: PASS=3 / FAIL=3 vs live.
- Bugs: 6 (2 High, 1 Med, 3 Low).
- ZAP: 0 FAIL / 5 WARN / 62 PASS. nuclei: 16 info, 0 vuln.
- CVEs fixed: 2 (requests, pytest).
- Naive baseline: catches ~1.5/3 API bugs, 1/3 summary defects.
9. Likely questions → crisp answers
- Why retries in tests? Only in behaviour
tests, bounded (8), settling on
{200,422}. Chaos itself is asserted separately in reliability tests. Nothing swallowed. - Why green with bugs?
xfail(strict=False)— each bug executable + documented, doesn't block CI on a known server defect; flips to XPASS if fixed. - Why grounding over embeddings/ROUGE? Demonstrated: difflib ranks the fabricated-email summary most-similar. Grounding + entity checks target the real failure modes.
- Evaluator thresholds? GROUNDING FAIL <0.50 / WARN <0.65; compression >1.20 fails. Calibrated on the live six, every verdict explainable.
- How do you know BUG-1/BUG-3 are injected, not real load
issues? Round numbers (exactly ~10s), rate is stable, statuses
are unrelated to input, warm service. Signature of
sleep(10)+ random-status injection, not organic failure. - What would you do next? Regression-gate the
evaluator on prompt changes (already an
allow_failurestage); trend alerting off the dashboard; authenticated DAST once the service has auth; fuzz corpus seeded from real traffic; add response-header middleware to close BUG-6. - Biggest weakness of your solution? No auth surface to test (service is open), so the security layer is unauthenticated-only. And the LLM-judge layer is unexercised in CI (no key) — it's scaffolding, not proven at scale. I'd flag both honestly.
10. AI-usage — the honest framing (rehearse this verbatim)
AI wrote a lot of the code. I owned the test strategy, verified every claim against the live API, and re-reviewed the result adversarially. The bug reports, thresholds, and metrics are all empirically demonstrated — I can defend any line of it.
If pushed on "so what did you do": the three decisions in §7. Those are judgement calls AI doesn't make for you — recognising the adversarial service, rejecting similarity metrics, and knowing which extras are worth the shared-service risk.
11. Logistics (where things live)
- GitHub
Rabusek/aurora-qa-assignment— private; miwierzb accepted, ogurtsov invite pending. CI (ci.yml) + security (security.yml) green. - GitLab
gitlab.raba.pl/patryk/aurora-qa-assignment— private mirror on the self-hosted instance; full pipeline + Pages + scheduled jobs. (Briefly lived ongitlab.com/patryk.raba.bragi— moved off that account.) The one pipeline token is a masked/protected CI variable, never committed. - Dashboard — GitLab Pages hosts it; GitHub ships it as a build artifact (Pages needs paid plan on private repos).
- Repo root has
WRITEUP.md(call notes) — uncommitted, local only. This file is separate and also uncommitted.