Aurora QA — interview walkthrough ("what I'd do" vs "what I did")

Personal prep. Not in any repo. Repo (for reference): GitHub Rabusek/aurora-qa-assignment (private, miwierzb accepted, ogurtsov pending) + GitLab mirror gitlab.com/patryk.raba.bragi/aurora-qa-assignment (private).

Reviewer verdict already in: "wszystko jak najbardziej ok, szybka, elegancka robota." Call is 30–45 min, likely "walk me through it + defend the choices." Use the would-do → did pairs below; each is a talking beat.


ELI5 — the whole thing in plain words

The simple version. Read this first; the numbered sections below are the same story with the receipts.

What they asked. Two jobs. (1) Test a small website that hands out movie info when you ask it. (2) Build a little robot that checks whether AI-written movie summaries are honest about the article they came from.

The trick I spotted. The website is rigged to misbehave on purpose. About 1 in 7 times it throws a fake error even when you asked correctly, and now and then it freezes for ~10 seconds. Most people would shrug — "my test is flaky, run it again" — and by retrying until it passes they'd quietly hide the website's real problem. I did the opposite. I made two kinds of test:

That one decision is the whole project. The same idea comes back in the speed test and the security scans: behave-tests push through the noise, measure-tests count it.

What I found: 6 bugs. The scary one — ask for "-1 movies" and instead of refusing, the site hands back almost the entire list. That's the shape of a data leak: it shows more than it should.

Part two — the AI honesty-checker. The lazy way is to measure how similar the summary looks to the article and call similar = good. I built that lazy version on purpose to show it's a trap: a summary that copied the article and pasted in a fake email scored as the most similar of all — because copying looks similar. So "similar" literally rewards the worst lie. Instead I built small specific checkers, each hunting one kind of lie: made-up facts, made-up contact details, made-up numbers, "summaries" longer than the original, and dropped key points. They catch every planted lie — and each one says why in plain English.

The extra credit. Tests run automatically on every save, plus a speed test, a live trend dashboard, and a security sweep that already found and fixed 2 real known vulnerabilities.

Why it's good, in one breath: I noticed the target was rigged and built the tests around that fact instead of being fooled by it — for both the website and the AI-summary part.

ELI5 — the 6 bugs, one at a time

BUG-1 · random errors — High. You press the right button, but about 1 in 7 times the machine beeps "ERROR" for no reason (wrong "try again" and "not found" messages on a perfectly good request). The button works fine — the machine is fibbing at random. Why it's the worst one: if a single answer can't be trusted, nothing else can either — so every other test has to ask a few times, and one test's whole job is to count how often it fibs. Caught by: the reliability check.

BUG-2 · the "−1 movies" leak — High. You ask for "minus one movies" — a nonsense amount. Instead of saying "that's silly, no," the machine hands you almost the entire list (34 of 35). Like asking a librarian for "negative one book" and walking out with the whole shelf. Why it matters: this is exactly how real systems accidentally hand out data they shouldn't — it gives back more than allowed. Caught by: the boundary check.

BUG-3 · the 10-second freeze — Medium. About 1 in 20 times, the machine freezes for ~10 seconds before answering. Everything else answers in a blink (~7 hundredths of a second). Why it matters: any app that gives up after a few seconds will break on those. The freeze is clearly on purpose — it's always the same suspiciously round ~10s. Caught by: the reliability check + the speed test.

BUG-4 · the missing-slash detour — Low. The real door is labelled /movies/ (with a slash). Knock on /movies (no slash) and it says "go next door" (a redirect). Most visitors follow the arrow fine, but some simple programs don't and get lost. Fix: just tell everyone the exact address.

BUG-5 · the odd-knock crash — Low. There are standard ways to talk to a web server (GET, DELETE, …). Use a rare-but-real one (QUERY) and, instead of politely saying "I don't do that" (a clean 405), the gateway chokes and throws a "I'm broken" 502. Only weird pokes find it. Why it matters: the front door should never crash on an odd-but-valid knock. Caught by: the spec-fuzz check.

BUG-6 · the missing safety stickers — Low. When the site answers, it forgets a few standard safety stickers on the envelope (tell the browser "don't guess the file type," "always use the secure line," etc.). Nothing is broken — those stickers are just cheap armour. Two different scanners agreed: only the stickers are missing; everything actually dangerous is already safe. Fix: one small piece of code adds all the stickers. Caught by: the passive scan + the template scan.

ELI5 — every test/check we added, and what it's for

Runs on every save — fast, and gives the same answer every time (safe to gate on):

Runs on a schedule / on-demand — these poke the shared live service, so we keep them gentle and never block a save on them:


0. 30-second pitch (open with this)

Two-part QA take-home against a live service. Task 1 = testing approach for GET /movies/; Task 2 = a tool that scores AI summaries vs their source. The one insight that shaped everything: the service injects failures on purpose — ~13% random 4xx and ~10s stalls on valid requests. So I split the suite into behaviour tests that retry through the noise and reliability tests that measure the noise. Found 6 bugs, all reproduced by code. Then took it past the brief: CI/CD on GitLab + GitHub, k6 perf, a KPI dashboard, and a layered security stage that already fixed 2 real CVEs and found 2 more bugs.


1. The framing that matters most — the service is adversarial

What I'd do: before writing a single assertion, characterise the target. Fire the same valid request many times, watch the distribution of responses and latencies. Decide what's "the API's contract" vs "injected noise."

What I did: looped GET /movies/ 30× → ~13–15% came back 400/401/402/404/405 on identical valid input, plus occasional ~10s responses. Conclusion: deliberate fault injection ("has not been executed to its fullest potential"). That single observation forks the whole strategy:

The line to land: "A naive suite treats that chaos as flaky tests and retry-wraps it into silence. I treated it as a measurable property and reported it as BUG-1/BUG-3."


2. Task 1 — API testing approach

What I'd do: layer the tests cheapest-and-most-deterministic first, so failures are diagnosable and CI-gateable. Contract → functional → boundary → security-of-input → reliability. Each bug I claim must be reproduced by a test.

What I did: pytest suite, one file per layer:

Bugs found (all in BUGS.md, each with a repro):

ID Sev One-liner How caught
BUG-1 High identical valid requests randomly return 400/401/402/404/405 (~13%) reliability sampling
BUG-2 High limit=-1 returns nearly the whole catalogue (Python negative-slice leak) boundary test
BUG-3 Medium ~10s stalls injected on ~1/20 requests reliability + k6
BUG-4 Low /movies (no slash) → 307 redirect documented
BUG-5 Low QUERY HTTP method → 502 instead of 405 schemathesis fuzz
BUG-6 Low missing hardening headers ZAP + nuclei

Best bug to talk about — BUG-2 (the data leak): limit=-1 → 34 of 35 items; limit=-5 → 30. Count is total + limit. Root cause I inferred: handler does items[skip:skip+limit], and items[0:-1] in Python is "all but the last one." It's the exact shape of a real-world data-exposure bug — returns more rows than intended. Fix: Query(ge=0). The naive suite catches the status mismatch but never inspects the body, so it misses that it's a leak, not just a wrong code.

Why the suite is green despite 3 live bugs: the live bugs are xfail(strict=False). Each stays executable and re-asserted every run, documents the defect, but doesn't red CI on a known server-side issue. If the server ever fixes one, xfail flips to XPASS and I notice.


3. Task 2 — LLM output evaluation

What I'd do: DON'T reach for one similarity number. Decide the failure modes of a summary first (fabrication, contradiction, not-actually-condensing, dropped key facts), then build a targeted detector per mode. Deterministic + offline first; an LLM judge only as an optional top layer.

What I did: layered evaluator, offline metrics first, no API key needed:

Result: the offline layer alone reproduces ground truth — PASS for TCK-001/002/005, FAIL for TCK-003/004/006, each caught by the detector built for its failure mode, with a human-readable reason.

The killer point — why not cosine/ROUGE/difflib: I built the naive similarity version and ran it. On the fabricated-email summary (TCK-004), difflib gave it the highest similarity score of all six (0.71) — because it copies the source and appends a fake email, and string similarity rewards copying. The two contradictions (0.60, 0.53) sit inside the clean band (0.52–0.63). No threshold separates good from bad. Similarity provably rewards the worst defect. That's the whole argument for failure-mode-specific metrics.


4. Beyond the brief (the "elegancka robota" parts)

What I'd do: treat a take-home as a chance to show the system around tests, not just tests — CI, perf, observability, security — but keep every extra gentle against a shared service.

What I did:


5. Testing levels (the pyramid — draw this if asked)

Cheap+deterministic gate every push; anything leaning on the shared live service is scheduled/manual.

# Level Runs
1 Contract (strict Pydantic schema) every push
2 Functional (pagination/search) every push
3 Boundary (edge/negative, 422) every push
4 Security — input (payloads as literal) every push
5 Security — SAST/deps/secrets (bandit, pip-audit, gitleaks) every push
6 Security — DAST (schemathesis, ZAP, nuclei) scheduled/manual
7 Reliability (chaos rate + latency budget) scheduled/manual
8 Performance (k6, chaos-separated) scheduled/manual
9 LLM-output eval (Task 2) every push (unit) + evaluate stage

Principle to state: the split isn't by tool, it's by what each layer trusts. Behaviour layers retry through chaos; reliability/perf layers measure it; security layers assume hostile input.


6. Security — what I'd do vs what I did

What I'd do: cover the four security concerns a pipeline can automate — own-code (SAST), dependencies (CVEs), committed secrets, and the running app (DAST). Gate the fast/cheap ones on every push; run active scanning manual/scheduled so a shared service isn't hammered.

What I did — 6 layers: | Concern | Tool | Burp analogue | |---|---|---| | own-code vulns (SAST) | bandit | — | | dependency CVEs | pip-audit | — | | committed secrets | gitleaks | — | | spec fuzz (DAST) | schemathesis | Burp Scanner + spec import | | passive web scan | OWASP ZAP baseline | Burp passive audit | | template scan | nuclei | Burp scanner templates |

Why not Burp itself: Burp is a licensed desktop GUI — can't run headless in CI. schemathesis + ZAP + nuclei are the OSS, headless equivalents that cover the same ground from a pipeline.

Chaos-awareness in security too: schemathesis excludes the two chaos-flooded checks (status_code_conformance, positive_data_acceptance) so injected 4xx don't drown real contract defects; all DAST jobs are allow_failure (an intermittent chaos response must not red the pipeline); BUG-5 was hand-confirmed deterministic (5/5) before filing.

Actual scan results (I ran these against the live endpoint):

One-line security verdict: "Two independent scanners agree — endpoint is solid against the dangerous classes; the only app-level gap is defense-in-depth response headers, fixable in one FastAPI middleware."

"Are the CVEs ours?" No — upstream dependency CVEs (in requests + pytest), test-only tooling, low blast radius. Fix = bump the pin, which I did.


7. The honest AI-vs-no-AI comparison (be ready — they'll ask)

I actually built the 1-day no-AI version and ran it 4× live to measure what it catches:

Bug Naive result Reality
BUG-2 negative limit caught 4/4 but never inspects body → misses that it's a data leak
BUG-1 random 4xx stumbled, never identified 3/4 runs failed with a different signature each time → reads as "flaky tests," realistic outcome is retry-wrapping = suppressing the bug
BUG-3 ~10s stalls missed 0/4 one stall happened; only visible in --durations, no assertion cares about time

Task 2 naive (difflib + length): flags 1 of 3 defects, and only via length — the fabricated-email summary scored highest similarity of all six.

Score: naive ≈ 1.5 of 3 API bugs, 1 of 3 summary defects.

The point: the gap is NOT "AI magic." The shipped detectors are deterministic code; the evaluator hits 3/3 with the LLM judge off. What closed the gap was three decisions:

  1. Recognising the service is adversarial → splitting behaviour from reliability.
  2. Choosing failure-mode-specific metrics for LLM output instead of one similarity number.
  3. Using AI as a breadth multiplier — same 7 days bought CI, perf, dashboard, security, CVE hygiene — every generated line verified against the live API before it shipped.

8. Numbers cheat-sheet (quote these)


9. Likely questions → crisp answers


10. AI-usage — the honest framing (rehearse this verbatim)

AI wrote a lot of the code. I owned the test strategy, verified every claim against the live API, and re-reviewed the result adversarially. The bug reports, thresholds, and metrics are all empirically demonstrated — I can defend any line of it.

If pushed on "so what did you do": the three decisions in §7. Those are judgement calls AI doesn't make for you — recognising the adversarial service, rejecting similarity metrics, and knowing which extras are worth the shared-service risk.


11. Logistics (where things live)