How we grade 10,000 candidate-job matches a day with an LLM, and why nobody waits for it.
A few thousand times a day, a Wynisco consultant clicks on a candidate and asks the platform: is this person right for this job? We can't make them wait three seconds for that answer. We can't make them wait one.
But the answer is being graded by an LLM. That's the whole game.
Two requirements that looked like a knife fight:
Sub-second response on every candidate's match list.
LLM-quality grading on every single match.
This post is how we resolved it. Short version: don't put the LLM on the critical path. Longer version has Cerebras, FastAPI background tasks, a Pydantic schema, and one production bug that cost us half a Tuesday.
Two stages, two latency budgets
Stage 1 — embeddings, inside the request.
The consultant hits /match/candidate-to-jobs. The resume runs through wynisco-matcher-v1 (a 384-dim sentence-transformer we fine-tuned on our own placement outcomes) and ANN-searches the live job pool. Hard filters on visa, location, and comp band cull the universe first. Top 20 comes back in 200–400ms. The consultant sees the list. They start clicking.
Stage 2 — the A-G rubric, after the response.
While the consultant is reading, we kick off an LLM evaluation in the background against the top 50 semantic matches. Each rubric returns structured JSON: archetype, seniority, requirement-by-requirement evidence with quoted resume snippets, gaps, sponsorship verdict, a global score from 1.0 to 5.0, and a recommend_apply boolean.
We call it the A-G rubric because internally we draft the prompt like a teacher's grade sheet: blocks A through G, each block answering one question. A is the role summary. C is the sponsorship check. G is legitimacy signals. The whole thing flows back into the UI rubric-by-rubric. By the time the consultant has read the third match card, the rubric for the first has rendered.
"After the response" is four lines
People overcomplicate this. The whole mechanism:
auto-matching-service/app/routes.py if settings.ENABLE_AG_RUBRIC and response.matches: job_ids = [m.job_id for m in response.matches] background_tasks.add_task( run_ag_rubric_for_candidate, req.candidate_id, job_ids ) background_tasks is FastAPI's built-in BackgroundTasks dependency. Once the route returns, FastAPI runs registered tasks on the same event loop — but after the HTTP response has been flushed. The user's connection is already closed. They see 200 OK and the JSON. Whatever happens next is invisible to them.
One gotcha that took us longer than it should have: the request-scoped DB session is dead by the time the background task runs. You cannot reuse it. run_ag_rubric_for_candidate opens its own session via the same factory and closes it when done. Forget that and you get psycopg.OperationalError: connection is closed in your logs and a long Slack thread. Don't ask.
Why Cerebras
We tried gpt-4o first. Quality excellent. p99 around 9 seconds.
We tried Claude Sonnet next. Also excellent. p99 around 6.
Neither was wrong. Both were being paid to be thorough.
But the rubric isn't a creative task. It's structured-output with a tight prompt and a fixed JSON schema. We don't need o1-class reasoning. We need a competent open model that can grind through 50 prompts every time a consultant opens a candidate's page.
Cerebras's gpt-oss-120b runs, frankly, very fast. Same prompt, same JSON, end-to-end 1.5–4 seconds including network. With six concurrent requests per candidate (AG_RUBRIC_CONCURRENCY=6), 50 jobs finish in about 15 seconds wall clock. Cheaper too, by a margin we don't talk about in public.
async def callcerebras(client, prompt):
resp = await client.post(
CEREBRAS_URL,
headers={"Authorization": f"Bearer {settings.CEREBRAS_API_KEY}"},
json={
"model": settings.AG_RUBRIC_MODEL,
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.2,
"max_completion_tokens": 6000,
"response_format": {"type": "json_object"},
},
timeout=60.0,
)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
temperature=0.2, response_format=json_object. That's it. No function calls, no agent loop. We ask for JSON, validate it with Pydantic on the way back, retry once if parsing fails. If it fails twice, the rubric for that job is null and the UI quietly omits the card. No deadlock, no halt. The 60-second timeout exists because of one outlier we caught when a Cerebras region was warming up. It hasn't been hit since.
The verifier (the part I'm proudest of)
Here's a thing nobody warns you about with structured-output LLMs: the model will happily make up evidence.
Early on, we'd see things like this:
{
"requirement": "8+ years building data pipelines on AWS",
"resume_evidence": "Led a team of 5 engineers managing AWS data pipelines for 8 years",
"strength": "strong"
}
The actual resume said four years, mid-size analytics firm, mostly on-prem. The model had pattern-matched the JD's requirements onto the resume and synthesized a quote that sounded right. Strength: strong. recommend_apply: true. We were one merge away from telling a consultant to push that candidate at a senior AWS role.
So we added a verifier. Eighty lines of Python. After every Cerebras response, before anything is saved, we walk every resume_evidence claim and check it against the actual resume text:
for req in ag_parsed["requirements"]:
evidence = req.get("resume_evidence", "")
if evidence and evidence.strip() != "NOT FOUND":
verified, ratio = isverified(evidence, resume_norm, lines_norm)
if not verified:
req["resume_evidence"] = f"NOT FOUND (unverified: {evidence})"
req["strength"] = "gap"
Normalized substring + a difflib fuzzy match at 0.85. If the quote isn't in the resume, the requirement gets demoted to gap, the global score drops 0.15 per demotion, and if the adjusted score falls below 3.5 we flip recommend_apply to false.
Every record we write also stores verifier metadata:
"_verifier": {
"version": "1.0",
"total_claims": 8,
"unverified": 0,
"demotions": 0,
"score_adjustment": 0.0
}
When unverified > 0 starts trending up over a week, we know it's time to re-tune the prompt. It's our hallucination canary.
What we got wrong
In chronological order of pain.
We stored the rubric as a string column first. Going to "parse it later." Never say "we'll parse it later." Migrated to JSONB the next morning.
We retried five times on Cerebras failures. During a 40-minute Cerebras incident, every candidate page-load fanned out into 250 doomed retries. Background tasks piled up. Memory climbed. We now retry exactly once, and only on JSON/validation errors — never on HTTP 5xx. If their service is down, ours doesn't pile on.
We blocked the response on the rubric "just for the top job." Sounded reasonable in standup. Added 4 seconds to p50. The whole point of moving the LLM behind the response is that nothing in the response waits for it. Deleted the optimization the same day.
Where this lands
Every candidate page-load on the platform now ships top 20 matches in around 300ms and trickles in A-G rubrics over the next 15-ish seconds. The consultant never waits. The LLM does the work behind their back. The verifier catches what the LLM lies about.
It's not glamorous architecture. No agent framework, no vector DB du jour, no MCP server. A FastAPI route, a BackgroundTasks call, an httpx.AsyncClient, a Pydantic model, and eighty lines of fuzzy-match Python.
The whole system is the lesson, really: when an LLM is too slow for your latency budget, the answer is rarely a faster LLM. It's usually a different position in the request lifecycle.
If you're building something like this and want to compare notes — mohammad@wynisco.com. We're hiring engineers who like systems like this one.
Written by
Mohammad Tauqeer
