Large language models (LLMs), the technology behind tools like ChatGPT, are very good at sounding smart. That is the exact thing that makes them a poor fit for assessment.
Here is what happens when an LLM scores a candidate. It produces a number, and then it writes an explanation of why the candidate got that number. The explanation reads well, it sounds reasonable. But the explanation is written after the score; it is not the actual reason the score was assigned. You are reading a story, not the math or logic behind the output.
Worse, LLMs do not give the same answer twice. The same candidate, with the same answers, can score differently on Monday and Friday. Your team has probably already experienced it without knowing; two candidates with nearly identical responses landing far apart in the ranking.
For TA leaders running staffing, retail, logistics, care or public sector hiring, this is not a small problem. If two equally qualified candidates can land in different parts of the shortlist for the same answer, you do not have a screening tool. You have a lottery.
Hubert was founded before the ChatGPT boom. We spent years building our own scoring models, grounded in the academic science of structured interviewing. Every candidate answer is broken down using the STAR framework (Situation, Task, Action, Result), and each piece is weighted on a defined scale. That is the scoring layer. It is deterministic, which means one simple thing: Same answer, same score. Every time.
There are two real consequences for your team.
First, every candidate in a cohort is judged on identical criteria. The model is locked when your hiring round opens which means it does not drift; nobody is being rated against a moving target.
Second, when you have to explain why a candidate moved forward, the explanation is not a story. It is the actual logic that produced the score. That is what "legally defensible by design" really means.
To be clear, LLMs are not the enemy. They have a job. Just not the scoring job.
Hubert uses AI for the conversation itself. When a candidate is taking a Hubert interview, the warmth, the natural follow-ups, the feeling of being heard; all of that comes from large language models. It is why 96% of candidates finish their interview and rate the experience 9 out of 10.
But the moment a number is assigned to that candidate, the technology switches. Scoring runs on our deterministic models. The conversation feels human while the assessment is mathematical. Your candidates get a great experience, and you get a shortlist you can stand behind. You do not have to pick one or the other.
The EU AI Act classifies AI used in hiring as "high-risk." Article 13 says high-risk systems have to be transparent enough that a human can interpret what they are doing. NYC's Local Law 144 already requires bias audits for automated hiring tools. The UK, Canada, and most of the OECD are not far behind.
The TA leaders who win this decade will not be the ones bolting compliance reports onto a black-box system after the fact. They will be the ones whose scoring was auditable from the start.
That is why our customers, including ManpowerGroup, Securitas, Coop Östra, Malmö Stad, Hemfrid, and Teleperformance, chose Hubert. Not because we use "AI." Every vendor does. They chose us because, when a hiring manager has to justify a decision, the answer is a specific candidate response tied to a specific scoring criterion. That is what 5x greater accuracy than traditional methods looks like in practice. That is what 80% faster time-to-hire looks like when speed does not cost defensibility.
If you take one thing from this piece, take this. The next time a vendor pitches you "AI-powered interviews," ask:
What technology scores my candidates, and can you show me exactly how each score is produced?
If the answer involves an LLM doing the scoring, or a vague reference to "our AI," keep asking. You deserve a model that gives the same answer twice. So do your candidates.