Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)

(sup.ai)

20 points | by supai 1 day ago

8 comments

scottmu 1 day ago
I want to clarify what Ken meant by "entropy in the output token probability distributions." Whenever an LLM outputs a token, it's choosing that token out of all possible tokens. Every possible output token has a probability assigned by the model (typically a logarithm of the probability). This is a probability distribution (the output token probabilities sum to 1). Entropy is a measure of uncertainty and can quantify if a token probability distribution is certain (1 token has a 99.9% probability, and the rest share the leftover 0.1% probability) or uncertain (every token has roughly the same probability, so it's pretty much random which token is selected). Low entropy is the former case, and high entropy is the latter.
There is interesting research in the correlation of entropy with accuracy and hallucinations:
- https://www.nature.com/articles/s41586-024-07421-0
- https://arxiv.org/abs/2405.19648
- https://arxiv.org/abs/2509.04492 (when only a small number of probabilities are available, which is something we frequently deal with)
- https://arxiv.org/abs/2603.18940
- tons more, happy to chat about if interested
[-]
- philipodonnell 49 minutes ago
  Is the difficulty that in high entropy situations, you can’t really tell whether it’s because the model is uncertain, or because of the options are so semantically similar that it doesn’t matter which one you choose? Like pure synonyms.
- mememememememo 13 hours ago
  Wow if it is that easy to detect hallucinations, are the big models or rigs (agentic scaffolds) building any self-correcting behaviour. Or possibly switching it to an I don't know mode so it can ask the human for help understanding.
  Maybe this insight is why I feel hallucinations are much rarer in the last 12 months on top models. Are they being detected before they get sent out.
  [-]
  - scottmu 2 hours ago
    I wouldn't say it's easy to detect hallucinations. Understanding output token probability distributions is only part of a solution, and we still aren't perfect. Just better than individual models.
    Hallucinations may seem rarer for a few reasons. First, models are more accurate with certain prompts. Second, models are more convincing when they do hallucinate. They may get an overall idea, but hallucinate the details. Hallucinations are still a major problem and are fundamental to the way modern LLMs work.
- stephantul 3 hours ago
  Buddy… your son gets a top post on HN in which he clearly mentions you, yet you feel the need to make an account just to correct him in the first comment? Can’t you send him a message and let him correct it?
  [-]
  - scottmu 1 hour ago
    You're right! I could've phrased my comment better. Ken actually wanted to edit his post, but it was too late. So he asked me to write a response explaining what he meant. Of course, he could've commented too. I was just trying to be helpful to him and others wanting an explanation.
    [-]
    - stephantul 46 minutes ago
      Ah ok sorry, it just looked like you were speaking on his behalf.
Tomjosetj31 2 hours ago
Impressive result on HLE if the methodology holds up. One thing I'd want to understand better: how much of the gain comes from the entropy weighting specifically vs. simply having more compute via parallel inference? Would be curious to see an ablation — same models, same budget, but with naive majority voting instead. That would isolate the actual contribution of your confidence-weighting approach.
[-]
- scottmu 1 hour ago
  Great question. What I can say is we experimented a _ton_. If you take a basic approach and simply ask the same prompt of a bunch of LLMs and then ask another LLM to combine the results, you'll get a pretty poor answer. At best, you'll get a response that is the average of the ensemble, which by definition is going to be worse than the best model of the ensemble. Of course, you're going to want a mechanism to choose the ensemble effectively. At worst, you'll regurgitate the worst model of the ensemble. And you'll have the added expense and potential latency, too. Not a good solution at all.
  We didn't experiment with different ensemble mechanisms rigorously enough for a research paper. We will, though.
  Majority voting was actually how we started, and we came up with great mechanisms for stopping early, saving token costs and time, along with other interesting things we could do with that simple mechanism. The issue we had was that the orchestration could already choose a model beforehand almost as good (according to simpler benchmarks than HLE we ran at the time) as majority voting could pick after the responses were complete. And we tried many voting mechanisms, such as all models in the ensemble voting on all others.
  An ablation study would be great to do now, with many other ideas we've played with. We have better benchmarks than we did just a few months ago, and it would be great to understand the tradeoffs of different approaches so that there could be alternative options for different use cases.
siliconc0w 3 hours ago
Do you have data for other benchmarks? +7% for HLE isn't nothing but it'd be more compelling if you could show you're consistently doing better with your method across more domains (especially coding, which seems like the primary use-case these days).
[-]
- kenmu 1 hour ago
  As of right now, we do not. I'm working on these other benchmarks, but unfortunately they cost quite a bit of money to run, which I'm hoping will come from many people using Sup :)
hello12343214 17 hours ago
I use gemini and cursor for enterprise software implementation, but they often suggest incorrect solutions to edge cases and unique config requirements. An AI that has a higher likelihood of being accurate is very appealing. I'll give Sup AI at try over the next few days at work.
Also, discovering HLE was great... scrolling through some of the questions brings back memories of college organic chem.
[-]
- scottmu 15 hours ago
  I've felt your pain. Models aren't always trained well enough on edge cases and configs.
  Would love to hear how Sup works out for you.
algolint 1 day ago
Ensembling usually hits a wall at latency and cost. Running these in parallel is table stakes, but how are you handling the orchestration layer overhead when one provider (e.g., Vertex or Bedrock) spikes in P99 latency? If you're waiting for the slowest model to get entropy stats, the DX falls off a cliff. Are you using speculative execution or a timeout/fallback strategy to maintain a responsive ttft?
[-]
- supai 1 day ago
  A few things:
  - We do something similar to OpenRouter which measures the latency of the different providers, to ensure we always get the fastest results
  - Users can cancel a single model stream if it's taking too long
  - The orchestrator is pretty good at choosing what models for what task. The actual confidence scoring and synthesis at the end is the difficult part that you cannot do naively, however, the orchestrator plays the biggest part in optimizing cost + speed. I've made sure that we don't exceed 25% extra in cost or time in the vast majority of queries, compared to equivalent prompts in ChatGPT/Gemini/etc.
  The reason why this is viable IMO is because of the fact that you can run multiple less-intelligent models with lower thinking efforts and beat a single more-intelligent model with a large thinking effort. The thinking effort reduction speeds up the prompt dramatically.
  The sequential steps are then:
  1. Ensemble RAG 2. Orchestrator 3. Models in parallel 4. Synthesizer
  And retries for low-confidence (although that's pretty optimized with selective retries of portions of the answer).
- mememememememo 13 hours ago
  You could timeout. You could trade them off dynamically.
  I.e. you get 3 replies. 80% confidence. You decide at 80% you are fairly good but happy to wait 5 seconds for completion / 500ms for time to first token. If either breaches you give the current answer.
  But if you are at 5% you wait for 60s total/2s for a token since the upside of that unspoken model is much higher.
  Basically wagering time for quality in a dynamic prediction market in front of the LLM.
  [-]
  - kenmu 1 hour ago
    Love your idea. We have timeout mechanisms and we originally would be pretty aggressive with timeouts based on both time and response length to balance accuracy and speed. There’s research that longer responses tend to be less accurate (when compared to other responses to the same prompt). So we came up with an algorithm that optimized this very effectively. However, we eventually removed this mechanism to avoid losing any accuracy or comprehensiveness. We have other systems, including confidence scoring, that are pretty effective at judging long responses and weighting them accordingly.
    We may reintroduce some of the above with user-configurable levers.
  - all2 3 hours ago
    If we treat LLM output as a manufacturing output if you have three 80% probabilities you actually have something like 0.80.80.8 -> 0.512 or 51%.
    [-]
    - scottmu 1 hour ago
      Yes, there's a wide variety of use cases that require different ratios of accuracy/speed. If you require 3 responses to be accurate, you have to multiply all 3 response accuracy probabilities, and as you've shown, this can reduce overall accuracy quite a bit. Of course, this does make the assumption that those 3 responses are independent of one another.
      [-]
      - all2 27 minutes ago
        One thing I considered some months ago that was very similar to what you guys have done, but at a higher abstraction layer:
        1. Consult many models (or a single model with higher temp) with the same prompt
        2. Intelligently chunk the outputs (by entity, concept, subject, etc.)
        3. Put each chunk into a semantic bucket (similar chunks live in the same bucket)
        4. Select winning buckets for each chunk.
        4a. Optionally push the undervoted chunks back into the model contexts for followup: is this a good idea, does it fit with what you recommended, etc.
        4b. do the whole chunk/vote thing again
        5. Fuse outputs. Mention outliers.
        Token spend is heavy here, where we rely on LLMs to make decisions instead of the underlying math you guys went with. IMO, the solution y'all have reached is far more elegant than my idea.
wavemode 3 hours ago
Is 7 extra percent on HLE benchmark really worth the cost of running an entire ensemble of models?
[-]
- kenmu 1 hour ago
  I mentioned in another comment that I make sure the cost/time is within 1.25x of the next best single-model run. So it's not perfect, but I think that aspect will only get better with time.
  Of course I'm biased, but using Sup has been great for me personally. Even disregarding the HLE score, having many different perspectives in the answers, and most importantly the combined answer, has been very helpful in feedback for architectural decisions I make for Sup, and many other questions I would normally ask ChatGPT/Gemini/Claude/Grok individually.
- kelseyfrog 2 hours ago
  Depends on the use-case and requirements.
adshotco 4 hours ago
[dead]