I wonder if a custom UI that auto-layouts different files in a 2D grid and connects them to each other in a graph in an intuitive way would help a lot with review velocity.
As in, if you have a large screen, a particularly-trained/prompted AI can organize the code changes in a "flowchart" with floating windows you can easily follow
Maybe in this UI, each code piece also comes with a summary from an agent that has already auto-reviewed the whole PR and creates a basic summary (instructed to be neutral but surface issues if it finds any).
Same experience here - I built a similar tool, for reviewing both plans and code - https://crit.md (shameless plug), browser based as opposed to TUI.
Having said that, I don't review the code until going through a few iterations of reviews from Claude. Each round it does find some "obvious" issue, so as long as I'm not close to maxing my subscription for the week I let it run an audit -> validator checks the claims -> fix issues before I get to it.
I've been using https://github.com/choplin/code-review.nvim, which looks like a similar UI, but in the NeoVim interface. `<leader>rc` to comment on a line/selection, then `<leader>ry` to yank all comments into the clipboard to be pasted into a chat.
It leaves the comments as markdown files in ./.code-review, so I also have my `/review` agent set to output in the same format, so an LLM can be reviewing the same code I am, I can edit or dismiss the LLM's reviews, then send the whole thing back to the first agent to fix.
I was thinking for years about doing something like this. Thank you for linking this. Would be nice if it allowed to "reject" or mark a change to fix later, but honestly when it would need to be linked to some tracking tool and it would be overkill.
The video actually convinced me that this might be an interesting tool. I'm going to try it myself for a small one-shot project and see how well it performs.
TUI-based reviews on it's own are already interesting. I had never considered it, I guess.
Curious how the multi-agent setup handles disagreement — when reviewer A says merge and reviewer B says block, does the human get both sides, or does one win the tie-breaker?
Why not just use an eval harness to prove this catches more real bugs? Benchmarks on actual bug classes would be far more convincing than comparing against /review.
That's a great idea. I had trouble finding anything like this, a benchmark made for (AI) code reviewers.
I had expected to find something like an eval harness available on GitHub, but couldn't find it.
Any suggestions? Or maybe we/I/someone should build something like this?
I suppose one challenge is that if it's going to be publicly available, it would also be easy to cheat, but still seems it would be useful if people agreed it's a good benchmark and could easily re-test tools themselves.
Based on how same models rank fluctuates week to week, all I can conclude is that no frontier models is statistically better than the other or it's too task dependent that the result cannot converge.
On small PRs (small features / changes ~hundreds to thousands of lines), I'd say around 500,000 total tokens.
On large PRs (new feature sets for apps ~10,000-30,000 lines), around 2-3 million total tokens.
By the way, I should have mentioned in my original post, adamsreview counts tokens used by sub-agents across the stages, and tells you at the end of each stage the total used so far.
Curious what kinds of bugs the multi-agent setup catches thatsingle-pass review misses in practice. Is it more about coverage(different agents looking at different aspects) or about getting a second opinion on the same aspect? The README has examples but the mechanism by which the parallelism actually helps isn't obvious to me from them.
I was thinking about building a GitHub repo made for evaluating Code Reviews. Something like a complex app (or perhaps a few branches with different options), and then PRs on each branch with varying types and degrees of bugs for a Code Review to find.
I suppose this would not be a 'real' benchmark because it would be public and so you couldn't necessarily trust scores people share about how their own tool did, but it would at least allow anyone to try out code review tools on their own and report relative effectiveness and characteristics.
I'll post again if I end up finding or building something like that. I couldn't find anything when I looked previously.
I'll also keep in mind your question as I continue testing this, because you are right that it would be useful to be able to describe what is different, not just the magnitude of bugs found.
Yes, being comprehensive, so early or blatant cheapo findings do not distract from other ones. That's important for base results. Splitting in both file and task is (currently) important.
Additionally, we run in a loop until it stops finding things, and as part of that, do test amplification when it does find any. We regularly see 3-8 rounds yielding valid results.
IMO half the value is customization to your repo, so copying these and specializing to your repo is super quick and pays off almost immediately . How to find style guides, how to run tests, what dimensions of correctness to look for, etc.
This kind of thing makes me question how important Mythos is for security bug finding - doing a High effort loop with a frontier model in code reviews until convergence has already outperformed human review for us . (Doesn't replace, but does find things we miss, and catches many we do see earlier).
That's the main issue I've found from running loops like this. Each loop has ~7 agents, say, looking through different lenses (security, UX, performance, etc.). Each one notes a few issues, each issue gets fixed, you do 5 to 8 loops, as you say. Each individual item that gets fixed looks minor but when you add it all up at the end you've increased PR size and scope significantly.
I recently opened a PR against this AI personal finance tool Ray https://github.com/cdinnison/ray-finance/pull/8 to add an Apple Card import feature, since Apple Card is not supported by Plaid.
I built the manual import feature, opened the PR, and then ran a code review.
What I hadn't thought about when I built the feature, was the myriad ways that the implications of importing data from Apple would have to be considered and integrated into the rest of the app, for the manual import to be a first-class feature, not "just a manual import" of data.
I ended up running adamsreview against it like 5-10 times, before considering it complete, as I learned that there was much more to the integration than I realized.
Now is that necessarily a problem? Maybe not. I should have realized from the start that the import feature was going to much more than just a small feature. But at least, thanks to the review loop, I got it completely right before the PR was merged.
- one wave is code reduction via DRY removals and architectural fixes, and another is adverserial to get rid of false additions, so this helps AI bloat either way
- as the other comment says, underspecification is a problem, so this ends up finding when the implementation, tests, docs, quality guide, and spec are out of sync, with whichever to blame.
- Usable, well-designed, secure, and well-typed code ends up being bigger, so this helps cut to the chase. Ultimately, either you get there or you don't, and this helps cut review burden so you can do your part of it faster and at a higher level.
Funny enough, I'm now playing with gardening agents whose job it is to reduce code. But I wouldn't want to slow PRs on that so view as seperate PRs.
I would say my workflow for any meaningful amount of work is (all in Claude Code):
- PRD: I discuss and brainstorm with Claude Code using something like the Grill Me skill https://github.com/mattpocock/skills/tree/main/skills/produc... but that I've modified a bit for my own style, until I have a good PRD (what the goals / design decisions are for what I'm building)
--- I run this PRD through multiple AI reviews (sometimes ChatGPT Pro for really important PRDs, because it seems to have some of the best critical feedback)
--- I read the PRD myself in detail before finalizing.
- PLAN: I have Claude Code develop the plan for implementing the PRD. Again, I have this reviewed several times by CC and sometimes by other tools for effectiveness, consistency with the PRD, consistency with the codebase, and internal consisenty.
- EXECUTE: I have an orchestration command I made that has CC execute the PLAN and use a build journal, using sub-agents whenever possible to save context, so that it can operate for up to several hours.
- QUICK REVIEWS: I have these commands /review-fix-loop /quick-dual-review which loops around running a Claude+Codex sub agents review and then fixing anything critical (deferring items needing human judgment)
- CODE REVIEWS: This is when I run between one and several of the adamsreview reviews, starting with /review --ensemble, then /walkthrough, then /fix; until I am satisfied.
Would it be useful if I packaged all this stuff into a GH repo to share with you and others?
I think there's no harm in that. But I will say, it sounds very similar to my own system, and probably a ton of other people's.
Yours might very well be better than most, but the thing that's missing from all of these is evals. You, me, everyone else, we're all vibe coding up these loops, we're getting work done and feeling excited about it, excited enough that we want to share, but nobody is doing real testing or benchmarks.
That is such a great point. We do need evals for this - and not just ones that the model companies use themselves. They have to be public and sharable and easy to use ourselves.
And in terms of sharing, I agree. On one hand, so many of us are already doing this themselves. On the other hand, when I was first learning CC and agentic engineering (vibe coding at the time :) ), I did find some of these random people's templates useful.
Is there a good way of adding in your own rules to the review? I’m always in the market for better review tools but I also need to check against internal coding stands and expectations,
adamsreview is mostly english language - since it is instructing your CC agents. While there are a number of python scripts and JSON storage, you - and your agent - would easily be able to add your own rules to this. It also respects your Claude.md and one of the lenses it already uses is checking for Claude.md compliance etc.
But with human judgment inserted into the right steps, it's really LLMs leveraging human thought at key stages and then, to your point, LLMs fighting LLMs fighting LLMs all the way down until...
Someone like me who has loved software his whole life and never been able to build anything more than a front-end website himself is building entire applications. So maybe it's worth the complexity!
Great project! I’ve build something similar, not very clean and polished, but focussed around deterministic orchestration of multiple agents via typescript, because a coordinating agent was notoriously bad at things such as fetching relevant tickets and other context. One thing I struggle with so far, though, are the actual instructions for the review themselves. They are either too vague, leading to superficial or overly broad reviews, or too specific and thus not applicable to different kinds of PRs…
That's awesome to hear and I'd love to see it when you're ready.
I actually think having something like adamsreview orchestrated by deterministic code - instead of simply having AI agents use deterministic code occasionally as this app does - could be even better!
The problem I ran into is that if you build a deterministic app that happens to use LLMs instead of the other way around, I don't think there's any way to get it to use your Claude Code subscription credits. It has to use API. And something like adamsreview would end up being so expensive if not subsidized by Anthropic along with the rest of our CC usage.
It's possible to use subscriptions! I run them in containers. For claude, I use `claude setup-token`, put the token into a local auth.json and mount that. For codex, I run the cli in my working dir prefixed with `CODEX_HOME=./codex-home codex` and mount that whole `codex-home` directory - done :)
the irony of multi-agent code review is that the people who would use it are already the ones who care about code quality. the real problem is everyone else just hitting accept on whatever claude spits out without even reading the diff. tooling for review keeps getting better while the average review effort keeps going down.
But it's extremely useful and effective compared to everything else out there that I've tried, if you're looking for an AI code review. Let me know if you try it - or find anything else that might work too without the bazillion prompts :)
I'll try to find a good public PR to review sometime soon so I can share that and add it to the Readme. This is really good feedback. I should have had something like this ready before posting to Show HN.
It runs locally, YOU review all the code locally, and feedback that to Claude.
Agents reviewing AI code always felt dirty to me, especially when working on production (non-disposable) code.
As in, if you have a large screen, a particularly-trained/prompted AI can organize the code changes in a "flowchart" with floating windows you can easily follow
Maybe in this UI, each code piece also comes with a summary from an agent that has already auto-reviewed the whole PR and creates a basic summary (instructed to be neutral but surface issues if it finds any).
Having said that, I don't review the code until going through a few iterations of reviews from Claude. Each round it does find some "obvious" issue, so as long as I'm not close to maxing my subscription for the week I let it run an audit -> validator checks the claims -> fix issues before I get to it.
It leaves the comments as markdown files in ./.code-review, so I also have my `/review` agent set to output in the same format, so an LLM can be reviewing the same code I am, I can edit or dismiss the LLM's reviews, then send the whole thing back to the first agent to fix.
TUI-based reviews on it's own are already interesting. I had never considered it, I guess.
I'll run this against your PR for you with my CC credits as a sort-of benchmark! Send me your PR link :)
I'm going to create one on one of my other repos meanwhile and add a link to the review when it's ready.
I had expected to find something like an eval harness available on GitHub, but couldn't find it.
Any suggestions? Or maybe we/I/someone should build something like this?
I suppose one challenge is that if it's going to be publicly available, it would also be easy to cheat, but still seems it would be useful if people agreed it's a good benchmark and could easily re-test tools themselves.
https://codereview.withmartian.com/
You’re right though, but evals are actually fairly tricky to write and maintain.
How expensive is it to run in your experience? In $ or tokens?
On large PRs (new feature sets for apps ~10,000-30,000 lines), around 2-3 million total tokens.
By the way, I should have mentioned in my original post, adamsreview counts tokens used by sub-agents across the stages, and tells you at the end of each stage the total used so far.
I suppose this would not be a 'real' benchmark because it would be public and so you couldn't necessarily trust scores people share about how their own tool did, but it would at least allow anyone to try out code review tools on their own and report relative effectiveness and characteristics.
I'll post again if I end up finding or building something like that. I couldn't find anything when I looked previously.
I'll also keep in mind your question as I continue testing this, because you are right that it would be useful to be able to describe what is different, not just the magnitude of bugs found.
https://www.greptile.com/benchmarks
Additionally, we run in a loop until it stops finding things, and as part of that, do test amplification when it does find any. We regularly see 3-8 rounds yielding valid results.
IMO half the value is customization to your repo, so copying these and specializing to your repo is super quick and pays off almost immediately . How to find style guides, how to run tests, what dimensions of correctness to look for, etc.
We do a similar look here: https://github.com/graphistry/pygraphistry/blob/master/agent...
This kind of thing makes me question how important Mythos is for security bug finding - doing a High effort loop with a frontier model in code reviews until convergence has already outperformed human review for us . (Doesn't replace, but does find things we miss, and catches many we do see earlier).
That's the main issue I've found from running loops like this. Each loop has ~7 agents, say, looking through different lenses (security, UX, performance, etc.). Each one notes a few issues, each issue gets fixed, you do 5 to 8 loops, as you say. Each individual item that gets fixed looks minor but when you add it all up at the end you've increased PR size and scope significantly.
I recently opened a PR against this AI personal finance tool Ray https://github.com/cdinnison/ray-finance/pull/8 to add an Apple Card import feature, since Apple Card is not supported by Plaid.
I built the manual import feature, opened the PR, and then ran a code review.
What I hadn't thought about when I built the feature, was the myriad ways that the implications of importing data from Apple would have to be considered and integrated into the rest of the app, for the manual import to be a first-class feature, not "just a manual import" of data.
I ended up running adamsreview against it like 5-10 times, before considering it complete, as I learned that there was much more to the integration than I realized.
Now is that necessarily a problem? Maybe not. I should have realized from the start that the import feature was going to much more than just a small feature. But at least, thanks to the review loop, I got it completely right before the PR was merged.
- one wave is code reduction via DRY removals and architectural fixes, and another is adverserial to get rid of false additions, so this helps AI bloat either way
- as the other comment says, underspecification is a problem, so this ends up finding when the implementation, tests, docs, quality guide, and spec are out of sync, with whichever to blame.
- Usable, well-designed, secure, and well-typed code ends up being bigger, so this helps cut to the chase. Ultimately, either you get there or you don't, and this helps cut review burden so you can do your part of it faster and at a higher level.
Funny enough, I'm now playing with gardening agents whose job it is to reduce code. But I wouldn't want to slow PRs on that so view as seperate PRs.
I am more curious about your AI workflow as I stay away from other's tools because I don't trust vibe-code related tools.
What is the workflow difference between `fragments/` and `plans/`. They seem logically the same but seem to have been used for different purposes.
Is this something it did on its own or is this something you prompted it to do?
I would say my workflow for any meaningful amount of work is (all in Claude Code):
- PRD: I discuss and brainstorm with Claude Code using something like the Grill Me skill https://github.com/mattpocock/skills/tree/main/skills/produc... but that I've modified a bit for my own style, until I have a good PRD (what the goals / design decisions are for what I'm building)
--- I run this PRD through multiple AI reviews (sometimes ChatGPT Pro for really important PRDs, because it seems to have some of the best critical feedback)
--- I read the PRD myself in detail before finalizing.
- PLAN: I have Claude Code develop the plan for implementing the PRD. Again, I have this reviewed several times by CC and sometimes by other tools for effectiveness, consistency with the PRD, consistency with the codebase, and internal consisenty.
- EXECUTE: I have an orchestration command I made that has CC execute the PLAN and use a build journal, using sub-agents whenever possible to save context, so that it can operate for up to several hours.
- QUICK REVIEWS: I have these commands /review-fix-loop /quick-dual-review which loops around running a Claude+Codex sub agents review and then fixing anything critical (deferring items needing human judgment)
- CODE REVIEWS: This is when I run between one and several of the adamsreview reviews, starting with /review --ensemble, then /walkthrough, then /fix; until I am satisfied.
Would it be useful if I packaged all this stuff into a GH repo to share with you and others?
Yours might very well be better than most, but the thing that's missing from all of these is evals. You, me, everyone else, we're all vibe coding up these loops, we're getting work done and feeling excited about it, excited enough that we want to share, but nobody is doing real testing or benchmarks.
And in terms of sharing, I agree. On one hand, so many of us are already doing this themselves. On the other hand, when I was first learning CC and agentic engineering (vibe coding at the time :) ), I did find some of these random people's templates useful.
adamsreview is mostly english language - since it is instructing your CC agents. While there are a number of python scripts and JSON storage, you - and your agent - would easily be able to add your own rules to this. It also respects your Claude.md and one of the lenses it already uses is checking for Claude.md compliance etc.
I wish I was kidding…
It sometimes feels silly to me to have AI reviewing AI reviewing AI all the way down - see my above comment https://news.ycombinator.com/item?id=48095831
But with human judgment inserted into the right steps, it's really LLMs leveraging human thought at key stages and then, to your point, LLMs fighting LLMs fighting LLMs all the way down until...
Someone like me who has loved software his whole life and never been able to build anything more than a front-end website himself is building entire applications. So maybe it's worth the complexity!
I actually think having something like adamsreview orchestrated by deterministic code - instead of simply having AI agents use deterministic code occasionally as this app does - could be even better!
The problem I ran into is that if you build a deterministic app that happens to use LLMs instead of the other way around, I don't think there's any way to get it to use your Claude Code subscription credits. It has to use API. And something like adamsreview would end up being so expensive if not subsidized by Anthropic along with the rest of our CC usage.
Curious to hear about your experience.
Seems like it would create a lot of friction and burn a lot of tokens.
Friction - maybe? Depending on what you mean.
But it's extremely useful and effective compared to everything else out there that I've tried, if you're looking for an AI code review. Let me know if you try it - or find anything else that might work too without the bazillion prompts :)
Have we all just given up?
Here's a comment from adamsreview, but even this was 3 weeks ago, and I've worked on it a lot since then: https://github.com/cdinnison/ray-finance/pull/8#issuecomment...
I'll try to find a good public PR to review sometime soon so I can share that and add it to the Readme. This is really good feedback. I should have had something like this ready before posting to Show HN.