What both tools are really competing on
The useful comparison is not whether the chatbot sounds smart. The useful comparison is whether the agent can inspect the repository, make a limited plan, edit files, run the right commands, explain the diff, and leave the project in a state a human can review.
- Give each tool the same issue, same repository, same time limit, and same allowed commands.
- Judge the final diff, the test evidence, and the amount of cleanup a human reviewer still needs.
- Track failure modes: missed files, broken tests, dependency churn, formatting noise, and risky shell commands.