Tools

Tools: Every AI Benchmark Tests Coding. We Built One That Tests Infrastructure Work.

2026-03-05 0 views admin

Source: Dev.to

20 tasks across 5 categories ## Three architectures, three philosophies ## GitHub Copilot: IDE-first with workflow guardrails ## Claude Code: terminal-first and tool-centric ## Amazon Q Developer: deeply AWS-integrated ## Two corrections to common claims ## What the results actually mean ## 1. IDE-first tools struggle with infrastructure context ## 2. Terminal-native agents handle multi-step infrastructure work better ## 3. Cloud-native integration cuts both ways ## Build your own benchmark in 15 minutes ## The honest take Every AI coding tool benchmark tests the same things. Autocomplete accuracy. Code generation. Refactoring. Test generation. Maybe a LeetCode problem for good measure. Not a single one tests the work infrastructure engineers actually do. I spend my days writing Terraform modules, debugging Kubernetes incidents, migrating CI/CD pipelines, and reviewing Helm charts for security issues. When I evaluated AI coding agents for my team, every vendor benchmark told me how well their tool could write a React component. None of them told me whether it could generate a production EKS module with networking, IAM, and logging that actually plans without errors. So we built the benchmark that should have existed. We designed 20 infrastructure tasks grouped into the five categories that eat most of a platform engineer's time: 1. Terraform module generation -- Generate complete, standards-compliant modules from organizational patterns. The test: does the output run terraform validate without manual patching? Does it follow naming conventions? Are there hardcoded environment values hiding in the examples? 2. Infrastructure composition -- Convert natural language intent ("2 web apps, 1 key vault, private endpoints, dev/staging/prod") into dependency-aware IaC scaffolding. This is harder than it sounds. The agent needs to understand that the web apps depend on the service plan, the key vault needs RBAC before the apps can reference it, and the environments need separate state. 3. Kubernetes incident triage -- Diagnose real failure patterns: CrashLoopBackOff, OOMKilled, ImagePullBackOff, configuration drift. The test isn't "what's wrong?" -- it's whether the agent follows a coherent triage sequence. Read logs, check resource limits, verify probe configuration, compare running state to declared state, suggest a safe rollback path. 4. CI/CD workflow engineering -- Create pipelines and migrate existing ones. The hardest task in this category: Jenkins-to-GitHub Actions migration with security gates and approval workflows preserved. Most agents could generate a new workflow. Fewer could preserve the approval semantics from an existing Jenkinsfile. 5. Security and reliability review -- Evaluate Helm charts and pipeline configurations for privilege escalation, secrets exposure, blast radius, and missing policy guardrails. The test: identify real risks with practical remediations that don't require destructive changes without approval. Each category contained four tasks at increasing complexity. The Terraform category, for example, started with a single-resource module and scaled to a multi-module composition with cross-resource dependencies, environment separation, and state management. The point was to test not just whether the tool could generate infrastructure code, but whether it could reason about infrastructure systems. We ran all 20 tasks across three tools: GitHub Copilot, Claude Code, and Amazon Q Developer. The most interesting finding wasn't which tool scored highest. It was how fundamentally different their architectures are -- and how those architectures determine what they're good at. Copilot's architecture centers on the IDE. It runs asynchronously in GitHub Actions-backed environments with explicit pull request handoffs. When Copilot's coding agent generates infrastructure code, the workflow runs are not executed automatically -- they require approval in Actions. This is a deliberate governance choice. Copilot treats every generated change as a PR that goes through your existing review process. For organizations that want AI-assisted infrastructure changes to flow through the same gates as human changes, this is the right model. The tradeoff: IDE autocomplete doesn't translate directly to infrastructure reasoning. One task asked for a production-grade EKS Terraform module with VPC networking, IAM roles, node groups, and CloudWatch logging. The output required modifications before it would even plan successfully. The individual resource blocks were syntactically correct, but the cross-resource dependencies -- the parts that make infrastructure code actually work -- needed manual wiring. Copilot's strength showed on tasks where the scope was contained: a single module, a focused pipeline, a specific Helm chart review. Where it struggled was multi-step tasks requiring understanding of how 15+ resources interact. Claude Code operates as a command-line tool executing commands under user permissions. It's terminal-native -- designed to run alongside your existing shell workflow rather than inside an IDE. The architecture difference shows immediately on infrastructure tasks. When debugging a CrashLoopBackOff, the triage sequence involves reading logs, checking resource limits, verifying probe configs, and comparing running state to declared state. In a terminal context, these are sequential commands with the full output flowing into the agent's context window. Claude Code's 200K context window (expandable to 1M) means it doesn't lose earlier diagnostic information while working through later steps. The numbers are hard to ignore: Claude Code currently accounts for 4% of all GitHub commits -- roughly 134,000 per day -- and is projected to hit 20% by year-end. Anthropic reports a $2.5B run-rate with users doubling since January 1st, 2026. Git worktree isolation means the agent can work on infrastructure changes in a separate branch without touching your working directory. Where it struggled: tasks requiring deep integration with specific cloud services. It generated correct Terraform and Kubernetes manifests, but it didn't have the same awareness of AWS-specific best practices that a cloud-native tool would. Amazon Q's architecture is built around AWS service integration. Cedar-based policy enforcement, 13 quality evaluators, and deep CloudWatch integration give it capabilities that generic tools can't match on AWS-specific tasks. On our AWS-focused tasks, Q Developer was impressive. CloudWatch log analysis, IAM policy review, and AWS-specific Terraform patterns all benefited from Q's native understanding of AWS service relationships. The tradeoff is the inverse of its strength: that depth vanishes the moment you work across clouds. Tasks involving Azure resources, multi-cloud networking, or cloud-agnostic Kubernetes patterns didn't benefit from Q's AWS integration. For teams running 100% AWS, this isn't a limitation. For teams running hybrid or multi-cloud -- which is most enterprise infrastructure teams -- it's a significant constraint. While researching this benchmark, I found two claims circulating widely that needed correction: GitHub's "Agentic Workflows" launch. Multiple sources describe this as a general product launch on February 17, 2026. The actual documentation from GitHub Next describes it as a research demonstrator, not a GA product release. The distinction matters if you're evaluating Copilot for production infrastructure workflows. Amazon Q's Cedar policies and 13 evaluators. These are frequently attributed to Amazon Q Developer directly. They're actually capabilities of AWS AgentCore -- a separate service for building and governing AI agents. Q Developer can leverage AgentCore, but the policy framework isn't a core Q Developer feature. If you're expecting Cedar-based governance out of the box with Q Developer, check whether your plan includes AgentCore access. Neither correction diminishes the tools. But infrastructure teams making purchasing decisions need accurate capability mapping, not marketing summaries. Three patterns emerged across all 20 tasks: Infrastructure code is fundamentally different from application code. A React component is self-contained -- you can evaluate it in isolation. A production EKS Terraform module with VPC networking, IAM roles, node groups, and logging touches 15+ resources with complex dependency chains. Understanding whether module.eks.cluster_endpoint is correctly referenced three modules away requires the kind of cross-file context that IDE autocomplete wasn't designed for. This doesn't mean Copilot can't do infrastructure work. It means the IDE-first interaction model -- suggest completions in the current file based on nearby context -- is a less natural fit for infrastructure than for application code. The tasks where Copilot performed best were scoped to a single file or module. Debugging a CrashLoopBackOff isn't a single-prompt task. It's a sequence: kubectl logs, then kubectl describe pod, then check resource requests vs limits, then verify liveness/readiness probes, then compare the running manifest to the declared spec, then check recent deployments for configuration drift. Each step's output informs the next. Terminal-native tools that maintain context across sequential commands handled these triage sequences more coherently than tools designed around single-prompt interactions. The agent didn't lose the output from step 2 when it reached step 5. The same pattern held for CI/CD migration tasks. Converting a Jenkinsfile to GitHub Actions requires reading the existing pipeline, understanding the stage semantics, mapping plugins to Actions equivalents, and preserving approval gates -- a multi-step reasoning process that benefits from persistent context. Amazon Q's AWS depth is genuinely useful. On AWS-specific tasks, it surfaced best practices and service relationships that generic tools missed. If your infrastructure is 100% AWS, this integration is a clear advantage. But most enterprise infrastructure teams don't live in a single cloud. The moment a task involved Azure resources, cloud-agnostic Kubernetes patterns, or multi-cloud networking, Q's advantage disappeared. And because the deep integration creates an expectation of quality, the gap between "AWS task" and "non-AWS task" is more jarring than with a tool that makes no cloud-specific promises. The lesson: match the tool's architecture to your infrastructure reality, not to any single cloud provider's pitch. Before signing an enterprise agreement, count how many of your infrastructure tasks touch only one cloud. If the answer is "most of them," cloud-native tools are a fit. If the answer is "about half," you need a tool that works everywhere, even if it's less deep on any single platform. This is the most actionable part. Vendor benchmarks test vendor strengths. Your tasks test your reality. Here's a 5-task starter framework using your own infrastructure patterns. Each task has specific validation criteria so you're not just vibes-testing. Task 1: Generate a Terraform module from your standards. Pick a resource type you deploy frequently. Ask each tool to generate a complete module following your naming conventions. Validate: does terraform validate pass without edits? Are there hardcoded environment values? Does it include the variables and outputs your team expects? Task 2: Compose a multi-resource stack from intent. Describe a real deployment in plain language: "2 web apps, 1 key vault, private endpoints, dev/staging/prod." Validate: is the dependency graph correct? Are environments explicitly separated? Is the state layout reviewable? Task 3: Triage a real incident from last month. Pick a Kubernetes or infrastructure incident your team debugged recently. Give the tool the same starting information your on-call engineer had. Validate: does it follow ordered triage steps with concrete signals (probe failures, OOM events, config mismatches)? Does it suggest safe rollback paths? Task 4: Migrate a real pipeline. Take an existing Jenkins, GitLab CI, or Azure DevOps pipeline and ask the tool to convert it to GitHub Actions (or vice versa). Validate: are stages and approval gates preserved? Does it handle secrets correctly for the target platform? Are existing rollback procedures maintained? Task 5: Review a real Helm chart for security. Give the tool a Helm chart your team actually deploys. Validate: does it identify privilege escalation risks? Secret exposure? Does it suggest practical remediations that don't require destructive changes without approval? Run all five tasks on each tool you're evaluating. Score pass/fail on the validation criteria. The results will tell you more in 15 minutes than any vendor demo. No tool won across all 20 tasks. That's the point. GitHub Copilot is the right choice for teams that want AI-assisted infrastructure changes flowing through existing PR review processes with explicit approval gates. If your governance model is "nothing ships without a reviewed PR," Copilot's architecture enforces that naturally. Claude Code is the right choice for infrastructure engineers who live in the terminal and need multi-step reasoning across complex dependency chains. If your typical task involves reading state from three sources and synthesizing a plan, terminal-native context management matters. Amazon Q Developer is the right choice for teams running deep on AWS who want cloud-native intelligence that understands service relationships. If your infrastructure is 90%+ AWS and staying there, Q's integration is a genuine advantage. The wrong choice is evaluating any of these tools on vendor benchmarks that test React components when your team writes Terraform and debugs Kubernetes. The second wrong choice is evaluating on a single task. Infrastructure engineers don't do one thing -- they context-switch between writing IaC, debugging incidents, reviewing security posture, and migrating pipelines in a single day. Your evaluation needs to cover that range. Build the 5-task benchmark with your own infrastructure patterns. It takes 15 minutes. It will tell you which tool fits your actual work better than any blog post -- including this one. The full benchmark deep dive with all 20 tasks and sources is here: Infrastructure AI Agent Benchmark: Copilot vs Claude Code vs Amazon Q The companion playbook for building an agentic Terraform factory with MCP governance is here: Agentic Terraform Factory Playbook I write about AI for infrastructure teams at talk-nerdy-to-me.com. If you've run your own infrastructure benchmark across AI tools, I'd like to hear what you found. Drop a comment. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: # Scoring template benchmark_tasks: - name: terraform_module_generation validate: - terraform_validate_passes: true - no_hardcoded_env_values: true - required_variables_present: true - required_outputs_present: true result: pass | fail - name: multi_resource_composition validate: - dependency_graph_correct: true - environment_boundaries_explicit: true - state_layout_reviewable: true result: pass | fail - name: incident_triage validate: - ordered_triage_steps: true - concrete_signals_referenced: true - safe_rollback_suggested: true result: pass | fail - name: pipeline_migration validate: - stages_preserved: true - approval_gates_preserved: true - secrets_handled_correctly: true - rollback_procedures_maintained: true result: pass | fail - name: helm_security_review validate: - privilege_escalation_identified: true - secret_exposure_identified: true - remediations_non_destructive: true result: pass | fail Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Scoring template benchmark_tasks: - name: terraform_module_generation validate: - terraform_validate_passes: true - no_hardcoded_env_values: true - required_variables_present: true - required_outputs_present: true result: pass | fail - name: multi_resource_composition validate: - dependency_graph_correct: true - environment_boundaries_explicit: true - state_layout_reviewable: true result: pass | fail - name: incident_triage validate: - ordered_triage_steps: true - concrete_signals_referenced: true - safe_rollback_suggested: true result: pass | fail - name: pipeline_migration validate: - stages_preserved: true - approval_gates_preserved: true - secrets_handled_correctly: true - rollback_procedures_maintained: true result: pass | fail - name: helm_security_review validate: - privilege_escalation_identified: true - secret_exposure_identified: true - remediations_non_destructive: true result: pass | fail COMMAND_BLOCK: # Scoring template benchmark_tasks: - name: terraform_module_generation validate: - terraform_validate_passes: true - no_hardcoded_env_values: true - required_variables_present: true - required_outputs_present: true result: pass | fail - name: multi_resource_composition validate: - dependency_graph_correct: true - environment_boundaries_explicit: true - state_layout_reviewable: true result: pass | fail - name: incident_triage validate: - ordered_triage_steps: true - concrete_signals_referenced: true - safe_rollback_suggested: true result: pass | fail - name: pipeline_migration validate: - stages_preserved: true - approval_gates_preserved: true - secrets_handled_correctly: true - rollback_procedures_maintained: true result: pass | fail - name: helm_security_review validate: - privilege_escalation_identified: true - secret_exposure_identified: true - remediations_non_destructive: true result: pass | fail

🏷️ Tags

how-totutorialguidedev.toaishellnetworknetworkingswitchnodekubernetesterraformgitgithub

Tools: Every AI Benchmark Tests Coding. We Built One That Tests Infrastructure Work.

🏷️ Tags

More from Tools

Tools: Building Production-Ready MCP Servers with TypeScript: A Complete Guide

Tools: Part 7: Decoupled Architecture

Tools: Why Senior Devs Do One Thing at a Time (And You Should Too)

Tools: I Built an Agent That Trades on Bitcoin Lightning. It remembered nothing. So I built a brain.

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting