Protocol-Driven Evaluation of LLM Agents

Agent Evaluation · Research Infrastructure · Reproducible Workflows

Master internship research on evaluating heterogeneous LLM-based agents through shared protocols, rubrics, and reproducible workflows.

Proof: Master Internship · First-author Paper · NeurIPS LLM Evaluation Workshop

Context

LLM-based agents are difficult to compare because they often use different interfaces, tools, task structures, and evaluation assumptions.

My Role

I worked on the research direction during my master internship, including literature review, benchmark analysis, evaluation design, and the platform concept.

Contribution

The project explored a protocol-driven approach for evaluating heterogeneous LLM agents through shared tasks, structured rubrics, and reproducible workflows.

Outcome

The work became part of my master internship and was later developed into a first-author paper accepted at the NeurIPS LLM Evaluation Workshop.

Media

Poster · Pipeline diagram · Paper title screenshot · Internship report cover · Research timeline