← Back to Archive
Updating

This case study is being refined with cleaner evidence, implementation details, and source links.

Agent Evaluation

Protocol-Driven Evaluation of LLM Agents

Protocol-Driven Evaluation of LLM Agents

Agent Evaluation · Research Infrastructure · Reproducible Workflows

Master internship research on evaluating heterogeneous LLM-based agents through shared protocols, rubrics, and reproducible workflows.

Proof: Master Internship · First-author Paper · NeurIPS LLM Evaluation Workshop

Context

LLM-based agents are difficult to compare because they often use different interfaces, tools, task structures, and evaluation assumptions.

My Role

I worked on the research direction during my master internship, including literature review, benchmark analysis, evaluation design, and the platform concept.

Contribution

The project explored a protocol-driven approach for evaluating heterogeneous LLM agents through shared tasks, structured rubrics, and reproducible workflows.

Outcome

The work became part of my master internship and was later developed into a first-author paper accepted at the NeurIPS LLM Evaluation Workshop.

Media

Poster · Pipeline diagram · Paper title screenshot · Internship report cover · Research timeline