Chen Wu
4 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of Large Language Models (LLMs).
The paper introduces Mindgames, a comprehensive multi-game arena for evaluating LLM agents' sustained social and strategic reasoning, demonstrating that current evaluations are limited by structural scaffolding and error-survival confounds.
SkillSmith is a synergy-aware framework that jointly co-evolves skills and tools, significantly improving self-improving agent systems by modeling skill-tool interactions and diagnosing failures.
LongAttnComp introduces a novel, two-stage fine-tuning framework for context compression that significantly improves long-context reasoning performance, matching or exceeding full-context accuracy on demanding tasks like code debugging.
Papers
SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems
Yangbo Wei, Zhen Huang, Shaoqiang Lu, Junhong Qian +3 more
SkillSmith is a synergy-aware framework that jointly co-evolves skills and tools, significantly improving self-improving agent systems by modeling skill-tool interactions and diagnosing failures.