I ran a fast experiment investigating how DeepSeek-R1 performs on agentic jobs, in spite of not supporting tool usage natively, and I was rather satisfied by preliminary outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not only prepares the actions however likewise creates the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 outshines Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other designs by an even larger margin:
The experiment followed model usage guidelines from the DeepSeek-R1 paper and [forum.batman.gainedge.org](https://forum.batman.gainedge.org/index.php?action=profile
1
Exploring DeepSeek R1's Agentic Capabilities Through Code Actions
noahgopinko801 edited this page 5 months ago