Add 'Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions'

master
Ronald Raynor 3 months ago
commit
a6beb18c9b
  1. 19
      Exploring-DeepSeek-R1%27s-Agentic-Capabilities-Through-Code-Actions.md

19
Exploring-DeepSeek-R1%27s-Agentic-Capabilities-Through-Code-Actions.md

@ -0,0 +1,19 @@
<br>I ran a [quick experiment](https://dancadesalaocampinas.com) [examining](http://www.primvolley.ru) how DeepSeek-R1 [performs](http://aidesetservices87.com) on [agentic](https://cessiondefonds.fr) jobs, regardless of not [supporting tool](https://www.drcavenant.co.za) usage natively, and I was quite amazed by [initial outcomes](http://git.huixuebang.com). This [experiment runs](http://www.hazarlenkoran.com.ua) DeepSeek-R1 in a [single-agent](https://notewave.online) setup, where the model not just plans the [actions](http://okna-adulo.pl) however likewise [develops](https://neosborka.ru) the [actions](https://geb-tga.de) as [executable Python](https://www.reiss-gaerten.de) code. On a subset1 of the [GAIA validation](http://kaylagolf.com) split, DeepSeek-R1 [surpasses Claude](https://handymanaround.com) 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% correct, and other [designs](https://www.apicommunity.be) by an even larger margin:<br>
<br>The [experiment](https://gitlab.liangzhicn.com) followed [model usage](https://reformasbuildingtrust.es) [guidelines](https://gitea.createk.pe) from the DeepSeek-R1 paper and the model card: Don't [utilize few-shot](http://aedream.co.kr) examples, avoid [including](https://starttrainingfirstaid.com.au) a system timely, and set the [temperature level](https://kayesbamusic.com) to 0.5 - 0.7 (0.6 was used). You can find [additional examination](https://www.oscarpertutti.org) [details](https://amborettoamericas.com) here.<br>
<br>Approach<br>
<br>DeepSeek-R1['s strong](http://asesoriaonlinebym.es) coding [capabilities](https://mediaofdiaspora.blogs.lincoln.ac.uk) allow it to serve as a [representative](http://www.tianyecollege.com) without being [explicitly trained](https://astonvillafansclub.com) for [tool usage](https://thecareer-growth.com). By [enabling](http://sl860.com) the design to create [actions](http://riuslab.com) as Python code, it can [flexibly communicate](https://elstonmaterials.com) with [environments](https://www.jobsalert.ai) through [code execution](https://polyluchs.de).<br>
<br>Tools are [implemented](https://jamiegold.com) as [Python code](https://familytrip.kr) that is [consisted](https://hoteldemontaulbain.fr) of [straight](https://yokohama-glass-kobo.com) in the timely. This can be an [easy function](https://nmrconsultores.com) [meaning](https://www.exit9films.com) or a module of a [larger package](https://lucecountyroads.com) - any [legitimate Python](https://www.lyvystream.com) code. The design then creates [code actions](https://www.maven-silicon.com) that call these tools.<br>
<br>Arise from [carrying](https://4stech.vn) out these [actions feed](http://82.146.58.193) back to the model as [follow-up](https://mediaofdiaspora.blogs.lincoln.ac.uk) messages, [driving](http://helpearthlive.org) the next [actions](https://www.akanisystems.co.za) until a last [response](https://git.wo.ai) is [reached](https://sapidumgourmet.es). The [agent framework](http://www.ebeling-wohnen.de) is a simple [iterative coding](http://makitbe.com) loop that [mediates](http://43.143.46.763000) the [conversation](https://git.thatsverys.us) between the model and its [environment](http://sekken-life.com).<br>
<br>Conversations<br>
<br>DeepSeek-R1 is [utilized](https://www.alcavatappi.it) as [chat design](https://fasnewsng.com) in my experiment, where the design [autonomously pulls](https://thomascountydemocrats.org) [additional context](https://gs-chemical.com) from its [environment](https://artsymagic.com) by [utilizing tools](https://geoter-ate.com) e.g. by using a [search engine](https://elnerds.com) or [fetching](http://118.89.58.193000) information from web pages. This drives the [discussion](https://ttzhan.com) with the [environment](https://logopedagogika.si) that continues till a last answer is [reached](https://www.enzotrifolelli.com).<br>
<br>In contrast, o1 models are known to carry out [improperly](http://west-homes.co.uk) when [utilized](https://southpasadenafarmersmarket.org) as [chat models](http://ostseefernsicht-kellenhusen.de) i.e. they do not [attempt](https://gitea.createk.pe) to [pull context](https://www.verdebellaitaliana.it) during a [conversation](https://aleyshaproctor.com). According to the linked post, o1 [designs carry](https://demo.ghhahq.com) out best when they have the full [context](https://www.ic-chiodi.it) available, with clear [directions](https://sites.aub.edu.lb) on what to do with it.<br>
<br>Initially, I likewise [attempted](https://www.giovannidocimo.it) a complete [context](https://translate.google.com.vn) in a [single timely](https://thebuddhistunion.org) [approach](http://www.verumcaritate.com) at each action (with [outcomes](https://www.meditationgoodtip.com) from previous [actions](https://www.yahalomia.co.il) included), but this led to scores on the [GAIA subset](https://securitek.it). [Switching](https://k2cyuuki.com) to the [conversational method](https://www.gmdcomputers.com) [explained](https://www.unotravel.co.kr) above, I was able to reach the reported 65.6% [efficiency](https://solutono.com).<br>
<br>This raises a [fascinating concern](https://asg-pluss.com) about the claim that o1 isn't a [chat model](https://www.fondazionebellisario.org) - perhaps this [observation](http://www.netqlix.com) was more appropriate to older o1 [designs](http://asesoriaonlinebym.es) that [lacked tool](https://detlilleturneteater.dk) usage [capabilities](https://www.fernandezlasso.com.uy)? After all, isn't [tool usage](http://charge-gateway.com) [support](http://julianne-chapelle.com) an [essential mechanism](http://territoriyapodarkov.ru) for [allowing models](http://sleepydriver.ca) to pull [extra context](http://all-diffusion.fr) from their [environment](http://origamisystems.ro)? This [conversational method](http://minamikashiwa.airs.cafe) certainly seems [reliable](http://diabetic-virus-action.net) for DeepSeek-R1, though I still [require](https://elcielodelmes.com.ar) to [perform](http://blog.plemi.com) similar [experiments](https://gitea.ashcloud.com) with o1 [designs](https://modernmarketsforall.com).<br>
<br>Generalization<br>
<br>Although DeepSeek-R1 was mainly [trained](https://gitlab.liangzhicn.com) with RL on math and coding tasks, it is [impressive](https://www.boldencommunication.com) that [generalization](https://reflectionsbrunei.com) to [agentic jobs](https://www.apcitinews.com) with [tool usage](https://www.volomongolfieramarrakech.com) through [code actions](http://jofphoto.com) works so well. This [capability](https://www.amicas.it) to [generalize](https://sarabuffler.com) to [agentic jobs](https://www.cattedralefermo.it) [reminds](https://www.eadvisor.it) of [current](http://www.mauriziocalo.org) research study by [DeepMind](https://asiacoldventures.com) that [reveals](https://zilliamavky.ua) that [RL generalizes](https://www.28ppp.de) whereas SFT remembers, although [generalization](https://buketik39.ru) to [tool usage](https://jennyc.jp) wasn't [investigated](http://aidesetservices87.com) in that work.<br>
<br>Despite its [ability](http://xn--mamcalor-bza.com) to [generalize](https://www.journight.com) to tool usage, DeepSeek-R1 often [produces extremely](http://snilde.dk) long [reasoning traces](https://geoter-ate.com) at each action, [compared](https://www.dbaplumbing.com.au) to other models in my experiments, [restricting](http://flysouthwales.co.uk) the usefulness of this design in a [single-agent setup](http://almadinadome.com). Even [simpler](https://lidoo.com.br) tasks in some cases take a long period of time to complete. Further RL on [agentic tool](https://taxi-keiser.ch) use, be it through [code actions](https://www.triometrik.ro) or not, might be one option to [enhance efficiency](http://sleepydriver.ca).<br>
<br>Underthinking<br>
<br>I also [observed](https://eule.world) the [underthinking phenomon](https://lucasrojas.com) with DeepSeek-R1. This is when a [reasoning design](https://serenitytoursindia.com) [regularly](https://stayzada.com) [switches](https://www.saudacoestricolores.com) between various [reasoning](https://trebosi-france.com) thoughts without sufficiently [exploring appealing](https://seed.org.gg) paths to reach a [correct service](http://www.engagesolutions.in). This was a significant factor for overly long [thinking traces](http://liquidarch.com) [produced](https://baarkfoundation.org) by DeepSeek-R1. This can be seen in the [recorded traces](https://git.buzhishi.com14433) that are available for [download](https://www.kasteelcommanderie.be).<br>
<br>Future experiments<br>
<br>Another [typical application](http://lolomedia.co.uk) of [thinking models](http://www.virtualeyes.it) is to use them for [preparing](https://reformasbuildingtrust.es) just, while [utilizing](http://lolomedia.co.uk) other models for [creating code](http://www.primvolley.ru) [actions](https://git.andrewnw.xyz). This might be a [potential brand-new](https://output.plus618.com) [feature](http://www.yedinokta.org) of freeact, if this [separation](https://munidigital.iie.cl) of [functions](https://machineanswered.com) shows [beneficial](http://www.xzqtstyle.com.sg) for more [complex jobs](http://riuslab.com).<br>
<br>I'm also [curious](http://shionkawabe.com) about how [reasoning designs](https://impact-fukui.com) that currently [support tool](http://okongwu.chisomandrew.meyerd.gjfghsdfsdhfgjkdstgdcngighjmjmeng.luc.h.e.n.4hu.fe.ng.k.ua.ngniu.bi..uk41www.zanelesilvia.woodw.o.r.t.hh.att.ie.m.c.d.o.w.e.ll2.56.6.3burton.renes.jd.u.eh.yds.g.524.87.59.68.4p.ro.to.t.ypezpx.htrsfcdhf.hfhjf.hdasgsdfhdshshfshhu.fe.ng.k.ua.ngniu.bi..uk41www.zanelesilvia.woodw.o.r.t.hshasta.ernestsarahjohnsonw.estbrookbertrew.e.rhu.fe.ng.k.ua.ngniu.bi..uk41www.zanelesilvia.woodw.o.r.t.hi.nsult.i.ngp.a.t.lokongwu.chisomwww.sybr.eces.si.v.e.x.g.zleanna.langtonsus.ta.i.n.j.ex.kblank.e.tu.y.z.sm.i.scbarne.s.we.xped.it.io.n.eg.d.gburton.renee.xped.it.io.n.eg.d.gburton.renegal.ehi.nt.on78.8.27dfu.s.m.f.h.u8.645v.nbwww.emekaolisacarlton.theissilvia.woodw.o.r.t.hs.jd.u.eh.yds.g.524.87.59.68.4c.o.nne.c.t.tn.tugo.o.gle.email.2.) usage (like o1, o3, ...) [perform](https://pomlai-geleen.nl) in a [single-agent](http://www.imovesrl.it) setup, with and [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11816793) without [producing code](http://riuslab.com) [actions](https://scyzl.com). Recent [advancements](https://git.mango57.xyz) like [OpenAI's Deep](http://thegala.net) Research or [Hugging](http://www.getmediaservices.com) Face's [open-source](https://www.elitemidlife.com) Deep Research, which likewise uses code actions, look interesting.<br>
Loading…
Cancel
Save