NVIDIA Research has released SpatialClaw, a training-free framework for spatial reasoning. It targets a persistent weakness in vision-language models (VLMs). These models still struggle to judge where objects are, how they relate, and how they move in 3D.
SpatialClaw does not retrain the model. Instead, it changes the action interface the agent uses to call perception tools. The research team argues the interface is the bottleneck. Their solution is to treat code as the action interface. Across 20 benchmarks, SpatialClaw reaches 59.9% average accuracy. It outperforms the recent spatial agent SpaceTools by 11.2 points.
What is SpatialClaw
SpatialClaw is an agent loop wrapped around a stateful Python kernel. The kernel is pre-loaded with input frames and a set of primitives. Perception tools are plain Python callables. Their outputs, including masks, depth maps, camera geometry, and trajectories, are ordinary Python variables.
The kernel exposes six public entry points. InputImages holds the sampled frames. Metadata carries frame rate, duration, and frame indices. tools exposes perception and geometry primitives. show() embeds an image into the agent’s next context. vlm dispatches queries to a separate VLM session. ReturnAnswer() submits the final answer.
Two perception tools are central. tools.Reconstruct wraps Depth Anything 3 and returns per-frame depth, camera intrinsics, extrinsics, and dense point maps. tools.SAM3 wraps SAM 3 and produces image or video masks from text, point, or box prompts. The framework adds lightweight utilities: tools.Geometry, tools.Mask, tools.Time, tools.Graph, and tools.Draw.
It is training-free. The same system prompt, tool set, and hyperparameters run across every benchmark and backbone.

Why the Action Interface Matters
The research team studied three action interfaces on the same question. Consider measuring the closest distance between a heater and a door.
- Single-pass code writes one complete program and runs it once. It commits to a full strategy before seeing any intermediate mask or depth map. A wrong assumption then propagates straight to the answer.
- Structured tool-call invokes named tools through a fixed JSON schema. It cannot freely combine outputs with NumPy or SciPy to express test-time computations. The closest-point operation has no pre-registered tool, so the result is wrong.
- SpatialClaw composes tools in code, inspects results, then revises. It first computes a centroid distance, then notices the centroid uses a median. The agent switches to
scipy.spatial.KDTreeto find the true closest point. It submits 0.9439 m against a 0.9 m ground truth.
Benchmark
SpatialClaw was tested on 20 benchmarks across five categories. These span single-image, multi-view, general, video and 4D, and general video understanding. It improves over the no-tool baseline on all six backbones tested. Backbones range from 26B to 397B parameters across the Qwen3.5/3.6 and Gemma4 families.
A controlled comparison isolates the interface. All three variants share the same toolset and prompt. Only the action interface differs.
| Action interface | Avg. (20 bench.) | Δ vs no-tool |
|---|---|---|
| No-tool baseline | 53.4 | – |
| Single-pass code | 55.2 | +1.8 |
| Structured tool-call | 56.7 | +3.3 |
| SpatialClaw (code as action) | 59.9 | +6.5 |
Gemma4-31B backbone, 20-benchmark average.
Against prior spatial agents on the same Gemma4-31B backbone, the gap widens.
| Method | Interface | Avg. | Δ vs SpatialClaw |
|---|---|---|---|
| VADAR | Single-pass | 40.5* | −19.4 |
| pySpatial | Single-pass | 47.8 | −12.1 |
| SpaceTools-Toolshed | Structured tool-call | 48.7 | −11.2 |
| SpatialClaw | Code as action | 59.9 | best |
The largest gains land on dynamic tasks. On Gemma4-31B, DSI-Bench rose +17.6 points and MindCube rose +15.3 points. These categories need chained geometric computation across frames and viewpoints.
An LLM-as-judge attribution explains the wins over structured tool-call. Code composition accounts for 52.2% of them. Control flow accounts for 19.5%, and the remaining 28.3% are interface-neutral.
Inside the Five-Stage Loop
Each sample runs a five-stage loop: planning, code generation, code execution, feedback assembly, and answer submission. A planner drafts a strategy without seeing the images. The main agent then writes one Python cell per step. A static AST checker rejects unsafe code before execution. The loop repeats until ReturnAnswer() is called or 30 steps pass.
The official repo runs on a LangGraph workflow and a persistent Jupyter kernel. Backbones serve through vLLM. Perception runs behind a FastAPI GPU service. A single quickstart runs one benchmark on one machine:
git clone --recursive https://github.com/NVlabs/SpatialClaw.git
cd SpatialClaw
bash spatial_agent/scripts/setup.sh
cp .env.example .env # add API keys, or self-host vLLM
python -m spatial_agent.entrypoints.run
--dataset spatial_agent/config/dataset/erqa.json
--model spatial_agent/config/model/gemini-3-pro.json
--concurrency 4
A representative agent cell composes perception with geometry, then revises:
# Reconstruct the scene, then segment both objects in one video pass
recon = tools.Reconstruct.Reconstruct(InputImages)
seg = tools.SAM3.segment_video_by_text(["radiator heater", "door"])
show(seg.visualize(1)) # inspect the masks first
# Closest-point distance via KD-tree, not centroids
pts_h = seg.get_masked_points(recon, frame=1, object=0) # object 0 = heater
pts_d = seg.get_masked_points(recon, frame=2, object=1) # object 1 = door
dists, _ = scipy.spatial.KDTree(pts_d).query(pts_h, k=1)
ReturnAnswer(float(dists.min()))
The agent picks primitives from the question itself. Distance questions invoke KD-tree search and vector norms. Direction questions rely on dot products. No category-specific routing was applied.
Use Cases
The design fits problems that need step-by-step geometric reasoning. Concrete examples include:
- Robotics and embodied agents that measure metric distances between objects before acting.
- Multi-view inspection, where an object’s facing direction is recovered from several camera angles.
- Video and 4D analysis that tracks object or camera motion across frames.
- Indoor scene question answering, such as “where is the door relative to the sink?”
Because it is training-free, teams can extend a deployed VLM without new data or fine-tuning.
Interactive Explainer
<button class=”c primary” id=”sc-next”>Run next step
<button class=”c” id=”sc-reset”>Reset</button>
<span class=”prog” id=”sc-prog”></span>
</div>
<div class=”foot”>
<span>Faithful to the paper’s walkthrough · interface logic is illustrative</span>
<span>Built for <b>Marktechpost</b> · verified Jun 2026</span>
</div>
</div>
<script>
(function(){
var root=document.getElementById(‘sc-root’);
if(!root)return;
var $=function(s){return root.querySelector(s)};
// — step data, faithful to Figure 2 of the SpatialClaw paper —
var DATA={
single:{
label:”single-pass · no persistence”,
stateNote:”No intermediate state. One complete program is committed before any execution feedback is seen.”,
vars:[],
steps:[{
think:”Write one complete program now, before seeing any mask, depth map, or error.”,
code:'<span class=”cm”># commit the full analysis up front</span>nrecon = tools.<span class=”fn”>Reconstruct</span>(frames_for_recon)nseg_heater = tools.<span class=”fn”>SAM3</span>(img_heater, <span class=”st”>”white radiator heater”</span>)nimg_door = InputImages[2]n<span class=”cm”># … compute centroid distance …</span>n<span class=”kw”>else</span>:n <span class=”fn”>ReturnAnswer</span>(<span class=”st”>”Could not determine distance”</span>)’,
fb:’Single run complete · no chance to inspect or revise.’,
final:true, answer:”1.638″, correct:false,
why:”Wrong — the validity of the mask was never checked.”
}]
},
struct:{
label:”named results only”,
stateNote:”Each step binds one named result. They cannot be freely composed with NumPy or SciPy at test time.”,
vars:[],
steps:[
{think:”Reconstruct the scene.”,
code:'{<span class=”st”>”tool”</span>: <span class=”st”>”Reconstruct”</span>,n <span class=”st”>”args”</span>: {<span class=”st”>”image”</span>: <span class=”st”>”InputImages[1]”</span>}}’,
fb:'<span class=”ok”>stored</span> → result_1 : Reconstruction’,
addVars:[{n:”result_1″,t:”Reconstruction”}]},
{think:”Segment the heater.”,
code:'{<span class=”st”>”tool”</span>: <span class=”st”>”SAM3″</span>,n <span class=”st”>”args”</span>: {<span class=”st”>”image”</span>: <span class=”st”>”InputImages[1]”</span>, <span class=”st”>”prompt”</span>: <span class=”st”>”heater”</span>}}’,
fb:'<span class=”ok”>stored</span> → result_2 : mask’,
addVars:[{n:”result_2″,t:”mask”}]},
{think:”Segment the door.”,
code:'{<span class=”st”>”tool”</span>: <span class=”st”>”SAM3″</span>,n <span class=”st”>”args”</span>: {<span class=”st”>”image”</span>: <span class=”st”>”InputImages[2]”</span>, <span class=”st”>”prompt”</span>: <span class=”st”>”door”</span>}}’,
fb:'<span class=”ok”>stored</span> → result_3 : mask’,
addVars:[{n:”result_3″,t:”mask”}]},
{think:”Measure distance with the predefined tool, between the two centroids.”,
code:'{<span class=”st”>”tool”</span>: <span class=”st”>”compute_dist”</span>,n <span class=”st”>”args”</span>: {<span class=”st”>”seg_1″</span>: <span class=”st”>”result_2″</span>, <span class=”st”>”seg_2″</span>: <span class=”st”>”result_3″</span>}}’,
fb:’Distance: 6.5 · no tool exists for the <em>closest</em> point’,
addVars:[{n:”result_4″,t:”float = 6.5″}]},
{think:”No registered tool returns the closest point. Submit the centroid distance.”,
code:'{<span class=”st”>”tool”</span>: <span class=”st”>”ReturnAnswer”</span>, <span class=”st”>”args”</span>: {<span class=”st”>”value”</span>: 6.5}}’,
fb:’answer submitted’,
final:true, answer:”6.5″, correct:false,
why:”Wrong — the schema has no tool for deriving the closest point.”}
]
},
claw:{
label:”persistent python kernel”,
stateNote:”Every object stays a live Python variable. The agent composes, inspects, and revises across steps.”,
vars:[],
steps:[
{think:”Reconstruct the scene, segment both objects, and render the masks to inspect them.”,
code:’seg_h = tools.<span class=”fn”>SAM3</span>(InputImages[1], <span class=”st”>”radiator”</span>)nseg_d = tools.<span class=”fn”>SAM3</span>(InputImages[2], <span class=”st”>”door”</span>)nrecon = tools.<span class=”fn”>Reconstruct</span>(InputImages[::100])n<span class=”fn”>show</span>(seg_h, seg_d) <span class=”cm”># visual feedback</span>’,
fb:'<span class=”ok”>Status: Success</span> · <span class=”show”>show() → mask overlay rendered for inspection</span>’,
addVars:[{n:”seg_h”,t:”PerFrameMask”},{n:”seg_d”,t:”PerFrameMask”},{n:”recon”,t:”Reconstruction”}]},
{think:”Masks verified. Compute each object’s 3D centroid and the distance between them.”,
code:’points_h = recon.points[seg_h]npoints_d = recon.points[seg_d]ncentroid_h = tools.<span class=”fn”>get_centroid</span>(points_h)ncentroid_d = tools.<span class=”fn”>get_centroid</span>(points_d)ndist = tools.<span class=”fn”>dist</span>(centroid_h, centroid_d)n<span class=”fn”>print</span>(<span class=”st”>f”Distance: {dist}”</span>)’,
fb:'<span class=”ok”>Status: Success</span> · Distance: 1.4807′,
addVars:[{n:”points_h”,t:”ndarray (12361,3)”},{n:”points_d”,t:”ndarray”},{n:”centroid_h”,t:”ndarray (3,)”},{n:”centroid_d”,t:”ndarray (3,)”},{n:”dist”,t:”float = 1.4807″}]},
{revise:true,
think:”Wait — tools.get_centroid uses the median, which is not the closest point. Switch to scipy.spatial.KDTree.”,
code:’tree_d = scipy.spatial.<span class=”fn”>KDTree</span>(points_d)ndists, _ = tree_d.<span class=”fn”>query</span>(points_h, k=1)nmin_dist = dists.<span class=”fn”>min</span>()n<span class=”fn”>print</span>(<span class=”st”>f”Min distance: {min_dist}”</span>)’,
fb:'<span class=”ok”>Status: Success</span> · Min distance: 0.9439′,
addVars:[{n:”tree_d”,t:”KDTree”},{n:”dists”,t:”ndarray (12361,)”},{n:”min_dist”,t:”float = 0.9439″}]},
{think:”Closest-point distance cross-checked against the visual evidence. Submit it.”,
code:'<span class=”fn”>ReturnAnswer</span>(0.9439)’,
fb:’answer submitted’,
final:true, answer:”0.9439″, correct:true,
why:”Correct — composition + intermediate verification reached the closest point.”}
]
}
};
var cur=”single”, idx=0;
function esc(t){return t}
function render(){
var d=DATA[cur];
var stream=$(‘#sc-stream’); stream.innerHTML=””;
var vars=[];
for(var i=0;i<=idx && i<d.steps.length;i++){
var s=d.steps[i];
if(s.addVars)vars=vars.concat(s.addVars);
var el=document.createElement(‘div’); el.className=’step’;
var tcls=’think’+(s.revise?’ revise’:”);
var badge=s.revise?’REVISE’:(s.final?’SUBMIT’:’STEP ‘+(i+1));
el.innerHTML='<div class=”‘+tcls+'”><span class=”badge”>’+badge+'</span>’+s.think+'</div>’+
‘<pre>’+s.code+'</pre>’+
‘<div class=”fb”>’+s.fb+'</div>’;
stream.appendChild(el);
}
// state panel
$(‘#sc-statelbl’).textContent=d.label;
var vb=$(‘#sc-vars’);
if(cur===’single’){
vb.innerHTML='<div class=”empty”>’+d.stateNote+'</div>’;
}else if(vars.length===0){
vb.innerHTML='<div class=”empty”>’+d.stateNote+'</div>’;
}else{
vb.innerHTML='<div class=”empty” style=”margin-bottom:9px”>’+d.stateNote+'</div>’+
vars.map(function(v){return ‘<div class=”var”><b>’+v.n+'</b><i>’+v.t+'</i></div>’}).join(”);
}
// verdict
var vdt=$(‘#sc-verdict’);
var last=d.steps[Math.min(idx,d.steps.length-1)];
if(idx>=d.steps.length-1 && last.final){
vdt.className=’verdict show ‘+(last.correct?’good’:’bad’);
vdt.querySelector(‘.mark’).textContent=last.correct?’✓’:’✗’;
$(‘#sc-vtxt’).innerHTML='<b>Submitted answer: ‘+last.answer+(last.correct?’ m’:”)+'</b>’+
‘<small>’+last.why+'</small>’;
}else{ vdt.className=’verdict’; }
// controls
$(‘#sc-prev’).disabled=(idx<=0);
$(‘#sc-next’).disabled=(idx>=d.steps.length-1);
$(‘#sc-next’).textContent=(idx>=d.steps.length-1)?’Done’:’Run next step ‘;
$(‘#sc-prog’).textContent=’step ‘+(idx+1)+’ / ‘+d.steps.length;
resize();
}
function setTab(k){
cur=k; idx=0;
root.querySelectorAll(‘.tab’).forEach(function(t){
t.classList.toggle(‘on’,t.getAttribute(‘data-k’)===k);
});
render();
}
$(‘#sc-tabs’).addEventListener(‘click’,function(e){
var t=e.target.closest(‘.tab’); if(!t)return; setTab(t.getAttribute(‘data-k’));
});
$(‘#sc-next’).addEventListener(‘click’,function(){
if(idx<DATA[cur].steps.length-1){idx++;render();}
});
$(‘#sc-prev’).addEventListener(‘click’,function(){
if(idx>0){idx–;render();}
});
$(‘#sc-reset’).addEventListener(‘click’,function(){idx=0;render();});
// auto-resize for WordPress iframe embedding
function resize(){
try{
var h=root.offsetHeight+40;
if(window.parent && window.parent!==window){
window.parent.postMessage({type:’sc-resize’,height:h},’*’);
}
}catch(e){}
}
window.addEventListener(‘load’,resize);
window.addEventListener(‘resize’,resize);
render();
})();
</script>
“>
Key Takeaways
- Code as the action interface: SpatialClaw lets a VLM write one Python cell per step into a persistent kernel, composing and revising perception outputs instead of committing to a fixed plan.
- State of the art, training-free: 59.9% average across 20 spatial benchmarks, +11.2 points over the prior agent SpaceTools, with no benchmark- or model-specific tuning.
- The interface is the lever: swapping only the action interface on Gemma4-31B moves accuracy from 56.7 (structured tool-call) to 59.9, and 52.2% of wins trace to code composition.
- Biggest gains where geometry chains: dynamic 4D and multi-view tasks lead the lifts (DSI-Bench +17.6, MindCube +15.3), where steps must compose across frames and viewpoints.
- Perception is the ceiling: gains transfer across six backbones (26B–397B), but the remaining bottleneck is perception quality, and the license is non-commercial.
Check out the Paper, Project and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning appeared first on MarkTechPost.