The 3D understanding stack is being rewritten from geometry to physics to language

Getting 3D structure from 2D images is the problem that keeps giving. We have NeRFs, Gaussian splats, pointmaps, MVS pipelines, monocular depth networks, and a dozen other approaches, each solving part of the problem while leaving gaps elsewhere. Three papers released this week attack three different gaps in the stack: the geometry representation itself, the physics layer on top of that geometry, and the integration of depth perception into vision-language models.

What connects them is a shared dissatisfaction with the status quo. Explicit point clouds are redundant and discontinuous. Visual reconstructions without physics can't drive simulation. VLMs that don't understand geometry can't reason about space. Each paper proposes a fix, and the fixes point in the same general direction: toward continuous, physically grounded, foundation-model-native 3D understanding.

Explicit pointmaps hit a ceiling

The dominant approach in visual geometry foundation models (think MASt3R, DUSt3R) is to regress pixel-aligned pointmaps: for every pixel in an image, predict a 3D point. This works well enough for sparse matching and rough reconstruction, but it has structural problems. Pointmaps are explicit, meaning you predict a fixed set of 3D points tied to input pixels. This creates redundancy when multiple views observe the same surface, and worse, it provides no geometric continuity between those predictions. You get a cloud of points, not a surface.

IVGT (Implicit Visual Geometry Transformer) takes a different route. Instead of predicting explicit 3D coordinates per pixel, IVGT learns a continuous neural scene representation in a canonical coordinate system. You can query any 3D position in space, and the model retrieves local features to predict a signed distance value and a color through lightweight decoders. This is the classic implicit representation idea (SDFs, as in NeuS or VolSDF), but applied at the foundation model scale with multi-dataset joint training.

The key architectural move: IVGT processes unposed multi-view images through a transformer backbone that builds a canonical scene representation. Spatial queries at arbitrary 3D positions retrieve features from this representation via interpolation or attention, and the decoders map those features to SDF values and RGB colors. Because the representation is continuous, you can extract meshes at arbitrary resolution using marching cubes, render depth and normal maps from any viewpoint, and estimate camera poses, all from a single trained model.

Training uses 2D supervision (rendered RGB, depth, normals) combined with 3D geometric regularization, across multiple datasets. The multi-dataset aspect matters: IVGT generalizes across scenes rather than overfitting to per-scene optimization like traditional NeRF variants.

On tasks including mesh and point cloud reconstruction, novel view synthesis, depth and normal estimation, and pose estimation, IVGT demonstrates strong generalization. The continuity of the SDF representation means extracted surfaces are coherent rather than noisy point clouds stitched from per-pixel predictions. For anyone who has tried to clean up MASt3R output into a watertight mesh, the appeal is obvious.

Physics turns reconstruction into simulation

Reconstruction gives you geometry and appearance. Simulation requires more. You need to know how things move, deform, and interact. This gap is especially acute in robot-assisted minimally invasive surgery, where surgeons train on simulators that need to model how tissue deforms when prodded, cut, or retracted.

EndoGSim addresses this by building a physics-aware layer on top of 4D Gaussian Splatting. The pipeline starts with 4DGS to represent deformable tissues and surgical tools, using pre-trained segmentation and depth estimation to decompose the scene. But 4DGS alone gives you visual reconstruction: it can render what the scene looks like over time, but it cannot tell you what happens if you push on a tissue or apply a force to an instrument.

The missing piece is material properties. Different tissues have different stiffness, elasticity, and density. EndoGSim introduces an object-wise material field that assigns physical parameters to each segmented object in the scene. The clever part is how these parameters get initialized: a Multi-modal Large Language Model (MLLM) infers initial material properties from visual and semantic information. Soft tissue gets different parameters than rigid tools. The MLLM provides a reasonable starting point, and then a differentiable Material Point Method (MPM) refines those parameters through physics simulation.

MPM is a continuum mechanics simulation method well-suited for deformable bodies. By making it differentiable, EndoGSim can backpropagate from simulation errors (measured as differences between rendered simulated states and observed states) to update material parameters. The joint supervision comes from rendered images and optical flow, giving both appearance and motion constraints.

The result is a framework that takes endoscopic video as input and produces not just a visual reconstruction but a physically grounded simulation. You can interact with the reconstructed scene: apply forces, make incisions, retract tissue, and the simulation responds with physically plausible deformations. Validated on open-source and in-house surgical datasets, EndoGSim shows superior simulation fidelity and physical accuracy compared to prior methods that treat reconstruction and simulation as separate problems.

This is a pattern I expect to see more of: using language models to bootstrap physical understanding from visual and semantic cues, then refining with differentiable simulation. The MLLM does not need to be precise about material parameters. It just needs to be in the right ballpark, and the differentiable physics solver takes care of the rest.

VLMs should understand depth natively

Vision-Language Models are good at 2D tasks: grounding, captioning, visual question answering. They are bad at 3D. The reason is straightforward: text-only supervision under-constrains fine-grained visual perception. When you train a model to describe "a cat sitting on a table," the loss function does not care whether the model understands the table is 0.75 meters deep or 7.5 meters deep. The language signal is too coarse to recover dense geometry.

Prior approaches to fixing this fall into two camps. The first distills geometry from external vision models (like Depth Anything or DPT) into the VLM, which introduces error accumulation from the teacher model. The second enables direct depth prediction but uses inefficient per-pixel queries or produces coarse token-level outputs that lack spatial resolution.

DepthVLM proposes a simpler solution: attach a lightweight depth head directly to the LLM backbone and train it with a unified vision-text supervision paradigm. The depth head takes features from the LLM's internal representations and decodes them into full-resolution metric depth maps. This happens in a single forward pass alongside language generation, so you get both text outputs and dense depth without running a separate model.

Training uses a two-stage schedule. The details of the stages matter for implementation but the principle is: first establish the multimodal alignment, then add depth supervision to avoid destabilizing the language capabilities. This is a common pattern when augmenting foundation models with new output modalities.

DepthVLM also introduces a unified indoor-outdoor metric depth benchmark in a VLM-compatible format, which addresses a real evaluation gap. Existing depth benchmarks are designed for pure vision models and do not test whether depth prediction coexists with language understanding.

The results are striking. DepthVLM outperforms existing VLMs on depth tasks with higher inference efficiency (single forward pass vs. running a separate depth model), surpasses leading pure vision models on metric depth estimation, and improves complex 3D spatial reasoning. The spatial reasoning improvement is the most interesting part: when a VLM actually understands metric depth, it can answer questions like "which object is closer to the camera" or "how far apart are these two objects" with genuine geometric grounding rather than learned priors about typical object arrangements.

Why implicit over explicit matters in practice

The shift from explicit to implicit geometry representation (IVGT) is not just an academic preference. It has practical consequences for downstream tasks.

Explicit pointmaps produce a fixed-resolution output. If you need finer detail, you need more input views or higher-resolution images. Implicit representations decouple query resolution from input resolution. You can extract a coarse mesh quickly or a fine mesh slowly from the same trained model, depending on your needs. For robotics applications where compute budgets vary (real-time obstacle avoidance vs. offline map building), this flexibility matters.

Geometric continuity is another practical win. Point clouds from explicit methods have gaps and noise. Any downstream task that needs surface normals (grasp planning, collision detection, lighting estimation) requires additional processing to smooth and complete the surface. IVGT's SDF representation gives you continuous normals by construction: the gradient of the SDF is the surface normal.

The tradeoff is inference speed. Querying an implicit representation requires evaluating the network at many 3D positions to extract a surface, which is slower than a single forward pass that produces a pointmap. IVGT does not report timing comparisons in the paper, and this is a real gap. For applications that need 3D output at 30 Hz, explicit prediction may still win despite its geometric limitations.

The physics bottleneck in surgical and robotic simulation

EndoGSim's approach to material property inference highlights a general problem in simulation-from-reconstruction. You can reconstruct geometry and appearance from images. You cannot directly observe material properties from images. A video of tissue deforming gives you constraints on stiffness and elasticity, but recovering those properties is an inverse problem that requires simulation.

Making the simulation differentiable is the key technical enabler. Non-differentiable physics simulators create a dead end for gradient-based optimization. You can run the simulation forward, compare to observations, but you cannot backpropagate through the simulation to update material parameters. Differentiable MPM opens this path.

The MLLM initialization is more than a convenience. Inverse problems in physics simulation are typically ill-posed: many different material parameter combinations can produce similar deformation patterns. The MLLM provides a regularizing prior based on semantic understanding of the scene. It knows that liver tissue is soft, that surgical steel is rigid, that fat has different properties than muscle. This prior constrains the optimization to physically plausible regions of the parameter space.

The limitation is that MLLM priors are trained on internet data, not on surgical physics textbooks. The initial material estimates are approximate, and the differentiable refinement can only do so much if the video evidence is ambiguous (short sequences, limited deformation). EndoGSim validates on datasets with sufficient motion, but real surgical footage often has limited tissue deformation, which could make material parameter identification underdetermined.

Depth inside the language model vs. depth beside it

DepthVLM's architectural choice deserves scrutiny. Putting the depth head on the LLM backbone rather than the vision encoder means depth prediction benefits from the multimodal reasoning layers. The LLM has already integrated visual and textual information, so the depth head can leverage semantic understanding (this is a wall, that is a person) to inform geometric prediction.

The alternative would be to predict depth from the vision encoder features, which is what most prior work does. The problem is that vision encoder features are trained for contrastive or reconstructive objectives that do not necessarily encode metric depth. By the time features reach the LLM backbone, they have been processed through attention layers that integrate spatial and semantic context, making them richer for depth prediction.

The single forward pass claim is important for deployment. Current pipelines that need both language understanding and depth estimation run two separate models: a VLM for text and a depth network for geometry. This doubles inference cost and creates alignment problems between the two outputs. DepthVLM produces both from the same computation graph.

The two-stage training schedule is worth noting for anyone implementing this. Adding depth supervision from the start can destabilize language training, because depth losses have different scale and gradient characteristics than language modeling losses. Stage one establishes stable multimodal representations. Stage two adds depth with appropriate loss weighting. This is a practical detail that matters for reproducibility.

Where these directions converge

Read together, these three papers outline a trajectory for 3D vision.

IVGT shows that implicit continuous geometry can be learned at foundation model scale with cross-scene generalization, replacing the explicit pointmap paradigm. EndoGSim shows that visual reconstruction is not the endpoint: physics-aware simulation built on top of 4D representations is achievable when you combine language model priors with differentiable physics. DepthVLM shows that 3D understanding does not need to live in a separate model: it can be a native output of vision-language systems.

The convergence point is a model that takes images (and possibly language) as input and produces continuous geometry, material properties, and language understanding in a unified framework. You could query such a model for depth at any point, simulate physical interactions, and ask questions about the scene, all from the same representation. We are not there yet, but the pieces are taking shape.

The remaining gaps are real. IVGT needs speed benchmarks and comparison against the latest explicit methods on standard datasets. EndoGSim needs validation on more diverse surgical procedures and longer sequences with ambiguous deformation. DepthVLM needs testing on outdoor scenes at scale and on scenes with complex occlusion. But the direction is clear: 3D understanding is moving from specialized pipelines into general-purpose, continuous, physics-aware, language-grounded systems.

The 3D understanding stack is being rewritten from geometry to physics to language

Explicit pointmaps hit a ceiling ​

Physics turns reconstruction into simulation ​

VLMs should understand depth natively ​

Why implicit over explicit matters in practice ​

The physics bottleneck in surgical and robotic simulation ​

Depth inside the language model vs. depth beside it ​