Research Reading Notes

Oct 14, 2025

I really got into research this year and spent the past 6-7 months trying to read a variety of papers. These are some of the top ones that I thought were worth making notes of. This is a growing log of the papers I’ve read with short takeaways. Some of the domains explored are: how models learn, how robots generalize with different modalities, and how humans acquire concepts.


Robotics, Generalizable Learning

Generalizable Robotic Insertion with World Models (Hansen et al. 2024)

  • A single model performs 90+ insertion tasks.
  • Visual + proprioception for robust manipulation.
  • Strong zero-shot generalization to unseen objects.
  • Thought: What about deformables like cables?

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLMs (Ye et al. 2025)

  • Builds a unified vision–audio–text model that learns all modalities in a shared latent space, instead of stitching them together with adapters.
  • Introduces architectural pieces like OmniAlignNet and Temporal Embedding Grouping to strengthen cross-modal alignment.
  • Achieves state-of-the-art omni-modal reasoning while using only 0.2T tokens — nearly 6× less than Qwen2.5-Omni.
  • Thought: Makes me wonder if robots could use this to interpret scenes through both tone and motion in real time.

AnySkin , Plug-and-Play Tactile Skin (Bhirangi et al. 2024)

  • Replaceable tactile “skin” easily attaches to different robots.
  • Low-cost sensorization → scalable touch feedback.
  • Enables rapid prototyping of contact-rich tasks.
  • Applied: Relevant to Vital tactile policy training.

DynaMo : In-domain dynamics pretraining (Cui et al. 2024)

  • Learns latent dynamics from small real datasets.
  • Better than MAE-style pretraining for control.
  • Helps robots adapt actions based on state change.
  • Thought: Perfect match for sample-limited robotics.

FISH : Fast imitation from humans (Haldar et al. 2023)

  • 1–3 demos + online RL → rapid skill learning.
  • OT-based reward shaping removes the need for labeled rewards.
  • Works across many robot morphologies.
  • Applied: Reinforcement component of my Vital pipeline.

Robot Utility Models : Zero-shot deployment policies (Etukuru et al. 2024)

  • General-purpose policies deploy to real homes without finetuning.
  • Uses mLLM retry loop to self-correct.
  • Real step toward “everywhere robots.”
  • Thought: What’s the failure boundary?

Learning human-to-robot handovers from point clouds (Christen et al. 2023)

  • Vision-only handovers adjusting to human motion in real time.
  • Great sim-to-real performance.
  • Safety-focused trajectory planning.
  • Thought: Could this adapt for rehab robots?

EgoZero : Robot learning from smart glasses (Liu et al. 2025)

  • Zero robot data , training from human videos only.
  • Compact representation works across robot shapes.
  • Robots learn by watching people work.
  • Thought: Can robots learn from my lab footage?

ControlNet : controlled diffusion for structured changes (Zhang et al. 2023)

  • Adds conditioning like edges/segmentation to guide diffusion.
  • Useful for generating robot training data.
  • Lets you alter environments safely + cheaply.
  • Applied: Socket augmentation for insertion tasks.

RoboMaster : Collaborative-trajectory video generation (Fu et al. 2025)

  • Generates realistic manipulation videos with physics-aware control.
  • Decomposes interaction into logical phases for realism.
  • Supports diverse robot skills in simulation.
  • Thought: Synthetic demos at scale , yes please.

RoboPearls : Editable simulation via 3D Gaussian Splatting (Tao et al. 2025)

  • Builds editable environments directly from video.
  • LLM agents help generate training data automatically.
  • Improves robustness via targeted domain randomization.
  • Thought: Fully automated synthetic data seems close.

Learning Video Generation for Robotic Manipulation (Fu et al. 2025)

  • Trajectory-controlled video generation for manipulation tasks.
  • Encourages plausible object-robot interaction modeling.
  • Better for training perception and planning jointly.
  • Thought: Could be paired with VLMs for closed-loop policy learning?

Computer Vision + Compression

End-to-End Optimized Image Compression (Ballé et al. 2017)

  • Neural codecs trained directly on the rate–distortion tradeoff.
  • Hyperpriors model entropy better → improved compression quality.
  • Foundation of learned compression systems today.
  • Applied to: My Apple work on ML-powered media compression.

Good, Cheap & Fast : Overfitted image compression w/ Wasserstein distortion (Ballé Lab 2024)

  • “Overfit the image” strategy with perceptual metrics.
  • OT-based distortion preserves details humans notice.
  • Great for single-image compression use cases.
  • Thought: Quality > generality when the target is known.

Cognitive + Language + Generalization

An explainable transformer circuit for compositional generalization (Tang et al. 2025)

  • Pinpoints the exact transformer circuit enabling rule-like compositional generalization.
  • Shows how modifying activations can change internal reasoning causally.
  • Makes transformers feel explainable, not magical.
  • Question: Could controlling circuits reduce hallucinations?

Do large language models reason causally like us? Even better? (Dettki et al. 2025)

  • Some LLMs perform correct causal inference beyond memorized patterns.
  • Others rely on shortcuts , vulnerable to misleading cues.
  • Highlights when “intelligence” is fragile.
  • Thought: Where do humans still outperform machines?

Learnability from single child linguistic input (Qin et al. 2024)

  • Models can learn grammar + semantics from only one child’s environment.
  • Strong proof of data-efficient language learning.
  • Perhaps human-like learning isn’t so mysterious.
  • Thought: What is still missing : curiosity? grounding?

gSCAN benchmark for grounded compositional generalization (Ruis et al. 2020)

  • Tests whether agents follow new instructions correctly in new contexts.
  • Most models struggle with simple concept recombinations.
  • Reveals gaps between real understanding vs memorization.
  • Question: Why are children so much better at this?

Grounded language learning from child egocentric video (Vong et al. 2024)

  • Learns word meanings from video + speech captured from a single child.
  • Minimal supervision → surprisingly rich language grounding.
  • Shows how environments shape vocabulary.
  • Thought: This feels like the blueprint for human-aligned learning.

Infant cognition-inspired benchmark for agency & intention (Wenjie et al. 2024)

  • Tests AI on social concepts babies understand: goals, help/hinder, beliefs.
  • Transformers still fail on intuitive social reasoning.
  • Machines lack the “common sense” we’re born with.
  • Thought: Maybe robots need new social priors.

Rapid word learning via Meta In-Context Learning (Minnow) (Wang et al. 2025)

  • Teaches models to learn new words instantly from just a few examples.
  • Small models close the gap with LLMs using better training recipes.
  • Huge for robotics vocabulary grounding.
  • Thought: Could robots learn from your voice in real time?
Abha Wadjikar