Research Reading Notes

I really got into research this year and spent the past 6-7 months trying to read a variety of papers. These are some of the top ones that I thought were worth making notes of. This is a growing log of the papers I’ve read with short takeaways. Some of the domains explored are: how models learn, how robots generalize with different modalities, and how humans acquire concepts.

Robotics, Generalizable Learning

Generalizable Robotic Insertion with World Models (Hansen et al. 2024)

A single model performs 90+ insertion tasks.
Visual + proprioception for robust manipulation.
Strong zero-shot generalization to unseen objects.
Thought: What about deformables like cables?

Builds a unified vision–audio–text model that learns all modalities in a shared latent space, instead of stitching them together with adapters.
Introduces architectural pieces like OmniAlignNet and Temporal Embedding Grouping to strengthen cross-modal alignment.
Achieves state-of-the-art omni-modal reasoning while using only 0.2T tokens — nearly 6× less than Qwen2.5-Omni.
Thought: Makes me wonder if robots could use this to interpret scenes through both tone and motion in real time.

AnySkin , Plug-and-Play Tactile Skin (Bhirangi et al. 2024)

Replaceable tactile “skin” easily attaches to different robots.
Low-cost sensorization → scalable touch feedback.
Enables rapid prototyping of contact-rich tasks.
Applied: Relevant to Vital tactile policy training.

DynaMo : In-domain dynamics pretraining (Cui et al. 2024)

Learns latent dynamics from small real datasets.
Better than MAE-style pretraining for control.
Helps robots adapt actions based on state change.
Thought: Perfect match for sample-limited robotics.

FISH : Fast imitation from humans (Haldar et al. 2023)

1–3 demos + online RL → rapid skill learning.
OT-based reward shaping removes the need for labeled rewards.
Works across many robot morphologies.
Applied: Reinforcement component of my Vital pipeline.

Robot Utility Models : Zero-shot deployment policies (Etukuru et al. 2024)

General-purpose policies deploy to real homes without finetuning.
Uses mLLM retry loop to self-correct.
Real step toward “everywhere robots.”
Thought: What’s the failure boundary?

Learning human-to-robot handovers from point clouds (Christen et al. 2023)

Vision-only handovers adjusting to human motion in real time.
Great sim-to-real performance.
Safety-focused trajectory planning.
Thought: Could this adapt for rehab robots?

EgoZero : Robot learning from smart glasses (Liu et al. 2025)

Zero robot data , training from human videos only.
Compact representation works across robot shapes.
Robots learn by watching people work.
Thought: Can robots learn from my lab footage?

ControlNet : controlled diffusion for structured changes (Zhang et al. 2023)

Adds conditioning like edges/segmentation to guide diffusion.
Useful for generating robot training data.
Lets you alter environments safely + cheaply.
Applied: Socket augmentation for insertion tasks.

RoboMaster : Collaborative-trajectory video generation (Fu et al. 2025)

Generates realistic manipulation videos with physics-aware control.
Decomposes interaction into logical phases for realism.
Supports diverse robot skills in simulation.
Thought: Synthetic demos at scale , yes please.

RoboPearls : Editable simulation via 3D Gaussian Splatting (Tao et al. 2025)

Builds editable environments directly from video.
LLM agents help generate training data automatically.
Improves robustness via targeted domain randomization.
Thought: Fully automated synthetic data seems close.

Learning Video Generation for Robotic Manipulation (Fu et al. 2025)

Trajectory-controlled video generation for manipulation tasks.
Encourages plausible object-robot interaction modeling.
Better for training perception and planning jointly.
Thought: Could be paired with VLMs for closed-loop policy learning?

Computer Vision + Compression

End-to-End Optimized Image Compression (Ballé et al. 2017)

Neural codecs trained directly on the rate–distortion tradeoff.
Hyperpriors model entropy better → improved compression quality.
Foundation of learned compression systems today.
Applied to: My Apple work on ML-powered media compression.

Good, Cheap & Fast : Overfitted image compression w/ Wasserstein distortion (Ballé Lab 2024)

“Overfit the image” strategy with perceptual metrics.
OT-based distortion preserves details humans notice.
Great for single-image compression use cases.
Thought: Quality > generality when the target is known.

Cognitive + Language + Generalization

An explainable transformer circuit for compositional generalization (Tang et al. 2025)

Pinpoints the exact transformer circuit enabling rule-like compositional generalization.
Shows how modifying activations can change internal reasoning causally.
Makes transformers feel explainable, not magical.
Question: Could controlling circuits reduce hallucinations?

Do large language models reason causally like us? Even better? (Dettki et al. 2025)

Some LLMs perform correct causal inference beyond memorized patterns.
Others rely on shortcuts , vulnerable to misleading cues.
Highlights when “intelligence” is fragile.
Thought: Where do humans still outperform machines?

Learnability from single child linguistic input (Qin et al. 2024)

Models can learn grammar + semantics from only one child’s environment.
Strong proof of data-efficient language learning.
Perhaps human-like learning isn’t so mysterious.
Thought: What is still missing : curiosity? grounding?

gSCAN benchmark for grounded compositional generalization (Ruis et al. 2020)

Tests whether agents follow new instructions correctly in new contexts.
Most models struggle with simple concept recombinations.
Reveals gaps between real understanding vs memorization.
Question: Why are children so much better at this?

Grounded language learning from child egocentric video (Vong et al. 2024)

Learns word meanings from video + speech captured from a single child.
Minimal supervision → surprisingly rich language grounding.
Shows how environments shape vocabulary.
Thought: This feels like the blueprint for human-aligned learning.

Infant cognition-inspired benchmark for agency & intention (Wenjie et al. 2024)

Tests AI on social concepts babies understand: goals, help/hinder, beliefs.
Transformers still fail on intuitive social reasoning.
Machines lack the “common sense” we’re born with.
Thought: Maybe robots need new social priors.

Rapid word learning via Meta In-Context Learning (Minnow) (Wang et al. 2025)

Teaches models to learn new words instantly from just a few examples.
Small models close the gap with LLMs using better training recipes.
Huge for robotics vocabulary grounding.
Thought: Could robots learn from your voice in real time?