I really got into research this year and spent the past 6-7 months trying to read a variety of papers. These are some of the top ones that I thought were worth making notes of. This is a growing log of the papers I’ve read with short takeaways. Some of the domains explored are: how models learn, how robots generalize with different modalities, and how humans acquire concepts.
Robotics, Generalizable Learning
Generalizable Robotic Insertion with World Models (Hansen et al. 2024)
- A single model performs 90+ insertion tasks.
- Visual + proprioception for robust manipulation.
- Strong zero-shot generalization to unseen objects.
- Thought: What about deformables like cables?
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLMs (Ye et al. 2025)
- Builds a unified vision–audio–text model that learns all modalities in a shared latent space, instead of stitching them together with adapters.
- Introduces architectural pieces like OmniAlignNet and Temporal Embedding Grouping to strengthen cross-modal alignment.
- Achieves state-of-the-art omni-modal reasoning while using only 0.2T tokens — nearly 6× less than Qwen2.5-Omni.
- Thought: Makes me wonder if robots could use this to interpret scenes through both tone and motion in real time.
AnySkin , Plug-and-Play Tactile Skin (Bhirangi et al. 2024)
- Replaceable tactile “skin” easily attaches to different robots.
- Low-cost sensorization → scalable touch feedback.
- Enables rapid prototyping of contact-rich tasks.
- Applied: Relevant to Vital tactile policy training.
DynaMo : In-domain dynamics pretraining (Cui et al. 2024)
- Learns latent dynamics from small real datasets.
- Better than MAE-style pretraining for control.
- Helps robots adapt actions based on state change.
- Thought: Perfect match for sample-limited robotics.
FISH : Fast imitation from humans (Haldar et al. 2023)
- 1–3 demos + online RL → rapid skill learning.
- OT-based reward shaping removes the need for labeled rewards.
- Works across many robot morphologies.
- Applied: Reinforcement component of my Vital pipeline.
Robot Utility Models : Zero-shot deployment policies (Etukuru et al. 2024)
- General-purpose policies deploy to real homes without finetuning.
- Uses mLLM retry loop to self-correct.
- Real step toward “everywhere robots.”
- Thought: What’s the failure boundary?
Learning human-to-robot handovers from point clouds (Christen et al. 2023)
- Vision-only handovers adjusting to human motion in real time.
- Great sim-to-real performance.
- Safety-focused trajectory planning.
- Thought: Could this adapt for rehab robots?
EgoZero : Robot learning from smart glasses (Liu et al. 2025)
- Zero robot data , training from human videos only.
- Compact representation works across robot shapes.
- Robots learn by watching people work.
- Thought: Can robots learn from my lab footage?
ControlNet : controlled diffusion for structured changes (Zhang et al. 2023)
- Adds conditioning like edges/segmentation to guide diffusion.
- Useful for generating robot training data.
- Lets you alter environments safely + cheaply.
- Applied: Socket augmentation for insertion tasks.
RoboMaster : Collaborative-trajectory video generation (Fu et al. 2025)
- Generates realistic manipulation videos with physics-aware control.
- Decomposes interaction into logical phases for realism.
- Supports diverse robot skills in simulation.
- Thought: Synthetic demos at scale , yes please.
RoboPearls : Editable simulation via 3D Gaussian Splatting (Tao et al. 2025)
- Builds editable environments directly from video.
- LLM agents help generate training data automatically.
- Improves robustness via targeted domain randomization.
- Thought: Fully automated synthetic data seems close.
Learning Video Generation for Robotic Manipulation (Fu et al. 2025)
- Trajectory-controlled video generation for manipulation tasks.
- Encourages plausible object-robot interaction modeling.
- Better for training perception and planning jointly.
- Thought: Could be paired with VLMs for closed-loop policy learning?
Computer Vision + Compression
End-to-End Optimized Image Compression (Ballé et al. 2017)
- Neural codecs trained directly on the rate–distortion tradeoff.
- Hyperpriors model entropy better → improved compression quality.
- Foundation of learned compression systems today.
- Applied to: My Apple work on ML-powered media compression.
Good, Cheap & Fast : Overfitted image compression w/ Wasserstein distortion (Ballé Lab 2024)
- “Overfit the image” strategy with perceptual metrics.
- OT-based distortion preserves details humans notice.
- Great for single-image compression use cases.
- Thought: Quality > generality when the target is known.
Cognitive + Language + Generalization
An explainable transformer circuit for compositional generalization (Tang et al. 2025)
- Pinpoints the exact transformer circuit enabling rule-like compositional generalization.
- Shows how modifying activations can change internal reasoning causally.
- Makes transformers feel explainable, not magical.
- Question: Could controlling circuits reduce hallucinations?
Do large language models reason causally like us? Even better? (Dettki et al. 2025)
- Some LLMs perform correct causal inference beyond memorized patterns.
- Others rely on shortcuts , vulnerable to misleading cues.
- Highlights when “intelligence” is fragile.
- Thought: Where do humans still outperform machines?
Learnability from single child linguistic input (Qin et al. 2024)
- Models can learn grammar + semantics from only one child’s environment.
- Strong proof of data-efficient language learning.
- Perhaps human-like learning isn’t so mysterious.
- Thought: What is still missing : curiosity? grounding?
gSCAN benchmark for grounded compositional generalization (Ruis et al. 2020)
- Tests whether agents follow new instructions correctly in new contexts.
- Most models struggle with simple concept recombinations.
- Reveals gaps between real understanding vs memorization.
- Question: Why are children so much better at this?
Grounded language learning from child egocentric video (Vong et al. 2024)
- Learns word meanings from video + speech captured from a single child.
- Minimal supervision → surprisingly rich language grounding.
- Shows how environments shape vocabulary.
- Thought: This feels like the blueprint for human-aligned learning.
Infant cognition-inspired benchmark for agency & intention (Wenjie et al. 2024)
- Tests AI on social concepts babies understand: goals, help/hinder, beliefs.
- Transformers still fail on intuitive social reasoning.
- Machines lack the “common sense” we’re born with.
- Thought: Maybe robots need new social priors.
Rapid word learning via Meta In-Context Learning (Minnow) (Wang et al. 2025)
- Teaches models to learn new words instantly from just a few examples.
- Small models close the gap with LLMs using better training recipes.
- Huge for robotics vocabulary grounding.
- Thought: Could robots learn from your voice in real time?