Sieve reposted this
Simulation and human videos are an exciting data well for robot world models due to the particular difficulty of collecting real-world data, but Sergey Levine (UC Berkeley professor + co-founder of Physical Intelligence) just published an incredibly sobering take on the danger of venturing too deeply in this direction. Some researchers have gotten so excited in other data sources to the extent that they're being treated as a replacement to the real thing. But just like how LLMs use lots of text data and VLMs use text-image pairs, VLAs (vision-language-action) models in robotics need a lot of data of robots performing real-world tasks. Instead of treating simulation or human video (i.e. FPV videos posted online) as a complete replacement, we should treat it the same way we treat internet data in LLM and VLM pre-training - something less relevant to the ultimate goals of the model but still relevant enough to provide useful world knowledge. At Sieve, we're excited to be contributing to this problem area through our early work with robotics labs making use human videos for VLA pre-training. If you're interested in learning more about Sergey's take or our human video offering, check out the links in comments.