On June 25, 2026, Chinese outlet 36Kr published an in-depth report on the emerging supply chain for embodied AI data collection in China, profiling workers who wear sensor gear or teleoperate robots to generate motion and interaction data. Executives estimate that achieving GPT‑3.5‑level embodied capabilities could require on the order of 100 million hours of such data, far above today’s available datasets.
This article aggregates reporting from 1 news source. The TL;DR is AI-generated from original reporting. Race to AGI's analysis provides editorial context on implications for AGI development.
Embodied AI has a data problem: robots don’t yet have the equivalent of the trillion-token internet scrape that fed LLMs. This 36Kr investigation makes that challenge visceral by focusing on the new class of ‘data workers’ who wear sensor suits or teleoperate robots so labs can record motion trajectories, object interactions, and first‑person video. The scale estimates—on the order of 100 million hours to reach GPT‑3.5‑like embodied competence—imply years of sustained data-labor if current methods continue.
Strategically, this suggests that whoever controls large, proprietary embodied datasets will enjoy a moat similar to today’s frontier LLM labs, but for physical intelligence. It also raises uncomfortable questions about labor conditions, compensation, and safety for the workers whose bodies and time become raw material for training robots. The analogy to crowdworkers labeling text and images is clear; the stakes may be higher when physical risk and repetitive motion are involved.
For the AGI race, the article is a reminder that “general intelligence” in the real world is not just a matter of stacking more GPUs; it’s constrained by how fast we can collect rich, diverse interaction data. If that process is slow, dangerous, or socially contested, it could become a bottleneck and a flashpoint in AI politics, especially in countries where robotics is a national priority.

