We propose a method for activity recognition from video and accelerometer data. Visual accelerometer localization and tracking establishes cross-modal relations. RETLETS encode relative motion between tracked objects and local visual features. Recognition using various feature combinations is evaluated on the 50 Salads dataset. The method using RETLETS outperforms the state-of-the-art on this dataset.