Existing speculation mechanism has fundamental flaws in mitigating intra-node and completed task stragglers, which are often caused by node failure. Those issues result in more than an order of magnitude performance breakdown of small jobs and serious performance degradation of large jobs upon failures. A hybrid solution includes a speculation mechanism to cope with the issues and a scheduling policy to enhance failure awareness and recovery. The implementation of the solution shows striking performance improvement for MapReduce failure recovery.