We describe a novel model for executing distributed memory parallel programs using uncoordinated tasks.
We describe several off-line optimizations for the proposed model.
We examine the effects of these optimizations on modern processors with wider vector units.
Increasing levels of task coalescence can improve throughput and increase performance.
Increases in performance are observed in both single node and multi node experiments.