The exascale computing roadmap has highlighted efficient locality oriented scheduling in runtime systems as one of the most important challenges (”Concurrency and Locality” Challenge [Kogge et al.]). Massively parallel many core architectures have NUMA characteristics in memory behavior, with a large gap between the local and the remote memory
latency. Further, future single nodes could have hundreds to thousands of cores with deep cache hierarchy and shared caches. Unless efficiently exploited, this complicated system architecture could lead to non-scalability and performance issues. Languages such as X10 [Charles et al.], Chapel [Chamberlain et al.] and Fortress [Allan et al.] are based on partitioned global address space (PGAS [Yellick et al.]) paradigm. They have been designed and implemented as part of DARPA HPCS program (www.highproductivity.org) for higher productivity and performance on many-core massively parallel platforms. These languages have in-built support for initial placement of threads (also referred as activities) and data structures in the parallel program. Therefore, locality comes implicitly with the program. The run-time systems of these languages need to provide efficient algorithmic scheduling of parallel computations.
Further, movement of massive amounts (Terabytes to Petabytes) of data is very expensive, which necessitates affinity driven computations. Therefore, distributed scheduling of parallel computations on multiple places needs to optimize multiple performance objectives: follow affinity maximally and ensure efficient space, time and message complexity. Further, achieving good load balancing can be contradictory to ensuring affinity which leads to challenging trade-offs in distributed scheduling. In addition, parallel computations have data dependent execution patterns which requires online scheduling to effectively optimize the computation orchestration as it unfolds. With continuous demand of processing larger and larger data volumes (from petabytes to exabytes and beyond), one needs to ensure data scalability along with scalability with the respect to number of compute nodes and cores in the target system. Thus, the scheduling framework needs to consider IO bottlenecks along with compute and memory bandwidth bottlenecks in the system to enable strong scalability and performance. Simultaneous consideration of these objectives makes distributed scheduling a particularly challenging problem.
Distributed Scheduling for parallel computations is a well studied problem in the shared memory context starting from the pioneering research by Blumofe and Leiserson [Blumofe et al.] on Cilk scheduling, followed by later work including [Arora et al.] [Acar et al.] [Blumofe, Lisiecki] [Guo et al.] amongst many others. These efforts are primarily focused on work-stealing efficiency improvement in shared-memory architectures without considering explicit affinity annotations by the programmer. Alongside, Parallel Depth First (PDF) based scheduling approaches [Blleloch et al.] [Chen et al.] for CMP architectures have been proven to have lower cache miss rates than work-stealing approaches and enable constructive cache sharing for general computation DAGs. Further, strategies such as Controlled-PDF [Blelloch, Chowdhury et al.] deliver higher performance for divide-conquer type algorithms with theoretical bounds on the number of cache misses for both L1 and L2 caches in the cache hierarchy. These approaches and insights could serve well for a single node in a large multi-core cluster.
With the advent of distributed memory architectures, lot of recent research on distributed scheduling looks at multi-core and many-core clusters [Saraswat et al.] [Sun et al.]. Min et al. provide a dynamic tasking library (HotSLAW) for many-core clusters that uses topology-aware hierarchical work stealing strategy for both NUMA and distributed memory systems. All these recent efforts primarily achieve load balancing using (locality-aware) work stealing across the nodes in the system. Although this strategy works well for slightly irregular computation such as UTS for geometric tree, it could result in parallel inefficiencies when the computation is highly irregular (binomial tree for UTS) or when there are complicated trade-offs between affinity and load-balance as in sparse matrix benchmark such as Conjugate Gradient benchmark. Certain other approaches such as [Oliver et al.] consider limited control and no data-dependencies in the parallel computation, which limits the scope of applicability of the scheduling framework.
Further, for massive scale systems power consumption is one of the biggest concerns. Power reduction and energy conservation are important in these systems for two major reasons: operating cost, and reliability. While Petaflop systems might require around 100 MW for peak performance, they also could sustain one hardware failure per twenty-four hour period. Moreover, using Arrhenius Law, component life expectancy decreases 50% for every 10° C (18° F) temperature increase. Reducing a component's operating temperature the same amount (consuming less energy) doubles the life expectancy. This necessitates distributed scheduling strategies that consider optimal trade-offs between performance and power using techniques such as DVS (dynamic voltage scaling) for both external (application unaware) and internal (application aware) scheduling. Additionally, low component life expectancy, leads to fault tolerance requirements on the scheduling framework. Approaches for fault tolerance such as those used in the widely popular Hadoop Map Reduce framework need to be considered to increase the usability and reliability of petascale and exascale systems.
In this talk, we present distributed scheduling algorithm (LDS) for multi-place parallel computations, that uses a unique combination of remote (inter-place) spawns and remote work steals to reduce the overheads in the scheduler, which helps to dynamically maintain load balance across the compute nodes of the system, while ensuring affinity maximally. Our design was implemented using GASNet API and POSIX threads. On affinity and load-balance oriented benchmarks such as CG (Conjugate Gradient) and Kmeans clustering, we demonstrate strong performance and scalability on 2048 node BG/P. Using benchmarks such as UTS we show that LDS has lower space requirement than hierarchical work-stealing based approaches such HotSLAW and better performance than Charm++.