Multicore and Extreme Scale Computing: Programming and Software Challenges

RK Shyamasundar

In the beginning of the decade, it was noticed that sequential programming was falling below Moore's law. For instance, in spite of pipelining, complex out-of-order execution, hierarchical caching mechanisms etc., the sequential throughput was falling below expectations due to the three walls: power efficiency, instruction-level parallelism limitations and memory latency. In 2002, the industry aggressively pushed forward to address the problem. Natural places of finding a solution are: New computer architectures, new programming languages, new compiler techniques and tools. The pursuit of the goal via architectures resulted in multicore architectures. The new basic architecture can be broadly seen as a switched model for multiple cores on thesilicon. The architecture satisfied the properties of single chip, distinct processing units, and multiple independent thread of control. Now, multicore architecture is the norm of the day -- Intel, Sun, IBM (Cell processor). In the classical sequential processors, programmers were oblivious of the processors as the high level languages abstracted away from the processors. However, in the multicore processors, the challenge is to find ways to provide such a paradigm of programmability. In other words, programming Multicore processors that will enable achieve the capability of the machines keeping the architecture oblivious from the programmer is a serious challenge. DARPA HPCS initiations were aimed towards development of highly productive computer systemsand tackle the programming productivity problem by developing system software that can reduce the burden on the programmer. Under this initiative, languages like X10 from IBM Research, Chapel from Cray and Fortress from Sun have been under development and experimentation. These languages have constructs for dynamic, fine-grained parallelism that are particularly useful for expressing applications with irregular, data-dependent parallelism. Such applications include physical simulations on non-uniform inputs, data classifiers, sparse-matrix operations, computational geometry codes, and a large number of divide-and-conquer algorithms with irregular, data-dependent recursion trees. A programmer expresses this parallelism by abstracting a partial execution order of computation of code snippets, and the compiler partitions these snippets into totally ordered sequences called threads. The programmer need not specify which processors of a parallel computer execute which threads or exactly when each thread should be executed. Because static, compile time partitioning of such programs is generally not possible, parallel tasks are scheduled onto the processors at runtime. Consequently, the performance of a program depends significantly on the implementation of the runtime system. Runtime systems need to focus on efficiently balancing the load, providing good data locality, reducing the memory requirements of the parallel computation and improving the performance of the parallel program in terms of time and message complexity. Our research initiative addresses the above issues and in particular, advances research in runtime systems by designing and developing online algorithms for affinity driven distributed scheduling of multi-threaded computations on heterogeneous many core architectures. In this talk, we first present the problems and issues discussed above, and then present our initial experimental results to show the efficacy of our approach.

Subscribe to ICTer News