Power/performance modeling of Single-ISA heterogeneous chip multiprocessor

Abstract

Homogeneous multicore processors have emerged as the prime approach for offering high parallelism, high performance and scalability for a wide range of applications over the last decade. Now the heterogeneous architectures comprising of specialized cores for the specific workloads are becoming popular for providing high performance and power efficiency at the same time. Exploration of the heterogeneous architectures is a challenging task as it involves reconsideration of important system parameters which include architecture options, operating system implications and application development. In this project we are exploring the Single ISA heterogeneous multiprocessors. There are two objectives for this project. First is to create new heterogeneous architectures by varying certain parameters of the processor for the x86 architecture. Second is to perform power and performance simulations of these heterogeneous multiprocessors. The results of heterogeneous multiprocessor simulations will be compared with that of homogeneous multiprocessors in order to understand the power/performance trade off of heterogeneous
chip multiprocessors.

Introduction

The power and performance are two of the key parameters which are driving the research in the processor architecture. These days multicore processors are becoming very attractive as they provide higher instruction throughput and less power. Multi-core processors also give the designer more flexibility to meet specific power/performance goals. As more cores are being integrated on the die, commercial operating systems
are evolving to efficiently support the parallelism provided by multi-core processors. As different types of cores are now available, the architectural options when designing a multicore processor are also better. It also introduces the possibility for developing heterogeneous architectures that mix and match big and small cores on the same die to provide a range of power/performance capability. In addition to big and small cores, on-die integration of domain-specific accelerators for special-purpose functionality like graphics and media processing has also become wide-spread. Some of the previous studies have shown that single-ISA heterogeneous multicore architectures provide better power/performance numbers as compared to homogenous processors. Single-ISA heterogeneous multicore processors provide greater flexibility to the designers to meet particular power/performance goals. Single-ISA heterogeneous architecture consists of multiple core with each core representing a different point in power-performance graph.
This allows an application to map on a particular core which best suits its power/performance requirements. This results in higher overall computational efficiency than conventional homogenous chip multiprocessors. Most of the previous work assumed a given architecture. Those architectures were composed of existing architectures, either different generations of the same processor family or the voltage or frequency scaled
editions of a single processor. While these architectures gave better results than similar homogenous designs but they failed to reach the full potential of heterogeneity. The reason for less potential of heterogeneity is that pre existing cores provided less flexibility in choice of cores. Second, those cores choices generally maintain a monotonic relationship in a way that the bigger core of the family is always powerful for all the applications.
Also all the cores are considered to perform equally for all the workloads. In this project we explored the single-ISA heterogeneous multi core designs using x86 architecture by
doing the power and performance simulations of these processors for a large design space by varying microarchitectural parameters for different workloads. The results of the simulations of heterogeneous multicore processors are compared to that of homogeneous multicore processors. Figure 1 shows the pictorial representation of single-ISA heterogeneous multicore processor.

Background

Kumar et. al.[1] studied the characteristics of heterogeneous architecture comprising of Alpha cores for the highest area or power efficiency. The study is done for varying degrees
of thread-level parallelism and for different area and power budgets. This study leads to conclusion that most efficient heterogeneous multiprocessor is not constructed of cores that make good general-purpose uniprocessor cores, or even those cores that would appear in a good homogenous multiprocessor architecture.
Kumar et. al.[2] also studied and evaluated the single-ISA heterogeneous multi-core architectures as a mechanism to reduce power dissipation. The study incorporated heterogeneous Alpha cores representing different points in the power/performance design space.In this study they included four Alpha cores - EV4(Alpha 21064), EV5(Alpha 21164),
EV6(Alpha 21264) and a single threaded version of the EV8(Alpha 21464). In both of the above studies they have used SMTSIM simulator which was developed by one of the
co-authors.
The major contribution of these papers were in the following areas:

The best way to design a heterogeneous CMP is by tuning each individual core for a class of applications with common characteristics. Customizing cores to subsets of workloads results in processors that are typically non-monotonic (i.e., there is no strict gradation among cores in terms of overall performance or complexity).
Performance advantages of heterogeneous, and even non monotonic, multiprocessors continue to hold even for a collection of completely homogeneous workloads. In those
cases, such processors exploit the diversity across different workloads.
This work deals with only multiprogrammed benchmarks and thus assumes all the properties to be additive. The paper simulates the performance and power of single cores and then adds those for the multiprogram analysis. Our analysis is based on multi-threaded architecture and thus requires simulation of the multicores with multi-threaded programs.
The problems like cache coherency, memory bandwidth conflict and other issues would arise in our study which would give very different power and performance results of the multicore combinations then the previous mentioned paper.

They have explored various types of heterogeneous architectures, three instances of the most efficient heterogeneous multiprocessor is not constructed of cores that make good
general-purpose uniprocessor cores, or even those cores that would appear in a good homogeneous multiprocessor architecture.
The best way to design a heterogeneous CMP is by tuning each individual core for a class of applications with common characteristics.
Customizing cores to subsets of workloads results in processors that are typically non-monotonic (i.e., there is no strict gradation among cores in terms of overall performance or
complexity).
Performance advantages of heterogeneous, and even non monotonic, multiprocessors continue to hold even for a collection of completely homogeneous workloads. In those
cases, such processors exploit the diversity across different workloads.
This work deals with only multiprogrammed benchmarks and thus assumes all the properties to be additive. The paper simulates the performance and power of single cores and then adds those for the multiprogram analysis. Our analysis is based on multi-threaded architecture and thus requires simulation of the multicores with multi-threaded programs.
The problems like cache coherency, memory bandwidth conflict and other issues would arise in our study which would give very different power and performance results of the multicore combinations then the previous mentioned paper.

Core + IP integration: This integrates homogeneous cores with accelerators(also referred to as intellectual property(IP)). In this type of architecture, the IP block provides low power, high performance processing for specific domains.
Asymmetric core integration: This type integrates the same ISA family big (out of order wide issue core) and small (inorder narrow issue core) targeted at providing performance
or power efficiency when required.
Asymmetric+Specialization: This configuration combines the heterogeneous architecture configurations consisting of asymmetric as well as specialized cores and accelerators.
The major area of the research which the paper focuses was designing of the architecture which would provide the same QoS by deciding which part of the application would run on
the big core and which part runs on the small core, which part needs to be accelerated and which not. This can be achieved by the profiling of the applications and then taking these architectural decisions. Another area which the paper focuses was the OS scheduling on the heterogeneous cores. Since the OS as of now are unaware of the heterogeneous architecture of the cores thus do not distribute the workload properly to those cores. The paper finds these correlation based on the heterogeneous platform and analyzing heuristics of long running workloads on these platforms. Figure 2 shows the pictorial view of the same.

Simulators

SniperSim: Sniper is a next generation parallel, high-speed and accurate x86 simulator. This multi-core simulator is based on the interval core model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulation speed for accuracy to allow a range of flexible simulation options when exploring different homogeneous and heterogeneous multi-core architectures. The Sniper simulator allows one to perform timing simulations for both multi-program workloads and multi-threaded,
shared-memory applications with 10s to 100+ cores, at a high speed when compared to existing simulators. The main feature of the simulator is its core model which is based on interval simulation, a fast mechanistic core model. Interval simulationraises the level of abstraction in architectural simulation which allows for faster simulator development and evaluation times; it does so by ’jumping’ between miss events, called intervals. Sniper has been validated against multi-socket Intel Core2 and Nehalem systems and provides average performance prediction errors within 25% at a simulation speed of up to several
MIPS. This simulator, and the interval core model, is useful for uncore and system-level studies that require more detail than the typical one-IPC models, but for which cycle-accurate simulators are too slow to allow workloads of meaningful sizes to be simulated. As an added benefit, the interval core model allows the generation of CPI stacks, which show the number of cycles lost due to different characteristics of the system, like the cache hierarchy or branch predictor, and leads to a better understanding of each component’s effect on total system performance. This extends the use for Sniper to application characterization and hardware/software co-design.

McPAT: McPAT is the first integrated power, area, and timing modeling framework for multithreaded and multicore/ manycore processors. It is designed to work with a variety
of processor performance simulators (and thermal simulators, etc.) over a large range of technology generations. McPAT allows a user to specify low-level configuration details. It
also provides default values when the user decides to specify only high-level architectural parameters. Rather than being hardwired to a particular simulator, McPAT uses an XMLbased interface with the performance simulator. McPAT uses an XML parser to parse the large XML interface file. This interface allows both the specification of the static microarchitecture configuration parameters and the passing of dynamic activity statistics generated by the performance simulator. McPAT can also send runtime power dissipation results back to the performance simulator through the XML-based interface, so that the performance simulator can react to power or even temperature data. This approach makes McPAT very flexible and easily ported to other performance simulators. Since McPAT provides complete hierarchical models from the architecture to technology level, the XML interface also contains circuit implementation style and technology parameters that are specific to a particular target processor. Examples are array types, crossbar types, and CMOS technology generations with associated voltage and device types. The key components of McPAT are (1) the hierarchical power, area, and timing models, (2) the optimizer for determining circuit level implementations, and (3) the internal chip representation that drives the analysis of power, area, and timing. Most of the parameters in the internal chip representation, such as cache capacity and core issue width, are directly set by the input parameters. McPAT’s hierarchical structure allows it to model structures at a low level including underlying device technology, and yet still allows an architect to focus on a high-level architectural configuration. The optimizer determines missing parameters in the internal chip representation. McPAT’s optimizer focuses on two major regular structures: interconnects and arrays. For example, the user can specify the frequency and bisection bandwidth of on-chip interconnects or the capacity, associativity, the number of cache banks, while letting the tool determine the implementation details such as the choice of metal planes, the effective signal wiring pitch for the interconnect, or the length of wordlines and bitlines of the cache bank. These optimizations lessen the burden on the architect to figure out every detail, and significantly lowers the learning curve to use the tool. Users always have the flexibility to turn off these features and set the circuit-level implementation parameters by themselves.

McPAT provides complete hierarchical models from the architecture

to technology level, the XML interface also contains

circuit implementation style and technology parameters that

are specific to a particular target processor. Examples are array

types, crossbar types, and CMOS technology generations with

associated voltage and device types. The key components

of McPAT are (1) the hierarchical power, area, and timing

models, (2) the optimizer for determining circuit level implementations,

and (3) the internal chip representation that drives

the analysis of power, area, and timing. Most of the parameters

in the internal chip representation, such as cache capacity

and core issue width, are directly set by the input parameters.

McPAT’s hierarchical structure allows it to model structures

at a low level including underlying device technology, and yet

still allows an architect to focus on a high-level architectural

configuration. The optimizer determines missing parameters

in the internal chip representation. McPAT’s optimizer focuses

on two major regular structures: interconnects and arrays. For

example, the user can specify the frequency and bisection

bandwidth of on-chip interconnects or the capacity, associativity,

the number of cache banks, while letting the tool determine

the implementation details such as the choice of metal planes,

the effective signal wiring pitch for the interconnect, or the

length of wordlines and bitlines of the cache bank. These

Methodology

In this project, we developed a methodology for obtaining the best Single-ISA heterogeneous multicore processor. We have considered the x86 architecture with Nehalem base microarchitecture with different D-Cache, I-Cache, Dispatch Width, and Window size. The different parameters which we have considered is shown in the Figure 4. By varying these parameters we obtained 144 single core and this resulted in 17 million heterogenous multicore processors. These many processors would take a very long time to simulate, so we reduced the design space by looking at the power performance plot of the 144 single core simulations. From the power/performance plots of the single core processors, we observed that some of the processors were giving almost similar power/performance characteristics. So we created three buckets out of all the cores and chose 10 cores which covered the whole design space of the processors. These resulted in 210 heterogenous 4-core processors which are simulated and compared with the corresponding homogenous processors. Figure 5 shows the methodology used for the project.

Analysis

We took the highest performing homogenous processors and corresponding highest performing heterogenous processors and compare them for all the benchmarks. We can see from Figure 17 and Figure 18 that the highest performing heterogenous processors have better performance and less power than corresponding homogeneous processors except for LU benchmark. Similarly we compared the lowest power homogenous and heterogenous processors for all the benchmarks. We can see from Figure 19 and Figure 20 that heterogenous processors have better performance but little more power than homogenous processors except for the LU benchmark which is a outlier.

Milestone Updates

In Milestone 1, we started with the project proposal of simulating the power and performance characteristics of Heterogeneous ISA multicore processors. Since there are not many simulators present which support heterogeneous ISA simulations, it was suggested to shift our focus on the Single ISA heterogenous multicore processors. We first started to simulate single ISA heterogenous processors on GEM5 but we faced a lot of difficulty in that. Major one was that there is not much documentation present for simulating multicore processors on Gem5. Also since Gem5 is cycle accurate simulator, it would take a very long time to simulate a large number of processors and lastly there is no integeration of McPAT available for Gem5 which made it hard to do the power analysis of the processors. So we decided to use Sniper Sim which has McPAT integrated with it. Also there is plenty of documentation available for the same.

References

[1] Rakesh Kumar, Dean M. Tullsen, and Norman P. Jouppi. 2006. Core architecture optimization for heterogeneous chip multiprocessors. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques (PACT ’06). ACM, New York, NY, USA, 23-32. DOI=http://dx.doi.org/10.1145/1152154.1152162

[2] Kumar, Rakesh and Tullsen, Dean M. and Ranganathan, Parthasarathy and Jouppi, Norman P. and Farkas, Keith I., “Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance” SIGARCH Comput. Archit. News, vol. 32, no. 2, pp. 64–,March 2004.

[3] Chitlur, N. and Srinivasa, G. and Hahn, S. and Gupta, P.K. and Reddy, D. and Koufaty, D. and Brett, P. and Prabhakaran, A. and Li Zhao and Ijih, N. and Subhaschandra, S. and Grover, S. and Xiaowei Jiang and Iyer, R., “QuickIA: Exploring heterogeneous architectures on real prototypes”, 2012 High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, 10.1109/HPCA.2012.6169046, pp. 1-8, June 2012.

[4] Sheng Li and Ahn, Jung Ho and Strong, R.D. and Brockman, J.B. and Tullsen, D.M. and Jouppi, N.P.,“McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures” Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, 1072-4451, 469-480,2009.

[5] Kumar, R.; Farkas, K.; Jouppi, N.P.; Ranganathan, P.; Tullsen, D.M.,“Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architectures,” Computer Architecture Letters , vol.2, no.1, pp.2-2, January-December 2003, doi: 10.1109/L-CA.2003.6

Figure1

Processor Topology

Figure 2

Heterogenous Processors

Title

Power/Performance modeling of single-ISA heterogeneous chip multiprocessor.

BenchMarks

The SPLASH-2 suite is one of the most widely used collections of multithreaded workloads. It is composed of eleven workloads, three of which come in two implementations that feature different optimization.The majority of workloads belong to the High-Performance Computing domain. The individual benchmarks which are used for the simulation are explained in the Figure 3.

Results

The Figure 6 and Figure 7 shows the simulation of power/performance of the single cores made from the configurations considered. We simulated 144 single cores for splash2 benchmarks. As can be seen, three distinct buckets exists for the single core power/performance. We chose 10 cores out of these 3 buckets.Figure 8 shows the configurations of those 10 cores.

Out of the chosen 10 cores, we have created 10 homogenous 4-core processors and 210 4-core heterogenous processors. Figure 9 and Figure 10 shows the power and performance characteristics of the Homogeneous cores. Figure 11 and Figure 12 shows the power and performance characteristics of Heterogenous processors for the splash2 benchmarks. Figures 13 and 14 shows the percentage energy distrbution in various components of the highest performing processor for the Cholesky benchmark. Figure 13 shows the energy distribution for Homogenous Processor while Figure 14 shows energy distribution for the Heterogenous Processor. Figure 15 and Figure 16 shows the CPI stack for the highest performing processor for the Cholesky benchmark. Figure 15 shows the CPI stack for Homogenous Processor while Figure 16 CPI stack distribution for the Heterogenous Processor.

Figure3

Benchmarks used for the Simulation

Figure4

Parameters and possible values considered

Figure5

Methodology

Figure 6

Single Core Performance

Figure 7

Single Core Power

Figure 8

Parameters of cores selected from 144 cores representing different points in Power/Performance space

Figure 9

Homogeneous Performance

Figure 10

Homogeneous Power

Figure 11

Heterogeneous Performance

Figure12

Heterogeneous Power

Figure 13

Power Stack for Homogeneous Processor

Figure 14

Power Stack for Heterogeneous Processor

Figure 15

$P9qtmcwt3fa58qskus7w fracoftime$

CPI Stack for Homogeneous Processor

Figure 16

$6o6kkmnat9ltxxcmhgeg fracoftim2$

CPI stack for Heterogeneous Processor

Figure17

Performance Comparison for highest performing processors

Figure18

Power Comparison for highest performing processors

Figure19

Performance Comparison for lowest power processors

Figure20

Power Comparison for lowest power processors

Future Work

We can implement better scheduling policies so as to maximize the utilization of each core. We can also use dynamic voltage frequency scaling so as to reduce the power consumption based on different phases of program. We can also increase the design space by varying more micro- architectural parameters. This will in turn increase the number of heterogenous processors and a more exhaustive search can be used to find out the est performing processor.

Samarthm, Gauravsr

Energy Aware Computing

Energy Aware Computing

Abstract

Introduction

Background

Simulators

Methodology

Analysis

Milestone Updates

References

Figure1

Figure 2

Title

BenchMarks

Results

Figure3

Figure4

Figure5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure12

Figure 13

Figure 14

Figure 15

Figure 16

Figure17

Figure18

Figure19

Figure20

Future Work

Samarthm, Gauravsr

Energy Aware Computing

Energy Aware Computing

Abstract

Introduction

Background

Simulators

Methodology

Analysis

Milestone Updates

References

Figure1

Figure 2

Title

BenchMarks

Results

Figure3

Figure4

Figure5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure12

Figure 13

Figure 14

Figure 15

Figure 16

Figure17

Figure18

Figure19

Figure20

Future Work

Share this site X