Filter by type:

Sort by year:

Levioso: Efficient Compiler-Informed Secure Speculation

Ali Hajiabadi, Archit Agarwal, Andreas Diavastos, and Trevor E. Carlson
Conference In Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC '24). Association for Computing Machinery, New York, NY, USA, Article 8, 1–6, 2024

Abstract

Spectre-type attacks have exposed a major class of vulnerabilities arising from speculative execution of instructions, the main performance enabler of modern CPUs. These attacks speculatively leak secrets that have been either speculatively loaded (seen in sand-boxed programs) or non-speculatively loaded (seen in constant-time programs). Various hardware-only defenses have been proposed to mitigate both speculative and non-speculative secrets via all potential transmission channels. However, limited program knowledge is exposed to the hardware and these solutions conservatively restrict the execution of all instructions that can potentially leak. In this work, we show that not all instructions depend on older unresolved branches and they can safely execute without leaking speculative information. We present Levioso, a novel hardware/-software co-design, that provides comprehensive secure speculation guarantees while reducing performance overhead compared to existing defenses. Levioso informs the hardware about true branch dependencies and applies restrictions only when necessary. Our evaluations demonstrate that Levioso is able to significantly reduce the performance overhead compared to two prior defenses from 51% and 43% to just 23%.

PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance

Yun Chen, Ali Hajiabadi, Romain Poussier, Yaswanth Tavva, Andreas Diavastos, Shivam Bhasin, and Trevor E. Carlson
Journal ACM Transactions on Architecture and Code Optimization, 2024

Abstract

Power side-channel attacks exploit the correlation of power consumption with the instructions and data being processed to extract secrets from a device (e.g., cryptographic keys). Prior work primarily focused on protecting small embedded micro-controllers and in-order processors rather than high-performance, out-of-order desktop and server CPUs. In this paper, we present Paradise, a general-purpose out-of-order processor with always-on protection, that implements a novel dynamic instruction scheduler to provide obfuscated execution and mitigate power analysis attacks. To achieve this, we exploit the time between operand availability of critical instructions (slack) and create high-performance random schedules. Further, we highlight the dangers of using incorrect adversarial assumptions, which can often lead to a false sense of security. Therefore, we perform an extended security analysis on AES-128 using different levels of adversaries, from basic to advanced, including a CNN-based attack. Our advanced security evaluation assumes a strong adversary with full knowledge of the countermeasure and demonstrates a significant security improvement of 556 × when combined with Boolean Masking over a baseline only protected by masking, and 62, 500 × over an unprotected baseline. The resulting overhead in performance, power and area of Paradise is, and respectively.

Secure Run-Time Hardware Trojan Detection Using Lightweight Analytical Models

B. Amornpaisannon, Andreas Diavastos, L. S. Peh and T. E. Carlson
Journal IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 43, no. 2, Feb. 2024

Abstract

Hardware Trojans, malicious components that attempt to prevent a chip from operating as expected, are carefully crafted to circumvent detection during the predeployment silicon design and verification stages. They are an emerging threat being investigated by academia, the military, and industry. Therefore, run-time hardware Trojan detection is critically needed as the final layer of defense during chip deployment, and in this work, we focus on hardware Trojans that target the processor’s performance. Current state-of-the-art detectors watch hardware counters for anomalies using complex machine-learning models, which require a dedicated off-chip processor and must be trained extensively for each target processor. In this work, we propose a lightweight solution that uses data from a single reference run to accurately determine whether a Trojan is slowing processor performance, across CPU configurations, without the need for new profiles. To accomplish this, we use an analytical model based on the application’s inherent microarchitecturally independent characteristics. Such models determine the expected microarchitectural events across different processor configurations without requiring reference values for each application-hardware configuration pair. By comparing predicted values to actual hardware events, one can quickly check for unexpected application slowdowns that are the key signatures of many hardware Trojans. The proposed methodology achieves a higher true positive rate (TPR) compared to prior works while having no false positives. The proposed detector incurs no run-time performance penalty and only adds a negligible power overhead of 0.005%.

3DRA: Dynamic Data-Driven Reconfigurable Architecture

J. Lee, B. Amornpaisannon, Andreas Diavastos, and T. E. Carlson
Journal IEEE Access, 2023

Abstract

Specialized accelerators are becoming a standard way to achieve both high-performance and efficient computation. We see this trend extending to all areas of computing, from low-power edge-computing systems to high-performance processors in datacenters. Reconfigurable architectures, such as Coarse-Grained Reconfigurable Arrays (CGRAs), attempt to find a balance between performance and energy efficiency by trading off dynamism, flexibility, and programmability. Our goal in this work is to find a new solution that provides the flexibility of traditional CPUs, with the parallelism of a CGRA, to improve overall performance and energy efficiency. Our design, the Dynamic Data-Driven Reconfigurable Architecture (3DRA), is unique, in that it targets both low-latency and high-throughput workloads. This architecture implements a dynamic dataflow execution model that resolves data dependencies at run-time and utilizes non-blocking broadcast communication that reduces transmission latency to a single cycle to achieve high performance and energy efficiency. By employing a dynamic model, 3DRA eliminates costly mapping algorithms during compilation and improves the flexibility and compilation time of traditional CGRAs. The 3DRA architecture achieves up to 731MIPS/mW, and it improves performance by up to 4.43x compared to the current state-of-the-art CGRA-based accelerators.

XFeatur: Hardware Feature Extraction for DNN Auto-tuning

J. S. Acosta, Andreas Diavastos, and Antonio Gonzalez
Conference IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2022

Abstract

In this work, we extend the auto-tuning process of the state-of-the-art TVM framework with XFeatur; a tool that extracts new meaningful hardware-related features that improve the quality of the representation of the search space and consequently improve the accuracy of its prediction algorithm. These new features provide information about the amount of thread-level parallelism, shared memory usage, register usage, dynamic instruction count and memory access dependencies. Optimizing ResNet-18 with the proposed features improves the quality of the search space representation by 63% on average and a maximum of 2× for certain tasks, while it reduces the tuning time by 9% (approximately 1.1 hours) and produces configurations that have equal or better performance (up to 92.7%) than the baseline.

Efficient Instruction Scheduling Using Real-time Load Delay Tracking

Andreas Diavastos, and Trevor E. Carlson
Journal ACM Transactions on Computer Systems (TOCS), 2022

Abstract

Issue time prediction processors use dataflow dependencies and predefined instruction latencies to predict issue times of repeated instructions. In this work, we make two key observations: (1) memory accesses often take additional time to arrive than the static, predefined access latency that is used to describe these systems. This is due to contention in the memory hierarchy and variability in DRAM access times, and (2) we find that these memory access delays often repeat across iterations of the same code. We propose a new processor microarchitecture that replaces a complex reservation-station-based scheduler with an efficient, scalable alternative. Our scheduling technique tracks real-time delays of loads to accurately predict instruction issue times and uses a reordering mechanism to prioritize instructions based on that prediction. To accomplish this in an energy-efficient manner we introduce (1) an instruction delay learning mechanism that monitors repeated load instructions and learns their latest delay, (2) an issue time predictor that uses learned delays and dataflow dependencies to predict instruction issue times, and (3) priority queues that reorder instructions based on their issue time prediction. Our processor achieves 86.2% of the performance of a traditional out-of-order processor, higher than previous efficient scheduler proposals, while consuming 30% less power.

A fast supervised density-based discretization algorithm for classification tasks in the medical domain

Aristodimou, Aristos, Andreas Diavastos, and Constantinos S Pattichis
Journal Health Informatics Journal, (January 2022)

Abstract

Discretization is a preprocessing technique used for converting continuous features into categorical. This step is essential for processing algorithms that cannot handle continuous data as input. In addition, in the big data era, it is important for a discretizer to be able to efficiently discretize data. In this paper, a new supervised density-based discretization (DBAD) algorithm is proposed, which satisfies these requirements. For the evaluation of the algorithm, 11 datasets that cover a wide range of datasets in the medical domain were used. The proposed algorithm was tested against three state-of-the art discretizers using three classifiers with different characteristics. A parallel version of the algorithm was evaluated using two synthetic big datasets. In the majority of the performed tests, the algorithm was found performing statistically similar or better than the other three discretization algorithms it was compared to. Additionally, the algorithm was faster than the other discretizers in all of the performed tests. Finally, the parallel version of DBAD shows almost linear speedup for a Message Passing Interface (MPI) implementation (9.64× for 10 nodes), while a hybrid MPI/OpenMP implementation improves execution time by 35.3× for 10 nodes and 6 threads per node.

NOREBA: a compiler-informed non-speculative out-of-order commit processor

Ali Hajiabadi, Andreas Diavastos, and Trevor E. Carlson
Conference In the Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS) (April 2021)

Abstract

Modern superscalar processors execute instructions out-of-order, but commit them in program order to provide precise exception handling and safe instruction retirement. However, in-order instruction commit is highly conservative and holds on to critical resources far longer than necessary, severely limiting the reach of general-purpose processors, ultimately reducing performance. Solutions that allow for efficient, early reclamation of these critical resources could seize the opportunity to improve performance. One such solution is out-of-order commit, which has traditionally been challenging due to inefficient, complex hardware used to guarantee safe instruction retirement and provide precise exception handling. In this work, we present NOREBA, a processor for Non-speculative Out-of-order Retirement via Branch Reconvergence Analysis. In NOREBA, we enable non-speculative out-of-order commit and resource reclamation in a light-weight manner, improving performance and efficiency. We accomplish this through a combination of (1) automatic compiler annotation of true branch dependencies, and (2) an efficient re-design of the reorder buffer from traditional processors. By exploiting compiler branch dependency information, this system achieves 95% of the performance of aggressive, speculative solutions, without any additional speculation, and while maintaining energy efficiency.

Laser Attack Benchmark Suite

Burin Amornpaisannon, Andreas Diavastos, Li-Shiuan Peh and Trevor E. Carlson
Conference In the Proceedings of the 39th IEEE/ACM International Conference on Computer Aided Design (ICCAD) San Diego, CA, USA, 2020

Abstract

Laser fault injection in integrated circuits is a powerful information leakage technique due to its high precision, timing accuracy and repeatability. Countermeasures to these attacks have been studied extensively. However, with most current design flows, security tests against these attacks can only be realized after chip fabrication. Restarting the complete silicon design cycle in order to address these vulnerabilities is thus both time-consuming and costly. To overcome these limitations, this paper proposes an open-source benchmark suite that allows chip designers to simulate laser attacks, and evaluate the security of their designs, both hardware-based and software-based, against laser fault injection early on during design time. The proposed benchmark suite consists of a tool that automatically integrates hardware-based spatial, temporal and hybrid redundancy techniques into a target design. With the tools used in this work, we demonstrate how the attacks can be simulated on a Verilog simulator, and run on an FPGA with a design equipped with hardware-based redundancy techniques without manual modifications. This work consists of four attacks, and four hardware-based redundancy techniques. The attacks and defenses together that the benchmark suite provides will automate the entire early design evaluation flow against laser fault injection attacks.

SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores

Andreas Diavastos and Pedro Trancoso
Conference HiPEAC 2018 conference (ACM TACO 14, 3, Article 31) (January 2018)

Abstract

SWITCHES is a task-based dataflow runtime that implements a lightweight distributed triggering system for runtime dependence resolution and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased to favor data-locality, even when having dependences across different loops. SWITCHES introduces explicit task resource allocation mechanisms for efficient allocation of resources and adopts the latest OpenMP Application Programming Interface (API), as to maintain high levels of programming productivity. It provides a source-to-source tool that automatically produces thread-based code. Performance on an Intel Xeon-Phi shows good scalability and surpasses OpenMP by an average of 32%.

An Energy-Efficient and Error-Resilient Server Ecosystem Exceeding Conservative Scaling Limits

G. Karakonstantis, K. Tovletoglou, L. Mukhanov, H. Vandierendonck, D. S. Nikolopoulos, P. Lawthers, P. Koutsovasilis, M. Maroudas, C. Antonopoulos, C. Kalogirou, N. Bellas, S. Lalis, S. Venugopal, A. Prat-Perez, A. Lampropulos, M. Kleanthous, A. Diavastos, Z. Hadjilambrou, P. Nikolaou, Y. Sazeides, P. Trancoso, G. Papadimitriou, M. Kaliorakis, A. Chatzidimitriou, D. Gizopoulos, and S. Das
Conference Proceedings of the Design, Automation and Test in Europe (DATE) 2018, Dresden, Germany, March 2018

Abstract

The explosive growth of Internet-connected devices will soon result in a flood of generated data, which will increase the demand for network bandwidth as well as compute power to process the generated data. Consequently, there is a need for more energy efficient servers to empower traditional centralized Cloud data-centers as well as emerging decentralized data-centers at the Edges of the Cloud. In this paper, we present our approach, which aims at developing a new class of micro-servers - the UniServer - that exceed the conservative energy and performance scaling boundaries by introducing novel mechanisms at all layers of the design stack. The main idea lies on the realization of the intrinsic hardware heterogeneity and the development of mechanisms that will automatically expose the unique varying capabilities of each hardware. Low overhead schemes are employed to monitor and predict the hardware behavior and report it to the system software. The system software including a virtualization and resource management layer is responsible for optimizing the system operation in terms of energy or performance, while guaranteeing non-disruptive operation under the extended operating points. Our characterization results on a 64-bit ARMv8 micro-server in 28nm process reveal large voltage margins in terms of Vmin variation among the 8 cores of the CPU chip, among three different sigma chips, and among different benchmarks with the potential to obtain up-to 38.8% energy savings. Similarly, DRAM characterizations show that refresh rate and voltage can be relaxed by 35x and 5%, respectively, leading to 23.2% power savings on average.

Task Data-flow Execution on Many-core Systems

Andreas Diavastos
PhD Thesis Department of Computer Science, University of Cyprus, Nicosia, Cyprus, November 2017.

Abstract

Power-performance efficiency in High Performance Computing (HPC) systems is currently achieved by exploring parallel processing. Consequently, in order to exploit application parallelism and optimize energy efficiency, the trend is to include more cores in the processors. From the hardware perspective, many small and simple cores will be added in processor architectures, leading towards many-core chips with hundreds of cores. Nevertheless, scaling the number of cores alone does not result in improved application performance. From the software perspective, this creates new challenges as we need a framework that can efficiently exploit application parallelism on the available hardware resources. Overall, performance scalability in future many-core systems will be affected by the following factors: the degree of parallelism, programmability, low-overhead runtime systems, locality-aware execution, efficient use of the available resources and scalable architecture designs. In this thesis, the Task model is used as an implementation of the Data-flow paradigm. Data-flow is the most appropriate model for exploiting large amounts of software parallelism, while a task-based implementation reduces runtime overheads and easily adapts to different applications. In this work, Transactional Memory is integrated in Data-flow to reduce the strictness of the latter by exploring speculative parallelism when task dependences are too complex to apply or even when not applicable. This thesis presents the first many-core implementation of the Data-Driven Multi-threading (DDM) model, a task-based implementation of Data-flow. DDM is redesigned to support the first single-chip many-core processor from Intel (the Single-chip Cloud Computer) that provides a single address space with no hardware support for cache-coherence. The results from this work led to the design and development of a new more efficient, lightweight runtime system that is able to scale to larger many-core processors. The proposed runtime system (called SWITCHES) is included in a complete programming and execution framework. SWITCHES is a software implementation of the task-based Data-flow model for many-core processors. It requires global address space but not necessarily a hardware cache-coherence mechanisms that could limit the scalability of the architecture. SWITCHES implements a lightweight distributed triggering system for runtime task dependence resolution and uses static scheduling and compile time assignment policies to reduce overheads. It supports explicit task resource allocation mechanisms and incorporates machine-learning techniques within the framework to efficiently utilize the underlying resources. To maintain high-levels of programming productivity, the framework implements the latest API standard from OpenMP (v4.5) and extends it to support variable granularity loop-tasks with dependences across different loops as to favor data-locality in loops with inter-dependences. It provides a source-to-source tool that automatically produces thread-based code that can be compiled by any off-the-shelf C/C++ compiler, applying all existing optimizations. Performance evaluation of applications with different characteristics on an Intel Xeon Phi system shows good scalability that surpasses the state-of-the-art by an average of 32% and resource utilization is increased with maximum performance achieved using 30% fewer cores for applications with complex dependences.

Unified Data-Flow Platform for General Purpose Many-core Systems

Andreas Diavastos and Pedro Trancoso
Technical Report Department of Computer Science, University of Cyprus, Nicosia, Cyprus, Technical Report UCY-CS-TR-17-2, September 2017.

Abstract

The current trend in high performance processor design is to increase the number of cores as to achieve desired performance. While having a large number of cores on a chip seems to be feasible in terms of the hardware, the development of the software that is able to exploit that parallelism is one of the biggest challenges. In this work we propose a Data-Flow based system that can be used to exploit the parallelism in large-scale many-core processors in an efficient way. We propose to use Data-Flow system that can be adapted to the two major many-core implementations: clustered and shared memory. In particular we test the original TFlux Data-Driven Multithreading platform on the 61-core Intel Xeon Phi processor and extend the platform for execution on the 48-core Intel Single-chip Cloud Computing (SCC) processor. Given the modularity of the TFlux platform, changes to support the new architecture are done at the back-end level and thus the application does not need to be re-written in order to execute on different systems. The experiments are performed on real off-the-shelf systems executing a Data-Flow software runtime. While different approaches and optimizations are chosen to exploit the different characteristics of each architecture, the evaluation shows good scalability for a common set of applications for both architectures.

Data-Driven Multithreading Programming Tool-chain

Andreas Diavastos, George Matheou, Paraskevas Evfipidou and Pedro Trancoso
Technical Report Department of Computer Science, University of Cyprus, Nicosia, Cyprus, Technical Report UCY-CS-TR-17-3, September 2017.

Abstract

The increasing parallelism offered by the parallel architectures introduced by processor vendors, coupled with the need to extract more parallelism out of the applications, has led the community to examine more efficient programming and execution models. The Dataflow Multithreading model is known to be the model that can exploit the most parallelism out of a wide range of applications. The Dataflow Multithreading model is not new to the community and one of the main reasons it wasn’t widely spread among the user community is the programming effort it requires in order to exploit the maximum parallelism out of a Dataflow implementation. The wide spread of parallel machines today though has led many researchers to re-examine its applicability. In this work we explore the programmability of the Data-Driven Multithreading (DDM) model, a non-blocking execution model that applies the principles of Dataflow execution at a coarser granularity, that of sequence of instructions. In this paper we present a tool-chain that promises to ease the effort needed by the programmers to write efficient DDM applications that exploit the Dataflow execution model and in particular two implementations of the DDM model, the TFlux Architecture and the Data-Driven Multithreading Virtual Machine (DDM-VM). The tool-chain includes a set of programming directives, developed under the DDM C Preprocessor project and an Eclipse Plug-in suite that enables content assist while integrating directives in applications, a side panel with the available directives, their arguments and explanation for each one. Finally, the preprocessing procedure is integrated in the Eclipse editor to enable easy DDM application development regardless of the platform used.

Auto-tuning Static Schedules for Task Data-flow Applications

Andreas Diavastos and Pedro Trancoso
Workshop Proceedings of the 1st ACM Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems, (ANDARE), Portland, USA, September 2017.

Abstract

Scheduling task-based parallel applications on many-core processors is becoming more challenging and has received lots of attention recently. The main challenge is to efficiently map the tasks to the underlying hardware topology using application characteristics such as the dependences between tasks, in order to satisfy the requirements. To achieve this, each application must be studied exhaustively as to define the usage of the data by the different tasks, that would provide the knowledge for mapping tasks that share the same data close to each other. In addition, different hardware topologies will require different mappings for the same application to produce the best performance. In this work we use the synchronization graph of a task-based parallel application that is produced during compilation and try to automatically tune the scheduling policy on top of any underlying hardware using heuristic-based Genetic Algorithm techniques. This tool is integrated into an actual task-based parallel programming platform called SWITCHES and is evaluated using real applications from the SWITCHES benchmark suite. We compare our results with the execution time of predefined schedules within SWITCHES and observe that the tool can converge close to an optimal solution with no effort from the user and using fewer resources.

An Energy-Efficient and Error-Resilient Server Ecosystem Exceeding Conservative Scaling Limits

K. Tovletoglou, C. Chalios, G. Karakonstantis, L. Mukhanov, H. Vandierendonck, D. S. Nikolopoulos, P. Koutsovasilis, M. Maroudas, C. Antonopoulos, C. Kalogirou, N. Bellas, S. Lalis, M. M. Rafique, S. Venugopal, A. Prat-Perez, A. Diavastos, Z. Hadjilambrou, P. Nikolaou, Y. Sazeides, P. Trancoso, G. Papadimitriou, M. Kaliorakis, A. Chatzidimitriou, and D. Gizopoulos
Workshop Proceedings of the Energy-efficient Servers for Cloud and Edge Computing Workshop (ENeSCE 2017) co-located with HiPEAC, Stockholm, Sweden, January 2017.

Abstract

The explosive growth of Internet-connected devices will soon result in a flood of generated data, which will increase the demand for network bandwidth as well as compute power to process the generated data. Consequently, there is a need for more energy efficient servers to empower traditional centralized Cloud data-centers as well as emerging decentralized data-centers at the Edges of the Cloud. In this paper, we present our approach, which aims at developing a new class of micro-servers - the UniServer - that exceed the conservative energy and performance scaling boundaries by introducing novel mechanisms at all layers of the design stack. The main idea lies on the realization of the intrinsic hardware heterogeneity and the development of mechanisms that will automatically expose the unique varying capabilities of each hardware. Low overhead schemes are employed to monitor and predict the hardware behavior and report it to the system software. The system software including a virtualization and resource management layer is responsible for optimizing the system operation in terms of energy or performance, while guaranteeing non-disruptive operation under the extended operating points. Our characterization results on a 64-bit ARMv8 micro-server in 28nm process reveal large voltage margins in terms of Vmin variation among the 8 cores of the CPU chip, among three different sigma chips, and among different benchmarks with the potential to obtain up-to 38.8% energy savings. Similarly, DRAM characterizations show that refresh rate and voltage can be relaxed by 35x and 5%, respectively, leading to 23.2% power savings on average.

SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores

Andreas Diavastos and Pedro Trancoso
Journal ACM Transactions on Architecture and Code Optimization (TACO) 14, 3, Article 31 (September 2017)

Abstract

SWITCHES is a task-based dataflow runtime that implements a lightweight distributed triggering system for runtime dependence resolution and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased to favor data-locality, even when having dependences across different loops. SWITCHES introduces explicit task resource allocation mechanisms for efficient allocation of resources and adopts the latest OpenMP Application Programming Interface (API), as to maintain high levels of programming productivity. It provides a source-to-source tool that automatically produces thread-based code. Performance on an Intel Xeon-Phi shows good scalability and surpasses OpenMP by an average of 32%.

Exploiting Very-Wide Vectors on Intel Xeon Phi with Lattice-QCD kernels

Andreas Diavastos, Giannos Stylianou and Giannis Koutsou
Conference Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), Heraklion Crete, Greece, February 2016

Abstract

Our target in this work is to study ways of exploring the parallelism offered by vectorization on accelerators with very wide vector units. To this end, we implemented two kernels that derive from the Wilson Dslash operator and investigate several data layout techniques for increasing the scalability of lattice QCD scientific kernels suitable for the Intel Xeon Phi. In parts of the application where real numbers are used for computation, we see a 6.6x increase in bandwidth compared to scalar code, thanks to the auto-vectorization by the compiler. In other kernels where arithmetic operations on complex numbers dominate, our hand-vectorized code out-performs the auto-vectorization of the compiler. In this paper we find that our proposed Hopping Vector-friendly Ordering allows for more efficient vectorization of complex arithmetic floating point operations. Using this data layout, we manage to increase the sustained bandwidth by approximately 1.8x.

Exploiting Very-Wide Vector Processing for Scientific Applications

Andreas Diavastos, Giannos Stylianou and Giannis Koutsou
Journal Computing in Science & Engineering (CiSE), Vol. 17, no. 6, Nov/Dec 2015

Abstract

Exploiting the recently introduced very-wide vector units of the Xeon Phi coprocessor can potentially increase the scalability for scientific applications. Using lattice QCD compute kernels, the authors find that the performance achieved using the Xeon Phi coprocessors wide vector units is similar to GPGPU performance after appropriate code refactoring, requiring moderate programming effort.

Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform

Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson
Journal International Journal of Parallel Programming, 2015

Abstract

The introduction of multi-core processors has renewed the interest in programming models which can efficiently exploit general purpose parallelism. Data-Flow is one such model which has demonstrated significant potential in the past. However, it is generally associated with functional styles of programming which do not deal well with shared mutable state. There have been a number of attempts to introduce state into Data-Flow models and functional languages but none have proved able to maintain the simplicity and efficiency of pure Data-Flow parallelism. Transactional memory is a concurrency control mechanism that simplifies sharing data when developing parallel applications while at the same time promises to deliver affordable performance. In this paper we report our experience of integrating Transactional Memory and Data-Flow within the TFlux Platform. The ability of the Data-Flow model to expose large amounts of parallelism is maintained while Transactional Memory provides simplified sharing of mutable data in those circumstances where it is important to the expression of the program. The isolation property of transactions ensures that the exploitation of Data-Flow parallelism is not compromised. In this study we extend the TFlux platform, a Data-Driven Multi-threading implementation, to support transactions. We achieve this by proposing new pragmas that allow the programmer to specify transactions. In addition we extend the runtime functionality by integrating a software transactional memory library with TFlux. To test the proposed system, we ported two applications that require transactional memory: Random Counter and Labyrinth an implementation of Lee’s parallel routing algorithm. Our results show good opportunities for scaling when using the integration of the two models.

TFluxSCC: Exploiting Performance on Future Many-core Systems through Data-Flow

Andreas Diavastos, Giannos Stylianou and Pedro Trancoso
Conference Proceedings of the 23rd Euromicro International Conference, Distributed and Network-based Processing (PDP), Turku, Finland, March 2015.

Abstract

The current trend in processor design is to increase the number of cores as to achieve a desired performance. While having a large number of cores on a chip seems to be feasible in terms of the hardware, the development of the software that is able to exploit that parallelism is one of the biggest challenges. In this paper we propose a Data-Flow based system that can be used to exploit the parallelism in large-scale many-core processors in an effective and efficient way. Our proposed system - TFluxSCC - is an extension of the TFlux Data-Driven Multithreading (DDM), which evolved to exploit the parallelism of the 48-core Intel Single-chip Cloud Computing (SCC) processor. With TFluxSCC we achieve scalable performance using a global address space without the need of cache-coherency support. Our scalability study shows that application's performance can scale, with speedup results reaching up to 48x for 48 cores. The findings of this work provide insight towards what a Data-Flow implementation requires and what not from a many-core architecture in order to scale the performance.

Scalability and Efficiency of Database Queries on Future Many-core Systems

Panayiotis Petrides, Andreas Diavastos, Constantinos Christofi and Pedro Trancoso
Conference Proceedings of the 21st Euromicro International Conference, Distributed and Network-based Processing (PDP), Belfast, Northern Ireland 2013.

Abstract

Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessors. In this work we exploit the benefits of using future many-core architectures, more specifically on-chip clustered many-core architectures. To achieve this goal we propose different representative data parallel versions of the original database scan and join algorithms. We also study the impact on the performance when on-chip memory, shared among all cores, is used as a prefetching buffer. For our experiments we study the behaviour of three queries from the standard DSS benchmark TPC-H executing on the Intel Single chip Cloud Computer experimental processor (Intel SCC). Our results show that parallelism can be well exploited by such architectures and how important it is to have a balance between computation and data intensity. Moreover, from our experimental results we show that performance improvement of 5x and 10x for the corresponding query implementation without data prefetching. Finally we show how we could efficiently use the system in order to achieve high power-performance efficiency when using the proposed prefetching buffer.

LDPC Decoding on the Intel SCC

Andreas Diavastos, Panayiotis Petrides, Gabriel Falcao and Pedro Trancoso
Conference Proceedings of the 20th Euromicro International Conference, Distributed and Network-based Processing (PDP), Garching, Germany, February 2012.

Abstract

Low-Density Parity-Check (LDPC) codes are powerful error correcting codes used today in communication standards such as DVB-S2 and WiMAX to transmit data inside noisy channels with high error probability. LDPC decoding is computationally demanding and requires irregular accesses to memory which makes it suitable for parallelization. The recent introduction of the many-core Single-chip Cloud Computer (SCC) from Intel research Labs has created new opportunities and also new challenges for programmers that wish to exploit conveniently the high level of parallelism available in the architecture. In this paper we propose three different implementations: a distributed, a shared and a multi-codeword implementation, for LDPC decoding algorithms that explore the Intel SCC scaling opportunities. From the experimental results we observed that the distributed memory model couldn't scale due to the large number of messages exchanged by the parallel kernels, while the shared memory model had a limited scaling due to the overhead added by the uncacheable shared memory. On the other hand, the multi-codeword implementation scales almost linearly acheving a relative throughput of 28 for 32 cores.

Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform

Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson
Workshop Proceedings of the Data-Flow Execution Models for Extreme Scale Computing Workshop (DFM), 2011

Abstract

Multi-core processors have renewed interest in programming models which can efficiently exploit general purpose parallelism. Data-Flow is one such model which has demonstrated significant potential in the past. However, it is generally associated with functional styles of programming which do not deal well with shared mutable state. There have been a number of attempts to introduce state into Data-Flow models and functional languages but none have proved able to maintain the simplicity and efficiency of pure Data-Flow parallelism. Transactional memory is a concurrency control mechanism that simplifies sharing data when developing parallel applications while at the same time promises to deliver affordable performance. In this paper we report our experience of integrating Transactional Memory and Data-Flow. The ability of the Data-Flow model to expose large amounts of parallelism is maintained while Transactional Memory provides simplified sharing of mutable data in those circumstances where it is important to the expression of the program. The isolation property of transactions ensures that the exploitation of Data-Flow parallelism is not compromised.In this study we extend the TFlux platform, a Data-Driven Multi-threading implementation, to support transactions. We achieve this by proposing new pragmas that allow the programmer to specify transactions. In addition we extend the runtime functionality by integrating a software transactional memory library with TFlux. To test the proposed system, we ported two applications that require transactional memory: Random Counter and Labyrinth an implementation of Lee's parallel routing algorithm. Our results show good opportunities for scaling when using the integration of the two models.

Reducing Data Transfer Costs on the Cell/BE Processor

Andreas Diavastos
BSc Thesis Department of Computer Science, University of Cyprus, Nicosia, Cyprus, 2010

Abstract

Ο επεξεργαστής Cell Broadband Engine είναι επεξεργαστής με κατανεμημένη αρχιτεκτονική μνήμης, πάνω στον οποίο ο προγραμματιστής πρέπει να διαχωρίσει και να κατανέμει τα δεδομένα στους διάφορους πυρήνες που τον αποτελούν. Η λειτουργία αυτή είναι ανάλογη με την κατανομή των δεδομένων σε συστήματα που αποτελούνται από συστάδες υπολογιστών(clusters) και που η επικοινωνία μεταξύ τους επιτυγχάνεται μέσω δικτύων αλληλοσύνδεσης. Αυτό το είδος σχεδιασμού και αρχιτεκτονικής σε ένα πολυπύρηνο επεξεργαστή έρχεται με κάποιο κόστος και το κόστος αυτό είναι η κατανομή και μεταφορά των δεδομένων από την κύρια μνήμη στους πυρήνες επεξεργασίας δεδομένων. Εδώ και αρκετά χρόνια στα δίκτυα επικοινωνίας και στα δίκτυα αλληλοσύνδεσης χρησιμοποιείται η τεχνική της συμπίεσης δεδομένων(data compression) με αποτέλεσμα την μείωση του όγκου μεταφοράς δεδομένων και συνεπώς την μείωση του χρόνο μεταφοράς των δεδομένων. Αυτή την τεχνική εφαρμόσαμε και στην εργασία αυτή στην προσπάθειαμας να μειώσουμε τον όγκο των δεδομένων που μεταφέρονται από την κύρια μνήμη μέχρι τις μονάδες επεξεργασίας του επεξεργαστή που είναι οι πυρήνες έχοντας δε απώτερο σκοπό να μειώσουμε το κόστος μεταφοράς δεδομένων μέσα στο chip. Χρησιμοποιώντας την συμπίεση των δεδομένων (data compression) μειώσαμε τον όγκο των δεδομένων κατά την μεταφορά, ακολούθως εφαρμόζαμε την αποσυμπίεση των δεδομένων (data decompression) πάνω στους πυρήνες επεξεργασίας έτσι ώστε να πάρουμε τα αρχικά μας δεδομένα χωρίς να έχουμε οποιαδήποτε απώλεια δεδομένων και να μπορέσουμε να υλοποιήσουμε την εφαρμογή μας. Όπως είναι λογικό έπρεπε να χρησιμοποιήσουμε εφαρμογές οι οποίες χρησιμοποιούν μεγάλο όγκο δεδομένων έτσι ώστε να μπορέσουμε να δούμε σημαντικές αλλαγές στο κόστος. Οι εφαρμογές που χρησιμοποιήσαμε στην εργασία αυτή είναι ο Πολλαπλασιασμός Πινάκων (Matrix Multiplication) και κυρίως εφαρμογές Βάσεων Δεδομένων (Database applications).