Designing cloud servers for lower carbon

J Wang, DS Berger, F Kazhamiaka… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
To mitigate climate change, we must reduce carbon emissions from hyperscale cloud
computing. We find that cloud compute servers cause the majority of emissions in a general …

[HTML][HTML] Open-source IP cores for space: A processor-level perspective on soft errors in the RISC-V era

S Di Mascio, A Menicucci, E Gill, G Furano… - Computer Science …, 2021 - Elsevier
This paper discusses principles and techniques to evaluate processors for dependable
computing in space applications. The focus is on soft errors, which dominate the failure rate …

Lessons learned from memory errors observed over the lifetime of cielo

S Levy, KB Ferreira, N DeBardeleben… - … Conference for High …, 2018 - ieeexplore.ieee.org
Maintaining the performance of high-performance computing (HPC) applications as failures
increase is a major challenge for next-generation extreme-scale systems. Recent work …

Structural coding: A low-cost scheme to protect cnns from large-granularity memory faults

A Asgari Khoshouyeh, F Geissler, S Qutub… - Proceedings of the …, 2023 - dl.acm.org
The advent of High-Performance Computing has led to the adoption of Convolutional Neural
Networks (CNNs) in safety-critical applications such as autonomous vehicles. However …

A case for self-managing dram chips: Improving performance, efficiency, reliability, and security via autonomous in-dram maintenance operations

H Hassan, A Olgun, AG Yaglikci, H Luo, O Mutlu - arxiv, 2022 - research-collection.ethz.ch
The memory controller is in charge of managing DRAM maintenance operations (eg,
refresh, RowHammer protection, memory scrubbing) in current DRAM chips. Implementing …

Characterizing and understanding hpc job failures over the 2k-day life of ibm bluegene/q system

S Di, H Guo, E Pershey, M Snir… - 2019 49th Annual IEEE …, 2019 - ieeexplore.ieee.org
An in-depth understanding of the failure features of HPC jobs in a supercomputer is critical
to the large-scale system maintenance and improvement of the service quality for users. In …

Dramscope: Uncovering dram microarchitecture and characteristics by issuing memory commands

H Nam, S Baek, M Wi, MJ Kim, J Park… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
The demand for precise information on DRAM microarchitectures and error characteristics
has surged, driven by the need to explore processing in memory, enhance reliability, and …

Self-Managing DRAM: A Low-Cost Framework for Enabling Autonomous and Efficient DRAM Maintenance Operations

H Hassan, A Olgun, AG Yağlıkçı, H Luo… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
The memory controller is in charge of managing DRAM maintenance operations (eg,
refresh, RowHammer protection, memory scrubbing) to reliably operate modern DRAM …

Predicting future-system reliability with a component-level dram fault model

J Jung, M Erez - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org
We introduce a new fault model for recent and future DRAM systems that uses empirical
analysis to derive DRAM internal-component level fault models. This modeling level offers …

Exploring properties and correlations of fatal events in a large-scale hpc system

S Di, H Guo, R Gupta, ER Pershey… - … on Parallel and …, 2018 - ieeexplore.ieee.org
In this paper, we explore potential correlations of fatal system events for one of the most
powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National …