Software fault tolerance in real-time systems: Identifying the future research questions

F Reghenzani, Z Guo, W Fornaciari - ACM Computing Surveys, 2023 - dl.acm.org
Tolerating hardware faults in modern architectures is becoming a prominent problem due to
the miniaturization of the hardware components, their increasing complexity, and the …

Serving {DNNs} like clockwork: Performance predictability from the bottom up

A Gujarati, R Karimi, S Alzayat, W Hao… - … USENIX Symposium on …, 2020 - usenix.org
Machine learning inference is becoming a core building block for interactive web
applications. As a result, the underlying model serving systems on which these applications …

Cores that don't count

PH Hochschild, P Turner, JC Mogul… - Proceedings of the …, 2021 - dl.acm.org
We are accustomed to thinking of computers as fail-stop, especially the cores that execute
instructions, and most system software implicitly relies on that assumption. During most of …

Taming performance variability

A Maricq, D Duplyakin, I Jimenez, C Maltzahn… - … USENIX Symposium on …, 2018 - usenix.org
The performance of compute hardware varies: software run repeatedly on the same server
(or a different server with supposedly identical parts) can produce performance results that …

Understanding silent data corruptions in a large production cpu population

S Wang, G Zhang, J Wei, Y Wang, J Wu… - Proceedings of the 29th …, 2023 - dl.acm.org
Silent Data Corruption (SDC) in processors can lead to various application-level issues,
such as incorrect calculations and even data loss. Since traditional techniques are not …

Don't be a blockhead: zoned namespaces make work on conventional SSDs obsolete

T Stavrinos, DS Berger, E Katz-Bassett… - Proceedings of the …, 2021 - dl.acm.org
Research on flash devices almost exclusively focuses on conventional SSDs, which expose
a block interface. Industry, however, has standardized and is adopting Zoned Namespaces …

Analog-to-digital conversion of information archived in display holograms: I. discussion

EV Rabosh, NS Balbekin, NV Petrov - JOSA A, 2023 - opg.optica.org
This discussion paper highlights the potential of display holograms in the storage of
information about objects' shape. The images recorded and reconstructed from holograms …

Aggregathor: Byzantine machine learning via robust gradient aggregation

G Damaskinos, EM El-Mhamdi… - Proceedings of …, 2019 - proceedings.mlsys.org
We present AGGREGATHOR, a framework that implements state-of-the-art robust
(Byzantine-resilient) distributed stochastic gradient descent. Following the standard …

Perseus: A {Fail-Slow} detection framework for cloud storage systems

R Lu, E Xu, Y Zhang, F Zhu, Z Zhu, M Wang… - … USENIX Conference on …, 2023 - usenix.org
The newly-emerging''fail-slow''failures plague both software and hardware where the victim
components are still functioning yet with degraded performance. To address this problem …

Unicorn: Reasoning about configurable system performance through the lens of causality

MS Iqbal, R Krishna, MA Javidian, B Ray… - Proceedings of the …, 2022 - dl.acm.org
Modern computer systems are highly configurable, with the total variability space sometimes
larger than the number of atoms in the universe. Understanding and reasoning about the …