Design challenges of multi-UAV systems in cyber-physical applications: A comprehensive survey and future directions

R Shakeri, MA Al-Garadi, A Badawy… - … Surveys & Tutorials, 2019 - ieeexplore.ieee.org
Unmanned aerial vehicles (UAVs) have recently rapidly grown to facilitate a wide range of
innovative applications that can fundamentally change the way cyber-physical systems …

Mojim: A reliable and highly-available non-volatile memory system

Y Zhang, J Yang, A Memaripour… - Proceedings of the …, 2015 - dl.acm.org
Next-generation non-volatile memories (NVMs) promise DRAM-like performance,
persistence, and high density. They can attach directly to processors to form non-volatile …

Fault-tolerant and transactional stateful serverless workflows

H Zhang, A Cardoza, PB Chen, S Angel… - 14th USENIX Symposium …, 2020 - usenix.org
This paper introduces Beldi, a library and runtime system for writing and composing fault-
tolerant and transactional stateful serverless functions. Beldi runs on existing providers and …

All about eve:{Execute-Verify} replication for {Multi-Core} servers

M Kapritsos, Y Wang, V Quema, A Clement… - … USENIX Symposium on …, 2012 - usenix.org
This paper presents Eve, a new Execute-Verify architecture that allows state machine
replication to scale to multi-core servers. Eve departs from the traditional agree-execute …

Gray failure: The achilles' heel of cloud-scale systems

P Huang, C Guo, L Zhou, JR Lorch, Y Dang… - Proceedings of the 16th …, 2017 - dl.acm.org
Cloud scale provides the vast resources necessary to replace failed components, but this is
useful only if those failures can be detected. For this reason, the major availability …

{NetBouncer}: Active device and link failure localization in data center networks

C Tan, Z **, C Guo, T Zhang, H Wu, K Deng… - … USENIX Symposium on …, 2019 - usenix.org
The availability of data center services is jeopardized by various network incidents. One of
the biggest challenges for network incident handling is to accurately localize the failures …

What bugs live in the cloud? a study of 3000+ issues in cloud systems

HS Gunawi, M Hao, T Leesatapornwongsa… - Proceedings of the …, 2014 - dl.acm.org
We conduct a comprehensive study of development and deployment issues of six popular
and important cloud systems (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper …

Microsecond consensus for microsecond applications

MK Aguilera, N Ben-David, R Guerraoui… - … USENIX Symposium on …, 2020 - usenix.org
We consider the problem of making apps fault-tolerant through replication, when apps
operate at the microsecond scale, as in finance, embedded computing, and microservices …

Understanding and detecting software upgrade failures in distributed systems

Y Zhang, J Yang, Z **, U Sethi, K Rodrigues… - Proceedings of the …, 2021 - dl.acm.org
Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine
the availability of distributed systems. Any failure during an upgrade is catastrophic, as it …

Perseus: A {Fail-Slow} detection framework for cloud storage systems

R Lu, E Xu, Y Zhang, F Zhu, Z Zhu, M Wang… - … USENIX Conference on …, 2023 - usenix.org
The newly-emerging''fail-slow''failures plague both software and hardware where the victim
components are still functioning yet with degraded performance. To address this problem …