A large-scale study of failures in high-performance computing systems

B Schroeder, GA Gibson - IEEE transactions on Dependable …, 2009 - ieeexplore.ieee.org
Designing highly dependable systems requires a good understanding of failure
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …

Workflowsim: A toolkit for simulating scientific workflows in distributed environments

W Chen, E Deelman - … IEEE 8th international conference on E …, 2012 - ieeexplore.ieee.org
Simulation is one of the most popular evaluation methods in scientific workflow studies.
However, existing workflow simulators fail to provide a framework that takes into …

On the performance variability of production cloud services

A Iosup, N Yigitbasi, D Epema - 2011 11th IEEE/ACM …, 2011 - ieeexplore.ieee.org
Cloud computing is an emerging infrastructure paradigm that promises to eliminate the need
for companies to maintain expensive computing hardware. Through the use of virtualization …

Failure prediction in ibm bluegene/l event logs

Y Liang, Y Zhang, H **ong… - … Conference on Data …, 2007 - ieeexplore.ieee.org
Frequent failures are becoming a serious concern to the community of high-end computing,
especially when the applications and the underlying systems rapidly grow in size and …

Bluegene/l failure analysis and prediction models

Y Liang, Y Zhang, A Sivasubramaniam… - … and Networks (DSN' …, 2006 - ieeexplore.ieee.org
The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …

Pets 2016: Dataset and challenge

L Patino, T Cane, A Vallee… - Proceedings of the IEEE …, 2016 - cv-foundation.org
This paper describes the datasets and computer vision challenges that form part of the PETS
2016 workshop. PETS 2016 addresses the application of on-board multi sensor surveillance …

On cloud service reliability enhancement with optimal resource usage

A Zhou, S Wang, Z Zheng, CH Hsu… - IEEE Transactions on …, 2014 - ieeexplore.ieee.org
An increasing number of companies are beginning to deploy services/applications in the
cloud computing environment. Enhancing the reliability of cloud service has become a …

[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.

Q Guan, Z Zhang, S Fu - J. Commun., 2012 - researchgate.net
In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …

Fault-aware, utility-based job scheduling on blue, gene/p systems

W Tang, Z Lan, N Desai… - 2009 IEEE International …, 2009 - ieeexplore.ieee.org
Job scheduling on large-scale systems is an increasingly complicated affair, with numerous
factors influencing scheduling policy. Addressing these concerns results in sophisticated …

Mining frequent itemsets in a stream

T Calders, N Dexters, JJM Gillis, B Goethals - Information Systems, 2014 - Elsevier
Mining frequent itemsets in a datastream proves to be a difficult problem, as itemsets arrive
in rapid succession and storing parts of the stream is typically impossible. Nonetheless, it …