A large-scale study of failures in high-performance computing systems
Designing highly dependable systems requires a good understanding of failure
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …
Workflowsim: A toolkit for simulating scientific workflows in distributed environments
Simulation is one of the most popular evaluation methods in scientific workflow studies.
However, existing workflow simulators fail to provide a framework that takes into …
However, existing workflow simulators fail to provide a framework that takes into …
On the performance variability of production cloud services
Cloud computing is an emerging infrastructure paradigm that promises to eliminate the need
for companies to maintain expensive computing hardware. Through the use of virtualization …
for companies to maintain expensive computing hardware. Through the use of virtualization …
Failure prediction in ibm bluegene/l event logs
Frequent failures are becoming a serious concern to the community of high-end computing,
especially when the applications and the underlying systems rapidly grow in size and …
especially when the applications and the underlying systems rapidly grow in size and …
Bluegene/l failure analysis and prediction models
The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …
Pets 2016: Dataset and challenge
This paper describes the datasets and computer vision challenges that form part of the PETS
2016 workshop. PETS 2016 addresses the application of on-board multi sensor surveillance …
2016 workshop. PETS 2016 addresses the application of on-board multi sensor surveillance …
On cloud service reliability enhancement with optimal resource usage
An increasing number of companies are beginning to deploy services/applications in the
cloud computing environment. Enhancing the reliability of cloud service has become a …
cloud computing environment. Enhancing the reliability of cloud service has become a …
[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.
In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …
interconnected by multi-layer networks. In such large-scale and complex systems, failures …
Fault-aware, utility-based job scheduling on blue, gene/p systems
Job scheduling on large-scale systems is an increasingly complicated affair, with numerous
factors influencing scheduling policy. Addressing these concerns results in sophisticated …
factors influencing scheduling policy. Addressing these concerns results in sophisticated …
Mining frequent itemsets in a stream
Mining frequent itemsets in a datastream proves to be a difficult problem, as itemsets arrive
in rapid succession and storing parts of the stream is typically impossible. Nonetheless, it …
in rapid succession and storing parts of the stream is typically impossible. Nonetheless, it …