IDEAS home Printed from https://ideas.repec.org/a/sae/risrel/v234y2020i4p636-648.html
   My bibliography  Save this article

Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure

Author

Listed:
  • Bentolhoda Jafary
  • Lance Fiondella
  • Ping-Chen Chang

Abstract

Checkpointing is a technique to back up work at periodic intervals so that if computation fails, it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This article presents a method to model the impact of correlated failures on an application that performs a specified amount of computation and implements checkpointing operations at equidistant periods during this computation. We develop a Markov model and superimpose a correlated life distribution. Two cases are considered. The first assumes that reaching a checkpoint resets the failure distribution. The second allows the probability of failure to progress. We illustrate the approach through a series of examples. The results indicate that correlation can negatively impact checkpointing, necessitating more frequent checkpointing and increasing the total time required, but that the approach can still identify the optimal number of equidistant checkpoints, despite this correlation.

Suggested Citation

  • Bentolhoda Jafary & Lance Fiondella & Ping-Chen Chang, 2020. "Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure," Journal of Risk and Reliability, , vol. 234(4), pages 636-648, August.
  • Handle: RePEc:sae:risrel:v:234:y:2020:i:4:p:636-648
    DOI: 10.1177/1748006X19893569
    as

    Download full text from publisher

    File URL: https://journals.sagepub.com/doi/10.1177/1748006X19893569
    Download Restriction: no

    File URL: https://libkey.io/10.1177/1748006X19893569?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Catello Di Martino & Zbigniew Kalbarczyk & Ravishankar Iyer, 2016. "Measuring the Resiliency of Extreme-Scale Computing Environments," Springer Series in Reliability Engineering, in: Lance Fiondella & Antonio Puliafito (ed.), Principles of Performance and Reliability Modeling and Evaluation, pages 609-655, Springer.
    2. Jafary, Bentolhoda & Fiondella, Lance, 2016. "A universal generating function-based multi-state system performance model subject to correlated failures," Reliability Engineering and System Safety, Elsevier, vol. 152(C), pages 16-27.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Park, Jae-Hyun, 2017. "Time-dependent reliability of wireless networks with dependent failures," Reliability Engineering and System Safety, Elsevier, vol. 165(C), pages 47-61.
    2. Tian, Tianzi & Yang, Jun & Li, Lei & Wang, Ning, 2023. "Reliability assessment of performance-based balanced systems with rebalancing mechanisms," Reliability Engineering and System Safety, Elsevier, vol. 233(C).
    3. Yan-Feng Li & Hong-Zhong Huang & Jinhua Mi & Weiwen Peng & Xiaomeng Han, 2022. "Reliability analysis of multi-state systems with common cause failures based on Bayesian network and fuzzy probability," Annals of Operations Research, Springer, vol. 311(1), pages 195-209, April.
    4. Peng, Rui & Xiao, Hui & Liu, Hanlin, 2017. "Reliability of multi-state systems with a performance sharing group of limited size," Reliability Engineering and System Safety, Elsevier, vol. 166(C), pages 164-170.
    5. Yi-Kuei Lin & Lance Fiondella & Ping-Chen Chang, 2022. "Reliability of time-constrained multi-state network susceptible to correlated component faults," Annals of Operations Research, Springer, vol. 311(1), pages 239-254, April.
    6. Li, Jian & Dueñas-Osorio, Leonardo & Chen, Changkun & Shi, Congling, 2016. "Connectivity reliability and topological controllability of infrastructure networks: A comparative assessment," Reliability Engineering and System Safety, Elsevier, vol. 156(C), pages 24-33.
    7. Xiaoyu Cui & Shaoping Wang & Tongyang Li & Jian Shi, 2019. "System Reliability Assessment Based on Energy Dissipation: Modeling and Application in Electro-Hydrostatic Actuation System," Energies, MDPI, vol. 12(18), pages 1-22, September.
    8. Akshay Kumar & Subhi Tyagi & Mangey Ram, 0. "Signature of bridge structure using universal generating function," International Journal of System Assurance Engineering and Management, Springer;The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden, vol. 0, pages 1-5.
    9. Wu, Di & Chi, Yuanying & Peng, Rui & Sun, Mengyao, 2019. "Reliability of capacitated systems with performance sharing mechanism," Reliability Engineering and System Safety, Elsevier, vol. 189(C), pages 335-344.
    10. Zhou, Xiaojun & Shi, Kailong, 2019. "Capacity failure rate based opportunistic maintenance modeling for series-parallel multi-station manufacturing systems," Reliability Engineering and System Safety, Elsevier, vol. 181(C), pages 46-53.
    11. Zhang, Yongjin & Zhao, Ming & Zhang, Yanjun & Pan, Ruilin & Cai, Jing, 2020. "Dynamic and steady-state performance analysis for multi-state repairable reconfigurable manufacturing systems with buffers," European Journal of Operational Research, Elsevier, vol. 283(2), pages 491-510.
    12. Li, Jingkui & Lu, Yuze & Liu, Xiaona & Jiang, Xiuhong, 2023. "Reliability analysis of cold-standby phased-mission system based on GO-FLOW methodology and the universal generating function," Reliability Engineering and System Safety, Elsevier, vol. 233(C).
    13. Shahraki, Ameneh Forouzandeh & Yadav, Om Prakash & Vogiatzis, Chrysafis, 2020. "Selective maintenance optimization for multi-state systems considering stochastically dependent components and stochastic imperfect maintenance actions," Reliability Engineering and System Safety, Elsevier, vol. 196(C).
    14. Akshay Kumar & Subhi Tyagi & Mangey Ram, 2021. "Signature of bridge structure using universal generating function," International Journal of System Assurance Engineering and Management, Springer;The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden, vol. 12(1), pages 53-57, February.
    15. Gregory Levitin & Heping Jia & Yi Ding & Yonghua Song, 2017. "1-out-of-N multi-state standby systems with state-dependent random replacement times," Journal of Risk and Reliability, , vol. 231(6), pages 750-760, December.
    16. Gao, Guibing & Wang, Junshen & Yue, Wenhui & Ou, Wenchu, 2020. "Structural-vulnerability assessment of reconfigurable manufacturing system based on universal generating function," Reliability Engineering and System Safety, Elsevier, vol. 203(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:sae:risrel:v:234:y:2020:i:4:p:636-648. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: SAGE Publications (email available below). General contact details of provider: .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.