IDEAS home Printed from https://ideas.repec.org/a/eee/reensy/v169y2018icp127-136.html
   My bibliography  Save this article

Heterogeneous 1-out-of-N warm standby systems with online checkpointing

Author

Listed:
  • Levitin, Gregory
  • Xing, Liudong
  • Dai, Yuanshun

Abstract

As a common practice in computing-related applications, checkpointing is used to facilitate an effective system recovery in the case of the occurrence of failures. Checkpoints are performed to save data associated with completed portion of a mission task. In the case of a failure, through rollback and data retrieval the system can resume the mission task from the last successful checkpoint instead of from the very beginning of the mission, saving time and cost. This paper models and optimizes 1-out-of-N: G warm standby systems subject to uneven online checkpointing, where checkpoints can be performed in parallel with execution of the primary mission task for improving efficiency of computing elements. Both data checkpoint and retrieval take dynamic time, depending on the amount of work completed. System elements can be heterogeneous in the time-to-failure distribution, performance, and level of readiness to take over the mission task during the warm standby mode. A numerical method is first suggested to evaluate mission performance indices including mission success probability, expected mission completion time, and expected mission operation cost. Examples are provided to demonstrate influence of mission deadline and element resource sharing parameter (i.e., CPU time distribution between the checkpointing procedure and the primary mission task) on the mission performance metrics. The optimal checkpoint distribution and optimal element activation sequencing problems are considered for different combinations of optimization objectives and constraints. A co-optimization problem is further addressed, which aims to find the optimal combination of checkpoint distribution and element activation sequence. Example optimization solutions illustrate the tradeoff among the three mission requirements (reliability, completion time, operation cost) for warm standby systems with online checkpoints.

Suggested Citation

  • Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2018. "Heterogeneous 1-out-of-N warm standby systems with online checkpointing," Reliability Engineering and System Safety, Elsevier, vol. 169(C), pages 127-136.
  • Handle: RePEc:eee:reensy:v:169:y:2018:i:c:p:127-136
    DOI: 10.1016/j.ress.2017.08.011
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0951832017301904
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ress.2017.08.011?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Durga Rao, K. & Gopika, V. & Sanyasi Rao, V.V.S. & Kushwaha, H.S. & Verma, A.K. & Srividya, A., 2009. "Dynamic fault tree analysis using Monte Carlo simulation in probabilistic safety assessment," Reliability Engineering and System Safety, Elsevier, vol. 94(4), pages 872-883.
    2. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2014. "Optimal component loading in 1-out-of-N cold standby systems," Reliability Engineering and System Safety, Elsevier, vol. 127(C), pages 58-64.
    3. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2013. "Cold-standby sequencing optimization considering mission cost," Reliability Engineering and System Safety, Elsevier, vol. 118(C), pages 28-34.
    4. Qingqing Zhai & Rui Peng & Liudong Xing & Jun Yang, 2013. "Binary decision diagram-based reliability evaluation of k-out-of-(n + k) warm standby systems subject to fault-level coverage," Journal of Risk and Reliability, , vol. 227(5), pages 540-548, October.
    5. Eryilmaz, Serkan, 2011. "The behavior of warm standby components with respect to a coherent system," Statistics & Probability Letters, Elsevier, vol. 81(8), pages 1319-1325, August.
    6. Ola Tannous & Liudong Xing & Rui Peng & Min Xie, 2014. "Reliability of warm-standby systems subject to imperfect fault coverage," Journal of Risk and Reliability, , vol. 228(6), pages 606-620, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2022. "Optimal sequencing of elements activation in 1-out-of-n warm standby system with storage," Reliability Engineering and System Safety, Elsevier, vol. 221(C).
    2. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2024. "Allocation and activation of resource constrained shock-exposed components in heterogeneous 1-out-of-n standby system," Reliability Engineering and System Safety, Elsevier, vol. 241(C).
    3. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2018. "Co-residence based data vulnerability vs. security in cloud computing system with random server assignment," European Journal of Operational Research, Elsevier, vol. 267(2), pages 676-686.
    4. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2023. "Co-optimizing component allocation and activation sequence in heterogeneous 1-out-of-n standby system exposed to shocks," Reliability Engineering and System Safety, Elsevier, vol. 230(C).
    5. Wu, Hui & Li, Yan-Fu & Bérenguer, Christophe, 2020. "Optimal inspection and maintenance for a repairable k-out-of-n: G warm standby system," Reliability Engineering and System Safety, Elsevier, vol. 193(C).
    6. Levitin, Gregory & Xing, Liudong & Xiang, Yanping, 2020. "Optimizing software rejuvenation policy for tasks with periodic inspections and time limitation," Reliability Engineering and System Safety, Elsevier, vol. 197(C).
    7. Liu, Baoliang & Wen, Yanqing & Qiu, Qingan & Shi, Haiyan & Chen, Jianhui, 2022. "Reliability analysis for multi-state systems under K-mixed redundancy strategy considering switching failure," Reliability Engineering and System Safety, Elsevier, vol. 228(C).
    8. Levitin, Gregory & Xing, Liudong & Haim, Hanoch Ben & Dai, Yuanshun, 2019. "Optimal structure of series system with 1-out-of-n warm standby subsystems performing operation and rescue functions," Reliability Engineering and System Safety, Elsevier, vol. 188(C), pages 523-531.
    9. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2022. "Heterogeneous 1-out-of-n standby systems with limited unit operation time," Reliability Engineering and System Safety, Elsevier, vol. 224(C).
    10. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2023. "Predetermined standby mode transfers in 1-out-of-N systems with resource-constrained elements," Reliability Engineering and System Safety, Elsevier, vol. 229(C).
    11. Levitin, Gregory & Xing, Liudong & Ben-Haim, Hanoch, 2018. "Optimizing software rejuvenation policy for real time tasks," Reliability Engineering and System Safety, Elsevier, vol. 176(C), pages 202-208.
    12. Jia, Heping & Peng, Rui & Yang, Li & Wu, Tianyi & Liu, Dunnan & Li, Yanbin, 2022. "Reliability evaluation of demand-based warm standby systems with capacity storage," Reliability Engineering and System Safety, Elsevier, vol. 218(PA).
    13. Ma, Xiaoyang & Liu, Bin & Yang, Li & Peng, Rui & Zhang, Xiaodong, 2020. "Reliability analysis and condition-based maintenance optimization for a warm standby cooling system," Reliability Engineering and System Safety, Elsevier, vol. 193(C).
    14. Levitin, Gregory & Xing, Liudong & Luo, Liang, 2019. "Joint optimal checkpointing and rejuvenation policy for real-time computing tasks," Reliability Engineering and System Safety, Elsevier, vol. 182(C), pages 63-72.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kim, Heungseob, 2018. "Maximization of system reliability with the consideration of component sequencing," Reliability Engineering and System Safety, Elsevier, vol. 170(C), pages 64-72.
    2. Levitin, Gregory & Xing, Liudong & Haim, Hanoch Ben & Dai, Yuanshun, 2019. "Optimal structure of series system with 1-out-of-n warm standby subsystems performing operation and rescue functions," Reliability Engineering and System Safety, Elsevier, vol. 188(C), pages 523-531.
    3. Ning Wang & Hailun Zhang & Ruoning Lv & Yangming Guo & Peican Zhu, 2022. "An investigation of reliability optimization in standby systems," Journal of Risk and Reliability, , vol. 236(2), pages 237-247, April.
    4. Ardakan, Mostafa Abouei & Amini, Hanieh & Juybari, Mohammad N., 2022. "Prescheduled switching time: A new strategy for systems with standby components," Reliability Engineering and System Safety, Elsevier, vol. 218(PB).
    5. Eryilmaz, Serkan, 2017. "The effectiveness of adding cold standby redundancy to a coherent system at system and component levels," Reliability Engineering and System Safety, Elsevier, vol. 165(C), pages 331-335.
    6. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2018. "Co-optimization of state dependent loading and mission abort policy in heterogeneous warm standby systems," Reliability Engineering and System Safety, Elsevier, vol. 172(C), pages 151-158.
    7. Levitin, Gregory & Finkelstein, Maxim, 2017. "Optimal backup in heterogeneous standby systems exposed to shocks," Reliability Engineering and System Safety, Elsevier, vol. 165(C), pages 336-344.
    8. Mansour Shrahili & Mohamed Kayid, 2023. "Stochastic Orderings of the Idle Time of Inactive Standby Systems," Mathematics, MDPI, vol. 11(20), pages 1-21, October.
    9. Amirhossain Chambari & Javad Sadeghi & Fakhri Bakhtiari & Reza Jahangard, 2016. "A note on a reliability redundancy allocation problem using a tuned parameter genetic algorithm," OPSEARCH, Springer;Operational Research Society of India, vol. 53(2), pages 426-442, June.
    10. Yan-Feng Li & Jinhua Mi & Yu Liu & Yuan-Jian Yang & Hong-Zhong Huang, 2015. "Dynamic fault tree analysis based on continuous-time Bayesian networks under fuzzy numbers," Journal of Risk and Reliability, , vol. 229(6), pages 530-541, December.
    11. Gayathri, P. & Umesh, K. & Ganguli, R., 2010. "Effect of matrix cracking and material uncertainty on composite plates," Reliability Engineering and System Safety, Elsevier, vol. 95(7), pages 716-728.
    12. Rodríguez, Joanna & Lillo, Rosa E. & Ramírez-Cobo, Pepa, 2015. "Failure modeling of an electrical N-component framework by the non-stationary Markovian arrival process," Reliability Engineering and System Safety, Elsevier, vol. 134(C), pages 126-133.
    13. Janssen, Hans, 2013. "Monte-Carlo based uncertainty analysis: Sampling efficiency and sampling convergence," Reliability Engineering and System Safety, Elsevier, vol. 109(C), pages 123-132.
    14. Jia, Heping & Ding, Yi & Peng, Rui & Liu, Hanlin & Song, Yonghua, 2020. "Reliability assessment and activation sequence optimization of non-repairable multi-state generation systems considering warm standby," Reliability Engineering and System Safety, Elsevier, vol. 195(C).
    15. Levitin, Gregory & Xing, Liudong & Dai, Yuanshun, 2023. "Optimizing uploading and downloading pace distribution in system with two non-identical storage units," Reliability Engineering and System Safety, Elsevier, vol. 231(C).
    16. Xing, Liudong & Shrestha, Akhilesh & Dai, Yuanshun, 2011. "Exact combinatorial reliability analysis of dynamic systems with sequence-dependent failures," Reliability Engineering and System Safety, Elsevier, vol. 96(10), pages 1375-1385.
    17. Chen, Wu-Lin & Wang, Kuo-Hsiung, 2018. "Reliability analysis of a retrial machine repair problem with warm standbys and a single server with N-policy," Reliability Engineering and System Safety, Elsevier, vol. 180(C), pages 476-486.
    18. Bibartiu, Otto & Dürr, Frank & Rothermel, Kurt & Ottenwälder, Beate & Grau, Andreas, 2021. "Scalable k-out-of-n models for dependability analysis with Bayesian networks," Reliability Engineering and System Safety, Elsevier, vol. 210(C).
    19. Hu, Bin & Seiler, Peter, 2015. "Pivotal decomposition for reliability analysis of fault tolerant control systems on unmanned aerial vehicles," Reliability Engineering and System Safety, Elsevier, vol. 140(C), pages 130-141.
    20. Lindhe, Andreas & Norberg, Tommy & Rosén, Lars, 2012. "Approximate dynamic fault tree calculations for modelling water supply risks," Reliability Engineering and System Safety, Elsevier, vol. 106(C), pages 61-71.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:reensy:v:169:y:2018:i:c:p:127-136. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: https://www.journals.elsevier.com/reliability-engineering-and-system-safety .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.