Skip to main content

Best Practices

📄️ Periodic Data Saving During Training

During training, issues may occur such as GPU drop, GPU failure, network fluctuations, excessive traffic load, network disconnection, hardware failure, machine crash, or the training process being automatically terminated at batch N by the system due to OOM. Once these issues occur, if proper measures are not taken to save training progress, previous results may be lost, requiring training to restart from scratch. This not only wastes valuable time and computing resources but also increases the workload of research and development.