Running Tasks in the Background
Normally, when you run a machine learning training or inference task with the command python train.py
, the process is attached to the system foreground. This means that if you connect to a remote instance via SSH and the SSH connection is interrupted due to network latency or fluctuations, the foreground process associated with the SSH session (including your training task) will also be terminated, causing you to lose all unsaved training progress.
To prevent training processes from being accidentally terminated due to network interruptions, it is recommended to SSH into the instance and use tools such as Tmux
, SCREEN
, or nohup
to run long-running tasks. These tools allow you to run training/inference/long processes in the background, so they continue to run even after the SSH connection is closed.
With these tools, you can ensure that even if a network issue occurs, your training tasks will not be interrupted, and you can continue to monitor the process status once the connection is restored.
Tmux
tmux
is a powerful terminal multiplexer that allows you to run multiple terminal sessions in one window and keep them running even after disconnection. Below are the basic usage instructions for tmux
and how to use it to run training processes in the background:
1. Basic Tmux Usage
1. Start a new tmux session
tmux new -s session_name
Here session_name
is the name you assign to the new session.
2. Detach from a tmux session
In a tmux
session, press Ctrl + b
then d
to detach
the session and let it continue running in the background.
3. List all tmux sessions
tmux ls
This command will display all running tmux sessions.
4. Reattach to an existing tmux session
tmux attach -t session_name
tmux a -t session_name
Use the session name you created earlier to reconnect.
5. Close the current tmux session
In a tmux
session, you can type exit
or press Ctrl + d
to exit; when the last window is closed, the tmux
session will also end.
2. Example: Run Training with Tmux
1. Start a new tmux session
tmux new -s training
This creates a new session named training
.
2. Start your training process in the tmux
session
python train.py
This will run your training script, and logs will be displayed within this tmux
session.
3. Once training starts, detach from the session with Ctrl + b
then d
to let the process continue in the background.
4. Close your SSH connection or terminal, and your training process will continue running in the tmux session.
5. Later, when you want to check the status, reattach to the tmux session
tmux attach -t training
This way, even if the SSH connection is interrupted, your training process remains unaffected; with tmux
, you can always reconnect to monitor progress or view outputs.
SCREEN
screen
is a powerful tool that lets you run multiple virtual terminal sessions in a single window and keep them running after disconnection. Below is the basic usage and an example of running training tasks in the background:
1. Basic Screen Usage
1. Start a new screen session
screen -S session_name
Here session_name
is the name you assign to the new session.
2. Detach from a screen session
In a screen
session, press Ctrl + a
then d
to detach
it and let it run in the background.
3. List all screen sessions
screen -ls
4. Exit a screen session
In a screen
session, you can type exit
to quit; when the last window is closed, the screen
session ends.
2. Example: Run Training with Screen
1. Start a new screen session
screen -S training
This creates a new session named training
.
2. Start your training process in the screen
session
python train.py
This starts your training script.
3. Once training starts, detach from the session with Ctrl + a
then d
to let it run in the background.
4. Close your SSH connection or terminal, and your training process will continue running in the screen
session.
5. Later, when you want to check the status, reattach to the screen session
screen -r training
With screen
, even if the SSH connection is interrupted, your training task remains unaffected. You can always reconnect to view status or outputs.
nohup
nohup
is a command-line tool used to continue running commands after logging out or closing the terminal. It can redirect output to a file so the logs of background processes are preserved. Here’s how to use nohup
to run training tasks in the background and log outputs:
1. Example: Run Training with nohup
1. Run training with nohup and redirect output
nohup python train.py > train.log 2>&1 &
This starts the train.py
script in the background, redirecting both standard output (stdout) and standard error (stderr) to the file train.log
.
2. Check training logs
The training output is written to train.log
. You can dynamically view it with:
tail -f train.log
This shows the latest log content with real-time updates.
2. Terminate Processes Started with nohup
To end a process started with nohup
, first find its process ID (PID), then kill it.
1. Find the process ID
ps aux | grep train.py
This lists all processes containing train.py
. Look for the corresponding PID (usually the second column).
2. Kill the process Use the PID to terminate it:
# Assume PID is 63832
kill 63832
# If the process is still running, use -9 to force terminate
kill -9 63832
Note: By default, the kill
command sends a graceful termination signal (SIGTERM
) to allow the process to clean up safely. If the process does not respond, you must use kill -9 PID
to force termination.