Skip to main content

Running Tasks in the Background

Background Processes

Normally, when you run a machine learning training or inference task with the command python train.py, the process is attached to the system foreground. This means that if you connect to a remote instance via SSH and the SSH connection is interrupted due to network latency or fluctuations, the foreground process associated with the SSH session (including your training task) will also be terminated, causing you to lose all unsaved training progress.
To prevent training processes from being accidentally terminated due to network interruptions, it is recommended to SSH into the instance and use tools such as Tmux, SCREEN, or nohup to run long-running tasks. These tools allow you to run training/inference/long processes in the background, so they continue to run even after the SSH connection is closed.

With these tools, you can ensure that even if a network issue occurs, your training tasks will not be interrupted, and you can continue to monitor the process status once the connection is restored.

Tmux

tmux is a powerful terminal multiplexer that allows you to run multiple terminal sessions in one window and keep them running even after disconnection. Below are the basic usage instructions for tmux and how to use it to run training processes in the background:

1. Basic Tmux Usage

1. Start a new tmux session

tmux new -s session_name

Here session_name is the name you assign to the new session.

2. Detach from a tmux session
In a tmux session, press Ctrl + b then d to detach the session and let it continue running in the background.

3. List all tmux sessions

tmux ls

This command will display all running tmux sessions.

4. Reattach to an existing tmux session

tmux attach -t session_name
tmux a -t session_name

Use the session name you created earlier to reconnect.

5. Close the current tmux session
In a tmux session, you can type exit or press Ctrl + d to exit; when the last window is closed, the tmux session will also end.

2. Example: Run Training with Tmux

1. Start a new tmux session

tmux new -s training

This creates a new session named training.

2. Start your training process in the tmux session

python train.py

This will run your training script, and logs will be displayed within this tmux session.

3. Once training starts, detach from the session with Ctrl + b then d to let the process continue in the background.

4. Close your SSH connection or terminal, and your training process will continue running in the tmux session.

5. Later, when you want to check the status, reattach to the tmux session

tmux attach -t training

This way, even if the SSH connection is interrupted, your training process remains unaffected; with tmux, you can always reconnect to monitor progress or view outputs.


SCREEN

screen is a powerful tool that lets you run multiple virtual terminal sessions in a single window and keep them running after disconnection. Below is the basic usage and an example of running training tasks in the background:

1. Basic Screen Usage

1. Start a new screen session

screen -S session_name

Here session_name is the name you assign to the new session.

2. Detach from a screen session
In a screen session, press Ctrl + a then d to detach it and let it run in the background.

3. List all screen sessions

screen -ls

4. Exit a screen session
In a screen session, you can type exit to quit; when the last window is closed, the screen session ends.

2. Example: Run Training with Screen

1. Start a new screen session

screen -S training

This creates a new session named training.

2. Start your training process in the screen session

python train.py

This starts your training script.

3. Once training starts, detach from the session with Ctrl + a then d to let it run in the background.

4. Close your SSH connection or terminal, and your training process will continue running in the screen session.

5. Later, when you want to check the status, reattach to the screen session

screen -r training

With screen, even if the SSH connection is interrupted, your training task remains unaffected. You can always reconnect to view status or outputs.


nohup

nohup is a command-line tool used to continue running commands after logging out or closing the terminal. It can redirect output to a file so the logs of background processes are preserved. Here’s how to use nohup to run training tasks in the background and log outputs:

1. Example: Run Training with nohup

1. Run training with nohup and redirect output

nohup python train.py > train.log 2>&1 &

This starts the train.py script in the background, redirecting both standard output (stdout) and standard error (stderr) to the file train.log.

2. Check training logs
The training output is written to train.log. You can dynamically view it with:

tail -f train.log

This shows the latest log content with real-time updates.

2. Terminate Processes Started with nohup

To end a process started with nohup, first find its process ID (PID), then kill it.

1. Find the process ID

ps aux | grep train.py

This lists all processes containing train.py. Look for the corresponding PID (usually the second column).

2. Kill the process Use the PID to terminate it:

# Assume PID is 63832
kill 63832

# If the process is still running, use -9 to force terminate
kill -9 63832
warning

Note: By default, the kill command sends a graceful termination signal (SIGTERM) to allow the process to clean up safely. If the process does not respond, you must use kill -9 PID to force termination.