Robin-hpc
From Robin
(→SLURM) |
(→Access) |
||
(40 intermediate revisions not shown) | |||
Line 17: | Line 17: | ||
| CentOS | | CentOS | ||
|- | |- | ||
- | | | + | | worker node |
| 120 cores/240 vCPU | | 120 cores/240 vCPU | ||
| 460GB | | 460GB | ||
Line 34: | Line 34: | ||
It is also a possibility to mount your UiO home directory to the <code>robin-hpc</code>. | It is also a possibility to mount your UiO home directory to the <code>robin-hpc</code>. | ||
- | == | + | === Network === |
- | Apply for access using this link: https://nettskjema.no/a/robin-hpc | + | For security reasons, the <code>robin-hpc</code> is only accessable from ifis-networks. If you use a desktop machine at ifi, you can access the <code>robin-hpc</code> directly. However, from your own laptop you are required to login to ifi's login cluster, namely <code>login.ifi.uio.no</code>, before you can access <code>hpc.robin.uiocloud.no</code>. |
+ | |||
+ | URL for the <code>robin-hpc</code> login node is: <code>hpc.robin.uiocloud.no</code>. | ||
+ | |||
+ | === SSH from outside UiOs networks === | ||
+ | |||
+ | To access the machine from outside UiOs networks, see [https://robin.wiki.ifi.uio.no/Deep_Learning_Workstations#Configure_SSH Configure SSH]. To make SSH slightly more pleasant to work with we can create a configuration file at <code>~/.ssh/config</code> which can contain additional information about remote machines. The following is a possible configuration for the Robin-hpc (see [https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Proxies_and_Jump_Hosts#Passing_Through_One_or_More_Gateways_Using_ProxyJump here] for more info). Indiviual key paths can be added with <code>IdentityFile ~/.ssh/<key_on_jump_host></code>. | ||
+ | |||
+ | Host uio-login | ||
+ | User <my-username> | ||
+ | Hostname <my-uio-hostname> | ||
+ | |||
+ | Host hpc-robin | ||
+ | HostName hpc.robin.uiocloud.no | ||
+ | ProxyJump uio-login | ||
+ | User <my-username> | ||
+ | Now, <code>ssh hpc-robin</code> should directly establish a connection to the server (commands like <code>scp</code> should also work in the same way). | ||
+ | |||
+ | == Access == | ||
+ | |||
+ | Apply for access using this link: https://nettskjema.no/a/robin-hpc. More information on how to get access is then given via e-mail. | ||
== SLURM == | == SLURM == | ||
Line 48: | Line 68: | ||
</pre> | </pre> | ||
- | (Note: Please do not run your program on the login node, test your implementation before copying the files to robin-hpc. | + | (Note: Please do not run your program on the login node, test your implementation before copying the files to robin-hpc. You can easily copy the files needed to run your program to your home area on robin-hpc with scp.) |
- | + | ||
- | You can easily copy the files needed to run your program to your home area on robin-hpc with scp. | + | |
Below you will find a template for the job script, followed by some convenient commands. | Below you will find a template for the job script, followed by some convenient commands. | ||
+ | |||
+ | For more information about slurm please visit https://slurm.schedmd.com/quickstart.html. | ||
'''SLURM job script template''' | '''SLURM job script template''' | ||
Line 64: | Line 84: | ||
#SBATCH --job-name=nameofmyjob | #SBATCH --job-name=nameofmyjob | ||
- | # Wall clock time limit (hh:mm:ss). The program will be killed when the time limit is reached. | + | # Wall clock time limit (hh:mm:ss). |
+ | # (Note: The program will be killed when the time limit is reached.) | ||
#SBATCH --time=01:00:00 | #SBATCH --time=01:00:00 | ||
Line 90: | Line 111: | ||
</pre> | </pre> | ||
+ | |||
+ | '''Additional options''' | ||
+ | |||
+ | To run multiple tasks in sequence you can also use the <i>--array</i> option of the <i>sbatch</i> command. You can add the following to the script. You can then get the job array number with <i>SLURM_ARRAY_TASK_ID</i>. The following example would run 900 jobs: | ||
+ | |||
+ | <pre class="brush: bash"> | ||
+ | #SBATCH --array=0-899 | ||
+ | </pre> | ||
+ | |||
'''Convenient commands''' | '''Convenient commands''' | ||
Line 103: | Line 133: | ||
</pre> | </pre> | ||
- | = | + | Cancel all your jobs: |
+ | <pre class="brush: bash"> | ||
+ | scancel -u yourusername | ||
+ | </pre> | ||
- | === | + | Cancel all jobs with name "job_name": |
+ | <pre class="brush: bash"> | ||
+ | scancel -n job_name | ||
+ | </pre> | ||
+ | |||
+ | == Software == | ||
- | + | === Matlab === | |
'''Setting up the SLURM job script''' | '''Setting up the SLURM job script''' | ||
Line 158: | Line 196: | ||
=== Anaconda === | === Anaconda === | ||
- | + | After initializing Anaconda (see [https://robin.wiki.ifi.uio.no/Deep_Learning_Workstations#Anaconda Deep Learning Workstations]), it is possible to start a job on robin-hpc by using the command [https://robin.wiki.ifi.uio.no/Robin-hpc#SLURM above] and the following template: | |
+ | |||
+ | '''SLURM Anaconda template''' | ||
+ | <pre class="brush: bash"> | ||
+ | #!/usr/bin/bash | ||
+ | |||
+ | #SBATCH --job-name=<test_name> | ||
+ | #SBATCH --output=log.txt | ||
+ | #SBATCH --ntasks=1 | ||
+ | #SBATCH --time=01:00:00 | ||
+ | #SBATCH --mem-per-cpu=100 | ||
+ | |||
+ | # load anaconda and your environment | ||
+ | source ~/.bashrc | ||
+ | conda activate <your_environment> | ||
+ | |||
+ | # Run your program with "srun your_command" | ||
+ | srun python3 <script_name>.py | ||
+ | </pre> | ||
+ | |||
+ | The output is written to <code>"log.txt"</code> and can be traced via <code>"tail -f log.txt"</code>. | ||
+ | |||
+ | === Running containers === | ||
- | + | <code>singularity</code> is installed on <code>robin-hpc</code>. Read more about singularity on https://sylabs.io/guides/3.0/user-guide/index.html . Alternativly you can use <code>podman</code>: | |
alias docker=podman | alias docker=podman |
Current revision as of 08:59, 18 May 2022
Contents |
Hardware and network configuration
The robin-hpc
is a shared resource for robins researchers and Master students. The strength of the machine is the amount of CPU cores and RAM. Unforunatly, there's no GPU available in this service.
Specs
CPU | RAM | OS | |
---|---|---|---|
login node | 2 cores/4 vCPU | 16GB | CentOS |
worker node | 120 cores/240 vCPU | 460GB | CentOS |
Storage
The storage on the nodes consists of one 1TB disk where 100GB is reserved for software and 900GB is reserved for the users of the node. However, each user has a soft limit of 20 GB and a hard limit of 100 GB with a grace period of 14 days.
It's important to note that there is no backup of the disk, so do not use the robin-hpc
as a cloud storage service. We suggest using rsync
/scp
to your home area on login.ifi.uio.no
.
E.g.
rsync --progress <file>.tar.gz <username>@login.ifi.uio.no:~/<path>/<to>/<wherever>
It is also a possibility to mount your UiO home directory to the robin-hpc
.
Network
For security reasons, the robin-hpc
is only accessable from ifis-networks. If you use a desktop machine at ifi, you can access the robin-hpc
directly. However, from your own laptop you are required to login to ifi's login cluster, namely login.ifi.uio.no
, before you can access hpc.robin.uiocloud.no
.
URL for the robin-hpc
login node is: hpc.robin.uiocloud.no
.
SSH from outside UiOs networks
To access the machine from outside UiOs networks, see Configure SSH. To make SSH slightly more pleasant to work with we can create a configuration file at ~/.ssh/config
which can contain additional information about remote machines. The following is a possible configuration for the Robin-hpc (see here for more info). Indiviual key paths can be added with IdentityFile ~/.ssh/<key_on_jump_host>
.
Host uio-login User <my-username> Hostname <my-uio-hostname>
Host hpc-robin HostName hpc.robin.uiocloud.no ProxyJump uio-login User <my-username>
Now, ssh hpc-robin
should directly establish a connection to the server (commands like scp
should also work in the same way).
Access
Apply for access using this link: https://nettskjema.no/a/robin-hpc. More information on how to get access is then given via e-mail.
SLURM
Ropin-hpc uses slurm for job scheduling.
To start a job on robin-hpc you will need to use a job script specifying the resources you want to use. The job is started with the following command:
sbatch nameofthejobscript.sh
(Note: Please do not run your program on the login node, test your implementation before copying the files to robin-hpc. You can easily copy the files needed to run your program to your home area on robin-hpc with scp.)
Below you will find a template for the job script, followed by some convenient commands.
For more information about slurm please visit https://slurm.schedmd.com/quickstart.html.
SLURM job script template
#!/bin/bash # This is a template for a slurm job script. # To start a job on robin-hpc please use the command "sbatch nameofthisscript.sh". # Job name #SBATCH --job-name=nameofmyjob # Wall clock time limit (hh:mm:ss). # (Note: The program will be killed when the time limit is reached.) #SBATCH --time=01:00:00 # Number of tasks to start in parallel from this script. # (i.e. myprogram.py below will be started ntasks times) #SBATCH --ntasks=1 # CPUs allocated per task #SBATCH --cpus-per-task=16 # Memory allocated per cpu #SBATCH --mem-per-cpu=1G # Set exit on errors set -o errexit set -o nounset # Load your environment source myenv/bin/activate # Run your program with "srun yourcommand" # stdout and stderr will be written to a file "slurm-jobid.out". # (warning: all tasks will write to the same slurm.out file) srun python3 myprogram.py
Additional options
To run multiple tasks in sequence you can also use the --array option of the sbatch command. You can add the following to the script. You can then get the job array number with SLURM_ARRAY_TASK_ID. The following example would run 900 jobs:
#SBATCH --array=0-899
Convenient commands
View the status of all jobs:
squeue
View the status of your own jobs:
squeue -u yourusername
Cancel all your jobs:
scancel -u yourusername
Cancel all jobs with name "job_name":
scancel -n job_name
Software
Matlab
Setting up the SLURM job script
#SBATCH --job-name=matlab_job #SBATCH --ntasks=1 #SBATCH --cpus-per-task 16 srun matlab -batch "addpath(genpath('/path/to/your/matlab/folder'));run('myScript.m')"
Running Matlab in batch mode is the most safe option for running MATLAB on a HPC. (From Mathworks documentation)[1]:
-batch statement
Starts without the desktop
Does not display the splash screen
Executes
statement
Disables changes to preferences
Disables toolbox caching
Logs text to
stdout
andstderr
Does not display modal dialog boxes
Exits automatically with exit code 0 if
script
executes successfully. Otherwise, MATLAB terminates with a non-zero exit code.
The addpath(genpath('/path/to/your/matlab/folder'))
part adds all files in the specified directory to the MATLAB search path. Afterwards we run the main script of your program with run('myScript.m')
.
Utilizing parallel computing in your MATLAB Script
When the SLURM worker node is setting up your job, a number of environment variables is set.
We can use the environment variable SLURM_CPUS_ON_NODE
to get the number of CPU cores available in our MATLAB script. In fact, we can use that variable to dynamically select the number of workers in the MATLAB parallel pool, so that your script works both on your own computer and on the HPC.
SLURM_CPUS_STR = getenv('SLURM_CPUS_ON_NODE'); % Delete parallel pool from earlier runs delete(gcp('nocreate')); if isempty(SLURM_CPUS_STR) % Run on personal computer (with however many cores your CPU has) parpool(6); else % Run on SLURM-scheduled HPC SLURM_CPUS_NUM = str2num(SLURM_CPUS_STR); parpool(SLURM_CPUS_NUM); end
Anaconda
After initializing Anaconda (see Deep Learning Workstations), it is possible to start a job on robin-hpc by using the command above and the following template:
SLURM Anaconda template
#!/usr/bin/bash #SBATCH --job-name=<test_name> #SBATCH --output=log.txt #SBATCH --ntasks=1 #SBATCH --time=01:00:00 #SBATCH --mem-per-cpu=100 # load anaconda and your environment source ~/.bashrc conda activate <your_environment> # Run your program with "srun your_command" srun python3 <script_name>.py
The output is written to "log.txt"
and can be traced via "tail -f log.txt"
.
Running containers
singularity
is installed on robin-hpc
. Read more about singularity on https://sylabs.io/guides/3.0/user-guide/index.html . Alternativly you can use podman
:
alias docker=podman