Deep Learning Workstations

From Robin

(Difference between revisions)
Jump to: navigation, search
(Deep Learning Workstations)
(Deep Learning Workstations (DLWS))
 
(69 intermediate revisions not shown)
Line 1: Line 1:
 +
== Deep Learning Workstations (DLWS) ==
-
== Deep Learning Workstations ==
+
We have shared workstations for projects that requires GPU and CPU power while retaining physical access to a computer. However, the local Deep Learning Workstations might not be the best solution for your project. Please see the [https://robin.wiki.ifi.uio.no/High_performance_computing High Performance Computing] article for more information.
-
 
+
-
We have shared workstations for projects needing serious GPU and CPU power while retaining physical access to a computer.
+
Most interesting are the four Dell Alienware computers from January 2018, here are their details and responsible staff for ongoing record keeping:
Most interesting are the four Dell Alienware computers from January 2018, here are their details and responsible staff for ongoing record keeping:
-
* Alienware Aurora R7 - Intel i7 8700K (6-core), 2x Nvidia GTX1080ti (UiO: 113616). Justas/Zia/Weria
+
We devide the available machines into two categories, namly <code>supported</code> and <code>stand-alone</code>. <code>supported</code> means that the client is running UiO supported operating system, typically the latest LTS-version of Red Hat Enterprise Linux. The other category; <code>stand-alone</code>, is set up with 3rd party OS and custom packages, meaning that the researcher themselves maintains the client.
-
** hostname:  
+
 
-
** URL:
+
For master's projects we prefer the students use the <code>supported</code> category.
 +
 
 +
* Alienware Aurora R7 - Intel i7 8700K (6-core), Nvidia RTX3090 (UiO: 113616). [https://mattermost.uio.no/ifi-robin/messages/@vegardds @vegardds]/shared resource
 +
** hostname: ''dancer''
 +
** URL:''dancer.ifi.uio.no''
 +
** Category: <code>supported</code>
** WLAN: D8:9E:F3:7A:84:B7
** WLAN: D8:9E:F3:7A:84:B7
** ETH: 9C:30:5B:13:AF:33
** ETH: 9C:30:5B:13:AF:33
-
* Alienware  Aurora R7 - Intel i7 8700K (6-core), 2x Nvidia GTX1080ti (UiO: 113615). Charles/Kai/Tønnes
+
* Alienware  Aurora R7 - Intel i7 8700K (6-core), 2x Nvidia RTX2080ti (UiO: 113615). [https://mattermost.uio.no/ifi-robin/messages/@vegardds @vegardds]/shared resource
-
** hostname: dasher
+
** hostname: ''dasher''
-
** URL: dasher-robin.duckdns.org
+
** URL: ''dasher.ifi.uio.no''
 +
** Category: <code>supported</code>
** WLAN: D8:9E:F3:7A:7E:D1
** WLAN: D8:9E:F3:7A:7E:D1
** ETH: 9C:30:5B:13:C5:69
** ETH: 9C:30:5B:13:C5:69
-
* Alienware Aurora R7 - Intel i7 8700K (6-core), 2x Nvidia GTX1070ti (UiO: 113617). Vegard/Masterstudenter
+
* Alienware Aurora R7 - Intel i7 8700K (6-core), Nvidia RTX3090 (UiO: 113617). [https://mattermost.uio.no/ifi-robin/messages/@vegardds @vegardds]/shared resource
** hostname: ''dunder''
** hostname: ''dunder''
-
** URL: ''mscdeeplearning.duckdns.org''
+
** URL: ''dunder.ifi.uio.no''
 +
** Category: <code>supported</code>.
** WLAN: D8:9E:F3:7A:46:08
** WLAN: D8:9E:F3:7A:46:08
** ETH: 9E:30:5B:13:C5:8B
** ETH: 9E:30:5B:13:C5:8B
-
* Alienware Area 51 R3 - AMD Threadripper 1950x (16-core), 1x Nvidia GTX1070ti (UiO: 113614). Jørgen etc.
+
* Alienware Area 51 R3 - AMD Threadripper 1950x (16-core), 3x Nvidia GTX1080ti (UiO: 113614). [https://mattermost.uio.no/ifi-robin/messages/@vegardds @vegardds]/shared resource
-
** hostname: ''rudolph''  
+
** hostname: ''rudolph''
-
** URL: ''rdlf.duckdns.org''
+
** URL: ''rudolph.ifi.uio.no''
 +
** Category: <code>supported</code>
** WLAN: 9C:30:5B:13:C5:71
** WLAN: 9C:30:5B:13:C5:71
** ETH1: 30:9C:23:2A:EB:39
** ETH1: 30:9C:23:2A:EB:39
Line 30: Line 37:
We also have older workstations:
We also have older workstations:
-
* Deep Thinker (2016): (Fractal Design enclosure) Intel i7 6700K (4-core), 2x Nvidia GTX1080. Charles/Justas/Masterstudenter
+
* Deep Thinker (2016): (Fractal Design enclosure) Intel i7 6700K (4-core), Nvidia RTX2060. [https://mattermost.uio.no/ifi-robin/messages/@yngveha @yngveha]
-
** hostname: deepthinker
+
** hostname: ''deepthinker''
-
** url: deepthinker.onthewifi.com
+
** URL: ''N/A''
 +
** Location: Master's lab
 +
** Category: <code>stand-alone</code>
* Mac Pro (2010): Intel Xeon (8-core), 1x Nvidia GTX1060 (+ ATI 5770 1GB, not useful for DL)... Justas etc.
* Mac Pro (2010): Intel Xeon (8-core), 1x Nvidia GTX1060 (+ ATI 5770 1GB, not useful for DL)... Justas etc.
-
** Good machine for quick tests. GPU is good for testing DL but only has 3GB memory.
+
** hostname: ''skrotten''
 +
** URL: ''N/A'
 +
** Location: MoCap lab
 +
** Category: <code>stand-alone</code>
 +
 
 +
== Getting access to the workstations ==
 +
To get access to the workstations you either need to be a master student at ROBIN or have an advisor at ROBIN which can vouch for you.
 +
 
 +
To access the <code>supported</code> clients, all you need to do is to ssh into the machine with your UiO-credentials. See more [https://robin.wiki.ifi.uio.no/Deep_Learning_Workstations#Configure_SSH Configure SSH].
 +
 
 +
=== Configure SSH ===
 +
To make SSH slightly more pleasant to work with we can create a configuration file at <code>~/.ssh/config</code> which can contain additional information about remote machines. The following is a possible configuration for the Rudolph workstation.
 +
 
 +
Host rudolph
 +
    User <my-username>
 +
    HostName rdlf.duckdns.org
 +
    IdentityFile ~/.ssh/rudolph
 +
 
 +
This allows us to <code>ssh</code> using the command <code>ssh rudolph</code> without any other configuration.
 +
 
 +
=== SSH from outside UiOs networks ===
 +
 
 +
To access the machines from outside UiOs networks, you need set up the following in the  <code>~/.ssh/config</code> file (in this example, we use dancer as an example):
 +
 
 +
Host ifi-login
 +
      Hostname login.ifi.uio.no
 +
      User <my-username>
 +
 
 +
Host dancer
 +
      User <my-username>
 +
      IdentityFile ~/.ssh/<key_to_dancer>
 +
      ProxyCommand ssh -q ifi-login nc  dancer.ifi.uio.no 22
 +
 
 +
On Windows machines you need to add an extra argument to the <code> ProxyCommand </code> to make it work:
 +
 
 +
Host dancer
 +
      User <my-username>
 +
      IdentityFile ~/.ssh/<key_to_dancer>
 +
      ProxyCommand C:\Windows\System32\OpenSSH\ssh.exe -q ifi-login nc dancer.ifi.uio.no 22
 +
 
 +
Now you should be able to run <code> ssh dancer </code>.
 +
 
 +
=== SSH key ===
 +
We recommend using key-pairs for secure login. See more information here: https://www.ssh.com/ssh/keygen/.
 +
 
 +
'''NOTE:''' In case you choose to use key-pairs make sure to never share your private key.
 +
 
 +
=== Mosh ===
 +
 
 +
We also recommend students to install [[https://mosh.org/ '''mosh''']] for a better '''ssh''' experience. The usage is exactly the same as with '''ssh''' except that '''mosh''' is capable of staying connected even when roaming and through hibernation of laptops. Mosh should be installed on all workstations.
 +
 
 +
'''NOTE:''' mosh does not support X-Forwarding.
 +
 
 +
== Using the <code>supported</code> workstations ==
 +
 
 +
=== Queuing up jobs ===
 +
Use batch to queue jobs to automatically start when resources are available.
 +
 
 +
$ batch
 +
at> <sh-command>
 +
 
 +
Press <code> ctrl + d </code> to add the job to the queue. This will queue a job with the specified command. To see your current queued jobs use the command <code> atq </code>. Note that <code> atq </code> will only show your jobs, and not other users. For a more detailed explanation see [https://linux.die.net/man/1/batch man batch]. It is also possible to queue several scripts/commands in one job.
 +
 
 +
==== Example of python job ====
 +
 
 +
$ batch
 +
at> python3 <name_of_script>.py
 +
at> python3 <name_of_second_script>.py
 +
at> echo "Job done at $HOSTNAME"
 +
 
 +
This will run <code> name_of_script </code> and then <code> name_of_second_script </code> sequentially as one job. Outputs and errors to the terminal will be sent to the users student/work mail. Tip: To get notified when the job is finished one can add <code> echo "Job done at $HOSTNAME" </code> at the end.
 +
 
 +
=== Anaconda ===
 +
The most frequently used machine learning tools is available through Anaconda. If you are not familiar with Anaconda, take a look at [https://docs.anaconda.com/anaconda/navigator/tutorials/ these] tutorials. Another tip is to download the [https://docs.anaconda.com/_downloads/9ee215ff15fde24bf01791d719084950/Anaconda-Starter-Guide.pdf Anaconda Cheat sheet].
 +
 
 +
To see how you can install machine learning packages, see [https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/ Working with GPU packages].
 +
 
 +
'''NOTE:''' the environments will be installed in your home directory, e.g: <code> ~/.conda/envs/<env_name> </code>. Make sure that you have enaugh storage.
 +
 
 +
=== Initialization ===
 +
 
 +
To use Anoconda on the DL workstation, you are required to do some first time config. Log in to a IFI client e.g: <code> login.ifi.uio.no </code>. Perform the following:
 +
 
 +
$ export PATH="/opt/ifi/anaconda3/bin:$PATH"
 +
$ conda init
 +
 
 +
These commands adds some stuff to your <code> ~/.bashrc </code>. You should now be able to see that you are working in the <code> base </code> environment.
 +
 
 +
=== Example usage ===
 +
 
 +
There is a huge community using Anaconda and deep learning tools, hence we encourage you to do some investigations on how to use it for your own. However, you can take a look at the simplistic example below to get you started.
 +
 
 +
Let's say we want to use TensorFlow in our project. First we need to create a environment:
 +
 
 +
$ conda create --name ml_project tensorflow-gpu
-
== Setting up the workstations ==
+
When the command is executed, you will get a list of all the packages required to create this environment and the size of the installation. Once you accept, the installation begins. This might take a while depending on the size of the packages.
-
As a rule, shared systems should be able to dual-boot between Windows 10 and Ubuntu.
+
On completion you should be able to run this command to see your new environment:
-
* For the Dell systems, first shrink the main NTFS partition to allow an Ubuntu system partition.
+
$ conda env list
-
* The shared (spinning) disk can stay as NTFS.
+
-
* Ubuntu 16.04 LTS current best practice
+
-
* Windows
+
This will output the environments you can activate. Output:
-
** [https://www.visualstudio.com Visual Studio Community 2017]
+
-
** [https://developer.nvidia.com/cuda-downloads CUDA]
+
-
** [https://developer.nvidia.com/cudnn CUDnn]
+
-
** [https://www.anaconda.com/download/ Anaconda]
+
-
** [https://www.tensorflow.org/install/install_windows Tensorflow with GPU support]
+
-
* TODO: Investigate Ubuntu 18.04
+
# conda environments:
-
* CUDA on Linux: Choose the deb (network) option - this saves time from doing patches and updates to the latest NVIDIA driver automagically.
+
#
-
* CUDA Version: Tensorflow 1.6 is built against 9.0, Latest version is 9.1 - be careful.
+
base                  * /opt/ifi/anaconda3
-
* CPM is installing 9.1 via network deb (also to get Nvidia driver), then 9.0 via run file [https://blog.kovalevskyi.com/multiple-version-of-cuda-libraries-on-the-same-machine-b9502d50ae77 following instructions from here].
+
ml_project              /uio/kant/ifi-ansatt-u03/vegardds/.conda/envs/ml_project
-
== Tensorflow ==
+
The star(*) indicates which environment you're currently using.
-
* Install tensorflow-gpu to use GPUs.
+
To change to the new environment run:
-
* Tensorflow requires specific versions of CUDA and CUDnn
+
-
** For Tensorflow 1.4 (current release as of 9/1/2018): Need CUDA 8, CUDnn 6
+
-
** For Tensorflow 1.5 (prerelease as of 9/1/2018) - works with CUDA 9, CUDnn 7 (latest).
+
-
== Ubuntu 16.04 Setup ==
+
$ conda activate ml_project
-
Install Ubuntu 16.04, do not turn Secure Boot off.
+
The environment should now be ready to go. Moreover, if you require another package use:
-
* Shrink Windows NTFS volume in Disk Management
+
$ conda install <package_name>
-
* Remove graphics cards
+
-
* Install Ubuntu following [https://medium.com/@FloodSung/tutorial-how-to-install-ubuntu-16-04-windows10-on-alienware-15-r3-91cd1dc7eb3c these instructions]
+
-
* Install Nvidia graphics driver from [https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa PPA repository].
+
-
* Put graphics cards back in.
+
-
* verify that you can boot into Windows and Ubuntu
+
-
* Ubuntu sometimes freezes on shutdown due to an I2C driver, which can [https://github.com/rdjondo/TensorFlowGPUonUbuntu/wiki/Installing-Ubuntu-16.04-LTS-in-Dual-Boot-with-NVIDIA-GPU-support be blacklisted]
+
-
* If Ubuntu doesn't login and graphics are weird, probably using unsigned Nvidia driver, use [https://gist.github.com/Garoe/74a0040f50ae7987885a0bebe5eda1aa this script to sign kernel modules.] - if this path is taken, modules need to be signed against any new future kernels.
+
-
* Install CUDA 8.0 and CUDnn 6.0 for Tensorflow 1.4
+
-
==== Setup SSH Server ====
+
You can find the most widely used [https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/ ML packages here].
-
* The SSH server config (/etc/ssh/sshd_config) should be hardened to prevent brute force password attacks.
+
== Running DL-activities on the <code>supported</code> workstations ==
-
* Broadly, follow [https://askubuntu.com/questions/2271/how-to-harden-an-ssh-server these instructions] to disable password access. Maybe a good idea to use fail2ban or deny hosts, but just disabling password access should be a good start.
+
-
* Regular users should copy their public key to the machine in person and add it to ~/.ssh/authorized_keys
+
-
* ssh-copy-id won't work if password access is turned off :-\
+
-
==== Setup Dynamic DNS ====
+
Please keep in mind that these workstations is a shared resource. They do not have any queue solutions, so we encourage you to share the resources as best as you can. We have some tools to help us see which processes are running:
-
* Good to have a fixed URL for each machine as the IP address from Eduroam can change.
+
=== Benchmarks ===
-
* Dynamic DNS services allow the computer to update it's own IP address on the DNS server.
+
-
* (Even with public URL, still need to be on UiO network or VPN to login via SSH)
+
-
* Get a URL from no-ip.com or DuckDNS
+
-
* Install ddclient on Ubuntu to configure dynDNS -- need to install back ported v3.8.3 to support duckdns [https://launchpad.net/~rhansen/+archive/ubuntu/ddclient https://launchpad.net/~rhansen/+archive/ubuntu/ddclient]
+
-
* Configure /etc/ddclient.conf as follows: [https://sourceforge.net/p/ddclient/wiki/protocols/#duckdns https://sourceforge.net/p/ddclient/wiki/protocols/#duckdns]
+
-
<nowiki>
+
We've performed some [[benchmarks on the DLWSs]] to better understand the capabilities of the different hardware.  
-
protocol=duckdns
+
-
password=token-from-duck-dns
+
-
chosen-host-url.duckdns.org
+
-
run_daemon="true"
+
-
</nowiki>
+
-
* start service: sudo service ddclient start
+
=== Some commands you need to get familiar with ===
-
* show status: sudo service ddclient status
+
-
==== Remote Connections ====
+
* <code>nvidia-smi</code>: See current GPU usage
 +
* <code>htop</code>: See current CPU/RAM usage
 +
* <code>w</code>: See currently logged in users
 +
* Please use the [https://mattermost.uio.no/login Mattermost] ROBIN-workspace to coordinate resource usage.
-
* Ports on Eduroam-connected systems are generally not opened externally
+
=== Before running DL-activities ===
-
* You can access them by connecting to the University VPN and then connecting.
+
-
* [https://www.uio.no/tjenester/it/nett/utenfra/vpn/ Instructions here for setting up VPN on your client system.]
+
-
==== Port forwarding for Jupyter Notebook ====
+
# Make sure you are on one of the supported workstations e.g: <code>dunder</code>.
 +
# Check if someone is running DL-activities using the commands above.
 +
# While running experiments, be available for DMs on Mattermost and messages in the <code><hostname></code>-channel.
-
Here’s a trick to use jupyter notebook in your local laptop browser, but to do the work on our GPU workstation, Dasher.
+
== Using the <code>stand-alone</code> workstations ==
 +
To get access to one of the local machines at Robin, send an email to robin-engineer@ifi.uio.no including the following information:
-
Login to Dasher via SSH with the port forwarding switches:
+
myusername=<username>
 +
supervisor=<username>
 +
masters_delivery_deadline=DD-MM-YYYY
 +
software_requirements=e.g. Python3, TensorFlow, Caffe
 +
ssh_access=Y/n (if Y, add a public-key as an appendix in the e-mail)
-
<nowiki>
+
When you get access to the computer, you will additionally be added to a [https://mattermost.uio.no/login Mattermost]-channel with the same name as your host computer. There is no job scheduler on these machines so it's required that you communicate with each other when you are conducting computational activities. Make sure you are familiar with the nvidia-smi command to be able to check current status.
-
ssh -L 8888:localhost:8888 charles@dasher-robin.duckdns.org
+
-
</nowiki>
+
-
This maps the address “localhost:8888” on dasher to port “localhost:8888” on your laptop.
+
{{note|  These machines does not have any backup. Make sure to sync (e.g to your home directory at UiO) your files when you've finished an experiment }}
-
Then you can run Jupyter on the ssh connection:
+
To ssh to the host, you need to be on the eduroam or on VPN. Be aware that some maintenance on these machines are required and will lead to occasional downtime.
-
Jupyter notebook
+
== Current usage of the <code>supported</code> clients ==
-
When Jupyter runs it shows you a special URL for the notebook session, including a unique token. Copy this, then on your local computer, open a browser and go to:
+
<div><iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://api.robin.uiocloud.no/ws/cpuhistory" height="525" width="80%"></iframe></div>
-
<nowiki>
+
<div><iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://api.robin.uiocloud.no/ws/gpuhistory" height="525" width="80%"></iframe></div>
-
http://localhost:8888/?token=TOKEN_TEXT_HERE
+
-
</nowiki>
+
-
Now you’re ready to compute something difficult!
+
== Got questions? ==
-
More info: [https://help.ubuntu.com/community/SSH/OpenSSH/PortForwarding Ubuntu Help on Port Forwarding]
+
If you have any questions feel free to use [https://mattermost.uio.no/login Mattermost]. If you have feature requests or technical questions, send it to robin-engineer@ifi.uio.no.

Current revision as of 13:27, 24 January 2024

Contents

Deep Learning Workstations (DLWS)

We have shared workstations for projects that requires GPU and CPU power while retaining physical access to a computer. However, the local Deep Learning Workstations might not be the best solution for your project. Please see the High Performance Computing article for more information.

Most interesting are the four Dell Alienware computers from January 2018, here are their details and responsible staff for ongoing record keeping:

We devide the available machines into two categories, namly supported and stand-alone. supported means that the client is running UiO supported operating system, typically the latest LTS-version of Red Hat Enterprise Linux. The other category; stand-alone, is set up with 3rd party OS and custom packages, meaning that the researcher themselves maintains the client.

For master's projects we prefer the students use the supported category.

  • Alienware Aurora R7 - Intel i7 8700K (6-core), Nvidia RTX3090 (UiO: 113616). @vegardds/shared resource
    • hostname: dancer
    • URL:dancer.ifi.uio.no
    • Category: supported
    • WLAN: D8:9E:F3:7A:84:B7
    • ETH: 9C:30:5B:13:AF:33
  • Alienware Aurora R7 - Intel i7 8700K (6-core), 2x Nvidia RTX2080ti (UiO: 113615). @vegardds/shared resource
    • hostname: dasher
    • URL: dasher.ifi.uio.no
    • Category: supported
    • WLAN: D8:9E:F3:7A:7E:D1
    • ETH: 9C:30:5B:13:C5:69
  • Alienware Aurora R7 - Intel i7 8700K (6-core), Nvidia RTX3090 (UiO: 113617). @vegardds/shared resource
    • hostname: dunder
    • URL: dunder.ifi.uio.no
    • Category: supported.
    • WLAN: D8:9E:F3:7A:46:08
    • ETH: 9E:30:5B:13:C5:8B
  • Alienware Area 51 R3 - AMD Threadripper 1950x (16-core), 3x Nvidia GTX1080ti (UiO: 113614). @vegardds/shared resource
    • hostname: rudolph
    • URL: rudolph.ifi.uio.no
    • Category: supported
    • WLAN: 9C:30:5B:13:C5:71
    • ETH1: 30:9C:23:2A:EB:39
    • ETH2: 30:9C:23:2A:EB:38

We also have older workstations:

  • Deep Thinker (2016): (Fractal Design enclosure) Intel i7 6700K (4-core), Nvidia RTX2060. @yngveha
    • hostname: deepthinker
    • URL: N/A
    • Location: Master's lab
    • Category: stand-alone
  • Mac Pro (2010): Intel Xeon (8-core), 1x Nvidia GTX1060 (+ ATI 5770 1GB, not useful for DL)... Justas etc.
    • hostname: skrotten
    • URL: N/A'
    • Location: MoCap lab
    • Category: stand-alone

Getting access to the workstations

To get access to the workstations you either need to be a master student at ROBIN or have an advisor at ROBIN which can vouch for you.

To access the supported clients, all you need to do is to ssh into the machine with your UiO-credentials. See more Configure SSH.

Configure SSH

To make SSH slightly more pleasant to work with we can create a configuration file at ~/.ssh/config which can contain additional information about remote machines. The following is a possible configuration for the Rudolph workstation.

Host rudolph
    User <my-username>
    HostName rdlf.duckdns.org
    IdentityFile ~/.ssh/rudolph

This allows us to ssh using the command ssh rudolph without any other configuration.

SSH from outside UiOs networks

To access the machines from outside UiOs networks, you need set up the following in the ~/.ssh/config file (in this example, we use dancer as an example):

Host ifi-login
     Hostname login.ifi.uio.no
     User <my-username>
Host dancer
     User <my-username>
     IdentityFile ~/.ssh/<key_to_dancer>
     ProxyCommand ssh -q ifi-login nc  dancer.ifi.uio.no 22

On Windows machines you need to add an extra argument to the ProxyCommand to make it work:

Host dancer
     User <my-username>
     IdentityFile ~/.ssh/<key_to_dancer>
     ProxyCommand C:\Windows\System32\OpenSSH\ssh.exe -q ifi-login nc dancer.ifi.uio.no 22

Now you should be able to run ssh dancer .

SSH key

We recommend using key-pairs for secure login. See more information here: https://www.ssh.com/ssh/keygen/.

NOTE: In case you choose to use key-pairs make sure to never share your private key.

Mosh

We also recommend students to install [mosh] for a better ssh experience. The usage is exactly the same as with ssh except that mosh is capable of staying connected even when roaming and through hibernation of laptops. Mosh should be installed on all workstations.

NOTE: mosh does not support X-Forwarding.

Using the supported workstations

Queuing up jobs

Use batch to queue jobs to automatically start when resources are available.

$ batch
at> <sh-command>

Press ctrl + d to add the job to the queue. This will queue a job with the specified command. To see your current queued jobs use the command atq . Note that atq will only show your jobs, and not other users. For a more detailed explanation see man batch. It is also possible to queue several scripts/commands in one job.

Example of python job

$ batch
at> python3 <name_of_script>.py
at> python3 <name_of_second_script>.py
at> echo "Job done at $HOSTNAME"

This will run name_of_script and then name_of_second_script sequentially as one job. Outputs and errors to the terminal will be sent to the users student/work mail. Tip: To get notified when the job is finished one can add echo "Job done at $HOSTNAME" at the end.

Anaconda

The most frequently used machine learning tools is available through Anaconda. If you are not familiar with Anaconda, take a look at these tutorials. Another tip is to download the Anaconda Cheat sheet.

To see how you can install machine learning packages, see Working with GPU packages.

NOTE: the environments will be installed in your home directory, e.g: ~/.conda/envs/<env_name> . Make sure that you have enaugh storage.

Initialization

To use Anoconda on the DL workstation, you are required to do some first time config. Log in to a IFI client e.g: login.ifi.uio.no . Perform the following:

$ export PATH="/opt/ifi/anaconda3/bin:$PATH"
$ conda init

These commands adds some stuff to your ~/.bashrc . You should now be able to see that you are working in the base environment.

Example usage

There is a huge community using Anaconda and deep learning tools, hence we encourage you to do some investigations on how to use it for your own. However, you can take a look at the simplistic example below to get you started.

Let's say we want to use TensorFlow in our project. First we need to create a environment:

$ conda create --name ml_project tensorflow-gpu

When the command is executed, you will get a list of all the packages required to create this environment and the size of the installation. Once you accept, the installation begins. This might take a while depending on the size of the packages.

On completion you should be able to run this command to see your new environment:

$ conda env list

This will output the environments you can activate. Output:

# conda environments:
#
base                  *  /opt/ifi/anaconda3
ml_project               /uio/kant/ifi-ansatt-u03/vegardds/.conda/envs/ml_project

The star(*) indicates which environment you're currently using.

To change to the new environment run:

$ conda activate ml_project

The environment should now be ready to go. Moreover, if you require another package use:

$ conda install <package_name>

You can find the most widely used ML packages here.

Running DL-activities on the supported workstations

Please keep in mind that these workstations is a shared resource. They do not have any queue solutions, so we encourage you to share the resources as best as you can. We have some tools to help us see which processes are running:

Benchmarks

We've performed some benchmarks on the DLWSs to better understand the capabilities of the different hardware.

Some commands you need to get familiar with

  • nvidia-smi: See current GPU usage
  • htop: See current CPU/RAM usage
  • w: See currently logged in users
  • Please use the Mattermost ROBIN-workspace to coordinate resource usage.

Before running DL-activities

  1. Make sure you are on one of the supported workstations e.g: dunder.
  2. Check if someone is running DL-activities using the commands above.
  3. While running experiments, be available for DMs on Mattermost and messages in the <hostname>-channel.

Using the stand-alone workstations

To get access to one of the local machines at Robin, send an email to robin-engineer@ifi.uio.no including the following information:

myusername=<username>
supervisor=<username>
masters_delivery_deadline=DD-MM-YYYY
software_requirements=e.g. Python3, TensorFlow, Caffe
ssh_access=Y/n (if Y, add a public-key as an appendix in the e-mail)

When you get access to the computer, you will additionally be added to a Mattermost-channel with the same name as your host computer. There is no job scheduler on these machines so it's required that you communicate with each other when you are conducting computational activities. Make sure you are familiar with the nvidia-smi command to be able to check current status.

These machines does not have any backup. Make sure to sync (e.g to your home directory at UiO) your files when you've finished an experiment

To ssh to the host, you need to be on the eduroam or on VPN. Be aware that some maintenance on these machines are required and will lead to occasional downtime.

Current usage of the supported clients

Got questions?

If you have any questions feel free to use Mattermost. If you have feature requests or technical questions, send it to robin-engineer@ifi.uio.no.

Personal tools
Front page