Deep Learning Workstations

From Robin

(Difference between revisions)

Current revision as of 13:27, 24 January 2024

Deep Learning Workstations (DLWS)

We have shared workstations for projects that requires GPU and CPU power while retaining physical access to a computer. However, the local Deep Learning Workstations might not be the best solution for your project. Please see the High Performance Computing article for more information.

Most interesting are the four Dell Alienware computers from January 2018, here are their details and responsible staff for ongoing record keeping:

We devide the available machines into two categories, namly supported and stand-alone. supported means that the client is running UiO supported operating system, typically the latest LTS-version of Red Hat Enterprise Linux. The other category; stand-alone, is set up with 3rd party OS and custom packages, meaning that the researcher themselves maintains the client.

For master's projects we prefer the students use the supported category.

Alienware Aurora R7 - Intel i7 8700K (6-core), Nvidia RTX3090 (UiO: 113616). @vegardds/shared resource
- hostname: dancer
- URL:dancer.ifi.uio.no
- Category: supported
- WLAN: D8:9E:F3:7A:84:B7
- ETH: 9C:30:5B:13:AF:33
Alienware Aurora R7 - Intel i7 8700K (6-core), 2x Nvidia RTX2080ti (UiO: 113615). @vegardds/shared resource
- hostname: dasher
- URL: dasher.ifi.uio.no
- Category: supported
- WLAN: D8:9E:F3:7A:7E:D1
- ETH: 9C:30:5B:13:C5:69
Alienware Aurora R7 - Intel i7 8700K (6-core), Nvidia RTX3090 (UiO: 113617). @vegardds/shared resource
- hostname: dunder
- URL: dunder.ifi.uio.no
- Category: supported.
- WLAN: D8:9E:F3:7A:46:08
- ETH: 9E:30:5B:13:C5:8B
Alienware Area 51 R3 - AMD Threadripper 1950x (16-core), 3x Nvidia GTX1080ti (UiO: 113614). @vegardds/shared resource
- hostname: rudolph
- URL: rudolph.ifi.uio.no
- Category: supported
- WLAN: 9C:30:5B:13:C5:71
- ETH1: 30:9C:23:2A:EB:39
- ETH2: 30:9C:23:2A:EB:38

We also have older workstations:

Deep Thinker (2016): (Fractal Design enclosure) Intel i7 6700K (4-core), Nvidia RTX2060. @yngveha
- hostname: deepthinker
- URL: N/A
- Location: Master's lab
- Category: stand-alone
Mac Pro (2010): Intel Xeon (8-core), 1x Nvidia GTX1060 (+ ATI 5770 1GB, not useful for DL)... Justas etc.
- hostname: skrotten
- URL: N/A'
- Location: MoCap lab
- Category: stand-alone

Getting access to the workstations

To get access to the workstations you either need to be a master student at ROBIN or have an advisor at ROBIN which can vouch for you.

To access the supported clients, all you need to do is to ssh into the machine with your UiO-credentials. See more Configure SSH.

Configure SSH

To make SSH slightly more pleasant to work with we can create a configuration file at ~/.ssh/config which can contain additional information about remote machines. The following is a possible configuration for the Rudolph workstation.

Host rudolph
    User <my-username>
    HostName rdlf.duckdns.org
    IdentityFile ~/.ssh/rudolph

This allows us to ssh using the command ssh rudolph without any other configuration.

SSH from outside UiOs networks

To access the machines from outside UiOs networks, you need set up the following in the ~/.ssh/config file (in this example, we use dancer as an example):

Host ifi-login
     Hostname login.ifi.uio.no
     User <my-username>

Host dancer
     User <my-username>
     IdentityFile ~/.ssh/<key_to_dancer>
     ProxyCommand ssh -q ifi-login nc  dancer.ifi.uio.no 22

On Windows machines you need to add an extra argument to the ProxyCommand to make it work:

Host dancer
     User <my-username>
     IdentityFile ~/.ssh/<key_to_dancer>
     ProxyCommand C:\Windows\System32\OpenSSH\ssh.exe -q ifi-login nc dancer.ifi.uio.no 22

Now you should be able to run ssh dancer .

SSH key

We recommend using key-pairs for secure login. See more information here: https://www.ssh.com/ssh/keygen/.

NOTE: In case you choose to use key-pairs make sure to never share your private key.

Mosh

We also recommend students to install [mosh] for a better ssh experience. The usage is exactly the same as with ssh except that mosh is capable of staying connected even when roaming and through hibernation of laptops. Mosh should be installed on all workstations.

NOTE: mosh does not support X-Forwarding.

Using the `supported` workstations

Queuing up jobs

Use batch to queue jobs to automatically start when resources are available.

$ batch
at> <sh-command>

Press ctrl + d to add the job to the queue. This will queue a job with the specified command. To see your current queued jobs use the command atq . Note that atq will only show your jobs, and not other users. For a more detailed explanation see man batch. It is also possible to queue several scripts/commands in one job.

Example of python job

$ batch
at> python3 <name_of_script>.py
at> python3 <name_of_second_script>.py
at> echo "Job done at $HOSTNAME"

This will run name_of_script and then name_of_second_script sequentially as one job. Outputs and errors to the terminal will be sent to the users student/work mail. Tip: To get notified when the job is finished one can add echo "Job done at $HOSTNAME" at the end.

Anaconda

The most frequently used machine learning tools is available through Anaconda. If you are not familiar with Anaconda, take a look at these tutorials. Another tip is to download the Anaconda Cheat sheet.

To see how you can install machine learning packages, see Working with GPU packages.

NOTE: the environments will be installed in your home directory, e.g: ~/.conda/envs/<env_name> . Make sure that you have enaugh storage.

Initialization

To use Anoconda on the DL workstation, you are required to do some first time config. Log in to a IFI client e.g: login.ifi.uio.no . Perform the following:

$ export PATH="/opt/ifi/anaconda3/bin:$PATH"
$ conda init

These commands adds some stuff to your ~/.bashrc . You should now be able to see that you are working in the base environment.

Example usage

There is a huge community using Anaconda and deep learning tools, hence we encourage you to do some investigations on how to use it for your own. However, you can take a look at the simplistic example below to get you started.

Let's say we want to use TensorFlow in our project. First we need to create a environment:

$ conda create --name ml_project tensorflow-gpu

When the command is executed, you will get a list of all the packages required to create this environment and the size of the installation. Once you accept, the installation begins. This might take a while depending on the size of the packages.

On completion you should be able to run this command to see your new environment:

$ conda env list

This will output the environments you can activate. Output:

# conda environments:
#
base                  *  /opt/ifi/anaconda3
ml_project               /uio/kant/ifi-ansatt-u03/vegardds/.conda/envs/ml_project

The star(*) indicates which environment you're currently using.

To change to the new environment run:

$ conda activate ml_project

The environment should now be ready to go. Moreover, if you require another package use:

$ conda install <package_name>

You can find the most widely used ML packages here.

Running DL-activities on the `supported` workstations

Please keep in mind that these workstations is a shared resource. They do not have any queue solutions, so we encourage you to share the resources as best as you can. We have some tools to help us see which processes are running:

Benchmarks

We've performed some benchmarks on the DLWSs to better understand the capabilities of the different hardware.

Some commands you need to get familiar with

nvidia-smi: See current GPU usage
htop: See current CPU/RAM usage
w: See currently logged in users
Please use the Mattermost ROBIN-workspace to coordinate resource usage.

Before running DL-activities

Make sure you are on one of the supported workstations e.g: dunder.
Check if someone is running DL-activities using the commands above.
While running experiments, be available for DMs on Mattermost and messages in the <hostname>-channel.

Using the `stand-alone` workstations

To get access to one of the local machines at Robin, send an email to robin-engineer@ifi.uio.no including the following information:

myusername=<username>
supervisor=<username>
masters_delivery_deadline=DD-MM-YYYY
software_requirements=e.g. Python3, TensorFlow, Caffe
ssh_access=Y/n (if Y, add a public-key as an appendix in the e-mail)

When you get access to the computer, you will additionally be added to a Mattermost-channel with the same name as your host computer. There is no job scheduler on these machines so it's required that you communicate with each other when you are conducting computational activities. Make sure you are familiar with the nvidia-smi command to be able to check current status.

These machines does not have any backup. Make sure to sync (e.g to your home directory at UiO) your files when you've finished an experiment

To ssh to the host, you need to be on the eduroam or on VPN. Be aware that some maintenance on these machines are required and will lead to occasional downtime.

Current usage of the `supported` clients

Got questions?

If you have any questions feel free to use Mattermost. If you have feature requests or technical questions, send it to robin-engineer@ifi.uio.no.

Deep Learning Workstations

From Robin

Current revision as of 13:27, 24 January 2024

Contents