Setting up Ubuntu on GPU EC2 in AWS for ML with PyTorch – part 1

In this post, we will show you how to configure your Ubuntu GPU EC2 instance in Amazon Web Services from start to finish so that you can perform Deep Learning tasks on it. At NubiSoft, we usually implement machine learning with less popular cloud service providers, because AWS is not the cheapest, and the learning process itself usually only requires a lot of computing power, but the availability of the system or the geographic location of calculations are of less importance here. Only when we obtain a trained model that we want to deploy to production (i.e. carry out inferencing), the latter factors become more important. But it so happened that the customer who uses AWS exclusively asked us to deploy his Deep Learning stack there. So here we go!

First, the new EC2 instance should be provisioned. This can be done using awscli, CloudFormation, Terraform, and others but the most illustrative will be to do that with AWS console using the lunch instance option, where we choose the plain Ubuntu 18.04 LTS AMI:

Choosing Ubuntu Server 18.04 LTS 64-bit (x86)

This tutorial is based on the P2 GPU instance type, so it has to be chosen as follow (if you prefer you can even choose a p2 instance with more computing capacity):

Choosing the p2.xlarge instance type

Adjust the disk volume to the needs of your computation profile – e.g. Deep Learning related to image classification would probably require a lot of free disk space:

Adding storage

Finally, create the EC2 instance by choosing the SSH access key and pressing the Launch Instances button:

Choosing the SSH access key pair

Having SSH access to newly provisioned EC2 instance let’s start with the prayer of every Ubuntu lover 😉

sudo apt update
sudo apt upgrade

To install the Tesla device driver we will also need access to gcc:

sudo apt install gcc

Next, make sure that the kernel headers and development packages for the currently running kernel are up-to-date:

sudo apt install linux-headers-$(uname -r)

Ensure packages on the CUDA network repository have priority over the standard Canonical repository:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin
sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600

Install the CUDA repository public GPG key:

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub

Setup the CUDA network repository:

echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list

Update the APT repository cache and install the driver using the cuda-drivers meta-package. The –no-install-recommends option can be used for a lean driver install without any dependencies on X packages. This is particularly useful for headless installations on cloud instances with modest HDD resources.

sudo apt update
sudo apt -y install cuda-drivers

.. and five minutes later we are ready to start post-installation steps. First, some paths need to be added – we can do that permanently modifying .bashrc file by extending it with two lines:

nano ~/.bashrc

export PATH=/usr/local/cuda-11.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Save the file and load the new path into the current shell session using the source command:

source ~/.bashrc

Verify that the driver is loaded and check its version:

cat /proc/driver/nvidia/version

Now, we are ready to deploy the rest of the PyTorch deep learning stack – we will cover it in the subsequent post.

Leave a Reply

Your email address will not be published. Required fields are marked *