Seven Story Rabbit Hole

Sometimes awesome things happen in deep rabbit holes. Or not.

   images

CoreOS With Nvidia CUDA GPU Drivers

This will walk you through installing the Nvidia GPU kernel module and CUDA drivers on a docker container running inside of CoreOS.

architecture diagram

Launch CoreOS on an AWS GPU instance

  • Launch a new EC2 instance

  • Under “Community AMIs”, search for ami-f669f29e (CoreOS stable 494.4.0 (HVM))

  • Select the GPU instances: g2.2xlarge

  • Increase root EBS store from 8 GB –> 20 GB to give yourself some breathing room

ssh into CoreOS instance

Find the public ip of the EC2 instance launched above, and ssh into it:

1
$ ssh -A core@ec2-54-80-24-46.compute-1.amazonaws.com

Run Ubuntu 14 docker container in privileged mode

1
$ sudo docker run --privileged=true -i -t ubuntu:14.04 /bin/bash

After the above command, you should be inside a root shell in your docker container. The rest of the steps will assume this.

Install build tools + other required packages

In order to match the version of gcc that was used to build the CoreOS kernel. (gcc 4.7)

1
2
# apt-get update
# apt-get install gcc-4.7 g++-4.7 wget git make dpkg-dev

Set gcc 4.7 as default

1
2
3
# update-alternatives --remove gcc /usr/bin/gcc-4.8
# update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.7 60 --slave /usr/bin/g++ g++ /usr/bin/g++-4.7
# update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.8 40 --slave /usr/bin/g++ g++ /usr/bin/g++-4.8

Verify

1
# update-alternatives --config gcc

It should list gcc 4.7 with an asterisk next to it:

1
* 0            /usr/bin/gcc-4.7   60        auto mode

Prepare CoreOS kernel source

Clone CoreOS kernel repository

1
2
3
$ mkdir -p /usr/src/kernels
$ cd /usr/src/kernels
$ git clone https://github.com/coreos/linux.git

Find CoreOS kernel version

1
2
# uname -a
Linux ip-10-11-167-200.ec2.internal 3.17.2+ #2 SMP Tue Nov 4 04:15:48 UTC 2014 x86_64 Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz GenuineIntel GNU/Linux

The CoreOS kernel version is 3.17.2

Switch correct branch for this kernel version

1
2
# cd linux
# git checkout remotes/origin/coreos/v3.17.2

Create kernel configuration file

1
# zcat /proc/config.gz > /usr/src/kernels/linux/.config

Prepare kernel source for building modules

1
# make modules_prepare

Now you should be ready to install the nvidia driver.

Hack the kernel version

In order to avoid nvidia: version magic errors, the following hack is required:

1
# sed -i -e 's/3.17.2/3.17.2+/' include/generated/utsrelease.h

I’ve posted to the CoreOS Group to ask why this hack is needed.

Install nvidia driver

Download

1
2
3
# mkdir -p /opt/nvidia
# cd /opt/nvidia
# wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run

Unpack

1
2
3
# chmod +x cuda_6.5.14_linux_64.run
# mkdir nvidia_installers
# ./cuda_6.5.14_linux_64.run -extract=`pwd`/nvidia_installers

Install

1
2
# cd nvidia_installers
# ./NVIDIA-Linux-x86_64-340.29.run --kernel-source-path=/usr/src/kernels/linux/

Installer Questions

  • Install NVidia’s 32-bit compatibility libraries? YES
  • Would you like to run nvidia-xconfig? NO

If everything worked, you should see:

nvidia drivers installed

your /var/log/nvidia-installer.log should look something like this

Load nvidia kernel module

1
# modprobe nvidia

No errors should be returned. Verify it’s loaded by running:

1
# lsmod | grep -i nvidia

and you should see:

1
2
nvidia              10533711  0
i2c_core               41189  2 nvidia,i2c_piix4

Install CUDA

In order to fully verify that the kernel module is working correctly, install the CUDA drivers + library and run a device query.

To install CUDA:

1
2
# ./cuda-linux64-rel-6.5.14-18749181.run
# ./cuda-samples-linux-6.5.14-18745345.run

Verify CUDA

1
2
3
# cd /usr/local/cuda/samples/1_Utilities/deviceQuery
# make
# ./deviceQuery   

You should see the following output:

1
2
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GRID K520
Result = PASS

Congratulations! You now have a docker container running under CoreOS that can access the GPU.

Appendix: Expose GPU to other docker containers

If you need other docker containers on this CoreOS instance to be able to access the GPU, you can do the following steps.

Exit docker container

1
# exit

You should be back to your CoreOS shell.

Add nvidia device nodes

1
2
3
$ wget https://gist.githubusercontent.com/tleyden/74f593a0beea300de08c/raw/95ed93c5751a989e58153db6f88c35515b7af120/nvidia_devices.sh
$ chmod +x nvidia_devices.sh
$ sudo ./nvidia_devices.sh

Verify device nodes

1
2
3
4
$ ls -alh /dev | grep -i nvidia
crw-rw-rw-  1 root root  251,   0 Nov  5 16:37 nvidia-uvm
crw-rw-rw-  1 root root  195,   0 Nov  5 16:37 nvidia0
crw-rw-rw-  1 root root  195, 255 Nov  5 16:37 nvidiactl

Launch docker containers

When you launch other docker containers on the same CoreOS instance, to allow them to access the GPU device you will need to add the following arguments:

1
$ sudo docker run -ti --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm tleyden5iwx/ubuntu-cuda /bin/bash

References

Comments