Setup OS Environment ==================== Backend.AI and its associated components share common requirements and configurations for proper operation. This section explains how to configure the OS environment. .. note:: This section assumes the installation on Ubuntu 20.04 LTS. Create a user account for operation ----------------------------------- We will create a user account ``bai`` to install and operate Backend.AI services. Set the ``UID`` and ``GID`` to ``1100`` to prevent conflicts with other users or groups. ``sudo`` privilege is required so add ``bai`` to ``sudo`` group. .. code-block:: console $ username="bai" $ password="secure-password" $ sudo adduser --disabled-password --uid 1100 --gecos "" $username $ echo "$username:$password" | sudo chpasswd $ sudo usermod -aG sudo bai If you do not want to expose your password in the shell history, remove the ``--disabled-password`` option and interactively enter your password. Login as the ``bai`` user and continue the installation. Install Docker engine --------------------------------- Backend.AI requires Docker Engine to create a compute session with the Docker container backend. Also, some service components are deployed as containers. So `installing Docker Engine `_ is required. Ensure ``docker-compose-plugin`` is installed as well to use ``docker compose`` command. After the installation, add the ``bai`` user to the ``docker`` group not to issue the ``sudo`` prefix command every time interacting with the Docker engine. .. code-block:: console $ sudo usermod -aG docker bai Logout and login again to apply the group membership change. Optimize sysctl/ulimit parameters --------------------------------- This is not essential but the recommended step to optimize the performance and stability of operating Backend.AI. Refer to the `guide of the Manager repiository `_ for the details of the kernel parameters and the ulimit settings. Depending on the Backend.AI services you install, the optimal values may vary. Each service installation section guide with the values, if needed. .. note:: Modern systems may have already set the optimal parameters. In that case, you can skip this step. To cleanly separate the configurations, you may follow the steps below. - Save the resource limit parameters in ``/etc/security/limits.d/99-backendai.conf``. .. code-block:: bash root hard nofile 512000 root soft nofile 512000 root hard nproc 65536 root soft nproc 65536 bai hard nofile 512000 bai soft nofile 512000 bai hard nproc 65536 bai soft nproc 65536 - Logout and login again to apply the resource limit changes. - Save the kernel parameters in ``/etc/sysctl.d/99-backendai.conf``. .. code-block:: bash fs.file-max=2048000 net.core.somaxconn=1024 net.ipv4.tcp_max_syn_backlog=1024 net.ipv4.tcp_slow_start_after_idle=0 net.ipv4.tcp_fin_timeout=10 net.ipv4.tcp_window_scaling=1 net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_early_retrans=1 net.ipv4.ip_local_port_range="10000 65000" net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.ipv4.tcp_rmem=4096 12582912 16777216 net.ipv4.tcp_wmem=4096 12582912 16777216 vm.overcommit_memory=1 - Apply the kernel parameters with ``sudo sysctl -p /etc/sysctl.d/99-backendai.conf``. .. _prepare_python_and_venv: Prepare required Python versions and virtual environments --------------------------------------------------------- Prepare a Python distribution whose version meets the requirements of the target package. Backend.AI 22.09, for example, requires Python 3.10. The latest information on the Python version compatibility can be found at `here `_. There can be several ways to prepare a specific Python version. Here, we will be using a standalone static built Python. Use a standalone static built Python (Recommended) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Obtain distribution of `a standalone static built Python `_ according to required python version, target machine architecture and etc. Then extract the distribution to a directory of your choice. .. code-block:: console $ curl -L "https://github.com/astral-sh/python-build-standalone/releases/download/${PYTHON_RELEASE_DATE}/cpython-${PYTHON_VERSION}+${PYTHON_RELEASE_DATE}-${TARGET_MACHINE_ARCHITECTURE}-${ARCHIVE_FLAVOR}.tar.gz" > cpython-${PYTHON_VERSION}+${PYTHON_RELEASE_DATE}-${TARGET_MACHINE_ARCHITECTURE}-${ARCHIVE_FLAVOR}.tar.gz $ tar -xf "cpython-${PYTHON_VERSION}+${PYTHON_RELEASE_DATE}-${TARGET_MACHINE_ARCHITECTURE}-${ARCHIVE_FLAVOR}.tar.gz" $ mkdir -p "/home/${USERNAME}/.static-python/versions" $ mv python "/home/${USERNAME}/.static-python/versions/${PYTHON_VERSION}" For example, .. code-block:: console $ curl -L "https://github.com/astral-sh/python-build-standalone/releases/download/20241219/cpython-3.13.3+20250409-x86_64-unknown-linux-gnu-install_only.tar.gz" > cpython-3.13.3+20250409-x86_64-unknown-linux-gnu-install_only.tar.gz $ tar -xf "cpython-3.13.3+20241219-x86_64-unknown-linux-gnu-install_only.tar.gz" $ mkdir -p "/home/bai/.static-python/versions" $ mv python "/home/bai/.static-python/versions/3.13.3" Then, you can create multiple virtual environments per service. To create a virtual environment for Backend.AI Manager and activate it, for example, you may run: .. code-block:: console $ mkdir "${HOME}/manager" $ cd "${HOME}/manager" $ ~/.static-python/versions/3.13.3/bin/python3 -m venv .venv $ source .venv/bin/activate $ pip install -U pip setuptools wheel You also need to make ``pip`` available to the Python installation with the latest ``wheel`` and ``setuptools`` packages, so that any non-binary extension packages can be compiled and installed on your system. (Alternative) Use pyenv to manually build and select a specific Python version ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you prefer, there is no problem using pyenv and pyenv-virtualenv. Install `pyenv `_ and `pyenv-virtualenv `_. Then, install a Python version that are needed: .. code-block:: console $ pyenv install "${YOUR_PYTHON_VERSION}" .. note:: You may need to install `suggested build environment `_ to build Python from pyenv. Then, you can create multiple virtual environments per service. To create a virtual environment for Backend.AI Manager 22.09.x and automatically activate it, for example, you may run: .. code-block:: console $ mkdir "${HOME}/manager" $ cd "${HOME}/manager" $ pyenv virtualenv "${YOUR_PYTHON_VERSION}" bai-22.09-manager $ pyenv local bai-22.09-manager $ pip install -U pip setuptools wheel You also need to make ``pip`` available to the Python installation with the latest ``wheel`` and ``setuptools`` packages, so that any non-binary extension packages can be compiled and installed on your system. Configure network aliases ------------------------- Although not required, using a network aliases instead of IP addresses can make setup and operation easier. Edit the ``/etc/hosts`` file for each node and append the contents like example below to access each server with network aliases. .. code-block:: bash ##### BEGIN for Backend.AI services ##### 10.20.30.10 bai-m1 # management node 01 10.20.30.20 bai-a01 # agent node 01 (GPU 01) 10.20.30.22 bai-a02 # agent node 02 (GPU 02) ##### END for Backend.AI services ##### Note that the IP addresses should be accessible from other nodes, if you are installing on multiple servers. Mount a shared storage ---------------------- Having a shared storage volume makes it easy to save and manage data inside a Backend.AI compute environment. If you have a dedicated storage, mount it with the name of your choice under ``/vfroot/`` directory on each server. You must mount it in the same path in all management and compute nodes. Detailed mount procedures may vary depending on the storage type or vendor. For a usual NFS, adding the configurations to ``/etc/fstab`` and executing ``sudo mount -a`` will do the job. .. note:: It is recommended to unify the UID and GID of the Storage Proxy service, all of the Agent services across nodes, container UID and GID (configurable in ``agent.toml``), and the NFS volume. If you do not have a dedicated storage or installing on one server, you can use a local directory. Just create a directory ``/vfroot/local``. .. code-block:: bash $ sudo mkdir -p /vfroot/local $ sudo chown -R ${UID}.${GID} /vfroot Setup accelerators ------------------ If there are accelerators (e.g., GPU) on the server, you have to install the vendor-specific drivers and libraries to make sure the accelerators are properly set up and working. Please refer to the vendor documentation for the details. - To integrate NVIDIA GPUs, - Install the NVIDIA driver and CUDA toolkit. - Install the NVIDIA container toolkit (nvidia-docker2). Pull container images --------------------- For compute nodes, you need to pull some container images that are required for creating a compute session. Lablup provides a set of open container images and you may pull the following starter images: .. code-block:: bash docker pull cr.backend.ai/stable/filebrowser:21.02-ubuntu20.04 docker pull cr.backend.ai/stable/python:3.9-ubuntu20.04 docker pull cr.backend.ai/stable/python-pytorch:1.11-py38-cuda11.3 docker pull cr.backend.ai/stable/python-tensorflow:2.7-py38-cuda11.3