About this document

Purpose

The present document presents the reference architecture, the bootstrap and installation procedures of an HPC system called Scibian HPC.

The main goal is to provide exhaustive information regarding the configuration and system settings based on the needs expressed by users. This information may be useful to business and technical stakeholders, as well as to all members of the scientific computing community at EDF.

Structure

This document is divided into five chapters:

  1. About this document: refers to the present chapter.

  2. Reference architecture: gives an overview of the software and hardware architecture of a Scibian HPC system. It also includes a detailed description of the boot sequence of the HPC System and some other advanced topics.

  3. Installation procedures: describes how to install the Puppet-HPC software stack used to configure the administration and generic nodes of the HPC system. This chapter also explains how to use Ceph for sharing the configuration files across all the nodes and how to handle the virtual machines providing all the services needed to operate the HPC system.

  4. Bootstrap procedures: contains all the procedures to boostrap all the crucial services for the Scibian HPC system: LDAP, Ceph, MariaDB with Galera, SlurmDBD, etc.

  5. Production procedures: contains all the technical procedures to follow for regular operations occuring during the production phase of the supercomputer. This notably includes changing any encryption or authentication key, changing passwords, reinstalling nodes, etc.

Typographic conventions

The following typographic conventions are used in this document:

  • Files or directories names are written in italics: /admin/restricted/config-puppet.

  • Hostnames are written in bold: genbatch1.

  • Groups of hostnames are written using the nodeset syntax from clustershell. For example, genbatch[1-2] refers to the servers genbatch1 and genbatch2.

  • Commands, configuration files contents or source code files are set off visually from the surrounding text as shown below:

    $ cp /etc/default/rcS /tmp

Build dependencies

On a Debian Jessie system, these packages must be installed to build this documentation:

  • asciidoctor >= 0.1.4

  • asciidoctor-edf-tpl-latex >= 2.0

  • inkscape

  • rubber

  • texlive-latex-extra

License

Copyright © 2014-2017 EDF S.A.

This document is governed by the CeCILL license under French law and
abiding by the rules of distribution of free software.  You can use,
modify and/ or redistribute the document under the terms of the
CeCILL license as circulated by CEA, CNRS and INRIA at the following
URL "http://www.cecill.info".

As a counterpart to the access to the source code and  rights to copy,
modify and redistribute granted by the license, users are provided only
with a limited warranty  and the document's author,  the holder of the
economic rights,  and the successive licensors  have only  limited
liability.

In this respect, the user's attention is drawn to the risks associated
with loading,  using,  modifying and/or developing or reproducing the
document by the user in light of its specific status of free software,
that may mean  that it is complicated to manipulate,  and  that  also
therefore means  that it is reserved for developers  and  experienced
professionals having in-depth computer knowledge. Users are therefore
encouraged to load and test the document's suitability as regards their
requirements in conditions enabling the security of their systems and/or
data to be ensured and,  more generally, to use and operate it in the
same conditions as regards security.

The fact that you are presently reading this means that you have had
knowledge of the CeCILL license and that you accept its terms.

Full license terms and conditions can be found at http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html.

Authors

In alphabetical order:

  • Benoit Boccard

  • Ana Guerrero López

  • Thomas Hamel

  • Camille Mange

  • Rémi Palancher

  • Cécile Yoshikawa

Reference architecture

This chapter gives an overview of the software and hardware architecture of a Scibian HPC system. It also includes a detailed description of the boot sequence of the HPC System and some other advanced topics.

Hardware architecture

The following diagram represents the reference high-level hardware architecture of Scibian HPC clusters:

arch hw
Figure 1. Scibian HPC cluster hardware reference architecture

Networks

The cluster is connected to three physically separated networks:

  • The WAN network, an Ethernet based network with L3 network routers which connect the IP networks of the HPC cluster to the organization network.

  • The low-latency network for both I/O to the storage system and distributed computing communications (typically MPI messages) between compute nodes. The hardware technologies of this network may vary upon performance requirements but it generally involves high bandwidth (10+GB/s) and low latency technologies such as InfiniBand, Omni-Path or 10GB Ethernet.

  • The administration network used for basically every other internal network communications: deployment, services, administrator operations, etc. It must be a L2 Ethernet network with dedicated switches.

It is recommended to split the administration Ethernet network with a VLAN dedicated to all management devices (BMC [1], CMC [2], etc). This has significant advantages:

  • It significantly reduces the size of Ethernet broadcast domains which notably increases DHCP reliability and drops Ethernet switches load.

  • It slightly increases security since the IP access to the management devices can be restricted to nodes accessing the VLAN or by a firewall on an IP router.

Administration cluster

The administration cluster is composed by two types of nodes: the admin node and the generic service nodes.

The admin node is the access node for administrators and the central point of administrative operations. All common administrative actions are performed on this node. It does not run any intensive workloads, just simple short-lived programs and it does not need to be very powerful. It does not store sensible data nor run critical services, so it does not need to be very reliable either. Example of hardware specifications:

CPU

1 x 4 cores

RAM

8GB ECC

Network

  • 1 x 1GB bonding on administration network

  • 1 x 1GB bonding on WAN network

  • 1 link on low-latency network

Storage

2 x 300GB RAID1 SATA hard disk

PSU

Non-redundant

The generic service nodes run all critical infrastructure services (within service virtual machines) and manage all production administrative data. Scibian HPC requires a pool from 3 (minimum) to 5 (recommended) generic service nodes. The pool works in active cluster mode, the load is balanced with automatic fail-over. All generic service nodes of a cluster must be fairly identical for efficient load-balancing.

The generic service nodes manage the production data into a distributed object-storage system. It is highly recommended that the nodes have a dedicated block storage device for this purpose. The workload is mostly proportional to the number of compute nodes but the generic service nodes must be quite powerful to comfortably handle load peaks happening during some operations (ex: full cluster reboot). Also, since services are run into virtual machines, a fairly large amount of RAM is required. Services can generate a lot of traffic on the administration network, it is relevant to provide a network adapter with high bandwidth. Even though high-availability is ensured at the software level with automatic fail-over between generic service nodes, it is nevertheless recommended to get hardware redundancy on most devices of the generic service nodes to avoid always risky and hazardous service migrations as much as possible. Example of hardware specifications:

CPU

2 x 16 cores

RAM

64GB ECC

Network

  • 2 x 10GB bonding on administration network

  • 2 x 1GB bonding on WAN network

  • 1 link on low-latency network

Storage

  • 2 x 300GB RAID1 SATA hard disk for host

  • 2 x 1TB SSD SAS or NVMe PCIe for object-storage system

PSU

Redundant

All physical nodes must be connected to all three physical networks. There are virtual bridges on the host of the generic service nodes connected to the WAN, administration (and eventually management) networks. The service virtual machines have connections to the virtual bridges upon their hosted service requirements.

User-space cluster

The user-space cluster is composed of frontend nodes and compute nodes.

The nodes of the user-space cluster are deployed with a diskless live system stored in RAM. It implies that, technically speaking, the nodes do not necessarily need to have local block storage devices.

The frontend nodes are the access hosts for users so they must be connected to all three physical networks. It is possible to have multiple frontend nodes in active cluster mode for load-balancing and automatic fail-over. The exact hardware specifications of the frontend nodes mostly depend on user needs and expectations. Users may need to transfer large amount of data to the cluster, it is therefore recommended to provide high-bandwidth network adapters for the WAN network. These nodes can also be designed to compile computational codes and in this case, they must be powerful in terms of CPU, RAM and local storage I/O.

The compute nodes run the jobs so they must provide high performances. Their exact hardware specifications totally depend on user needs. They must be connected to both the administration and the low-latency networks.

Storage system

The storage system is designed to host user data. It provides one or several shared POSIX filesystems. The evolved storage technologies depend on user needs ranging from a simple NFS NAS to a complex distributed filesystem such as Lustre or GPFS with many SAN and I/O servers.

External services

A Scibian HPC cluster is designed to be mainly self contained and to continue running jobs even if it is cut off from the rest of the organization network. There is some limits to this though and some external services are needed. Critical external services are replicated inside the cluster though, to avoid losing availability of the cluster if the connection to external service is cut.

Base services

LDAP

The reference cluster architecture provides a highly available LDAP service, but it is only meant as a replica of an external LDAP service. The organization must provide an LDAP service with suitable replica credentials.

Only the LDAP servers (Proxy virtual machines) connect to these servers.

NTP

The generic service nodes are providing NTP servers for the whole cluster. Those servers must be synchronized on an external NTP source. This could be an organization NTP or a public one (eg. spool.ntp.org).

Only the NTP servers (Generic Service nodes) connect to these servers.

Package repositories

The normal way for a Scibian HPC Cluster to handle package repositories (APT) is to provide a proxy cache to organization or public distribution repositories. Alternatively, it is possible to mirror external repositories on the cluster (with clara and Ceph/S3).

Proxy cache needs less maintenance and is the preferred solution. Local mirrors can be used when reliable connection to external repositories is unreliable.

Only the Proxy Cache servers (Generic Service nodes) connect to these servers. In the mirror mode, only the admin node uses them.

DNS

External DNS service is not strictly necessary but is hard to not configure if the cluster must use organization or public services (License servers, NAS…).

The external DNS servers are configured as recursive in the local DNS server configuration.

Only the DNS servers (Generic Service nodes) connect to these servers.

Optional services

NAS

It is frequent to mount (at least on the frontend nodes) an external NAS space to copy data in and out of the cluster.

Graphite

In the reference architecture all system metrics collected on the cluster (by collectd) are pushed to an external graphite server. This is usually relayed by the proxy virtual machines.

InfluxDB

In the reference architecture all jobs metrics collected on the cluster are pushed to an external InfluxDB server. This is usually relayed by the proxy virtual machines.

HPCStats

HPCStats is a tool that frequently connects to the frontend as a normal user to launch job. It also connects to the SlurmDBD database to get batch job statistics. The database connection needs a special NAT configuration on the Proxy virtual machines.

Slurm-Web Dashboard

The Slurm-Web Dashboard aggregates data coming from multiple clusters in the same web interface. To get those data, the client connect to an HTTP REST API that is hosted on the Proxy virtual machines.

Software architecture

Overview

image

Functions

The software configuration of the cluster aims to deliver a set of functions. Functions can rely on each other, for example, the disk installer uses the configuration management to finish the post-install process.

The main functions provided by a Scibian HPC cluster are:

  • Configuration Management, to distribute and apply the configuration to the nodes

  • Disk Installer, to install an OS from scratch on the node disks through the network

  • Diskless Boot, to boot a node with a live diskless OS through the network

  • Administrator Tools, tools and services used by the system administrator to operate the cluster

  • User Tools, tools and services used by end users

The Scibian HPC Cluster will use a set of services to deliver a particular function. If a cluster can provide Configuration Management and a Disk Installer, it is able to operate even if it cannot do something useful for the users. These two core functions permit to create a self sufficient cluster that will be used to provide other functions.

Services

The software services of the cluster are sorted into two broad categories:

  • Base Services, necessary to provide core functions: install and configure a physical or virtual machine

  • Additional Services, to boot a diskless (live) machine, provide all end user services (batch, user directory, licenses…), and system services not mandatory to install a machine (monitoring, metrics…)

The Base Services run on a set of physical machines that are almost identical, those hosts are called Service Nodes. The services are setup to work reliably even if some of the service nodes are down. This means that a service node can be re-installed by other active service nodes.

The Additional Services can be installed on a set of other hosts that can be either physical or virtual. VMs (Virtual Machines) are usually used because those services do not need a lot of raw power and the agility provided by virtual machines (like live host migration) are often an advantage.

If the cluster is using virtualized machines for the Additional Services, the service nodes must also provide a consistent virtualization platform (storage and hosts). In the reference architecture, this is provided with Ceph RBD and Libvirtd running on service nodes.

A particular service runs on service nodes even if it is not mandatory for Disk Installer or Config Management: the low-latency network manager (Subnet Manager for InfiniBand, Fabric Manager for Intel Omni-Path). This exception is due to the fact that this particular service needs raw access to the low-latency network.

In the Puppet configuration, services are usually associated with profiles. For example, the puppet configuration configures the DNS Server service with the profile: profiles::dns::server.

Base Services

Infrastructure

Infrastructure-related services provide basic network operations:

  • DHCP and TFTP for PXE Boot

  • DNS servers, with forwarding for external zones

  • NTP servers, synchronized on external servers

These services are configured the same way and running on each service nodes.

Consul

Consul is a service that permits to discover available services in the cluster. Client will query a special DNS entry (xxx.service.virtual) and the DNS server integrated with Consul will return the IP address of an available instance.

Ceph

Ceph provides an highly available storage system for all system needs. Ceph has the advantage to work with internal storage on service nodes. It does not require a storage system shared between servers (NAS or SAN).

Ceph provides:

  • A Rados Block Device (RBD) that is used to store Virtual Machines disk images

  • A Rados GateWay to provide storage for configuration management, Amazon S3 compatible REST API for write operations and plain HTTP for read.

  • A Ceph FS that can provide a POSIX filesystem used for Slurm Controller state save location

image

A Ceph cluster is made of four kinds of daemons. All generic service nodes run the following daemons:

  • OSD, Object Storage Daemons actually holding the content of the ceph cluster

  • RGW, Rados GateWay (sometimes shortened radosgw) exposing an HTTP API like S3 to store and retrieve data in Ceph

Two other kind of service are only available on three of the generic service nodes:

  • MON, Monitoring nodes, this is the orchestrator of the ceph cluster. A quorum of two active mon nodes must be maintained for the cluster to be available

  • MDS, MetaData Server, only used by CephFS (the POSIX implementation above ceph). At least one must always be active.

With this configuration, any server can be unavailable. As long as at least two servers holding critical services are available, the cluster might survive losing another non-critical server.

Libvirt/KVM

Service nodes are also the physical hosts for the Virtual Machines of the cluster. Libvirt is used in combination with QEMU/KVM to configure the VMs. A Ceph RBD pool is used to store the image of the VMs. With this configuration, the only state on a service node is the VM definition.

image
Figure 2. How the service machines, ceph and vm interact

Integration with Clara makes it easy to move VMs between nodes.

HTTP secret and boot

The process to boot a node needs a configuration obtained through HTTP and computed by a CGI (in Python). This is hosted on the service nodes and served by Apache. This is also used to serve files like the kernel, initrd and pre-seeded configuration.

A special Virtual Host on the Apache configuration is used to serve secrets (Hiera-Eyaml keys). This VHost is configured to only serve the files on a specific port. This port is only accessible if the client connects from a port below 1024 (is root), this is enforced by a Shorewall rule.

APT proxy

There is no full repository mirror on the cluster. APT is configured to use a proxy that will fetch data from external repositories and cache it. This permits to have always up-to-date packages without overloading external repositories and without having to maintain mirror sync (internally and externally).

Logs

Logs from all nodes are forwarded to a Virtual IP address running on the service nodes. The local rsyslog daemon will centralize those logs and optionally forward the result to an external location.

Low-latency network manager

The Low-latency network manager (InfiniBand Subnet Manager or Intel Omni-Path Fabric Manager) is not mandatory to achieve the feature set of Base Services (Configuration Management and Disk Installation) but it must run on a physical machine, so it is grouped with the Base Services to run on the service nodes.

NFS HA Service

A NFS HA Service can serve two purpose:

  • Shared state for servicing using Posix to share their state (like SlurmCtld) when CephFS does not provided sufficient performance

  • Shared storage for the users if a distributed file system like GPFS or Lustre is not used (only works for smaller cluster sizes)

The NFS HA Service is provided with a Keepalived setup.

Additional Services

LDAP

There is no standalone LDAP servers configured. The servers are replica from an external directory. This means that both are configured independently and are accessed only for read operations.

If the organization uses Kerberos, all Kerberos requests and password checks are done directly by the external Kerberos server.

Bittorrent

Diskless image files are downloaded by the nodes with the BitTorrent protocol. The cluster provides a redundant tracker service with OpenTracker and two server machines are configured to always seed the images.

An Apache server is used to serve the torrent files for the diskless images (HTTP Live).

Slurm

Slurm provides the job management service for the cluster. The controller service (SlurmCtld) runs in an Active/Passive configuration on a pair of servers (batch nodes). The state is shared between the controller nodes. This can be achieved with a CephFS mount or with an NFS HA server. CephFS does not permit to support a large number (thousands) of jobs yet.

The SlurmDBD service also runs on these two servers.

MariaDB/Galera

SlurmDBD uses a MySQL like database to store accounting information and limits. On Scibian HPC Clusters this is provided by a MariaDB/Galera cluster which provides an Active/Active SQL server compatible with MySQL.

This cluster is usually co-located with SlurmDBD service and Slurm Controllers (batch nodes).

Relays

The Additional Services include a set of relay services to the outside of the cluster for:

  • Email (Postfix Relay)

  • Network (NAT configured by Shorewall)

  • Metrics (Carbon C Relay)

Monitoring

Cluster monitoring is done by Icinga2, the cluster is integrated inside an organization Icinga infrastructure. The cluster hosts a redundant pair of monitoring satellites that checks the nodes. The monitoring master is external to the cluster.

High-Availability

All services running on the cluster should be highly available (HA). Some services not critical for normal cluster operation can be not highly available, but this should be avoided if possible.

The following section lists the different techniques used to achieve high-availability of the cluster services.

Stateless

Stateless services are configured the same way on all servers and will give the same answer to all requests. These services include:

  • DHCP

  • TFTP

  • NTP

  • DNS

  • LDAP Replica

  • HTTP Secret

  • HTTP Boot

  • HTTP Live

  • Ceph RadosGW

  • APT Proxy

  • Carbon Relay

  • Bittorrent Tracker

  • Bittorrent Seeder

  • SMTP Relay

Clients can provide a list of potential servers that will be tried in turn. If the client do not automatically accept multiple servers, it is possible to use the Consul service to get a DNS entry (``xxx.service.virtual``) that will always point to an available instance of the service.

As a last resort and for services that do not need Active/Active (Load Balancing) capabilities, it is possible to use a Virtual IP address (VIP). HTTP Live and Carbon Relay uses this technique.

Native Active/Active

Some services have native internal mechanisms to share states between the servers. Contacting any server will have the same effect on the state of the service, or the service has an internal mechanism to get the right server. These services behave this way:

  • Ceph Rados

  • MariaDB/Galera

  • Consul

Native Active/Passive

Services that have only one active server at any time, but the mechanism to select the active server is internal to the service. This means all servers are launched in the same way and not by an external agent like Keepalived or Pacemaker/Corosync. Services using this technique are:

  • Ceph MDS (Posix CephFS server)

  • Slurm Controller

  • Omni-Path Fabric Manager or InfiniBand Subnet Manager

Controlled Active/Passive

The service can only have one active server at any one time and this failover must be controlled by an external service. On the current configuration the only service requiring this setup is:

  • NFS HA Server

Conventions

In order to restrain the complexity of the configuration of a Scibian HPC cluster, some naming and architecture conventions have been defined. Multiple components of the software stack expect these conventions to be followed in order to operate properly. These conventions are actually rather close to HPC cluster standards, then they should not seem very constraining.

  • The operating system short hostname of the nodes must have the following format: <prefix><role><id>. This is required by the association logic used in Puppet-HPC to map a node to its unique Puppet role. This point is fully explained in the role section of Puppet-HPC reference documentation.

  • The FQDN[3] hostnames of the nodes must be similar to their network names on the administration network. In other words, the IP address resolution on the cluster of the FQDN hostname of a node must return the IP address of this node on the administration network.

Advanced Topics

Boot sequence

Initial common steps

The servers of the cluster can boot on their hard disks or via the network, using the PXE protocol. In normal operations, all service nodes are installed on hard disks, and all nodes of the userspace (compute and frontend nodes) use the network method to boot the diskless image. A service node can use the PXE method when it is being installed. The boot sequence between the power on event on the node and the boot of the initrd is identical regardless of the system booted (installer or diskless image).

The steps of the boot sequence are described on the diagram below:

image

When a node boots on its network device, after a few (but generally time-consuming) internal checks, it loads and runs the PXE ROM stored inside the Ethernet adapter. This ROM first sends a DHCP request to get an IP address and other network parameters. The DHCP server gives it an IP address alongside the filename parameter. This filename is the file the PXE ROM downloads using the TFTP protocol. This protocol, which is rather limited and unreliable is used here because the PXE ROM commonly available in Ethernet adapters only supports this network protocol.

The file to download depends on the type of nodes or roles. On Scibian HPC clusters when using the Puppet-HPC software stack, the required filename for the current node is set in the profile bootsystem and therefore its value is usually specified in Hiera in the profiles::bootsystem::boot_params hash. It is set to launch the open source iPXE software, because it delivers many powerful features such as HTTP protocol support. This way, it is used as a workaround to hardware PXE ROM limitations.

The virtual machines boot like any other node, except QEMU uses iPXE as the PXE implementation for its virtual network adapters. This means that the virtual machines go directly to this step.

The iPXE bootloader must perform another DHCP request since the IP settings are lost when the bootloader is loaded. The DHCP server is able to recognize this request originates from an iPXE ROM. In this case, it sets the filename parameter with an HTTP URL to a CGI written in Python: bootmenu.py. If the DHCP server already knows the originating node and its MAC address (with a statically assigned IP address), it also sends the hostname in the answer. Otherwise, the DHCP request is not honored.

Then, the iPXE bootloader sends the GET HTTP request to this URL. In this request, it also adds to the parameters its hostname as it was given by the DHCP server.

On the HTTP server side, the Python script bootmenu.py is run as a CGI program. This script parses its configuration file /etc/hpc-config/bootmenu.yaml to get the parameters to properly boot the node: serial console and Ethernet device to use, default boot mode (diskless, installer, etc ..). Then it generates an iPXE profile with a menu containing all possible boot entries. Finally, a timeout parameter is added to the iPXE profile.

The iPXE bootloader downloads and loads this dynamically generated profile. Without any action from the administrator, iPXE waits the timeout and loads the default entry set by the Python script.

Note
If one of the following conditions is satisfied: either the hostname parameter is empty, or the node could not be found in the /etc/hpc-config/bootmenu.yaml file, then, the default choices from the config file are used.

Disk installation

Here is the sequence diagram of a Scibian server installation on disk, right after the PXE boot common steps:

image

The iPXE ROM downloads the Linux kernel and the initrd archive associated with the boot menu entry. The kernel is then run with all the parameters given in the menu entry, notably with the HTTP url to the preseed file.

The initrd archive contains the Debian Installer program. This program starts by sending a new DHCP request to get an IP address. Then, it downloads the preseed file located at the URL found in the `url ` kernel parameter. This preseed file contains all the answers to the questions asked by the Debian Installer program. This way, the installation process is totally automated and does not require any interaction from the administrator.

During the installation, many Debian packages are retrieved from Debian repositories.

At the end of the installation, Debian Installer runs the commands set in the late_command parameter of the preseed file. On Scibian HPC clusters, this parameter is used to run the following steps:

  • Download through HTTP the hpc-config-apply script,

  • Run hpc-config-apply inside the chroot environment of the newly installed system.

Detailed functionning of the hpc-config-apply script is not described here, but it involves:

  • downloading and installing additional Debian packages depending on the node role,

  • executing various types of software

  • and writing various configuration files on the installed system.

Please refer to hpc-config-apply(1) man page for a full documentation on how to use this script.

Finally, when the execution of the commands are over, the server reboots.

Once the servers are installed, they are configured through IPMI with Clara to boot on their disk devices first. Please refer to Clara documentation for further details.

Diskless boot

Here is the sequence diagram of the boot process for diskless nodes, right after the PXE boot common steps:

image

The iPXE bootloader downloads the Linux kernel and the initrd image defined within the default boot menu entry and runs them with the provided parameters. Among these parameters, there are notably:

  • fetch whose value is an HTTP URL to a torrent file available on the HTTP server of the supercomputer,

  • cowsize whose value is the size of the ramfs filesystem mounted on /lib/live/mount/overlay,

  • disk_format if this parameter is present the device indicated is formatted on node boot,

  • disk_raid if this parameter is present a software raid is created with the parameters indicated on node boot.

Within the initrd images, there are several specific scripts that come from live-boot, live-torrent and specific Scibian Debian packages. Please refer to the following sub-section Advanced Topics, Generating diskless initrd for all explanations about how these scripts have been added to the initramfs image.

These scripts download the torrent file at the URL specified in the fetch parameter, then they launch the ctorrent BitTorrent client. This client extracts from the torrent file the IP address of the BitTorrent trackers and the names of the files to download using the BitTorrent protocol. There is actually one file to download, the SquashFS image, that the client will download in P2P mode by gathering small chunks on several other nodes. Then, once the file has been fully retrieved, the image is mounted after executing some preliminary tasks like formatting the disk or setting up a raid array if it has been indicated in the kernel options passed by the boot menu. Then, the real init system is started and it launches all the system services. One of these services is hpc-config-apply.service which runs the hpc-config-apply script.

As for the part regarding the installation with a disk, how the hpc-config-apply script works is not described here. Please refer to hpc-config-apply(1) man page for a full documentation on this topic.

Finally, the node is ready for production.

Frontend nodes: SSH load-balancing and high-availability

The frontend nodes offer a virtual IP address on the WAN network that features both an highly-available and load-balanced SSH service for users to access the HPC cluster. The load-balancing feature automatically distributes users on all available frontend nodes. This load-balancing is operated with persistence so that users (based on their source IP address) are always redirected to the same frontend node in a time frame. Behind the virtual IP address, the high-availability of the SSH service is also ensured in case of outage on a frontend node. These load-balancing and high-availability features are ensured by the Keepalived software.

For security reasons, a firewall is also set up on the frontend nodes to control outgoing network traffic. This firewall service is managed by Shorewall, a high-level configuration tool for Linux netfilter. Because of all the various network flows involved in Keepalived, it must be tightly integrated with the firewall rules. The following diagram illustrates both the network principles behind the high-availability/load-balancing mechanisms and the integration with the software components of the firewall:

sshd frontend ha ll
Figure 3. sshd load-balancing HA mechanism with firewall integration

The Keepalived sofware checks all the frontend nodes using the VRRP[4] protocol on the WAN network interfaces (purple arrow in the diagram). This protocol must be allowed in the OUTPUT chain of the firewall so that Keepalived can work properly.

On the master frontend node, the HA virtual IP address is set on the network interface attached to the WAN network. The Keepalived software configures the IPVS[5] Linux kernel load-balancer to redirect new TCP connections with a Round-Robin algorithm. Therefore, a part of the TCP connections is redirected to the sshd daemon of other frontend nodes (orange arrow in the diagram). An exception must be specified in the OUTPUT chain of the firewall to allow these redirected connections.

To perform such redirections, IPVS simply changes the destination MAC address, to set the address of the real destination frontend, in the Ethernet layer of the first packet of the TCP connection. However, the destination IP address does not change: it is still the virtual IP address.

On the slave frontend nodes, the HA virtual IP address is set on the loopback interface. This is required to make the kernel accept the redirected packets from the master frontend node addressed to the virtual IP address. In order to avoid endless loops, the IPVS redirection rules are disabled on slave frontend nodes or else, packets would be redirected endlessly.

By default, the Linux kernel answers the ARP requests coming from any network device for any IP address attached to any network device. For example, on a system with two network devices: eth0 with ip0 and eth1 with ip1, if an ARP request is received for ip1 on eth0, the kernel positively responds to it, with the MAC address of eth0. Though it is convenient in many cases, this feature is annoying on the frontend nodes, since the virtual IP address is set on all of them. Consequently all frontend nodes answer the ARP requests coming from the WAN default gateway. In order to avoid this behaviour, the net.ipv4.conf.<netif>.arp_ignore and net.ipv4.conf.<netif>.arp_announce sysctl Linux kernel parameters, where <netif> is the network interface connected to the WAN network, are respectively set to 1 and 2. Please refer to the Linux documentation for more details on these parameters and their values: http://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

The Keepalived software also checks periodically if the sshd service is still available on all frontend nodes by trying to perform a TCP connection to their real IP addresses on the TCP/22 port (green arrow in the diagram). An exception must be present in the OUPUT chain of the firewall to allow these connections.

There is an unexplained behaviour in the Linux kernel where the Netfilter conntrack module considers that new TCP connections redirected by IPVS to the local sshd daemon have an invalid cstate. This point can be verified with well placed iptable rules using the LOG destination. This causes the TCP SYN/ACK answer from the sshd to be blocked by the OUTPUT chain since it considers the connection is new and not related to any incoming connections. To workaround this annoying behaviour, an exception has been added in the OUTPUT chain of the firewall to accept connections with a source port that is TCP/22 and a source IP address that is the virtual IP address. This is not totally satisfying in terms of security but there is no known easy or obvious way to exploit this security exception from a user perspective for other purposes.

If a slave frontend node becomes unavailable, Keepalived detects it either with VRRP checks, or with TCP checks in case only the sshd daemon is crashed. The IPVS rules are changed dynamically to avoid redirecting new TCP connections to this failing node.

If the master frontend node becomes unavailable, the Keepalived software selects a new master node within the other frontend nodes. Then, on this new master node, Keepalived restores the IPVS redirection rules (since they were previously disabled to avoid loops) and moves the virtual IP address from the loopback interface to the WAN network interface.

If a frontend node is scheduled to be turned of, it is possible to drain it.

Service nodes: DNS load-balancing and high-availability

This diagram gives an overview of the load-balancing and high-availability mechanisms involved in the DNS service of the Scibian HPC clusters:

arch dns ha ll
Figure 4. DNS service load-balancing and high-availability

On Linux systems, when an application needs to resolve a network hostname, it usually calls the gethostbyname*() and getaddrinfo() functions of the libc. With a common configuration of the Name Service Switch (in the file /etc/nsswitch.conf), the libc searches for the IP address in the file /etc/hosts and then fallbacks to a DNS resolution. The DNS solver gathers the IP address by sending a request to the DNS nameservers specified in the file /etc/resolv.conf. If this file contains multiple nameservers, the solver sends the request to the first nameserver. If it does not get the answer before the timeout, it sends the request to the second nameserver, and so on . If the application needs another DNS resolution, the solver will follow the same logic, always trying the first nameserver in priority. It implies that, with this default configuration, as long as the first nameserver answers the requests before the timeout, the other nameservers are never requested and the load is not balanced.

This behavior can be slightly altered with additional options in the file /etc/resolv.conf

  • options rotate: this option tells the libc DNS solver to send requests to all the nameservers for successive DNS requests of a process. The DNS solver is stateless and loaded locally for the processes as a library, either as a shared library or statically in the binary. Therefore, the rotation status is local to a process. The first DNS request of a process will always be sent to the first nameserver. The rotation only starts with the second DNS request of a process. Notably, this means that a program which sends one DNS request during its lifetime, launched n times, will send n DNS requests to the first nameserver only. While useful for programs with long lifetime, this option can not be considered as an efficient and sufficient load-balancing technique.

  • options timeout:1: this option reduces the request timeout from the default value i.e. 60 seconds to 1 second. This is useful when a nameserver has an outage since many processes are literally stuck waiting for this timeout when it occurs. This causes many latency issues. With this option, the libc DNS solver quickly tries the other nameservers and the side-effects of the outage are significantly reduced.

On Scibian HPC clusters, Puppet manages the file /etc/resolv.conf and ensures these two options are present. It also randomizes the list of nameservers with the fqdn_rotate() function of the Puppet stdlib community module. This function randomizes the order of the elements of an array but uses the fqdn fact to ensure the order stays the same for a node with a given FQDN. That is, each node will get a different random rotation from this function, but a given node’s result will be the same every time unless its hostname changes. This prevents the file content from changing with every Puppet runs. With this function, all the DNS nameservers are equivalently balanced on the nodes. Combined with the options rotate, it forms an efficient load-balancing mechanism.

The DNS servers are managed with the bind daemon on the generic service nodes. Each generic service nodes has a virtual IP address managed by a keepalived daemon and balanced between all the generic service nodes. The IP addresses of the nameservers mentioned in the file /etc/resolv.conf on the nodes are these virtual IP addresses. If a generic service node fails, its virtual IP address is automatically routed to another generic service node. In combination with options timeout:1, this constitutes a reliable failover mechanism and ensures the high-availability of the DNS service.

Consul and DNS integration

This diagram illustrates how Consul and the DNS servers integrate to provide load-balanced and horizontally scaled network services with high-availability: