About this document
Purpose
The present document presents the reference architecture, the bootstrap and installation procedures of an HPC system called Scibian HPC.
The main goal is to provide exhaustive information regarding the configuration and system settings based on the needs expressed by users. This information may be useful to business and technical stakeholders, as well as to all members of the scientific computing community at EDF.
Structure
This document is divided into five chapters:
-
About this document: refers to the present chapter.
-
Reference architecture: gives an overview of the software and hardware architecture of a Scibian HPC system. It also includes a detailed description of the boot sequence of the HPC System and some other advanced topics.
-
Installation procedures: describes how to install the Puppet-HPC software stack used to configure the administration and generic nodes of the HPC system. This chapter also explains how to use Ceph for sharing the configuration files across all the nodes and how to handle the virtual machines providing all the services needed to operate the HPC system.
-
Bootstrap procedures: contains all the procedures to boostrap all the crucial services for the Scibian HPC system: LDAP, Ceph, MariaDB with Galera, SlurmDBD, etc.
-
Production procedures: contains all the technical procedures to follow for regular operations occuring during the production phase of the supercomputer. This notably includes changing any encryption or authentication key, changing passwords, reinstalling nodes, etc.
Typographic conventions
The following typographic conventions are used in this document:
-
Files or directories names are written in italics: /admin/restricted/config-puppet.
-
Hostnames are written in bold: genbatch1.
-
Groups of hostnames are written using the nodeset syntax from clustershell. For example, genbatch[1-2] refers to the servers genbatch1 and genbatch2.
-
Commands, configuration files contents or source code files are set off visually from the surrounding text as shown below:
$ cp /etc/default/rcS /tmp
Build dependencies
On a Debian Jessie system, these packages must be installed to build this documentation:
-
asciidoctor >= 0.1.4
-
asciidoctor-edf-tpl-latex >= 2.0
-
inkscape
-
rubber
-
texlive-latex-extra
License
Copyright © 2014-2017 EDF S.A.
CCN-HPC <dsp-cspito-ccn-hpc@edf.fr>
This document is governed by the CeCILL license under French law and abiding by the rules of distribution of free software. You can use, modify and/ or redistribute the document under the terms of the CeCILL license as circulated by CEA, CNRS and INRIA at the following URL "http://www.cecill.info". As a counterpart to the access to the source code and rights to copy, modify and redistribute granted by the license, users are provided only with a limited warranty and the document's author, the holder of the economic rights, and the successive licensors have only limited liability. In this respect, the user's attention is drawn to the risks associated with loading, using, modifying and/or developing or reproducing the document by the user in light of its specific status of free software, that may mean that it is complicated to manipulate, and that also therefore means that it is reserved for developers and experienced professionals having in-depth computer knowledge. Users are therefore encouraged to load and test the document's suitability as regards their requirements in conditions enabling the security of their systems and/or data to be ensured and, more generally, to use and operate it in the same conditions as regards security. The fact that you are presently reading this means that you have had knowledge of the CeCILL license and that you accept its terms.
Full license terms and conditions can be found at http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html.
Authors
In alphabetical order:
-
Benoit Boccard
-
Ana Guerrero López
-
Thomas Hamel
-
Camille Mange
-
Rémi Palancher
-
Cécile Yoshikawa
Reference architecture
This chapter gives an overview of the software and hardware architecture of a Scibian HPC system. It also includes a detailed description of the boot sequence of the HPC System and some other advanced topics.
Hardware architecture
The following diagram represents the reference high-level hardware architecture of Scibian HPC clusters:
Networks
The cluster is connected to three physically separated networks:
-
The WAN network, an Ethernet based network with L3 network routers which connect the IP networks of the HPC cluster to the organization network.
-
The low-latency network for both I/O to the storage system and distributed computing communications (typically MPI messages) between compute nodes. The hardware technologies of this network may vary upon performance requirements but it generally involves high bandwidth (10+GB/s) and low latency technologies such as InfiniBand, Omni-Path or 10GB Ethernet.
-
The administration network used for basically every other internal network communications: deployment, services, administrator operations, etc. It must be a L2 Ethernet network with dedicated switches.
It is recommended to split the administration Ethernet network with a VLAN dedicated to all management devices (BMC [1], CMC [2], etc). This has significant advantages:
-
It significantly reduces the size of Ethernet broadcast domains which notably increases DHCP reliability and drops Ethernet switches load.
-
It slightly increases security since the IP access to the management devices can be restricted to nodes accessing the VLAN or by a firewall on an IP router.
Administration cluster
The administration cluster is composed by two types of nodes: the admin node and the generic service nodes.
The admin node is the access node for administrators and the central point of administrative operations. All common administrative actions are performed on this node. It does not run any intensive workloads, just simple short-lived programs and it does not need to be very powerful. It does not store sensible data nor run critical services, so it does not need to be very reliable either. Example of hardware specifications:
CPU |
1 x 4 cores |
RAM |
8GB ECC |
Network |
|
Storage |
2 x 300GB RAID1 SATA hard disk |
PSU |
Non-redundant |
The generic service nodes run all critical infrastructure services (within service virtual machines) and manage all production administrative data. Scibian HPC requires a pool from 3 (minimum) to 5 (recommended) generic service nodes. The pool works in active cluster mode, the load is balanced with automatic fail-over. All generic service nodes of a cluster must be fairly identical for efficient load-balancing.
The generic service nodes manage the production data into a distributed object-storage system. It is highly recommended that the nodes have a dedicated block storage device for this purpose. The workload is mostly proportional to the number of compute nodes but the generic service nodes must be quite powerful to comfortably handle load peaks happening during some operations (ex: full cluster reboot). Also, since services are run into virtual machines, a fairly large amount of RAM is required. Services can generate a lot of traffic on the administration network, it is relevant to provide a network adapter with high bandwidth. Even though high-availability is ensured at the software level with automatic fail-over between generic service nodes, it is nevertheless recommended to get hardware redundancy on most devices of the generic service nodes to avoid always risky and hazardous service migrations as much as possible. Example of hardware specifications:
CPU |
2 x 16 cores |
RAM |
64GB ECC |
Network |
|
Storage |
|
PSU |
Redundant |
All physical nodes must be connected to all three physical networks. There are virtual bridges on the host of the generic service nodes connected to the WAN, administration (and eventually management) networks. The service virtual machines have connections to the virtual bridges upon their hosted service requirements.
User-space cluster
The user-space cluster is composed of frontend nodes and compute nodes.
The nodes of the user-space cluster are deployed with a diskless live system stored in RAM. It implies that, technically speaking, the nodes do not necessarily need to have local block storage devices.
The frontend nodes are the access hosts for users so they must be connected to all three physical networks. It is possible to have multiple frontend nodes in active cluster mode for load-balancing and automatic fail-over. The exact hardware specifications of the frontend nodes mostly depend on user needs and expectations. Users may need to transfer large amount of data to the cluster, it is therefore recommended to provide high-bandwidth network adapters for the WAN network. These nodes can also be designed to compile computational codes and in this case, they must be powerful in terms of CPU, RAM and local storage I/O.
The compute nodes run the jobs so they must provide high performances. Their exact hardware specifications totally depend on user needs. They must be connected to both the administration and the low-latency networks.
Storage system
The storage system is designed to host user data. It provides one or several shared POSIX filesystems. The evolved storage technologies depend on user needs ranging from a simple NFS NAS to a complex distributed filesystem such as Lustre or GPFS with many SAN and I/O servers.
External services
A Scibian HPC cluster is designed to be mainly self contained and to continue running jobs even if it is cut off from the rest of the organization network. There is some limits to this though and some external services are needed. Critical external services are replicated inside the cluster though, to avoid losing availability of the cluster if the connection to external service is cut.
Base services
LDAP
The reference cluster architecture provides a highly available LDAP service, but it is only meant as a replica of an external LDAP service. The organization must provide an LDAP service with suitable replica credentials.
Only the LDAP servers (Proxy virtual machines) connect to these servers.
NTP
The generic service nodes are providing NTP servers for the whole cluster.
Those servers must be synchronized on an external NTP source. This could be an
organization NTP or a public one (eg. spool.ntp.org
).
Only the NTP servers (Generic Service nodes) connect to these servers.
Package repositories
The normal way for a Scibian HPC Cluster to handle package repositories (APT)
is to provide a proxy cache to organization or public distribution
repositories. Alternatively, it is possible to mirror external repositories on
the cluster (with clara
and Ceph/S3).
Proxy cache needs less maintenance and is the preferred solution. Local mirrors can be used when reliable connection to external repositories is unreliable.
Only the Proxy Cache servers (Generic Service nodes) connect to these servers. In the mirror mode, only the admin node uses them.
DNS
External DNS service is not strictly necessary but is hard to not configure if the cluster must use organization or public services (License servers, NAS…).
The external DNS servers are configured as recursive in the local DNS server configuration.
Only the DNS servers (Generic Service nodes) connect to these servers.
Optional services
NAS
It is frequent to mount (at least on the frontend nodes) an external NAS space to copy data in and out of the cluster.
Graphite
In the reference architecture all system metrics collected on the cluster (by collectd) are pushed to an external graphite server. This is usually relayed by the proxy virtual machines.
InfluxDB
In the reference architecture all jobs metrics collected on the cluster are pushed to an external InfluxDB server. This is usually relayed by the proxy virtual machines.
HPCStats
HPCStats is a tool that frequently connects to the frontend as a normal user to launch job. It also connects to the SlurmDBD database to get batch job statistics. The database connection needs a special NAT configuration on the Proxy virtual machines.
Slurm-Web Dashboard
The Slurm-Web Dashboard aggregates data coming from multiple clusters in the same web interface. To get those data, the client connect to an HTTP REST API that is hosted on the Proxy virtual machines.
Software architecture
Overview
Functions
The software configuration of the cluster aims to deliver a set of functions. Functions can rely on each other, for example, the disk installer uses the configuration management to finish the post-install process.
The main functions provided by a Scibian HPC cluster are:
-
Configuration Management, to distribute and apply the configuration to the nodes
-
Disk Installer, to install an OS from scratch on the node disks through the network
-
Diskless Boot, to boot a node with a live diskless OS through the network
-
Administrator Tools, tools and services used by the system administrator to operate the cluster
-
User Tools, tools and services used by end users
The Scibian HPC Cluster will use a set of services to deliver a particular function. If a cluster can provide Configuration Management and a Disk Installer, it is able to operate even if it cannot do something useful for the users. These two core functions permit to create a self sufficient cluster that will be used to provide other functions.
Services
The software services of the cluster are sorted into two broad categories:
-
Base Services, necessary to provide core functions: install and configure a physical or virtual machine
-
Additional Services, to boot a diskless (live) machine, provide all end user services (batch, user directory, licenses…), and system services not mandatory to install a machine (monitoring, metrics…)
The Base Services run on a set of physical machines that are almost identical, those hosts are called Service Nodes. The services are setup to work reliably even if some of the service nodes are down. This means that a service node can be re-installed by other active service nodes.
The Additional Services can be installed on a set of other hosts that can be either physical or virtual. VMs (Virtual Machines) are usually used because those services do not need a lot of raw power and the agility provided by virtual machines (like live host migration) are often an advantage.
If the cluster is using virtualized machines for the Additional Services, the service nodes must also provide a consistent virtualization platform (storage and hosts). In the reference architecture, this is provided with Ceph RBD and Libvirtd running on service nodes.
A particular service runs on service nodes even if it is not mandatory for Disk Installer or Config Management: the low-latency network manager (Subnet Manager for InfiniBand, Fabric Manager for Intel Omni-Path). This exception is due to the fact that this particular service needs raw access to the low-latency network.
In the Puppet configuration, services are usually associated with profiles.
For example, the puppet configuration configures the DNS Server service with
the profile: profiles::dns::server
.
Base Services
Infrastructure
Infrastructure-related services provide basic network operations:
-
DHCP and TFTP for PXE Boot
-
DNS servers, with forwarding for external zones
-
NTP servers, synchronized on external servers
These services are configured the same way and running on each service nodes.
Consul
Consul is a service that permits to discover available services in the cluster.
Client will query a special DNS entry (xxx.service.virtual
) and the DNS
server integrated with Consul will return the IP address of an available instance.
Ceph
Ceph provides an highly available storage system for all system needs. Ceph has the advantage to work with internal storage on service nodes. It does not require a storage system shared between servers (NAS or SAN).
Ceph provides:
-
A Rados Block Device (RBD) that is used to store Virtual Machines disk images
-
A Rados GateWay to provide storage for configuration management, Amazon S3 compatible REST API for write operations and plain HTTP for read.
-
A Ceph FS that can provide a POSIX filesystem used for Slurm Controller state save location
A Ceph cluster is made of four kinds of daemons. All generic service nodes run the following daemons:
-
OSD, Object Storage Daemons actually holding the content of the ceph cluster
-
RGW, Rados GateWay (sometimes shortened radosgw) exposing an HTTP API like S3 to store and retrieve data in Ceph
Two other kind of service are only available on three of the generic service nodes:
-
MON, Monitoring nodes, this is the orchestrator of the ceph cluster. A quorum of two active mon nodes must be maintained for the cluster to be available
-
MDS, MetaData Server, only used by CephFS (the POSIX implementation above ceph). At least one must always be active.
With this configuration, any server can be unavailable. As long as at least two servers holding critical services are available, the cluster might survive losing another non-critical server.
Libvirt/KVM
Service nodes are also the physical hosts for the Virtual Machines of the cluster. Libvirt is used in combination with QEMU/KVM to configure the VMs. A Ceph RBD pool is used to store the image of the VMs. With this configuration, the only state on a service node is the VM definition.
Integration with Clara makes it easy to move VMs between nodes.
HTTP secret and boot
The process to boot a node needs a configuration obtained through HTTP and computed by a CGI (in Python). This is hosted on the service nodes and served by Apache. This is also used to serve files like the kernel, initrd and pre-seeded configuration.
A special Virtual Host on the Apache configuration is used to serve secrets (Hiera-Eyaml keys). This VHost is configured to only serve the files on a specific port. This port is only accessible if the client connects from a port below 1024 (is root), this is enforced by a Shorewall rule.
APT proxy
There is no full repository mirror on the cluster. APT is configured to use a proxy that will fetch data from external repositories and cache it. This permits to have always up-to-date packages without overloading external repositories and without having to maintain mirror sync (internally and externally).
Logs
Logs from all nodes are forwarded to a Virtual IP address running on the service nodes. The local rsyslog daemon will centralize those logs and optionally forward the result to an external location.
Low-latency network manager
The Low-latency network manager (InfiniBand Subnet Manager or Intel Omni-Path Fabric Manager) is not mandatory to achieve the feature set of Base Services (Configuration Management and Disk Installation) but it must run on a physical machine, so it is grouped with the Base Services to run on the service nodes.
NFS HA Service
A NFS HA Service can serve two purpose:
-
Shared state for servicing using Posix to share their state (like SlurmCtld) when CephFS does not provided sufficient performance
-
Shared storage for the users if a distributed file system like GPFS or Lustre is not used (only works for smaller cluster sizes)
The NFS HA Service is provided with a Keepalived setup.
Additional Services
LDAP
There is no standalone LDAP servers configured. The servers are replica from an external directory. This means that both are configured independently and are accessed only for read operations.
If the organization uses Kerberos, all Kerberos requests and password checks are done directly by the external Kerberos server.
Bittorrent
Diskless image files are downloaded by the nodes with the BitTorrent protocol. The cluster provides a redundant tracker service with OpenTracker and two server machines are configured to always seed the images.
An Apache server is used to serve the torrent files for the diskless images (HTTP Live).
Slurm
Slurm provides the job management service for the cluster. The controller service (SlurmCtld) runs in an Active/Passive configuration on a pair of servers (batch nodes). The state is shared between the controller nodes. This can be achieved with a CephFS mount or with an NFS HA server. CephFS does not permit to support a large number (thousands) of jobs yet.
The SlurmDBD service also runs on these two servers.
MariaDB/Galera
SlurmDBD uses a MySQL like database to store accounting information and limits. On Scibian HPC Clusters this is provided by a MariaDB/Galera cluster which provides an Active/Active SQL server compatible with MySQL.
This cluster is usually co-located with SlurmDBD service and Slurm Controllers (batch nodes).
Relays
The Additional Services include a set of relay services to the outside of the cluster for:
-
Email (Postfix Relay)
-
Network (NAT configured by Shorewall)
-
Metrics (Carbon C Relay)
Monitoring
Cluster monitoring is done by Icinga2, the cluster is integrated inside an organization Icinga infrastructure. The cluster hosts a redundant pair of monitoring satellites that checks the nodes. The monitoring master is external to the cluster.
High-Availability
All services running on the cluster should be highly available (HA). Some services not critical for normal cluster operation can be not highly available, but this should be avoided if possible.
The following section lists the different techniques used to achieve high-availability of the cluster services.
Stateless
Stateless services are configured the same way on all servers and will give the same answer to all requests. These services include:
-
DHCP
-
TFTP
-
NTP
-
DNS
-
LDAP Replica
-
HTTP Secret
-
HTTP Boot
-
HTTP Live
-
Ceph RadosGW
-
APT Proxy
-
Carbon Relay
-
Bittorrent Tracker
-
Bittorrent Seeder
-
SMTP Relay
Clients can provide a list of potential servers that will be tried in turn. If the client do not automatically accept multiple servers, it is possible to use the Consul service to get a DNS entry (``xxx.service.virtual``) that will always point to an available instance of the service.
As a last resort and for services that do not need Active/Active (Load Balancing) capabilities, it is possible to use a Virtual IP address (VIP). HTTP Live and Carbon Relay uses this technique.
Native Active/Active
Some services have native internal mechanisms to share states between the servers. Contacting any server will have the same effect on the state of the service, or the service has an internal mechanism to get the right server. These services behave this way:
-
Ceph Rados
-
MariaDB/Galera
-
Consul
Native Active/Passive
Services that have only one active server at any time, but the mechanism to select the active server is internal to the service. This means all servers are launched in the same way and not by an external agent like Keepalived or Pacemaker/Corosync. Services using this technique are:
-
Ceph MDS (Posix CephFS server)
-
Slurm Controller
-
Omni-Path Fabric Manager or InfiniBand Subnet Manager
Controlled Active/Passive
The service can only have one active server at any one time and this failover must be controlled by an external service. On the current configuration the only service requiring this setup is:
-
NFS HA Server
Conventions
In order to restrain the complexity of the configuration of a Scibian HPC cluster, some naming and architecture conventions have been defined. Multiple components of the software stack expect these conventions to be followed in order to operate properly. These conventions are actually rather close to HPC cluster standards, then they should not seem very constraining.
-
The operating system short hostname of the nodes must have the following format:
<prefix><role><id>
. This is required by the association logic used in Puppet-HPC to map a node to its unique Puppet role. This point is fully explained in the role section of Puppet-HPC reference documentation. -
The FQDN[3] hostnames of the nodes must be similar to their network names on the administration network. In other words, the IP address resolution on the cluster of the FQDN hostname of a node must return the IP address of this node on the administration network.
Advanced Topics
Boot sequence
Initial common steps
The servers of the cluster can boot on their hard disks or via the network,
using the PXE protocol. In normal operations, all service nodes are installed
on hard disks, and all nodes of the userspace (compute and frontend nodes)
use the network method to boot the diskless image.
A service node can use the PXE method when it is being installed.
The boot sequence between the power on
event on the node and the boot of the
initrd is identical regardless of the system booted (installer or diskless
image).
The steps of the boot sequence are described on the diagram below:
When a node boots on its network device, after a few (but generally time-consuming) internal checks, it loads and runs the PXE ROM stored inside the Ethernet adapter. This ROM first sends a DHCP request to get an IP address and other network parameters. The DHCP server gives it an IP address alongside the filename parameter. This filename is the file the PXE ROM downloads using the TFTP protocol. This protocol, which is rather limited and unreliable is used here because the PXE ROM commonly available in Ethernet adapters only supports this network protocol.
The file to download depends on the type of nodes or roles. On Scibian HPC
clusters when using the Puppet-HPC software stack, the required filename for
the current node is set in the profile bootsystem
and therefore its value is
usually specified in Hiera in the profiles::bootsystem::boot_params
hash.
It is set to launch the open source iPXE software, because it delivers many
powerful features such as HTTP protocol support. This way, it is used as a
workaround to hardware PXE ROM limitations.
The virtual machines boot like any other node, except QEMU uses iPXE as the PXE implementation for its virtual network adapters. This means that the virtual machines go directly to this step.
The iPXE bootloader must perform another DHCP request since the IP settings are lost when the bootloader is loaded. The DHCP server is able to recognize this request originates from an iPXE ROM. In this case, it sets the filename parameter with an HTTP URL to a CGI written in Python: bootmenu.py. If the DHCP server already knows the originating node and its MAC address (with a statically assigned IP address), it also sends the hostname in the answer. Otherwise, the DHCP request is not honored.
Then, the iPXE bootloader sends the GET HTTP request to this URL. In this request, it also adds to the parameters its hostname as it was given by the DHCP server.
On the HTTP server side, the Python script bootmenu.py
is run as a CGI program. This
script parses its configuration file /etc/hpc-config/bootmenu.yaml to get
the parameters to properly boot the node: serial console and Ethernet device
to use, default boot mode (diskless, installer, etc ..). Then it generates an
iPXE profile with a menu containing all possible boot entries. Finally, a
timeout parameter is added to the iPXE profile.
The iPXE bootloader downloads and loads this dynamically generated profile. Without any action from the administrator, iPXE waits the timeout and loads the default entry set by the Python script.
Note
|
If one of the following conditions is satisfied: either the hostname parameter is empty, or the node could not be found in the /etc/hpc-config/bootmenu.yaml file, then, the default choices from the config file are used. |
Disk installation
Here is the sequence diagram of a Scibian server installation on disk, right after the PXE boot common steps:
The iPXE ROM downloads the Linux kernel and the initrd archive associated with the boot menu entry. The kernel is then run with all the parameters given in the menu entry, notably with the HTTP url to the preseed file.
The initrd archive contains the Debian Installer program. This program starts by sending a new DHCP request to get an IP address. Then, it downloads the preseed file located at the URL found in the `url ` kernel parameter. This preseed file contains all the answers to the questions asked by the Debian Installer program. This way, the installation process is totally automated and does not require any interaction from the administrator.
During the installation, many Debian packages are retrieved from Debian repositories.
At the end of the installation, Debian Installer runs the commands set in the
late_command
parameter of the preseed file. On Scibian HPC clusters, this
parameter is used to run the following steps:
-
Download through HTTP the hpc-config-apply script,
-
Run hpc-config-apply inside the chroot environment of the newly installed system.
Detailed functionning of the hpc-config-apply script is not described here, but it involves:
-
downloading and installing additional Debian packages depending on the node role,
-
executing various types of software
-
and writing various configuration files on the installed system.
Please refer to hpc-config-apply(1)
man page
for a full documentation on how to use this script.
Finally, when the execution of the commands are over, the server reboots.
Once the servers are installed, they are configured through IPMI with Clara to boot on their disk devices first. Please refer to Clara documentation for further details.
Diskless boot
Here is the sequence diagram of the boot process for diskless nodes, right after the PXE boot common steps:
The iPXE bootloader downloads the Linux kernel and the initrd image defined within the default boot menu entry and runs them with the provided parameters. Among these parameters, there are notably:
-
fetch
whose value is an HTTP URL to a torrent file available on the HTTP server of the supercomputer, -
cowsize
whose value is the size of the ramfs filesystem mounted on /lib/live/mount/overlay, -
disk_format
if this parameter is present the device indicated is formatted on node boot, -
disk_raid
if this parameter is present a software raid is created with the parameters indicated on node boot.
Within the initrd images, there are several specific scripts that come from
live-boot
, live-torrent
and specific Scibian Debian packages. Please refer
to the following sub-section Advanced Topics, Generating diskless initrd
for all explanations about how these scripts have been added to the
initramfs image.
These scripts download the torrent file at the URL specified in the fetch
parameter, then they launch the ctorrent
BitTorrent client. This client
extracts from the torrent file the IP address of the BitTorrent trackers and
the names of the files to download using the BitTorrent protocol.
There is actually one file to download, the SquashFS image, that the client
will download in P2P mode by gathering small chunks on several other nodes.
Then, once the file has been fully retrieved, the image is mounted after
executing some preliminary tasks like formatting the disk or setting up a raid
array if it has been indicated in the kernel options passed by the boot menu.
Then, the real init system is started and it launches all the system services.
One of these services is hpc-config-apply.service
which runs the
hpc-config-apply script.
As for the part regarding the installation with a disk, how the hpc-config-apply script works is not described here.
Please refer to hpc-config-apply(1)
man page
for a full documentation on this topic.
Finally, the node is ready for production.
Frontend nodes: SSH load-balancing and high-availability
The frontend nodes offer a virtual IP address on the WAN network that features both an highly-available and load-balanced SSH service for users to access the HPC cluster. The load-balancing feature automatically distributes users on all available frontend nodes. This load-balancing is operated with persistence so that users (based on their source IP address) are always redirected to the same frontend node in a time frame. Behind the virtual IP address, the high-availability of the SSH service is also ensured in case of outage on a frontend node. These load-balancing and high-availability features are ensured by the Keepalived software.
For security reasons, a firewall is also set up on the frontend nodes to control outgoing network traffic. This firewall service is managed by Shorewall, a high-level configuration tool for Linux netfilter. Because of all the various network flows involved in Keepalived, it must be tightly integrated with the firewall rules. The following diagram illustrates both the network principles behind the high-availability/load-balancing mechanisms and the integration with the software components of the firewall:
The Keepalived sofware checks all the frontend nodes using the VRRP[4] protocol on the WAN network interfaces (purple arrow in the diagram). This protocol must be allowed in the OUTPUT chain of the firewall so that Keepalived can work properly.
On the master frontend node, the HA virtual IP address is set on the network
interface attached to the WAN network. The Keepalived software configures the
IPVS[5] Linux kernel load-balancer to redirect new TCP
connections with a Round-Robin algorithm. Therefore, a part of the TCP connections is redirected to the sshd
daemon of other frontend nodes (orange arrow in
the diagram). An exception must be specified in the OUTPUT chain of the
firewall to allow these redirected connections.
To perform such redirections, IPVS simply changes the destination MAC address, to set the address of the real destination frontend, in the Ethernet layer of the first packet of the TCP connection. However, the destination IP address does not change: it is still the virtual IP address.
On the slave frontend nodes, the HA virtual IP address is set on the loopback interface. This is required to make the kernel accept the redirected packets from the master frontend node addressed to the virtual IP address. In order to avoid endless loops, the IPVS redirection rules are disabled on slave frontend nodes or else, packets would be redirected endlessly.
By default, the Linux kernel answers the ARP requests coming from any network
device for any IP address attached to any network device. For example, on a
system with two network devices: eth0
with ip0
and eth1
with ip1
, if
an ARP request is received for ip1
on eth0
, the kernel positively responds
to it, with the MAC address of eth0
. Though it is convenient in many cases,
this feature is annoying on the frontend nodes, since the virtual IP address is
set on all of them. Consequently all frontend nodes answer the ARP requests
coming from the WAN default gateway. In order to avoid this behaviour, the
net.ipv4.conf.<netif>.arp_ignore
and net.ipv4.conf.<netif>.arp_announce
sysctl Linux kernel parameters, where <netif>
is the network interface
connected to the WAN network, are respectively set to 1 and 2. Please refer to
the Linux documentation for more details on these parameters and their values:
http://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
The Keepalived software also checks periodically if the sshd
service is still
available on all frontend nodes by trying to perform a TCP connection to their
real IP addresses on the TCP/22 port (green arrow in the diagram). An exception
must be present in the OUPUT chain of the firewall to allow these connections.
There is an unexplained behaviour in the Linux kernel where the Netfilter
conntrack module considers that new TCP connections redirected by IPVS to the
local sshd
daemon have an invalid cstate. This point can be verified with well placed iptable rules using the LOG destination. This causes the TCP SYN/ACK
answer from the sshd
to be blocked by the OUTPUT chain since it considers the
connection is new and not related to any incoming connections. To workaround
this annoying behaviour, an exception has been added in the OUTPUT chain of the
firewall to accept connections with a source port that is TCP/22 and a source IP address that is the virtual IP address. This is not totally satisfying in
terms of security but there is no known easy or obvious way to exploit this
security exception from a user perspective for other purposes.
If a slave frontend node becomes unavailable, Keepalived detects it either with
VRRP checks, or with TCP checks in case only the sshd
daemon is crashed. The
IPVS rules are changed dynamically to avoid redirecting new TCP connections to
this failing node.
If the master frontend node becomes unavailable, the Keepalived software selects a new master node within the other frontend nodes. Then, on this new master node, Keepalived restores the IPVS redirection rules (since they were previously disabled to avoid loops) and moves the virtual IP address from the loopback interface to the WAN network interface.
If a frontend node is scheduled to be turned of, it is possible to drain it.
Service nodes: DNS load-balancing and high-availability
This diagram gives an overview of the load-balancing and high-availability mechanisms involved in the DNS service of the Scibian HPC clusters:
On Linux systems, when an application needs to resolve a network hostname, it
usually calls the gethostbyname*()
and getaddrinfo()
functions of the libc.
With a common configuration of the Name Service Switch (in the file
/etc/nsswitch.conf), the libc searches for the IP address in the file
/etc/hosts and then fallbacks to a DNS resolution. The DNS solver gathers
the IP address by sending a request to the DNS nameservers specified in the
file /etc/resolv.conf. If this file contains multiple nameservers, the
solver sends the request to the first nameserver. If it does not get the answer
before the timeout, it sends the request to the second nameserver, and so on .
If the application needs another DNS resolution, the solver will follow the
same logic, always trying the first nameserver in priority. It implies that,
with this default configuration, as long as the first nameserver answers the
requests before the timeout, the other nameservers are never requested and
the load is not balanced.
This behavior can be slightly altered with additional options in the file /etc/resolv.conf
-
options rotate
: this option tells the libc DNS solver to send requests to all the nameservers for successive DNS requests of a process. The DNS solver is stateless and loaded locally for the processes as a library, either as a shared library or statically in the binary. Therefore, the rotation status is local to a process. The first DNS request of a process will always be sent to the first nameserver. The rotation only starts with the second DNS request of a process. Notably, this means that a program which sends one DNS request during its lifetime, launched n times, will send n DNS requests to the first nameserver only. While useful for programs with long lifetime, this option can not be considered as an efficient and sufficient load-balancing technique. -
options timeout:1
: this option reduces the request timeout from the default value i.e. 60 seconds to 1 second. This is useful when a nameserver has an outage since many processes are literally stuck waiting for this timeout when it occurs. This causes many latency issues. With this option, the libc DNS solver quickly tries the other nameservers and the side-effects of the outage are significantly reduced.
On Scibian HPC clusters, Puppet manages the file /etc/resolv.conf and
ensures these two options are present. It also randomizes the list of
nameservers with the fqdn_rotate()
function of the Puppet stdlib community
module. This function randomizes the order of the elements of an array but uses
the fqdn
fact to ensure the order stays the same for a node with a given FQDN.
That is, each node will get a different random rotation from this function,
but a given node’s result will be the same every time unless its hostname
changes. This prevents the file content from changing with every Puppet runs.
With this function, all the DNS nameservers are equivalently balanced on the
nodes. Combined with the options rotate
, it forms an efficient load-balancing
mechanism.
The DNS servers are managed with the bind
daemon on the generic service nodes.
Each generic service nodes has a virtual IP address managed by a
keepalived
daemon and balanced between all the generic service nodes. The IP
addresses of the nameservers mentioned in the file /etc/resolv.conf on the
nodes are these virtual IP addresses. If a generic service node fails, its
virtual IP address is automatically routed to another generic service node.
In combination with options timeout:1
, this constitutes a reliable failover
mechanism and ensures the high-availability of the DNS service.
Consul and DNS integration
This diagram illustrates how Consul and the DNS servers integrate to provide load-balanced and horizontally scaled network services with high-availability:
The Consul agent daemon can run in two modes: server and client. The cluster of Consul servers maintains the state of the cluster using the raft protocol. The clients communicate with the servers to detect failures using the gossip protocol. Both agents expose the data of the Consul cluster through a HTTP REST API. On Scibian HPC clusters, the Consul servers run on the generic service nodes while the admin node runs a client agent.
As explained in the Software architecture section, Consul discovers network services on a pool of nodes. The services discovered by Consul on Scibian HPC clusters are hosted on the generic service nodes. Each Consul server is responsible for checking its locally running services, such as an HTTP server for example. The state being constantly shared by all Consul agents, every agent is actually able to tell where the services are available. Consul notably provides a DNS interface. Given a particular virtual hostname referring to a service, Consul can give the IP addresses of the servers currently running this service.
Consul is not designed to operate as a full DNS server. It listens for incoming
requests on an alternative UDP port for a particular sub-domain
virtual.<domain>
, where <domain>
is configurable and depends on the cluster.
On the nodes, the clients are configured to connect to services in this
particular sub-domain, for example http.virtual.<domain>
for the HTTP service.
The DNS requests sent by the clients are received by the bind
daemon through
the virtual IP addresses of the generic service nodes, as explained in
DNS Load-balancing and High-availability section. The
DNS bind
daemon is configured to forward the requests on the virtual
sub-domain to the local Consul agent. The Consul agent answers the DNS request
with the static IP address of the generic service nodes running this service, in
random order.
In this architecture, both the DNS requests to the Consul servers and the services (eg. HTTP) requests are load-balanced on all the generic service nodes in high-availability mode. The same mechanism also applies to APT proxies, Ceph RADOS gateways, and so on.
The Consult
utility is installed on the
admin node to request the current status of the Consul cluster. It connects to
the REST API of the Consul client running locally and prints the status on the
standard output.
Scibian diskless initrd
The scibian-diskless-initramfs-config package
This package contains the necessary configuration in order to build an initramfs disk suitable for Scibian diskless nodes. It depends on the following packages:
-
initramfs-tools
-
live-torrent
-
live-boot
-
mdadm
-
parted
initramfs-tools
Initramfs-tools
is a Debian package that provides tools to create a bootable
initramfs for Linux kernel packages. The initramfs is a compressed cpio
archive. At boot time, the kernel unpacks that archive into RAM, mounts and
uses it as the initial root file system. The mounting of the real root file
system occurs in early user space.
live-boot
The live-boot
package provides support for live systems. It depends on the
live-boot-initramfs-config
package, which is a backend for live-boot in
initramfs config. In particular, it provides the "live" script in
/usr/share/initramfs-tools/scripts/live. This script is copied in the
generated initramfs and can download and unpack live system images used as the
root filesystem for diskles nodes.
live-torrent
The live-torrent
package provides support for BitTorrent downloading for live
systems. It depends on the live-torrent-initramfs-tools
package, which
provides the ctorrent
binary (a bitorrent client) in the initramfs.
Generating the initramfs
With the packages described above installed on a Scibian system, it is possible to generate an initramfs able to download the root live system image via the BitTorrent protocol.
On a Scibian HPC cluster, it is recommended to use the Clara
tool to generate
the root live system image, and to generate the corresponding initramfs.
It is possible to specify in the Clara configuration file which packages are
mandatory in the image before generating the initramfs.
Here is an example of the "images" section of the Clara configuration file:
[images] files_to_remove=/etc/udev/rules.d/70-persistent-net.rules,/root/.bash_history,/etc/hostname etc_hosts=10.0.0.1:service,10.0.0.2:admin1 extra_packages_image=hpc-config-apply,scibian-hpc-compute packages_initrd=scibian-diskless-initramfs-config
With this configuration, Clara
follows these steps to generate the initramfs:
-
Uncompress the squashfs image
-
Chroot in the directory created
-
Install the packages defined by the
packages_initrd
key in the Clara config file -
Generate the initramfs
-
Do not re-compress the squashfs image
This method is used to guarantee consistency with the kernel in the squashfs image. It is also possible to generate an initramfs for an image based on Scibian9 with a machine installed on Scibian8, for example.
Installation procedure
This chapter describes how to install the Scibian HPC cluster software stack on a hardware infrastructure compliant with the reference architecture. The first section gives a quick overview of the main steps of the installation process. There are few requirements before starting the installation, they are listed in the following sections. Then, the successive steps are described in details. Finally, the chapter ends with the installation documentation of various optional features.
Overview
The installation process of a Scibian HPC cluster starts with the administration cluster of the reference architecture. The administration cluster is composed of the admin node and a pool of generic services nodes. The generic services nodes run the base services required by all nodes, then they are the entry point of the installation procedure.
The first generic service node takes the role of the temporary installation in order to install all the other generic service nodes. When the generic services nodes are fully operational with the base software services stack, the admin node is installed. Then, the process continues with the services virtual machines and the set of additional services are installed.
Finally, the frontend and compute nodes of the userspace cluster are deployed and all the additional services are setup to make the Scibian HPC cluster fully operational.
Requirements
There are a few requirements before starting up the cluster installation. This section aims to inventory all of these requirements, with example values.
Note
|
For the sake of simplicity, the examples values are used all along the rest of the installation procedure documentation in various commands or code excerpts. These examples values must be replaced with values corresponding to your environment where appropriate. |
Description | Example |
---|---|
Cluster name |
|
Cluster prefix |
|
Network domain name |
|
Remote Git internal configuration repository (cf. following note) |
|
DNS servers |
|
NTP servers |
|
SMTP servers |
|
LDAP server |
|
Groups of users in LDAP directory |
|
IP networks and adressing plan |
4 IP networks:
|
All MAC adresses |
|
Network interfaces configuration of all the nodes and equipments |
Please refer to the following diagram for an example of generic service network configuration. |
Local block storage configuration of all the nodes |
For generic services nodes:
* |
Note
|
The deployment of Scibian HPC cluster is mainly based on Puppet-HPC. As explained in the Software Architecture chapter of Puppet-HPC documentation, it works in combination with an internal configuration repository containing all configuration settings and data specific to your organization. This Git repository does not have to be populated to proceed the installation. If it empty, the Internal repository section of this chapter explains how to initialize it from scratch for Puppet-HPC. |
This diagram represents an exemple network interfaces configuration for the generic services nodes of a Scibian HPC cluster:
Temporary installation node
The first step of the installation process is to install the first generic service node. This node will ensure the role of temporary installation node for the other generic service nodes. Before the admin node is installed, all operations (unless explicitely stated) are realized on this temporary installation node.
Base installation
Install Debian 8 Jessie base system using any of the official Debian
installation media (CD, DVD, USB key, PXE server, etc) at your convenience.
Configure the network interfaces with static IP addresses in compliancy with
the cluster IP adressing plan. Set the hostname following the
architecture conventions, for example: fbservice1
.
Once the node has rebooted on freshly installed system, add the Scibian 8 APT repositories to the configuration:
# echo <<EOF >/etc/apt/sources.list.d/scibian8.list deb http://scibian.org/repo/ scibian8 main EOF
Download and enable Scibian repository keyring:
# wget https://scibian.org/repo/scibian-auto-keyring.pub -O- | apt-key add -
Update the packages repositories local database:
# apt-get update
Install the following Scibian HPC administration node meta-package:
# apt-get install scibian-hpc-admin
Administration environment
All the files manipulated during the installation process will be placed into a
dedicated working directory. The location of this directory is arbitrary, for
example: ~root/install
. This directory will be designated as $ADMIN
in the
following section of the installation procedure documentation.
# export ADMIN=~root/install # mkdir $ADMIN && cd $ADMIN
Clone both Puppet-HPC and internal configuration repositories into this dedicated working directory:
# git clone https://github.com/edf-hpc/puppet-hpc.git # git clone ssh://forge/hpc-privatedata.git
At this stage, the internal repository can be populated with all files and data initially required to install the cluster.
Internal configuration repository
The internal configuration repository required by Puppet-HPC is designed to be shared upon multiple clusters in an organization. Then, it has to be initialized only for the first cluster installation. Its structure and content is fully explained in the Software Architecture chapter of Puppet-HPC documentation.
This section provides examples configurations snippets to quickstart this internal configuration repository from scratch.
Base directories
If the internal configuration repository is fully empty and is initialiazed from scratch, a few base directories must be created under its root.
Important
|
This step must not be realized if the internal configuration repository is not empty, typically if it has already been initialized for another cluster. |
# cd $ADMIN/hpc-privatedata # mkdir files hieradata puppet-config
Organization settings
Some settings are common to all HPC clusters of an organization, in particular settings regarding the external services. To avoid duplication of these settings in all HPC cluster configurations, they are defined once in the organization layer of the hiera repository shared by all HPC clusters.
Important
|
This step must be done only once for the organization. It can be skipped safely if the organization layer YAML file already exists. |
Initialize the file $ADMIN/hpc-privatedata/hieradata/org.yaml
with the
following content:
##### Common #####
org: 'company' # lower-case name of the organization
##### DNS #####
domain: "%{hiera('cluster_name')}.hpc.example.org"
profiles::dns::client::search: "%{hiera('domain')} hpc.example.org"
profiles::dns::server::config_options:
forwarders:
- '1.1.1.1'
- '2.2.2.2'
##### NTP #####
profiles::ntp::server::site_servers:
- "ntp1.example.org"
- "ntp2.example.org"
##### APT #####
profiles::cluster::apt_sources:
'jessie':
location: "http://%{hiera('debian_mirror_server')}/%{hiera('debian_mirror_dir')}"
release: 'jessie'
repos: 'main contrib non-free'
required_packages: 'debian-archive-keyring'
pin: '500'
include_src: false
architecture: 'amd64,i386'
'jessie-updates':
location: "http://%{hiera('debian_mirror_server')}/%{hiera('debian_mirror_dir')}"
release: 'jessie-updates'
repos: 'main contrib non-free'
required_packages: 'debian-archive-keyring'
pin: '500'
include_src: false
architecture: 'amd64,i386'
'jessie-security':
location: "http://%{hiera('debian_mirror_server')}/%{hiera('debian_sec_mirror_dir')}"
release: 'jessie/updates'
repos: 'main contrib non-free'
required_packages: 'debian-archive-keyring'
pin: '500'
include_src: false
architecture: 'amd64,i386'
'scibian8':
location: "http://%{hiera('scibian_mirror_server')}/%{hiera('scibian_mirror_dir')}"
release: 'scibian8'
repos: 'main'
required_packages: 'scibian-archive-keyring'
pin: '1000'
include_src: false
architecture: 'amd64,i386'
##### SMTP/Postfix #####
profiles::postfix::relay::config_options:
relay_domains: '$mydestination example.org'
relayhost: 'smtp.example.org'
##### LDAP/SSSD #####
ldap_external: 'ldap.example.org'
profiles::auth::client::sssd_options_domain:
ldap_search_base: 'dc=example,dc=org'
ldap_user_search_base: 'ou=people,dc=example,dc=org'
ldap_group_search_base: 'ou=groups,dc=example,dc=org'
Note
|
This configuration supposes the APT, NTP, SMTP, DNS and LDAP settings are similar on all the HPC clusters of your organization. This might not be true in some specific organization environments. In this case, the settings of the affected services must be defined in the cluster specific layers of the hiera repository instead. |
The examples values must be replaced with the settings corresponding to your organization environment.
Cluster directories
Some directories are required to store cluster specific file and settings inside the internal configuration repository. Create these directories with the following command:
# mkdir $ADMIN/hpc-privatedata/puppet-config/$CLUSTER \ $ADMIN/hpc-privatedata/files/$CLUSTER \ $ADMIN/hpc-privatedata/hieradata/$CLUSTER \ $ADMIN/hpc-privatedata/hieradata/$CLUSTER/roles
Puppet configuration
The hpc-config-push
Puppet-HPC utility expects to find a Puppet and Hiera
configuration files for the cluster under the puppet-config
directory of the
internal configuration repository. Simply copy examples configuration files
provided with Puppet-HPC:
cp $ADMIN/puppet-hpc/examples/privatedata/{puppet.conf,hiera.yaml} \ $ADMIN/hpc-privatedata/puppet-config/$CLUSTER/
The hiera.yaml
file notably specifies the layers of YAML files composing the
hiera repository. It can eventually be tuned for additional layer to fit your
needs.
Puppet-HPC requires the cluster name to be a fact (according to
Facter). For this purpose, define the file
$ADMIN/hpc-privatedata/puppet-config/$CLUSTER/hpc-config-facts.yaml
with the
following content:
---
cluster_name: 'foobar'
This YAML file is deployed by Puppet-HPC hpc-config-{push,apply}
on every
nodes in /var/lib/puppet/facts.d
directory. The content of the YAML files
located this directory are considered as valid custom facts.
Eventually, additional arbitrary custom facts can be defined for the cluster in this file at your convenience.
Cluster definition
The cluster specific layers of the Hiera repository must be initialized with a sufficient description of the HPC cluster. This description is the cluster definition.
Networks definition
A specific layer in the hiera repository stack is dedicated to all the networks
settings of the HPC cluster. This layer is defined in file
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/network.yaml
. Initialize this file
with the following content:
network::ib_enable: false
network::opa_enable: true
net::administration::ipnetwork: '10.1.0.0'
net::administration::netmask: '255.255.255.0'
net::administration::prefix_length: '/24'
net::administration::broadcast: '10.1.0.255'
net::lowlatency::ipnetwork: '10.4.0.0'
net::lowlatency::prefix_length: '/24'
net::management::ipnetwork: '10.2.0.0'
net::management::netmask: '255.255.255.0'
net::management::broadcast: '10.2.0.255'
net_topology:
wan:
name: 'WAN'
prefixes: 'wan'
ipnetwork: '10.3.0.0'
netmask: '255.255.255.0'
prefix_length: '/24'
gateway: '10.3.0.254'
broadcast: '10.3.0.255'
ip_range_start: '10.3.0.1'
ip_range_end: '10.3.0.254'
firewall_zone: 'wan'
administration:
name: 'CLUSTER'
ipnetwork: '10.1.0.0'
netmask: '255.255.255.0'
prefix_length: '/24'
gateway: '10.1.0.0' # fbproxy
broadcast: '10.1.0.255'
ip_range_start: '10.1.0.1'
ip_range_end: '10.1.0.254'
firewall_zone: 'clstr'
pool0:
ip_range_start: '10.1.0.1'
ip_range_end: '10.1.0.254'
lowlatency:
name: 'LOWLATENCY'
prefixes: 'opa'
ipnetwork: '10.4.0.0'
netmask: '255.255.255.0'
prefix_length: '/24'
broadcast: '10.4.0.255'
ip_range_start: '10.4.0.1'
ip_range_end: '10.4.0.254'
firewall_zone: 'clstr'
management:
name: 'MGT'
prefixes: 'mgt'
ipnetwork: '10.2.0.0'
netmask: '255.255.255.0'
prefix_length: '/24'
broadcast: '10.2.0.255'
ip_range_start: '10.2.0.1'
ip_range_end: '10.2.0.254'
firewall_zone: 'clstr'
bmc:
name: 'BMC'
prefixes: 'bmc'
ipnetwork: '10.2.0.0'
netmask: '255.255.255.0'
prefix_length: '/24'
broadcast: '10.2.0.255'
ip_range_start: '10.2.0.1'
ip_range_end: '10.2.0.254'
firewall_zone: 'clstr'
network::bonding_options:
bond0:
slaves:
- eth1
- eth2
options: 'mode=802.3ad primary=eth2 miimon=100 updelay=200 downdelay=200'
description: 'service nodes on adm/mgt networks'
network::bridge_options:
br0:
ports:
- bond0
description: 'service nodes on administration network'
br1:
ports:
- eth3
description: 'service nodes on management network'
br2:
ports:
- eth4
description: 'service nodes on WAN network'
master_network:
fbservice1:
fqdn: "fbservice1.%{hiera('domain')}"
networks:
administration:
'DHCP_MAC': 'aa:bb:cc:dd:ee:00'
'IP': '10.1.0.1'
'device': 'br0'
'hostname': 'fbservice1'
lowlatency:
'IP': '10.4.0.1'
'device': 'ib0'
'hostname': 'opafbservice1'
bmc:
'DHCP_MAC': 'aa:bb:cc:dd:ee:01'
'IP': '10.2.0.101'
'hostname': 'bmcfbservice1'
management:
'IP': '10.2.0.1'
'device': 'br1'
'hostname': 'mgtfbservice1'
wan:
'IP': '10.3.0.1'
'device': 'br2'
'hostname': 'wanfbservice1'
fbservice2:
fqdn: "fbservice2.%{hiera('domain')}"
networks:
administration:
'DHCP_MAC': 'aa:bb:cc:dd:ee:02'
'IP': '10.1.0.2'
'device': 'br0'
'hostname': 'fbservice2'
lowlatency:
'IP': '10.4.0.2'
'device': 'ib0'
'hostname': 'opafbservice2'
bmc:
'DHCP_MAC': 'aa:bb:cc:dd:ee:03'
'IP': '10.2.0.102'
'hostname': 'bmcfbservice2'
management:
'IP': '10.2.0.2'
'device': 'br1'
'hostname': 'mgtfbservice2'
wan:
'IP': '10.3.0.2'
'device': 'br2'
'hostname': 'wanfbservice2'
fbservice3:
fqdn: "fbservice3.%{hiera('domain')}"
networks:
administration:
'DHCP_MAC': 'aa:bb:cc:dd:ee:04'
'IP': '10.1.0.3'
'device': 'br0'
'hostname': 'fbservice3'
lowlatency:
'IP': '10.4.0.3'
'device': 'ib0'
'hostname': 'opafbservice3'
bmc:
'DHCP_MAC': 'aa:bb:cc:dd:ee:05'
'IP': '10.2.0.103'
'hostname': 'bmcfbservice3'
management:
'IP': '10.2.0.3'
'device': 'br1'
'hostname': 'mgtfbservice3'
wan:
'IP': '10.3.0.3'
'device': 'br2'
'hostname': 'wanfbservice3'
fbservice4:
fqdn: "fbservice4.%{hiera('domain')}"
networks:
administration:
'DHCP_MAC': 'aa:bb:cc:dd:ee:06'
'IP': '10.1.0.4'
'device': 'br0'
'hostname': 'fbservice4'
lowlatency:
'IP': '10.4.0.4'
'device': 'ib0'
'hostname': 'opafbservice4'
bmc:
'DHCP_MAC': 'aa:bb:cc:dd:ee:07'
'IP': '10.2.0.104'
'hostname': 'bmcfbservice4'
management:
'IP': '10.2.0.4'
'device': 'br1'
'hostname': 'mgtfbservice4'
wan:
'IP': '10.3.0.4'
'device': 'br2'
'hostname': 'wanfbservice4'
#### High-Availability Virtual IP addresses ######
vips:
service1:
network: 'administration'
ip: '10.1.0.101'
hostname: 'vipfbservice1'
router_id: 161
master: 'fbservice1'
members: 'fbservice[1-4]'
notify:
common:
- "%{hiera('private_files_dir')}/ha/rndc_reload.sh"
advert_int: '2'
service2:
network: 'administration'
ip: '10.1.0.102'
hostname: 'vipfbservice2'
router_id: 162
master: 'fbservice2'
members: 'fbservice[1-4]'
notify:
common:
- "%{hiera('private_files_dir')}/ha/rndc_reload.sh"
service3:
network: 'administration'
ip: '10.1.0.103'
hostname: 'vipfbservice3'
router_id: 163
master: 'fbservice3'
members: 'fbservice[1-4]'
notify:
common:
- "%{hiera('private_files_dir')}/ha/rndc_reload.sh"
service4:
network: 'administration'
ip: '10.1.0.104'
hostname: 'vipfbservice4'
router_id: 164
master: 'fbservice4'
members: 'fbservice[1-4]'
notify:
common:
- "%{hiera('private_files_dir')}/ha/rndc_reload.sh"
The first network::{ip,opa}_enable
define which high-performance interconnect
network technology is involved in the HPC cluster (InfiniBand or Intel
Omni-Path).
The net::*
parameters and net_topology
hash basically define the adressing
maps of the various IP networks of the clusters, along with some metadata such
as the network hostname prefixes, the DHCP dynamic pools and the firewall zones
associated to these IP networks.
The network::bonding_options
and network::bridge_options
hashes respectively
define all the network interfaces bondings and virtual bridges involved on the
nodes of the HPC cluster. Note that these settings are global to all nodes.
The master_network
hash defines the list of nodes and all their network
interfaces with the associated IP addresses, network hostnames and eventually
MAC addresses (on the administration and bmc networks).
Finally, the vips
hash define the virtual highly-available IP addresses (VIP)
managed by nodes of the HPC cluster.
Initially, the YAML file must contain all the IP network definitions and the network settings of all the generic service nodes with their VIP.
General cluster settings
The cluster specific general parameters and services settings are located in
file $ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
. Initialize this
file with the following content:
cluster_prefix: 'fb' # Prefix starting all nodes hostnames
user_groups: # Array of user groups allowed to access to the cluster
- 'grpusers1'
- 'grpusers2'
###### DNS Cluster settings ######
profiles::dns::client::nameservers:
- '10.1.0.101' # VIP addresses of generic service nodes on administration
- '10.1.0.102' # network
- '10.1.0.103'
- '10.1.0.104'
profiles::dns::server::config_options:
listen-on:
- '127.0.0.1'
- '10.1.0.1' # Static IP addresses of generic service nodes on
- '10.1.0.2' # administration network
- '10.1.0.3'
- '10.1.0.4'
- '10.1.0.101' # VIP addresses of generic service nodes on administration
- '10.1.0.102' # network
- '10.1.0.103'
- '10.1.0.104'
- '10.2.0.1' # Static IP addresses of generic service nodes on
- '10.2.0.2' # management network
- '10.2.0.3'
- '10.2.0.4'
###### Bootsystem ######
boot_params:
defaults: # default boot params to all nodes
domain: "%{hiera('domain')}"
kernel_opts: 'persistence formatcow nosmap'
cowsize: '2G'
dhcp_timeout: '120'
diskinstall_server: "%{hiera('server_web_boot')}"
diskless_server: '10.1.0.50:3137' # VIP addresse of P2P servers
nameserver: '10.1.0.101' # VIP address of 1st generic
# service on administration network
boot_os: 'scibian8_ram'
boot_dev: 'eth0'
console: 'ttyS0,115200n8'
fbservice[1-4]: # generic service nodes specific
# boot params
boot_os: 'scibian8_disk'
ipxebin: 'ipxe_noserial.bin'
# iPXE ROM downloaded by TFTP. There are 2 versions of the same ROM: with
# software serial console enabled for virtual machines and w/o software
# serial console for bare metal machines, as the serial console is transparently
# redirected to the BMC virtual serial port by the BIOS/UEFI on these nodes.
boottftp::hpc_files:
"%{hiera('tftp_dir')}/ipxe_serial.bin":
source: "%{hiera('private_files_dir')}/boot/ipxe/ipxe_serial.bin"
"%{hiera('tftp_dir')}/ipxe_noserial.bin":
source: "%{hiera('private_files_dir')}/boot/ipxe/ipxe_noserial.bin"
# Configuration files downloaded by HTTP by the debian-installer: the partition
# schema and hpc-config-apply configuration file.
boothttp::hpc_files:
"%{hiera('website_dir')}/disk/scibian8/partition-schema":
source: "%{hiera('private_files_dir')}/boot/disk-installer/scibian8/partition-schema"
"%{hiera('website_dir')}/disk/scibian8/hpc-config.conf":
source: "file:///etc/hpc-config.conf"
# The tarball containing the Debian network installer
boothttp::archives:
"%{hiera('website_dir')}/disk/scibian8/netboot.tar.gz":
source: "%{hiera('private_files_dir')}/boot/disk-installer/scibian8/netboot.tar.gz"
extract_path: "%{hiera('website_dir')}/disk/scibian8"
extract: true
boothttp::install_options:
scibian8:
### Localization
'd-i debian-installer/locale string': 'en_US.UTF-8'
'd-i debian-installer/language string': 'en'
'd-i debian-installer/country string': 'en'
'd-i localechooser/supported-locales multiselect': 'en_US.UTF-8'
# Keyboard selection.
'd-i keyboard-configuration/xkb-keymap select': 'en'
### Apt setup
'apt-setup-udeb apt-setup/services-select multiselect': 'none'
'apt-mirror-setup apt-setup/contrib boolean': 'true'
'apt-mirror-setup apt-setup/no_mirror boolean': 'false'
'apt-mirror-setup apt-setup/non-free boolean': 'true'
'apt-mirror-setup apt-setup/use_mirror boolean': 'true'
### Mirror settings
'd-i mirror/protocol select': 'http'
'd-i mirror/country string': 'manual'
'd-i mirror/http/hostname string': "%{hiera('debian_mirror_server')}"
'd-i mirror/http/directory string': "/%{hiera('debian_mirror_dir')}"
'd-i mirror/suite string': 'jessie'
'd-i debian-installer/allow_unauthenticated boolean': 'true'
### Network configuration
'd-i netcfg/enable boolean': 'true'
'd-i netcfg/choose_interface select': 'auto'
'd-i netcfg/use_autoconfig boolean': 'true'
'd-i netcfg/no_default_route boolean': 'true'
'd-i netcfg/get_domain string': "%{hiera('domain')}"
# VIP addresses of the generic service nodes on the administration network
'd-i netcfg/get_nameservers string': '10.1.0.101 10.1.0.102 10.1.0.103 10.1.0.104'
'd-i netcfg/dhcp_failed': 'note'
'd-i netcfg/dhcp_options select Configure network': 'manually'
'd-i netcfg/dhcp_timeout string': '25'
'd-i netcfg/dhcpv6_timeout string': '3'
# Hardware detect
'd-i hw-detect/start_pcmcia boolean': 'false'
'd-i hw-detect/load_firmware boolean': 'true'
### Partitioning
'd-i partman/early_command string': >
env -u http_proxy wget http://%{hiera('server_web_boot')}/disk/scibian8/partition-schema -O /partition-schema;
debconf-set partman-auto/disk "$(list-devices disk | head -n1)"
'd-i partman-auto/method string': 'lvm'
'd-i partman-partitioning/choose_label select': 'gpt'
'd-i partman-partitioning/confirm_write_new_label boolean': 'true'
'd-i partman-auto/expert_recipe_file string': '/partition-schema'
'partman-basicfilesystems partman-basicfilesystems/no_mount_point boolean': 'false'
'd-i partman-lvm/device_remove_lvm boolean': 'true'
'd-i partman-auto/purge_lvm_from_device boolean': 'true'
'd-i partman-md/device_remove_md boolean': 'true'
'd-i partman-md/confirm_nochanges boolean': 'true'
'd-i partman-lvm/confirm boolean': 'true'
'd-i partman-lvm/confirm_nooverwrite boolean': 'true'
'd-i partman-auto/choose_recipe select': 'multiraid'
'd-i partman-md/confirm boolean': 'true'
'd-i partman-md/confirm_nooverwrite boolean': 'true'
'd-i partman/confirm_write_new_label boolean': 'true'
'd-i partman/confirm_nooverwrite boolean': 'true'
'd-i partman/choose_partition select': 'finish'
'd-i partman/confirm boolean': 'true'
### Boot loader
'grub-installer grub-installer/only_debian boolean': 'true'
'grub-installer grub-installer/bootdev string': 'default'
'grub-installer grub-installer/skip boolean': 'false'
'grub-installer grub-installer/multipath boolean': 'true'
'grub-installer grub-installer/with_other_os boolean': 'false'
'grub-installer grub-installer/make_active boolean': 'true'
# Don't create a normal user account.
'user-setup-udeb passwd/make-user boolean': 'false'
'user-setup-udeb passwd/root-login boolean': 'true'
'user-setup-udeb passwd/shadow boolean': 'true'
### Clock and time zone setup
'clock-setup clock-setup/utc boolean': 'true'
'clock-setup clock-setup/system-time-changed boolean': 'false'
'clock-setup clock-setup/hwclock-wait boolean': 'false'
'clock-setup clock-setup/ntp boolean': 'false'
'tzsetup-udeb time/zone select': 'GMT'
### Base system installation
'bootstrap-base base-installer/kernel/image select': 'linux-image-amd64'
'bootstrap-base base-installer/initramfs-tools/driver-policy select': 'dep'
'base-installer base-installer/install-recommends boolean': 'false'
### Package selection
'tasksel tasksel/first multiselect': 'standard, ssh-server'
'd-i pkgsel/include string': >
apt-transport-https
scibian-archive-keyring
hpc-config-apply
# Additional repositories, local[0-9] available
'd-i apt-setup/local0/repository string': "http://%{hiera('scibian_mirror_server')}/%{hiera('scibian_mirror_dir')} scibian8 main contrib non-free"
'd-i pkgsel/upgrade select': 'full-upgrade'
# Other packages ##
'd-i popularity-contest/participate boolean': 'false'
'console-setup console-setup/charmap47 select': 'UTF-8'
'console-setup console-setup/codesetcode string': 'Lat15'
'console-setup console-setup/store_defaults_in_debconf_db boolean': 'true'
'console-setup console-setup/fontsize-fb47 select': '8x16'
'console-setup console-setup/fontsize string': '8x16'
### Late command ###
'd-i preseed/late_command string': >
/bin/in-target sed -i 's/PermitRootLogin without-password/PermitRootLogin yes/' /etc/ssh/sshd_config;
env -u http_proxy wget http://%{hiera('server_web_boot')}/disk/scibian8/hpc-config.conf -O /target/etc/hpc-config.conf;
/bin/in-target mkdir -p /var/lib/puppet/facts.d;
/bin/in-target /usr/bin/env -u http_proxy hpc-config-apply -vvv;
# Root password, either in clear text
'd-i passwd/root-password-crypted password': "%{hiera('profiles::cluster::root_password_hash')}"
# Avoid that last message about the install being complete.
'd-i finish-install/reboot_in_progress note': ''
###### DHCP ######
profiles::dhcp::default_options:
- 'INTERFACES=br0 br1' # bridge interfaces of the generic service nodes on the
# administration and management networks
profiles::dhcp::includes:
adm-subnet:
'pool_name': 'subnet'
'subnet_name': 'adm'
'tftp': true
'pool':
'use-host-decl-names': 'on'
'deny': 'unknown-clients'
'max-lease-time': '1800'
# Range of IP addresses on the administration network
'range': '10.1.0.1 10.1.0.254'
'include': '/etc/dhcp/adm_subnet'
mgt-subnet:
'pool_name': 'subnet'
'subnet_name': 'mgt'
'tftp': false
'pool':
'use-host-decl-names': 'on'
'deny': 'unknown-clients'
'max-lease-time': '1800'
# Range of IP addresses on the management network
'range': '10.2.0.1 10.2.0.254'
'include': "/etc/dhcp/mgt_subnet"
profiles::dhcp::sharednet:
'name': "%{hiera('cluster_name')}net"
'subnet':
- 'name': 'adm'
'network': "%{hiera('net::administration::ipnetwork')}"
'netmask': "%{hiera('net::administration::netmask')}"
'domain-name': "%{hiera('domain')}"
# VIP addresses of the generic service nodes on the administration network
'domain-name-servers': '10.1.0.101, 10.1.0.102 10.1.0.103 10.1.0.104'
'broadcast': "%{hiera('net::administration::broadcast')}"
- 'name': 'mgt'
'network': "%{hiera('net::management::ipnetwork')}"
'netmask': "%{hiera('net::management::netmask')}"
'domain-name': "%{hiera('domain')}"
# Static IP addresses of the generic service nodes on the management network
'domain-name-servers': '10.2.0.1, 10.2.0.2, 10.2.0.3, 10.2.0.4'
'broadcast': "%{hiera('net::management::broadcast')}"
Additionally to some general parameters (cluster_prefix
and user_groups
),
the initial version of this file notably contains the configuration of the base
services required to install nodes on disk (DNS, TFTP, HTTP, DHCP, Debian
installer, etc).
Also, in order to prevent user to access the cluster during the installation process, it is recommended to enable the maintenance mode in this file:
profiles::access::maintenance_mode: true
Service role
The Puppet role service
associated to the generic service nodes must be
defined with the corresponding profiles. This is achieved by initializing file
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/roles/service.yaml
with the
following content:
profiles:
# common
- profiles::cluster::common
- profiles::systemd::base
- profiles::ssmtp::client
- profiles::network::base
- profiles::dns::client
- profiles::access::base
- profiles::openssh::server
- profiles::openssh::client
- profiles::environment::base
- profiles::environment::limits
- profiles::environment::service
- profiles::log::client
# HW host
- profiles::hardware::ipmi
- profiles::hardware::admin_tuning
# service
- profiles::hpcconfig::push
- profiles::hpcconfig::apply
- profiles::ntp::server
- profiles::openssh::client_identities
- profiles::clush::client
- profiles::ha::base
- profiles::http::secret
- profiles::apt::proxy
- profiles::log::server
- profiles::dns::server
- profiles::bootsystem::server
- profiles::dhcp::server
profiles::network::gw_connect: 'wan'
The first profiles (below the common comment) are common to all nodes of the cluster. The profiles after the HW host comment are common to all bare metal nodes. The last profiles, after the service comment, carry the base services hosted by the generic service nodes.
The last parameter profiles::network::gw_connect
defines on which network’s
gateway the nodes use as their default route.
Authentication and encryption keys
Configuration encryption keys
As explained in the Puppet-HPC reference documentation (chapter Software Architecture, section Sensitive Data Encryption), the sensitive data stored in the internal configuration repository are encrypted using two keys:
-
asymmetric PKCS7 key pair for encrypting values in Hiera,
-
symmetric AES key for encrypting files.
These keys must be generated following the bootstrap procedure available in the Puppet-HPC Reference Documentation.
SSH host keys
The SSH host keys must stay consistent between node re-installations and/or diskless reboots. To ensure this, the SSH host keys are generated in the cluster’s files directory of the internal configuration repository before their first installation and/or diskless boot.
Firstly, create the hostkeys
directory:
# cd $ADMIN # mkdir -p hpc-privatedata/files/$CLUSTER/hostkeys
During this step, the known_hosts
file will also be synchronized with the
generated keys. This file will be stored in
hpc-privatedata/files/$CLUSTER/ssh/known_hosts
. To build this file, the
script needs to know the local domain name for the cluster, the script will use
the local domain of the machine where it runs by default. If this is not
correct you must provide the domain.
Run the following script:
# puppet-hpc/scripts/sync-ssh-hostkeys.sh hpc-privatedata $CLUSTER $CLUSTER.$NETDOMAIN
This script ensures that all nodes present in the master_network
hash have
valid SSH host keys.
SSH root key
For password-less SSH authentication from the admin and generic service nodes to all the other nodes of the cluster, SSH authentication keys pair are deployed for root on the nodes.
First, create the rootkeys
sub-directory in the cluster’s files directory of
the internal configuration repository:
# cd $ADMIN # mkdir -p hpc-privatedata/files/$CLUSTER/rootkeys
Then, generate the key pair:
# ssh-keygen -t rsa -b 2048 -N '' -C root@$CLUSTER \ -f hpc-privatedata/files/$CLUSTER/rootkeys/id_rsa_root
Eventually, adjust the key type and size. Encode the private key with the following helper script provided by Puppet-HPC:
# puppet-hpc/scripts/encode-file.sh hpc-privatedata $CLUSTER \ hpc-privatedata/files/$CLUSTER/rootkeys/id_rsa_root generating encoded file hpc-privatedata/files/foobar/rootkeys/id_rsa_root.enc
Do not forget to remove the generated unencrypted private key:
# rm hpc-privatedata/files/$CLUSTER/rootkeys/id_rsa_root
Finally, publish the public key with the following parameter in the cluster
specific layer of the hiera repository
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
:
openssh::server::root_public_key: <pubkey>
Initial files
There are few files that are required in the cluster’s file directory of the internal configuration repository to deploy the generic service nodes.
First, create the needed sub-directories:
mkdir -p $ADMIN/hpc-privatedata/files/$CLUSTER/boot/ipxe \ $ADMIN/hpc-privatedata/files/$CLUSTER/boot/disk-installer/scibian8 \ $ADMIN/hpc-privatedata/files/$CLUSTER/ha
Download the Debian network installer archive:
# wget http://deb.debian.org/debian/dists/jessie/main/installer-amd64/current/images/netboot/netboot.tar.gz \ -O $ADMIN/hpc-privatedata/files/$CLUSTER/boot/disk-installer/scibian8/netboot.tar.gz
The iPXE ROM must be built from sources downloaded on iPXE website:
# wget https://git.ipxe.org/ipxe.git/snapshot/HEAD.tar.gz -O $ADMIN/ipxe.tar.gz # tar -C $ADMIN -xzf $ADMIN/ipxe.tar.gz # cd $ADMIN/ipxe-*/src # make bin/undionly.kpxe # cp bin/undionly.kpxe $ADMIN/hpc-privatedata/files/$CLUSTER/boot/ipxe/ipxe_noserial.bin
The ipxe_serial.bin
variant must be generated after modifying hard-coded
serial console parameters in iPXE source code as documented in
iPXE official documentation.
The iPXE menu boot entries are generated dynamically, depending on the source
node name, by the CGI Python script
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/cgi/bootmenu.py
:
#!/usr/bin/python
import ClusterShell.NodeSet
import cgi
import yaml
import sys
import syslog
sys.stderr = sys.stdout
params = cgi.FieldStorage()
target = params.getvalue('node')
print("Content-Type: text/plain")
print("")
config = yaml.load(open('/etc/hpc-config/bootmenu.yaml'))
def lookup_param(key):
global config, target
for nodeset in config:
if nodeset == 'defaults':
continue
if target in ClusterShell.NodeSet.expand(nodeset) and key in config[nodeset]:
return config[nodeset][key]
return config['defaults'][key]
ipxe_menu = {
'params' : {
'timeout' : '3000',
'boot' : 'boot || goto failed',
'goto' : 'goto start',
},
'menu' : {
'scibian8_ram' : {
'label' : 'Run Scibian8 in RAM',
'url' : "http://" + lookup_param('diskless_server') + "/scibian8",
'initrd' : '${base-url}/initrd-3.16.0-4-amd64',
'kernel' : "${base-url}/vmlinuz-3.16.0-4-amd64 initrd=initrd-3.16.0-4-amd64 " +
" console=" + lookup_param('console') +
" ethdevice=" + lookup_param('boot_dev') +
" ethdevice-timeout=" + lookup_param('dhcp_timeout') +
" cowsize=" + lookup_param('cowsize') +
" transparent_hugepage=always " +
" disk-format=" + lookup_param('disk_format') +
" disk-raid=" + lookup_param('disk_raid') +
" boot=live " +
" fetch=http://" + lookup_param('diskless_server') + "/scibian8/scibian8.squashfs.torrent " +
lookup_param('kernel_opts'),
},
'scibian8_disk' : {
'label' : 'Install Scibian8',
'url' : "http://"+lookup_param('diskinstall_server')+"/disk/scibian8",
'initrd' : '${base-url}/debian-installer/amd64/initrd.gz',
'kernel' : "${base-url}/debian-installer/amd64/linux initrd=initrd.gz console="+lookup_param('console')+" url=${base-url}/install_config auto interface="+lookup_param('boot_dev')+" locale=en_US console-keymaps-at/keymap=fr keyboard-configuration/xkb-keymap=fr languagechooser/language-name=English netcfg/get_domain="+lookup_param('domain')+" netcfg/get_nameservers="+lookup_param('nameserver')+" netcfg/no_default_route=true debian-installer/add-kernel-opts=console="+lookup_param('console')+" priority=critical scibian-installer",
},
'discovery' : {
'label' : 'Discover/Rescue system',
'url' : "http://"+lookup_param('diskless_server')+"/scibian8",
'initrd' : '${base-url}/initrd-3.16.0-4-amd64',
'kernel' : "${base-url}/vmlinuz-3.16.0-4-amd64 initrd=initrd-3.16.0-4-amd64 console="+lookup_param('console')+" boot=discovery interface="+lookup_param('boot_dev'),
},
},
}
print("#!ipxe")
print("#####################################################################")
print("# IPXE Linux menu specification")
print("#####################################################################")
print("set esc:hex 1b # ANSI escape character - ^[")
print("set cls ${esc:string}[2J # ANSI clear screen sequence - ^[[2J")
print("menu Please choose an operating system to boot")
print("echo ${cls}")
for key in ipxe_menu["menu"]:
print("item "+key+" "+ipxe_menu["menu"][key]["label"])
if key == lookup_param('boot_os'):
menudft = key
if not menudft:
menudft = 'discovery'
print("choose --default "+menudft+" --timeout "+ipxe_menu["params"]["timeout"]+" target && goto ${target}")
syslog.syslog("generating menu for " + target + " with default kernel: " + ipxe_menu["menu"][menudft]["kernel"] )
for key in ipxe_menu["menu"]:
print("")
print(":"+key)
print("set base-url "+ipxe_menu["menu"][key]["url"])
print("initrd "+ipxe_menu["menu"][key]["initrd"])
print("kernel "+ipxe_menu["menu"][key]["kernel"])
print(ipxe_menu["params"]["boot"])
print(ipxe_menu["params"]["goto"])
Debian installer partman utility requires a partition schema with rules and
constraints to create the disk partitions. This file must is located in
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/disk-installer/scibian8/partition-schema
.
Here is a complete example for this file:
512 512 1024 free
$iflabel{ gpt }
$reusemethod{ }
method{ efi }
format{ } .
256 256 256 ext2
$primary{ }
$bootable{ }
method{ format }
format{ }
use_filesystem{ }
filesystem{ ext2 }
mountpoint{ /boot } .
1 250000 -1 ext4
$defaultignore{ }
$primary{ }
method{ lvm }
vg_name{ sysvg } .
20000 100000 100000 ext4
$lvmok{ }
in_vg{ sysvg }
lv_name{ root }
method{ format }
format{ }
use_filesystem{ }
filesystem{ ext4 }
mountpoint{ / } .
30000 100000 100000 ext4
$lvmok{ }
in_vg{ sysvg }
lv_name{ var }
method{ format }
format{ }
use_filesystem{ }
filesystem{ ext4 }
mountpoint{ /var } .
2000 10000 20000 ext4
$lvmok{ }
in_vg{ sysvg }
lv_name{ tmp }
method{ format }
format{ }
use_filesystem{ }
filesystem{ ext4 }
mountpoint{ /tmp } .
2000 100000 2000+50% linux-swap
$lvmok{ }
in_vg{ sysvg }
lv_name{ swap }
method{ swap }
format{ } .
Finally, the version bind9 DNS server available in Debian Jessie is not able to
automatically listen for incoming connections to IP addresses configured after
the daemon has started. This is annoying when virtual IP addresses are gathered
to failover a failing generic service node. This behaviour is workaround with
the following script located in
$ADMIN/hpc-privatedata/files/$CLUSTER/ha/rndc_reload.sh
:
#!/bin/sh
# This is needed with bind 9.5 to make it:
#
# - start listening on new interfaces
# - stop listening on interfaces that don't exist anymore
#
# This becomes useless starting from bind >= 9.10 thanks to new
# automatic-interface-scan feature.
rndc reconfig
This script is run when the virtual IP addresses move from one generic service node to another. It asks bind9 daemon to reload its configuration which also makes it listen on new IP addresses.
Generic service nodes
Temporary installation services
The goal of this section is to configure the Temporary Installation Services on the Temporary Installation Node. This process is done in two steps:
-
A First Run only using local files or external services
-
A Second Run reconfiguring the Temporary Installation Node to use the services setup during the First Run with values that will also be used to install the other generic nodes remotely.
The reference architecture (with Ceph/S3) does not uses the local apache server
in the final configuration. Some other architectures uses a local apache
Virtual Host instead of Ceph/S3 (profiles::http::system). If this is the
case, /var/www/html
can be replaced by
hiera('profiles::http::system::docroot')
and http.service
by
web-system.service
in the private_files_dir
URL.
First Run
Consul is not available because the consul cluster needs quorum to work. Quorum
can only be achieved when more than half of the generic service nodes are
configures. The DNS server is therefore configured to only returns the
temporary installation node for all requests on the consul domain. This is done
simply by adding temporarily the following parameters in file
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
:
dns::server::virtual_relay: false
install_server_ip: '10.1.0.1' # static IP address of the temporary
# installation node on the administration
# network
Technically speaking, these parameters makes bind authorative on the virtual
DNS zone before Consul service discovery utility is available. The virtual zone
contains all the symbolic names to the network services (ex: http.virtual
).
This way, all services will be directed to the temporary installation node with
the IP address provided in install_server_ip
parameter.
The first run also needs to work properly without a local DNS server and without a local repository cache proxy. These service will be configured during this first run. Local repositories must also be disabled during the first run.
private_files_dir: "file:///var/www/html/hpc-config/%{environment}/latest/files"
apt::proxy_host: ''
profiles::dns::client::nameservers:
- '172.16.1.1' # External DNS server
hpcconfig::push::config_options:
global:
cluster: "%{hiera('cluster_name')}"
mode: 'posix'
destination: "/var/www/html/hpc-config"
The configuration will be pushed on local files while the temporary
installation is used. The settings above configures this, but the first push
must use a configuration that will be created manually in the file:
/etc/hpc-config/push.conf
.
---
[global]
environment=production
version=latest
destination=/var/www/html/hpc-config
cluster=<CLUSTER NAME>
mode=posix
---
The directory where the keys where generated cannot be used as a key source for apply because it will be overwritten during the apply. So it must be copied before doing the apply. To deploy the configuration of the temporary installation node, run the following commands:
# cd $ADMIN # hpc-config-push # cp -a /etc/puppet/secure/keys keys # hpc-config-apply --source file:///var/www/html/hpc-config \ --keys-source=file://$ADMIN/keys \ --verbose # rm -rf $ADMIN/keys
Second Run
At this stage the unique temporary installation node should work, the
configuration is modified to use it. The following hiera keys must be changed
in file $ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
:
private_files_dir: "http://http.service.%{hiera('virtual_domain')}/hpc-config/%{environment}/latest/files"
# apt:proxy_host is left to the default values
# profiles::dns::client::nameservers is configured as detailed above
Check the new settings with:
# cd $ADMIN # hpc-config-push # hpc-config-apply --verbose
If the two commands run without error, the initial setup succeeded.
At this stage, the temporary installation service are fully configured and available to install other generic service nodes.
Base system installation
The other generic service nodes must now be rebooted in PXE mode to run the Debian installer and configure the base system:
for BMC in $(nodeset -e bmcfbservice[2-4]); do
ipmitool -I lanplus -U ADMIN -P ADMIN -H $BMC chassis bootdev pxe
ipmitool -I lanplus -U ADMIN -P ADMIN -H $BMC power reset
done
Replace the BMC credentials with the appropriate values.
Once the base system is fully installed, the nodes reboot and become available with SSH. Check this with:
# clush -bw fbservice[2-4] uname --------------- fbservice[2-4] (3) --------------- Linux
Ceph deployment
Deployment is based on a tool called ceph-deploy
. This tool performs the
steps on a node to setup a ceph component. It is only used for the initial setup
of the Ceph cluster. Once the cluster is running, the configuration is reported
in the Puppet configuration in case it is re-deployed.
The reference configuration uses one disk (or hardware RAID LUN) to hold the
system (/dev/sda
) and another to hold the Ceph OSD data and journal
(/dev/sdb
). Three or five nodes must be chosed to setup the MON and MDS
services, the remaining nodes are used only as OSD and RadosGW nodes.
The ceph-deploy
utility generates authentication keys for Ceph. Once the
cluster is running, theses keys are manually collected and encrypted with
eyaml
to be included in the hiera configuration.
In the following example MONs/MDS are installed on nodes fbservice[2-4]
while
the node fbservice1
only has OSD and RGW.
Packages installation
Install the ceph-deploy
utility and the S3 CLI client s3cmd
:
# apt-get install ceph-deploy s3cmd
The deployment of Ceph cluster generates a bunch of files (keyrings, configuration file, etc). Create a temporary directory to store these files:
# mkdir ~root/ceph-deploy # cd ~root/ceph-deploy
Install the Ceph software stack on all nodes of the Ceph cluster:
# ceph-deploy install --no-adjust-repos $(nodeset -e fbservice[1-4])
Cluster bootstrap
Initialize the cluster with the first MON server of the Ceph cluster in parameter:
# ceph-deploy new fbservice2
Install admin credentials
# ceph-deploy admin $(nodeset -e fbservice[1-4])
Create the MON servers:
# ceph-deploy mon add $(nodeset -e fbservice[2-4])
Create the OSD servers:
# ceph-deploy disk zap $(nodeset -e fbservice[1-4]:sdb) # ceph-deploy osd prepare $(nodeset -e fbservice[1-4]:sdb)
Create the MDS servers:
# ceph-deploy mds create $(nodeset -e fbservice[2-4])
Check the Ceph cluster status:
# ceph status
The command must report HEALTH_OK
.
RadosGW
Enable RadosGW with the following command:
# ceph-deploy rgw create $(nodeset -e fbservice[1-4])
Libvirt RBD pool
The virtual machines will use a specific libvirt storage pool to store the disk
images. This libvirt storage pool uses ceph RBD, so a specific ceph pool is
necessary. This is not handled by ceph-deploy
:
# ceph osd pool create libvirt-pool 64 64
If the cluster has five OSDs or more, the numbers of PG and PGP can be set to 128 instead of 64.
The client credentials must be manually generated:
# ceph auth get-or-create client.libvirt \ mon 'allow r' \ osd 'allow class-read object_prefix rbd_children, allow rwx pool=libvirt-pool'
CephFS initialization
In high-availability mode, Slurm controller requires a shared POSIX filesystem between the primary and the backup controllers. In the Scibian HPC cluster reference architecture, CephFS is used for this filesystem. Create this CephFS filesystem with the following commands:
# ceph osd pool create cephfs_data 64 64 pool 'cephfs_data' created # ceph osd pool create cephfs_metadata 64 64 pool 'cephfs_metadata' created # ceph fs new cephfs cephfs_metadata cephfs_data new fs with metadata pool 15 and data pool 14
If the cluster has five OSDs or more, the numbers of PGs can be set to 128 for data and metadata pool.
RadosGW S3
A user must be created to access the RadosGW S3 API:
# radosgw-admin user create --uid=hpc-config --display-name="HPC Config push"
This commands gives an access_key
and a secret_key
that can be used by
hpc-config-push(1)
or s3cmd(1)
.
Create a temporary configuration file for s3cmd with these keys:
# cat <<EOF >~/.s3cfg [default] access_key=<ACCESS_KEY> secret_key=<SECRET_KEY> host_bucket=%(bucket)s.service.virtual:7480 host_base=rgw.service.virtual:7480 EOF
With the access_key
and the secret_key
provided by radosgw-admin user
create
command.
To work properly with Amazon S3 tools and consul DNS, RadosGW must be
configured to accept requests on rgw.service.virtual
and on
<bucket_name>.service.virtual
. To configure this, it is necessary to
re-define the default realm, region and zonegroup.
The region is configured by writing a JSON region file (rgw-region.json
):
{"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": ["rgw.service.virtual", "service.virtual"],
"master_zone": "",
"zones": [
{"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false"}
],
"placement_targets": [
{"name": "default-placement",
"tags": [] }],
"default_placement": "default-placement"
}
Inject this region file into RadosGW configuration:
# radosgw-admin realm create --rgw-realm=default --default # radosgw-admin region set --infile rgw-region.json # radosgw-admin region default --rgw-zonegroup=default # radosgw-admin zonegroup add --rgw-zonegroup=default --rgw-zone=default
Define default zone and zonegroup:
# radosgw-admin zone default --rgw-zone=default # radosgw-admin zonegroup default --rgw-zonegroup=default
Update the period:
# radosgw-admin period get # radosgw-admin period update --commit
After this step the RadosGW daemons must be restarted on every nodes:
clush -w fbservice[1-4] 'systemctl restart ceph-radosgw@rgw.${HOSTNAME}.service'
Finally, create the bucket with s3cmd
:
# s3cmd mb --acl-public s3://s3-system Bucket 's3://s3-system/' created
Transfer to Hiera
When the Ceph cluster is fully initialized, its configuration must be reported
into the cluster specific layer of the hiera repository
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
. Here is an example a
Ceph cluster configuration in Hiera:
profiles::ceph::config_options:
global:
fsid: '9ce06246-210f-4f59-866a-1f384852da38'
mon_initial_members: 'fbservice2, fbservice3, fbservice4'
mon_host: 'fbservice2, fbservice3, fbservice4'
auth_cluster_required: 'cephx'
auth_service_required: 'cephx'
auth_client_required: 'cephx'
ceph_client_admin_key: <eyaml encrypted key>
ceph::keyrings:
client.admin.keyring:
client.admin:
key: "%{hiera('ceph_client_admin_key')}"
ceph.mon.keyring:
'mon.':
key: <eyaml encrypted key>
'caps mon': 'allow *'
ceph.bootstrap-mds.keyring:
client.bootstrap-mds:
key: <eyaml encrypted key>
ceph.bootstrap-osd.keyring:
client.bootstrap-osd:
'key': <eyaml encrypted key>
ceph.bootstrap-rgw.keyring:
client.bootstrap-rgw:
key: <eyaml encrypted key>
ceph::osd_config:
fbservice1:
id: '0'
device: '/dev/sdb1'
key: <eyaml encrypted key>
fbservice2:
id: '1'
device: '/dev/sdb1'
key: <eyaml encrypted key>
fbservice3:
id: '2'
device: '/dev/sdb1'
key: <eyaml encrypted key>
fbservice4:
id: '3'
device: '/dev/sdb1'
key: <eyaml encrypted key>
ceph::mon_config:
- fbservice2
- fbservice3
- fbservice4
ceph::mds_config:
- fbservice2
- fbservice3
- fbservice4
ceph::mds_keyring:
fbservice2:
mds.fbservice2:
key: <eyaml encrypted key>
fbservice3:
mds.fbservice3:
key: <eyaml encrypted key>
fbservice4:
mds.fbservice4:
key: <eyaml encrypted key>
ceph::rgw_config:
- fbservice1
- fbservice2
- fbservice3
- fbservice4
ceph::rgw_client_keyring:
fbservice1:
client.rgw.fbservice1:
key: <eyaml encrypted key>
fbservice2:
client.rgw.fbservice2:
key: <eyaml encrypted key>
fbservice3:
client.rgw.fbservice3:
key: <eyaml encrypted key>
fbservice4:
client.rgw.fbservice4:
key: <eyaml encrypted key>
Many parameters in this example must be replaced to match the exact Ceph cluster configuration.
First, the fsid can be obtained with:
# ceph fsid
The bootstrap keys have been generated in the temporary Ceph deployment directory:
# cd ~root/ceph-deploy # cat ceph.client.admin.keyring # cat ceph.mon.keyring # cat ceph.bootstrap-mds.keyring # cat ceph.bootstrap-osd.keyring # cat ceph.bootstrap-rgw.keyring
The OSD keys can be gathered with:
# clush -bw fbservice[1-4] 'cat /var/lib/ceph/osd/ceph-?/keyring'
The MDS keys can be gathered with:
# clush -bw fbservice[2-4] 'cat /var/lib/ceph/mds/ceph-${HOSTNAME}/keyring'
The RGW keys can be gathered with:
# clush -bw fbservice[1-4] 'cat /var/lib/ceph/radosgw/ceph-rgw.${HOSTNAME}/keyring'
Then, add the ceph::server
profile into the service role:
--- a/hpc-privatedata/hieradata/foobar/roles/service.yaml
+++ b/hpc-privatedata/hieradata/foobar/roles/service.yaml
@@ -28,5 +28,6 @@
- profiles::bootsystem::server
- profiles::dhcp::server
- profiles::environment::limits
+ - profiles::ceph::server
profiles::network::gw_connect: 'wan'
Then push the new configuration:
# hpc-config-push
Theoritically, at this stage, the Ceph cluster can be fully configured with Puppet. It is really recommended to check this by re-installing one of the generic service nodes (excepting the temporary installation node) before going further. Please mind that in case of generic service node reinstallation after the initial configuration, bootstrap steps may be necessary:
-
MDS and RadosGW, those services have no state outside of Rados, so no additional bootstrap is necessary
-
Mon Always necessary to bootstrap
-
OSD Must be bootstraped if the OSD volume (
/dev/sdb
) is lost.
Please refer to the bootstrap procedure section for all details.
Once the re-installation of a generic service node with Ceph is validated, the
ceph-deploy
temporary directory can be removed from the temporary installation
node:
# rm -r ~root/ceph-deploy
Network restrictions
By default with Puppet-HPC, Ceph daemons socket are binded to the administration network interface of the generic service nodes. This setup is done on purpose for security reasons and avoid access to the Ceph cluster from outside of the administration network (typically from the wan network, outside of the cluster).
However, this can be easily changed by overriding this parameter in the hiera repository:
profiles::ceph::listen_network: 'wan' # Make ceph listen the wan network for
# connections, default is 'administration'
It is also possible to totally disable the network restriction settings on Ceph daemons with:
ceph::restrict_network: false
Migrate configuration
At this stage, the configuration was published on the HTTP server of the temporary installation node. Now that Ceph cluster is available, the configuration can be migrated to the S3 backend of Ceph RadosGW.
Edit the cluster specific layer of the hiera repository
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
to make the
hpc-config
utilities push and download configuration in dedicated Ceph S3
bucket:
hpcconfig::apply::config_options:
DEFAULT:
source:
value: "http://s3-system.service.%{hiera('virtual_domain')}:7480/hpc-config"
keys_source:
value: "http://secret.service.%{hiera('virtual_domain')}:%{hiera('secret_port')}"
hpcconfig::push::config_options:
global:
cluster: "%{hiera('cluster_name')}"
mode: 's3'
s3:
access_key: "%{hiera('s3::access_key')}"
secret_key: "%{hiera('s3::secret_key')}"
bucket_name: 's3-system'
host: 'rgw.service.virtual'
port: '7480'
s3::access_key: DEC::PKCS7[<access_key>]!
s3::secret_key: DEC::PKCS7[<secret_key>]!
s3cmd::config_options:
default:
access_key: "%{hiera('s3::access_key')}"
secret_key: "%{hiera('s3::secret_key')}"
host_bucket: '%(bucket)s.service.virtual:7480'
host_base: 'rgw.service.virtual:7480'
Push and apply configuration:
# hpc-config-push # clush -bg service hpc-config-apply -v
At this stage, the hpc-config
utilities are configured to use Ceph. Try this
new configuration by running them once again:
# hpc-config-push # clush -bg service hpc-config-apply -v
Finally, tune the private_files_dir
in file
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
to use Ceph S3 as well:
-private_files_dir: "http://http.service.%{hiera('virtual_domain')}/hpc-config/%{environment}/latest/files" +private_files_dir: "http://s3-system.service.%{hiera('virtual_domain')}:7480/hpc-config/%{environment}/latest/files"
Run a final check:
# hpc-config-push # clush -bg service hpc-config-apply -v
Starting from this moment, all the cluster configuration is hosted in the Ceph distributed filesystem.
Consul deployment
All the base services are now deployed on all the generic service nodes. It is time to enable load-balancing and high-availability with Consul service discovery tool.
Consul needs a shared secret key to encrypt communication between its distributed agents. Generate this key with:
# dd if=/dev/urandom bs=16 count=1 2>/dev/null | base64
The output of this command must be reported in the cluster’s specific layer of
the hiera repository $ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
with eyaml:
consul::key: DEC::PKCS7[<key>]!
Add consul::server
profile to the service role:
--- a/hpc-privatedata/hieradata/foobar/roles/service.yaml +++ b/hpc-privatedata/hieradata/foobar/roles/service.yaml @@ -29,5 +29,6 @@ - profiles::dhcp::server - profiles::environment::limits - profiles::ceph::server + - profiles::consul::server profiles::network::gw_connect: 'wan'
Then, run Puppet on all services nodes:
# hpc-config-push && clush -bg service hpc-config-apply -v
Check that all the generic service nodes are members of the Consul cluster with this command:
# clush --pick -Ng service consul members Node Address Status Type Build Protocol DC fbservice1 10.1.0.1:8301 alive server 0.6.4 2 foobar fbservice2 10.1.0.2:8301 alive server 0.6.4 2 foobar fbservice3 10.1.0.3:8301 alive server 0.6.4 2 foobar fbservice4 10.1.0.4:8301 alive server 0.6.4 2 foobar
The output should report that all the services nodes are members and alive.
Remove dns::server::virtual_relay
and install_server_ip
parameters from
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
:
--- a/hpc-privatedata/hieradata/foobar/roles/cluster.yaml +++ b/hpc-privatedata/hieradata/foobar/roles/cluster.yaml @@ -225,8 +225,3 @@ # Static IP addresses of the generic service nodes on the management network 'domain-name-servers': '10.2.0.1, 10.2.0.2, 10.2.0.3, 10.2.0.4' 'broadcast': "%{hiera('net::management::broadcast')}" - -dns::server::virtual_relay: false -install_server_ip: '10.1.0.1' # static IP address of the temporary - # installation node on the administration - # network
With this new configuration, Bind DNS server relays all DNS requests on the virtual zone to Consul DNS interface.
Push and the apply the new configuration:
# hpc-config-push && clush -bg service hpc-config-apply -v
Finally, check DNS requests on virtual zone are managed by Consul with:
# dig +short http.service.virtual 10.1.0.4 10.1.0.2 10.1.0.3
The output must report multiple generic service nodes static IP addresses in random order.
Temporary installation node sweep
Since the beginning of the installation process, the temporary installation node hosts installation files and services required to install the other generic service nodes. Now, all the other generic service nodes host the same files and services. Finally, the temporary installation node must be re-installed to be strictly identical to the other generic service nodes in terms of configuration.
Note
|
The disks of the temporary installation node are going to be formatted and all data hosted of this node will be lost. Then, it is probably time to backup all the manual modifications realized on this node and push all modifications in the remote internal configuration Git repository. |
Reboot the node in PXE mode through its BMC:
# export BMC=bmcfbservice1 # ipmitool -I lanplus -U ADMIN -P ADMIN -H $BMC chassis bootdev pxe # ipmitool -I lanplus -U ADMIN -P ADMIN -H $BMC power reset
Wait for the network installation to proceed and the node to reboot on the system freshly installed on its disks.
Admin node
Once the Service nodes are fully configured (Ceph, DNS, Consul, DHCP, TFTP, HTTP for boot…), the cluster is able to reinstall any physical or virtual machine with load-balancing and high-availability.
The first other node to install is the admin node, the central point of the HPC cluster administration.
Base system
Add the admin role by creating the file
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/roles/admin.yaml
with the
following content:
profiles:
# common
- profiles::cluster::common
- profiles::systemd::base
- profiles::ssmtp::client
- profiles::network::base
- profiles::dns::client
- profiles::access::base
- profiles::openssh::server
- profiles::openssh::client
- profiles::environment::base
- profiles::environment::limits
- profiles::environment::service
- profiles::log::client
# HW host
- profiles::hardware::ipmi
- profiles::hardware::admin_tuning
# admin
- profiles::hpcconfig::push
- profiles::hpcconfig::apply
- profiles::ntp::client
- profiles::openssh::client_identities
- profiles::clush::client
- profiles::consul::client
- profiles::conman::client
- profiles::clara::base
- profiles::ceph::client
- profiles::s3::s3cmd
- profiles::jobsched::client
profiles::network::gw_connect: 'wan'
profiles::environment::service::packages:
- scibian-hpc-admin
The profiles listed after the admin comment carry the software required on the
admin node. The profiles::environment::service::packages
has a specific value
for this role in order to install the admin meta-package.
Append the node definition in the master_network
hash, for example:
master_network: [...] fbadmin1: fqdn: "fbadmin1.%{hiera('domain')}" networks: administration: 'DHCP_MAC': 'aa:bb:cc:dd:ee:08' 'IP': '10.1.0.10' 'device': 'eth0' 'hostname': 'fbadmin1' management: 'IP': '10.2.0.10' 'device': 'eth1' 'hostname': 'mgtfbadmin1' lowlatency: 'IP': '10.4.0.10' 'device': 'ib0' 'hostname': 'opafbadmin1' bmc: 'DHCP_MAC': 'aa:bb:cc:dd:ee:09' 'IP': '10.2.0.110' 'hostname': 'bmcfbadmin1' wan: 'IP': '10.2.0.10' 'device': 'eth2' 'hostname': 'wanfbadmin1'
Adjust the node boot parameters in the boot_params
hash, for example:
boot_params: [...] fbadmin1: boot_os: 'scibian8_disk' boot_dev: 'eth0' console: 'ttyS0,115200n8' ipxebin: 'ipxe_serial.bin'
Synchronize SSH host keys:
# puppet-hpc/scripts/sync-ssh-hostkeys.sh hpc-privatedata $CLUSTER
Push and apply the new configuration:
# hpc-config-push && clush -bg service hpc-config-apply -v
And reboot the node in PXE mode to proceed the network installation:
# export BMC=bmcfbadmin1 # ipmitool -I lanplus -U ADMIN -P ADMIN -H $BMC chassis bootdev pxe # ipmitool -I lanplus -U ADMIN -P ADMIN -H $BMC power reset
Wait for the network installation to proceed. Once the installation is over, the node reboot on its freshly installed system on its disks and it becomes available through SSH. Starting from this point, all the following operations of the installation process are realized from this admin node.
Administration environmnent
The administration environment must be re-created following the same instructions given in the temporary installation node administration environmnet section.
The Clara utility is available on the admin node. Its ipmi plugin can be configured with this small snippet added with eyaml to the cluster specific layer of the hiera repository:
##### Clara ##### clara::ipmi_options: conmand: "%{hiera('cluster_prefix')}%{my_conman_server}1" prefix: 'bmc' clara::password_options: ASUPASSWD: "%{hiera('cluster_decrypt_password')}" IMMUSER: 'ADMIN' IMMPASSWORD: DEC::PKCS7[<password>]!
Push and apply configuration on the admin node:
# hpc-config-push && hpc-config-apply -v
Then, the clara ipmi plugin can be used as explained in its documentation (man
clara-ipmi (1)
).
Service virtual machines
On Scibian HPC clusters, the additional services are hosted inside virtual machines for more flexibility and better resources partitionning. These service virtual machines run on the generic service nodes. On the generic services nodes, the virtual machines are managed by Libvirt service. The ditributed instances of Libvirt are controlled centrally from the admin node with Clara utility. The following sub-sections explain how to setup these software components.
Libvirt settings
The Libvirt service must create various virtual networks to connect the virtual machines to the HPC cluster and a storage pool on Ceph RDB interface to store the virtual disks of the virtual machines. These virtual resources are setup with the following configuration in the cluster specific layer of the hiera repository:
virt_ceph_uuid: '<uuid>'
profiles::virt::networks:
'administration':
'mode': 'bridge'
'interface': 'br0'
'management':
'mode': 'bridge'
'interface': 'br1'
'wan':
'mode': 'bridge'
'interface': 'br2'
profiles::virt::secrets:
'client.libvirt':
'type': 'ceph'
'uuid': "%{hiera('virt_ceph_uuid')}"
'value': DEC::PKCS7[<key>]!
profiles::virt::pools:
'rbd-pool':
'type': 'rbd'
'hosts':
- 'fbservice2'
- 'fbservice3'
- 'fbservice4'
'auth':
'type': 'ceph'
'username': 'libvirt'
'uuid': "%{hiera('virt_ceph_uuid')}"
In this example, the following values must be replaced:
-
<key>
is given by the following command:
# ceph auth get-key client.libvirt`
-
<uuid>
is an arbitrary UUID[6] to identify uniquely the secret. For example, it can be generated with this command:
# python -c 'import uuid; print uuid.uuid1()'
Push and apply configuration on the generic service nodes:
# hpc-config-push && clush -bg service hpc-config-apply
Clara configuration
Clara has dedicated configuration for its virt plugin. This configuration is set with the following two hashes in the cluster specific layer of the hiera repository:
clara::virt_options:
'nodegroup:default':
'default': 'true'
'nodes': 'fbservice1,fbservice2,fbservice3,fbservice4'
'pool:default':
'default': 'false'
'pool:rbd-pool':
'default': 'true'
'vol_pattern': '{vm_name}_{vol_role}'
'template:default':
'default': 'true'
'xml': 'domain_default_template.xml'
'vol_roles': 'system'
'vol_role_system_capacity': '60000000000'
'networks': 'administration'
clara::virt_tpl_hpc_files:
'/etc/clara/templates/vm/domain_default_template.xml':
source: "%{hiera('private_files_dir')}/virt/domain_default_template.xml"
The clara::virt_options
hash notably specifies the list of generic services
nodes that hosts the virtual machines and the domain templates and parameters
associated to each service virtual machine. For the moment, only the default
domain template and parameters are set. The second hash
clara::virt_tpl_hpc_files
defines the templates of Libvirt XML domains
definitions. In this example, there is one default domain XML template for all
virtual machines which should be fine for most Scibian HPC clusters.
The domain XML template must be located in
$ADMIN/hpc-privatedata/files/$CLUSTER/virt/domain_default_template.xml
. Here
is a full example of this file:
<domain type='kvm'>
<name>{{ name }}</name>
<memory unit='KiB'>{{ memory_kib }}</memory>
<currentMemory unit='KiB'>{{ memory_kib }}</currentMemory>
<vcpu placement='static'>{{ core_count }}</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='x86_64' machine='pc-i440fx-2.1'>hvm</type>
<bootmenu enable='yes'/>
<boot dev='hd'/>
<boot dev='network'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<cpu mode='custom' match='exact'> </cpu>
<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<pm>
<suspend-to-mem enabled='no'/>
<suspend-to-disk enabled='no'/>
</pm>
<devices>
<emulator>/usr/bin/kvm</emulator>
<disk type='network' device='disk'>
<source protocol='rbd' name='{{ volumes.system.path }}'>
<host name='<ip_mon_server_1>' />
<host name='<ip_mon_server_2>' />
<host name='<ip_mon_server_3>' />
</source>
<auth username='libvirt'>
<secret type='ceph' uuid='<uuid>'/>
</auth>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
</disk>
<disk type='block' device='cdrom'>
<driver name='qemu' type='raw'/>
<backingStore/>
<target dev='hda' bus='ide'/>
<readonly/>
<alias name='ide0-0-0'/>
</disk>
<controller type='usb' index='0' model='ich9-ehci1'>
<alias name='usb0'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci1'>
<alias name='usb0'/>
<master startport='0'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci2'>
<alias name='usb0'/>
<master startport='2'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci3'>
<alias name='usb0'/>
<master startport='4'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<alias name='pci.0'/>
</controller>
<controller type='ide' index='0'>
<alias name='ide0'/>
</controller>
<controller type='virtio-serial' index='0'>
<alias name='virtio-serial0'/>
</controller>
{% for network_name, network in networks.iteritems() %}
<interface type='network'>
<mac address='{{ network.mac_address }}'/>
<source network='{{ network_name }}'/>
<model type='virtio'/>
</interface>
{% endfor %}
<!--
<serial type='tcp'>
<source mode='bind' host='{{ serial_tcp_host }}' service='{{ serial_tcp_port }}'/>
<protocol type='telnet'/>
<target port='0'/>
<alias name='serial0'/>
</serial>
-->
<serial type='pty'>
<target port='0'/>
<alias name='serial0'/>
</serial>
<channel type='spicevmc'>
<target type='virtio' name='com.redhat.spice.0'/>
</channel>
<input type='tablet' bus='usb'>
<alias name='input0'/>
</input>
<input type='mouse' bus='ps2'/>
<input type='keyboard' bus='ps2'/>
<graphics type='spice' port='5901' autoport='yes' listen='127.0.0.1'>
<listen type='address' address='127.0.0.1'/>
</graphics>
<sound model='ich6'>
<alias name='sound0'/>
</sound>
<video>
<model type='qxl' ram='65536' vram='65536' heads='1'/>
<alias name='video0'/>
</video>
<redirdev bus='usb' type='spicevmc'>
<alias name='redir0'/>
</redirdev>
<redirdev bus='usb' type='spicevmc'>
<alias name='redir1'/>
</redirdev>
<redirdev bus='usb' type='spicevmc'>
<alias name='redir2'/>
</redirdev>
<redirdev bus='usb' type='spicevmc'>
<alias name='redir3'/>
</redirdev>
<memballoon model='virtio'>
<alias name='balloon0'/>
</memballoon>
<rng model='virtio'>
<backend model='random'>/dev/random</backend>
<alias name='rng0'/>
</rng>
</devices>
</domain>
In this example, the following values must be replaced:
-
<ip_mon_server_*>
are the static IP addresses of the Ceph MON servers on the administration network. -
<uuid>
is the UUID for Libvirt Ceph RBD secret generated in the previous sub-section.
Deploy these new settings by pushing and applying the configuration on the admin node:
# hpc-config-push && hpc-config-apply -v
Virtual machine definitions
Now that Libvirt and Clara virt plugin are properly setup, the various service
virtual machines can be defined. The steps to define the service virtual
machines are mostly generic and common to all of them. As an example for this
documentation, the two service virtual machines fbdoe[1-2]
will be defined.
The first step is to define the boot_params
of the virtual machines in
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
:
boot_params: [...] fbdoe[1-2]: boot_os: 'scibian8_disk' ipxebin: 'ipxe_noserial.bin'
Eventually, in the same file, an additional domain template and parameters
association can be appended to the clara::virt_options
for these new virtual
machines, if the default domain parameters are not appropriate:
clara::virt_options: [...] 'template:proxy': 'vm_names': 'fbdoe[1-2]' 'xml': 'domain_default_template.xml' 'vol_roles': 'system' 'vol_role_system_capacity': '60000000000' 'networks': 'administration,wan' 'core_count': '16' 'memory_kib': '16777216'
In this example, the following settings are overriden from the defaults:
-
the virtual block storage device has a size of 60GB,
-
2 network devices attached to the administration and wan networks,
-
16 virtual CPU cores,
-
16GB of RAM.
Then, the new role doe must be defined in file
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/roles/doe.yaml
with all the
appropriate profiles.
Push and apply configuration on admin node:
# hpc-config-push && hpc-config-apply -v
Extract MAC address of the virtual machine on the administration network:
# clara virt getmacs <VM>
Then add the network settings of the virtual machines in the master_network
hash with their MAC addresses:
master_network:
fbdoe1:
fqdn: "fbdoe1.%{hiera('domain')}"
networks:
administration:
'DHCP_MAC': 'aa:bb:cc:dd:ee:0a'
'IP': '10.1.0.11'
'device': 'eth0'
'hostname': 'fbdoe1'
wan:
'IP': '10.3.0.11'
'device': 'eth1'
'hostname': 'wanfbdoe1'
fbdoe2:
fqdn: "fbdoe2.%{hiera('domain')}"
networks:
administration:
'DHCP_MAC': 'aa:bb:cc:dd:ee:0b'
'IP': '10.1.0.12'
'device': 'eth0'
'hostname': 'fbdoe2'
wan:
'IP': '10.3.0.12'
'device': 'eth1'
'hostname': 'wanfbdoe2'
Eventually, virtual IP addresses can also be defined for the virtual machines in
the vips
hash of the same file.
Generate the SSH host keys in synchronization with the master_network
:
# puppet-hpc/scripts/sync-ssh-hostkeys.sh hpc-privatedata $CLUSTER
Push and apply the new configuration on the generic service nodes:
# hpc-config-push && clush -bg service hpc-config-apply -v
Define the new virtual machines with Clara on the generic service node of your
choice, for example fbservice1
:
# clara virt define fbdoe[1-2] --host=fbservice1
Note
|
The choice of the generic service node is not critical as the service virtual machines can be migrated from one generic service node to another at any time. |
Then start the virtual machine by wiping its virtual block storage devices and boot in PXE mode:
# clara virt start fbdoe[1-2] --wipe
Eventually, watch the serial console with:
# ssh -t fbservice1 -- virsh console fbdoe1
Required virtual machines
You are free to define the service virtual machines you want on Scibian HPC clusters. The service virtual machines can run any software services you would like. However, some specific generic virtual machines are required in the reference architecture to run some mandatory additional services.
The required service virtual machines are:
-
two proxy virtual machines with the
auth::replica
profile for managing the LDAP directory replica. The installation of the LDAP directory replica of the proxy nodes is documented in the Directory replica sub-section of the LDAP Authentication_ section of this installation procedure. -
two batch virtual machines with the
jobsched::server
anddb::server
profiles for Slurm controller, SlurmDBD accounting service and MariaDB galera database. The installation of the Slurm server-side components on the batch nodes is documented in the Slurm section of this installation guide. -
two p2p virtual machines with the
p2p::seeder
,p2p::tracker
andhttp::diskless
profiles for serving files to boot diskless nodes with Bittorrent. The installation of the p2p nodes is pretty straightforward as long as the required profiles are enabled. The creation of the diskless environment is documented in the Build diskless image sub-section of the Frontend and compute nodes section of the installation procedure.
LDAP Authentication
Directory replica
User authentication on Scibian HPC clusters is based on LDAP directory using ldaps protocol (LDAP over SSL/TLS). This protocol requires the LDAP replica to have valid SSL certificate and asymmetric keys.
For production use, it is recommended to obtain a certificate signed by a valid PKI CA[7], either a public CA on the Internet or a CA internal to your organization. Otherwise, it is possible to use self-signed certificates.
Copy the private key and the certificate under the following paths:
-
certificate:
$ADMIN/hpc-privatedata/files/$CLUSTER/auth/$CLUSTER_ldap.crt
-
private key:
$ADMIN/hpc-privatedata/files/$CLUSTER/auth/$CLUSTER_ldap.key
Encrypt these files with clara enc plugin:
# clara enc `$ADMIN/hpc-privatedata/files/$CLUSTER/auth/$CLUSTER_ldap.crt` # clara enc `$ADMIN/hpc-privatedata/files/$CLUSTER/auth/$CLUSTER_ldap.key`
Remove the unencrypted files:
# rm `$ADMIN/hpc-privatedata/files/$CLUSTER/auth/$CLUSTER_ldap.crt` # rm `$ADMIN/hpc-privatedata/files/$CLUSTER/auth/$CLUSTER_ldap.key`
Then, append the auth::replica
profile and set certificate owner to openldap
in the proxy role:
--- a/hieradata/foobar/roles/proxy.yaml
+++ b/hieradata/foobar/roles/proxy.yaml
@@ -14,7 +14,7 @@ profiles:
# Proxy
- profiles::ntp::client
- profiles::network::wan_nat
+ - profiles::auth::replica
- profiles::postfix::relay
- profiles::ha::base
- profiles::hardware::admin_tuning
@@ -30,3 +30,24 @@ profiles:
profiles::network::gw_connect: 'wan'
shorewall::ip_forwarding: true
+
+certificates::certificates_owner: 'openldap
Push and apply the configuration on the proxy nodes:
# hpc-config-push && clush -bg proxy hpc-config-apply -v
Finally, follow the steps documented in LDAP bootstrap procedure.
Clients setup
Once the LDAP replica are bootstrapped and operational, it is possible to setup NSS LDAP backend and PAM LDAP authentication on the nodes.
On Scibian HPC clusters, NSS LDAP backend and PAM authentication over LDAP are
both setup with the same auth::client
profile. This profile must be used in
combination with the access::base
profile. This profile controls the remote
access rules to the nodes. By default, the profile prevents remote access to the
nodes with LDAP accounts. The access rules must explicitely whitelist users
and/or administrators to allow remote access with SSH.
There are two main access whitelist parameters:
-
profiles::access:base_options
is the list of permanent access rules. -
profiles::access:production_options
is the list of access rules disabled in maintenance mode.
The administrators related access rules must be listed in the base_options
while the users related access rules must only be present in the
production_options
list. This way, only administrators can access the HPC
cluster in maintenance mode. For example:
profiles::access::base_options: - "+ : (admins) : ALL" profiles::access::production_options: - "+ : (grpusers1) : ALL" - "+ : (grpusers2) : ALL"
These parameters must be set in the roles specific layer of the hiera repository as access rules depends on the role of the nodes. For example, users may access the frontend nodes but not the admin node.
Additionally, it is also possible to setup sudo rules with the sudo::base
profile and the sudo::sudo_config_opts
list. This parameter is basically a
list of sudo rules. For example, to allow the group of administrator to sudo any
command on the admin node, add the following excerpt to file
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/roles/admin.yaml
:
profiles::sudo::sudo_config_opts: - "%admins ALL = (ALL) ALL"
By default, the PAM and NSS LDAP backend connect to the HPC cluster internal LDAP replica. This replica is hosted by service virtual machine. In order to make LDAP authentication on the admin nodes and generic service nodes possible for the administrators when the virtual machines are offline (typically during maintenances), it is possible to add the following parameter in the associated roles:
profiles::auth::client::external_ldap: true
This way, the nodes will connect to the organization reference LDAP directory instead of the internal LDAP replica.
Push and apply the configuration on all the affected nodes with:
# hpc-config-push && clush -bg all hpc-config-apply
Slurm
Slurm workload manager is distributed among the HPC cluster nodes with multiple
daemons and clients software. On Scibian HPC clusters, the server part of
Slurm, ie. the controller and the accounting services, run in
high-availability mode on the batch nodes. These components are managed by
the jobsched::server
. The batch nodes also need the db::server
, and the
ceph::client
or nfs::mount
that respectively setup the MariaDB galera
RDBMS[8], and CephFS or NFS
filesystem client.
Base Configuration
Slurm communications between nodes are secured using Munge which is based on a secret shared key. Generate this munge key with the following command:
# mkdir -p $ADMIN/hpc-privatedata/files/$CLUSTER/munge # dd if=/dev/urandom bs=1 count=1024 > $ADMIN/hpc-privatedata/files/$CLUSTER/munge/munge.key
Encrypt the key using Clara:
# clara enc encode $ADMIN/hpc-privatedata/files/$CLUSTER/munge/munge.key
Remove the unencrypted key:
# rm $ADMIN/hpc-privatedata/files/$CLUSTER/munge/munge.key
Setup the nodes and partitions managed by Slurm in the
slurm::partitions_options
hash in the cluster specific layer of the Hiera
repository. For example:
slurm::partitions_options:
- 'NodeName=fbcn[01-04] Sockets=2 CoresPerSocket=14 RealMemory=64000 State=UNKNOWN'
- 'NodeName=fbgn01 Sockets=2 CoresPerSocket=4 RealMemory=64000 Gres=gpu:k80:2 State=UNKNOWN'
- 'PartitionName=cn Nodes=fbcn[01-04] Default=YES MaxTime=INFINITE State=UP'
- 'PartitionName=gn Nodes=fbgn01 MaxTime=INFINITE State=UP'
- 'PartitionName=all Nodes=fbcn[01-04],fbgn01 MaxTime=INFINITE State=UP'
Please refer to Slurm documentation for more details about these settings.
In the same, setup the LDAP/SlurmDBD users synchronization utility, for example:
profiles::jobsched::server::sync_options:
main:
cluster: "%{hiera('cluster_name')}"
org: "%{hiera('org')}"
policy: 'global_account'
global_account:
name: 'users'
desc: 'Main users account'
Please refer to the example configuration file for more details.
Shared State Location
Still in the cluster specific layer of the Hiera repository, setup the shared storage directory.
CephFS
If you are using CephFS, configure the client mount with the following excerpt:
profiles::jobsched::server::ceph::keys:
client:
key: "%{hiera('ceph_client_admin_key')}"
profiles::jobsched::server::ceph::mounts:
slurmctld:
servers: # list of Ceph MON servers
- fbservice2
- fbservice3
- fbservice4
device: '/slurmctld'
mountpoint: "%{hiera('slurm_state_save_loc')}"
user: 'admin'
key: 'client'
mode: 'kernel'
NFS
If you are using an NFS HA Server:
profiles::jobsched::server::ceph::enabled: false
profiles::jobsched::slurm_config_options:
[...]
StateSaveLocation: '/admin/restricted/backup/slurm_state_save'
For NFS HA, at the role level, configure the NFS mount:
profiles: [...] - profiles::nfs::mounts profiles::nfs::to_mount: home: server: 'fbnas' exportdir: '/srv/admin' mountpoint: '/admin' options: 'bg,rw,hard,vers=4'
Miscellaneous Tuning
Eventually, it is possible to tune Slurm, GRES, SlurmDBD, job submit LUA script with the following parameters:
profiles::jobsched::slurm_config_options:
PrivateData: 'jobs,reservations,usage'
AccountingStorageEnforce: 'associations,limits,qos'
GresTypes: 'gpu'
SlurmCtldDebug: 'verbose'
PriorityFlags: 'FAIR_TREE'
slurm::gres_options:
- 'NodeName=fbgn01 Name=gpu Type=k80 File=/dev/nvidia0'
profiles::jobsched::server::slurmdbd_config_options:
PrivateData: 'accounts,jobs,reservations,usage,users'
slurm::ctld::submit_lua_options:
CORES_PER_NODE: '28'
MariaDB security hardening
By default, the MariaDB server is setup with parameters to harden its security. Notably, the following settings are deployed by default:
-
max_user_connections
to 100 (default is 0 ie. unlimited), in order to prevent one user from grabbing all 151 availablemax_connections
(default MariaDB value). -
secure_file_priv
is set to an empty value in order to disable potentially dangerous commandLOAD DATA INFILE
. -
the client histfile
~/.mysql_history
is disabled by default.
Obviously, these settings can be altered in the hiera repository. Here is an example yaml excerpt to change these default values:
mariadb::disable_histfile: false mariadb::galera_conf_options: mysqld: max_user_connections: '0' # unlimited secure_file_priv: '/'
Bootstrap
Some software components need to be manually bootstrapped on the batch nodes before being started:
The shared storage can be on CephFS or on NFS HA, the suitable bootstrap procedure must be performed:
Please refer to the Bootstrap procedure chapter of this document for all details.
Configuration deployment
Once the configuration is set in the Hiera repository, push and apply the configuration on the admin and batch nodes:
# hpc-config-push && clush -bg admin,batch hpc-config-apply -v
Check Slurm is available by running the sinfo
command on the admin node. If
the command report the nodes and partitions state without error, Slurm is
properly running.
Frontend and compute nodes
On Scibian HPC clusters, the frontend and compute nodes download at boot time a system image in deployed in RAM which notably gives possibility to have diskless nodes. For more details about this technique, please refer to the Diskless boot sub-section in the Advanced Topics section of the Architecture chapter of this document. The diskless image must be generated with Clara images plugin on the admin node before booting the frontend and the compute nodes. These steps are explained in the following sub-sections.
Diskless image generation
The diskless image is generated by the Clara images plugin. This plugin need some configuration in the cluster specific layer of the Hiera repository. Here is an example of such configuration:
clara_images_target_dir: "%{hiera('admin_dir')}/scibian8" clara::common_options: allowed_distributions: value: 'scibian8' clara::images_options: extra_packages_image: "scibian-archive-keyring,hpc-config-apply,scibian-hpc-commons" packages_initrd: "scibian-diskless-initramfs-config" etc_hosts: "10.1.0.101:vipfbservice1,10.1.0.101:apt.service.virtual,10.1.0.10:fbadmin1" clara::config_options: images-scibian8: debiandist: 'jessie' debmirror: "http://%{hiera('debian_mirror_server')}/%{hiera('debian_mirror_dir')}" kver: "3.16.0-4-amd64" list_repos: "deb [arch=amd64,i386] http://%{hiera('debian_mirror_server')}/" trg_dir: "%{hiera('clara_images_target_dir')}" trg_img: "%{hiera('clara_images_target_dir')}/scibian8.squashfs" preseed_file: "%{hiera('clara_images_config_dir')}/scibian8/preseed" package_file: "%{hiera('clara_images_config_dir')}/scibian8/packages" script_post_image_creation: "%{hiera('clara_images_config_dir')}/scibian8/post.sh" list_files_to_install: "%{hiera('clara_images_config_dir')}/scibian8/filelist" dir_files_to_install: "%{hiera('clara_images_config_dir')}/scibian8/files_dir" foreign_archs: 'i386' clara::live_dirs: "%{hiera('clara_images_config_dir')}": ensure: directory "%{hiera('clara_images_config_dir')}/scibian8": ensure: directory "%{hiera('clara_images_config_dir')}/scibian8/files_dir": ensure: directory clara::live_files: "%{hiera('clara_images_config_dir')}/scibian8/post.sh": source: "%{hiera('private_files_dir')}/boot/live/scibian8/post.sh" mode: '755' "%{hiera('clara_images_config_dir')}/scibian8/preseed": source: "%{hiera('private_files_dir')}/boot/live/scibian8/preseed" "%{hiera('clara_images_config_dir')}/scibian8/filelist": source: "%{hiera('private_files_dir')}/boot/live/scibian8/filelist" "%{hiera('clara_images_config_dir')}/scibian8/files_dir/resolv.conf": source: "%{hiera('private_files_dir')}/boot/live/scibian8/files_dir/resolv.conf" "%{hiera('clara_images_config_dir')}/scibian8/files_dir/no-cache": source: "%{hiera('private_files_dir')}/boot/live/scibian8/files_dir/no-cache" "%{hiera('clara_images_config_dir')}/scibian8/files_dir/no-recommends": source: "%{hiera('private_files_dir')}/boot/live/scibian8/files_dir/no-recommends" "%{hiera('clara_images_config_dir')}/scibian8/files_dir/interfaces": source: "%{hiera('private_files_dir')}/boot/live/scibian8/files_dir/interfaces" "%{hiera('clara_images_config_dir')}/scibian8/files_dir/proxy": source: "%{hiera('private_files_dir')}/boot/live/scibian8/files_dir/proxy" "%{hiera('clara_images_config_dir')}/scibian8/files_dir/mk_ipmi_dev.sh": source: "%{hiera('private_files_dir')}/boot/live/scibian8/files_dir/mk_ipmi_dev.sh" "%{hiera('clara_images_config_dir')}/scibian8/files_dir/hpc-config.conf": source: "%{hiera('private_files_dir')}/boot/live/scibian8/files_dir/hpc-config.conf"
The clara::live_files
parameter contains a list of files deployed under the
configuration directory of Clara. Their files are:
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/post.sh
is a post-generation script run by Clara inside the image environment:
#!/bin/bash -e
# Fix Timezone data
echo GMT > /etc/timezone
dpkg-reconfigure -f noninteractive tzdata
# Fix hostname
echo "localhost" > /etc/hostname
# Create needed directory for Puppet
mkdir -p /var/lib/puppet/facts.d/
# Enable setuid on /bin/ping to let users run it because AUFS does not support
# xattr and therefore capabilities.
chmod 4755 /bin/ping
This script can notably be used to customize the image or set files and directories that are required very early in the live boot process before Puppet run.
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/preseed
contains the answers to the Debconf packages configuration questions:
console-common console-data/keymap/full select en
console-common console-data/keymap/policy select Select keymap from full list
console-data console-data/keymap/full select en
console-data console-data/keymap/policy select Select keymap from full list
console-setup console-setup/charmap47 select UTF-8
locales locales/default_environment_locale select en_US.UTF-8
locales locales/locales_to_be_generated multiselect en_US.UTF-8 UTF-8, en_US ISO-8859-1
keyboard-configuration keyboard-configuration/layout select English
keyboard-configuration keyboard-configuration/variant select English
keyboard-configuration keyboard-configuration/unsupported_layout boolean true
keyboard-configuration keyboard-configuration/model select International (with dead keys)
keyboard-configuration keyboard-configuration/layoutcode string intl
keyboard-configuration keyboard-configuration/ctrl_alt_bksp boolean false
keyboard-configuration keyboard-configuration/variantcode string oss
keyboard-configuration keyboard-configuration/modelcode string pc105
postfix postfix/main_mailer_type select No configuration
tzdata tzdata/Areas select Europe
tzdata tzdata/Zones/Europe select London
libpam-runtime libpam-runtime/conflicts error
mdadm mdadm/start_daemon boolean false
postfix postfix/mailname string localdomain
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/filelist
specifies the list of files to copy inside the generated image:
hpc-config.conf etc/ 0644
resolv.conf etc/ 0644
proxy etc/apt/apt.conf.d/ 0644
no-cache etc/apt/apt.conf.d/ 0644
no-recommends etc/apt/apt.conf.d/ 0644
interfaces etc/network/ 0644
mk_ipmi_dev.sh usr/local/sbin/ 0755
All the files under the files_dir
directory are copied without modification
into the image. The required files are:
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/files_dir/resolv.conf
is the configuration file for DNS solvers with the virtual IP addresses of the cluster’s internal DNS servers:
domain foobar.hpc.example.org
search foorbar.hpc.example.org hpc.example.org
nameserver 10.1.0.101
nameserver 10.1.0.102
nameserver 10.1.0.103
nameserver 10.1.0.104
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/files_dir/no-cache
disables packages local caching in APT package manager:
Dir::Cache::srcpkgcache "";
Dir::Cache::pkgcache "";
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/files_dir/no-recommends
disables recommends soft-dependencies installation in APT package manager:
APT::Install-Recommends "0";
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/files_dir/interfaces
is a default network interfaces configuration file to enable DHCP oneth0
interface:
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet dhcp
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/files_dir/proxy
setup cluster’s internal packages proxy in APT configuration:
Acquire::http::Proxy "http://apt.service.virtual:3142";
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/files_dir/mk_ipmi_dev.sh
is a workaround script to create the BMC devices inodes the/dev
virtual filesystem very early in the diskless nodes boot process:
#!/bin/sh
DEVICE='/dev/ipmi0'
if [ -e ${DEVICE} ]
then
exit 0
else
MAJOR=$(grep ipmidev /proc/devices | awk '{print $1}')
mknod --mode=0600 ${DEVICE} c ${MAJOR} 0
fi
-
$ADMIN/hpc-privatedata/files/$CLUSTER/boot/live/scibian8/files_dir/hpc-config.conf
is a configuration file for Puppet-HPChpc-config-apply
utility:
[DEFAULT]
environment=production
source=http://s3-system.service.virtual:7480/hpc-config
keys_source=http://secret.service.virtual:1216
# Using /var/tmp to more easily manipulate /tmp mount
# point during a puppet run.
tmpdir=/var/tmp
Once all these files have been added to the cluster specific files directory, the Clara p2p plugin must be configured in the cluster specific layer of the Hiera repository with the following parameter:
clara::p2p_options: seeders: "%{hiera('cluster_prefix')}%{my_p2p_tracker}1,%{hiera('cluster_prefix')}%{my_p2p_tracker}2:%{hiera('clara_images_target_dir')}/scibian8.squashfs.torrent" trackers: "10.1.0.51,10.1.0.52:%{hiera('clara_images_target_dir')}/scibian8.squashfs.torrent" trackers_port: '6881' trackers_schema: 'http'
Push and apply the configuration on the admin node:
# hpc-config-push && hpc-config-apply
Now that Clara is setup, the image can be created with the following command:
# clara images create scibian8
Also create the associated initrd environment:
# clara images initrd scibian8
Deploy the generate image and initrd to the _p2p nodes with!
# clush -g p2p mkdir -p /var/www/diskless/scibian8 # clush -g p2p \ --copy /var/cache/admin/scibian8/{initrd-3.16.0-4-amd64,vmlinuz-3.16.0-4-amd64} \ --dest /var/www/diskless/scibian8 # clush -g p2p \ --copy /var/cache/admin/scibian8/{scibian8.squashfs.torrent,scibian8.squashfs} \ --dest /var/www/diskless/scibian8
Restart peer-to-peer services to load new files:
# clara p2p restart
The diskless environment is finally ready and available to frontend and compute nodes.
Boot nodes
Before booting the frontend and compute nodes, they must be declared in the
internal configuration repository in the first place. Append the nodes to the
boot_params
hash in $ADMIN/hpc-privatedata/hieradata/$CLUSTER/cluster.yaml
:
boot_params:
[...]
fbfront[1-2]:
cowsize: '8G'
boot_os: 'scibian8_ram'
ipxebin: 'ipxe_noserial.bin'
fbcn[01-04]:
boot_os: 'scibian8_ram'
ipxebin: 'ipxe_noserial.bin'
fbgn01:
cowsize: '8G'
boot_os: 'scibian8_ram'
ipxebin: 'ipxe_noserial.bin'
The cowsize
must be increased to 8GB from default 2GB on frontend and
graphical nodes because these nodes need much more packages to be installed at
boot time.
Then define the roles associated to the frontend and the compute nodes, for
example front
, cn
and gn
. For these roles definitions, keep in mind the
following rules:
-
The frontend role must include the
jobsched::client
while the compute nodes require thejobsched::exec
profile instead. -
The
profiles::environment::userspace::packages
must include thescibian-hpc-frontend
meta-package in the frontend nodes role,scibian-hpc-compute
meta-package in the standard compute nodes andscibian-hpc-graphical
meta-package on the graphical nodes.
The nodes must be added into the master_network
hash in file
$ADMIN/hpc-privatedata/hieradata/$CLUSTER/network.yaml
with all their network
interfaces and the MAC addresses of their network interface connected to the
administration and their BMC.
Generate all the SSH host keys:
# puppet-hpc/scripts/sync-ssh-hostkeys.sh hpc-privatedata $CLUSTER
Push and apply the configuration to the admin and generic service nodes:
# hpc-config-push && clush -bg admin,service hpc-config-apply -v
Finally, boot all the nodes in PXE mode with Clara:
# clara ipmi pxe @front,@cn,@gn # clara ipmi boot @front,@cn,@gn
Optional features
Tuning
TBD
Firewall
TBD
Kerberos
TBD
Internal APT repository
TBD
Storage Multipath
TBD
Monitoring
TBD
Metrics
TBD
HPCStats
TBD
Slurm WCKeys
TBD
NFS High-Availability
TBD
Slurm power management
Generate and encrypt the SSH key used to poweroff the nodes from the batch nodes:
# mkdir hpc-privatedata/files/$CLUSTER/pwmgt
# ssh-keygen -N '' -C root@pwmgt -f hpc-privatedata/files/$CLUSTER/pwmgt/id_rsa_slurm
# clara enc encode hpc-privatedata/files/$CLUSTER/pwmgt/id_rsa_slurm
# rm hpc-privatedata/files/$CLUSTER/pwmgt/id_rsa_slurm
Then add those settings in the cluster specific layer of the hiera repository:
profiles::jobsched::pwmgt::enabled: true
slurmutils::pwmgt::ctld::config_options:
ipmi:
prefix: "%{hiera('ipmi_prefix')}"
user: "%{hiera('ipmi_user')}"
password: "%{hiera('ipmi_password')}"
slurmutils::pwmgt::ctld::priv_key_enc: "%{hiera('private_files_dir')}/pwmgt/id_rsa_slurm.enc"
slurmutils::pwmgt::ctld::decrypt_passwd: "%{hiera('cluster_decrypt_password')}"
slurmutils::pwmgt::exec::pub_key: <PUBKEY>
Where <PUBKEY>
is the public key in file
hpc-privatedata/files/$CLUSTER/pwmgt/id_rsa_slurm.pub
.
Finally, apply the new configuration on the batch nodes and all the compute nodes:
# hpc-config-push # clush -bg batch hpc-config-apply -v # clush -bg compute hpc-config-apply -v
Bootstrap procedures
This chapter contains all the procedures to boostrap all the crucial services for a Scibian HPC system: LDAP, Ceph, MariaDB with Galera, SlurmDBD, etc.
LDAP bootstrap
As stated in external services section of the Reference Architecture chapter, a central LDAP directory server external to the Scibian HPC cluster is required. The LDAP directory server on the cluster is just is a replica of this central external server.
The Puppet-HPC openldap
module expects a LDIF file containing a full dump of
the LDAP replica configuration. The easiest way to produce this bootstrap LDIF
file is to install and configure an LDAP server replica manually and dump the
live configuration.
First, install an LDAP server with common LDAP utilities:
# apt-get install slapd ldap-utils
Select the HDB database backend. Then, configure the base DN, the domain name, the organization name according to your environment, and set the administration password.
Write the LDAP replication configuration LDIF file syncrepl_config.ldif
,
similarly to this example:
dn: olcDatabase={1}hdb,cn=config changetype: modify add: olcSyncrepl olcSyncrepl: rid=001 provider=<LDAP_SERVER_URL> bindmethod=simple timeout=0 tls_cacert=<CA_CRT_CHAIN> network-timeout=0 binddn="<BIND_DN>" credentials="<BIND_PASSWORD>" searchbase="dc=calibre,dc=edf,dc=fr" schemachecking=on type=refreshAndPersist retry="60 +" - add: olcUpdateref olcUpdateref: <LDAP_SERVER_URL>
Where:
-
LDAP_SERVER_URL
is the URL to the organization central LDAP server, ex:ldaps://ldap.company.tld
. -
If using TLS/SSL,
CA_CRT_CHAIN
is the absolute path to the CA certificate chain (up-to root CA certificate), ex:/usr/local/share/ca-certificates/ca-chain.crt
-
BIND_DN
is the replication user DN, ex:cn=replication,dc=company,dc=tld
-
BIND_PASSWORD
is the password of the replication user
Inject this LDIF replication configuration file into the LDAP server:
# ldapmodify -a -Y EXTERNAL -H ldapi:// -f syncrepl_config.ldif
Using the same technique, configure to your needs the indexes, ACLs, TLS/SSL, password policy, kerberos, etc. Finally, generate the full LDAP config dump with:
# slapcat -b cn=config > config_replica.ldif
or:
# ldapsearch -Y EXTERNAL -H ldapi:/// -b cn=config > config-replica.ldif
The config_replica.ldif
file must be deployed encrypted within Puppet-HPC
private files directory. Please refer to Puppet-HPC Reference Documentation
for more details.
After a fresh installation the cluster’s services virtual machines that host the
LDAP directory replicas, the config_replica.ldif
is deployed by Puppet and the
LDAP replication must be bootstraped with this script:
# make_ldap_replica.sh
The script will ask you to confirm by typing YES
and press enter.
MariaDB/Galera bootstrap
The Puppet-HPC mariadb
module configures an active/active MariaDB cluster
based on galera replication library. On the service virtual machines that host
this database system, the corresponding mariadb
system service will not start
unless it is already started on another service virtual machine. If it is not
running anywhere else, the service must bootstraped with this command:
# galera_new_cluster
This command starts the MariaDB service on the local host in new cluster mode. The state of the local service can be checked with this command:
# systemctl status mariabd.service
This command must report on running mysqld
process. In some case, typically
when a MariaDB/Galera was not properly stopped, the command may fail and report
this error:
[ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .
In this case, and if you are totally sure that MariaDB service is stopped on all nodes, the error can be ignored with the following command:
# sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/' /var/lib/mysql/grastate.dat
Then, the MariaDB/Galera cluster can be started again with galera_new_cluster
.
Once the service is started on all service virtual machines, you can check the cluster replication status with:
# mysql -e "SELECT VARIABLE_VALUE as cluster_size \ FROM INFORMATION_SCHEMA.GLOBAL_STATUS \ WHERE VARIABLE_NAME='wsrep_cluster_size'"
This result must be the number of expected active nodes in the MariaDB/Galera cluster (ex: 2).
SlurmDBD bootstrap
After its first installation on the cluster, the SlurmDBD accounting database is empty. First, the cluster must be created in the database:
# sacctmgr --immediate add cluster <name>
Where <name>
is the name of the cluster.
Then, once the
sync-accounts
utility is configured, run it to create all accounts and users:
# slurm-sync-accounts
Then, it is possible to create QOS and configure fair-share depending upon your needs.
If using wckeys, they must be bootstrapped by adding the first key manually
using the sacctmgr
command and then run the importation script:
# sacctmgr -i add user root wckey=<init> # slurm_wckeys_setup.sh
Ceph
Mon
After the reinstallation of one of the generic service nodes with a mon, it must be re-initialized. This procedure only works on a running cluster, the initial mon creation uses another command.
From an admin node:
# cd <ceph deploy directory> # ceph-deploy --overwrite-conf mon add <mon hostname>
OSD
This procedure only applies if the content of an OSD volume is lost. If the node is reinstalled without erasing the content of the OSD volume, the configuration in puppet will be enough to start the osd volume again.
The relevant OSD ID can be retrieved with:
# ceph osd tree
Before doing this procedure, make sure the OSD is really down and not mounted on the OSD node.
Removing old OSD
The old OSD must be removed from the configuration (stored by the MON).
# ceph osd crush remove osd.X # ceph auth del osd.X # ceph osd rm X
Re-creating the OSD
# cd <ceph deploy directory> # ceph-deploy osd prepare clserviceY:sdb # ceph-deploy disk zap clserviceY:sdb
The OSD id and authentication key should be updated on the hiera configuration. In most cases, the new OSD will take the same ID as the old one. You can get the new ID and the new key with:
# ceph osd tree # ceph auth print-key osd.X
CephFS
CephFS filesystem is used between the batch nodes to shared Slurm controller state. The filesystem must be initialized before being used by Slurm.
First, mount temporarily the CephFS filesystem:
# mount -t ceph -o name=admin,secretfile=/etc/ceph/client.key fbservice2,fbservice3,fbservice4:/ /mnt
Create a subdirectory for Slurm controller, set its ownership and restrict its mode:
# mkdir /mnt/slurmctld # chown slurm: /mnt/slurmctld # chmod 0700 /mnt/slurmctld
Finally, umount it:
# umount /mnt
Puppet-HPC is now able to use this filesystem for Slurm on batch nodes.
NFS HA bootstrap
The shared storage of the NFS server contains a directory that holds the state of the clients (mainly the locks). When the shared NFS storage is created, it must be formated and this state directory must be created.
The shared storage must be on a specific LVM Volume Group. What the PVs are for this volume group and how they are configured depends on the hardware available.
In the following example, the PV/LV is VG_NAS
/LV_NAS
and is to be mounted
as /srv/admin
.
# mkfs.ext4 /dev/VG_NAS/LV_NAS # mkdir /srv/admin # mount /dev/VG_NAS/LV_NAS /srv/admin # mkdir -p /srv/admin/restricted/state/nfs/v4recovery # umount /srv/admin
After these steps, the keepalived daemon can be started on the nodes. The MASTER node will mount the storage and export it.
Production procedures
In this chapter are listed all the technical procedures to follow for regular operations occurring during the production phase of the supercomputer. This notably includes changing any encryption or authentication key, changing passwords, reinstalling nodes, etc.
MAC address change
This procedure explains how to modify the Puppet-HPC configuration to change an hardware Ethernet address after a motherboard replacement, for example.
First, the yaml file in the hieradata repository containing the
master_network
hash must be edited to replace the old hardware address. A
description of this hash can be found in the Installation section of this guide.
The modified configuration must be pushed to the shared administration
directory with the hpc-config-push
command:
# hpc-config-push INFO: creating archive /tmp/puppet-config-push/tmp_ndq0ujz/puppet-config-environment.tar.xz INFO: S3 push: pushing data in bucket s3-system
Then apply the configuration on the service
nodes, who runs the DHCP server:
# hpc-config-apply
Note
|
It is not possible to run the hpc-config-apply command on all the
service nodes at the same time exactly. A short delay must be respected as the
Ceph service can be disturbed by a restart of the network service.
|
Password/keys changes
Root password
The hashed root password is stored in the variable
profiles::cluster::root_password_hash
in yaml files. The value must be
encrypted using eyaml. It can be simply changed using the eyaml
command.
# eyaml edit cluster.yaml ... profiles::cluster::root_password_hash: DEC::PKCS7[hashed_password]! ...
Once changed, the new configuration must be applied on all the machines of the cluster.
Root SSH key
The root SSH keys are stored in the internal repository. The privates keys must
be encrypted. The SSH public rsa key is also in the variable
openssh::server::root_public_key
. It is necessary to change the files and the
value of the variable at the same time. To avoid connections problems, it is
necessary to follow these steps in this order:
-
Change the keys files and the variable
openssh::server::root_public_key
in the internal repository -
Apply the configuration on all the machines exept the admin one
-
Apply the new configuration on the admin server.
Note
|
In case of desynchronization between the keys on the admin node and those on the others nodes, it is always possible to use the root password to connect. |
SSH host keys
The SSH host keys are stored, encrypted, in the internal repository. To avoid connections problems, it is necessary to follow these steps in this order:
-
Change the keys files in the internal repository
-
Apply the configuration on all the machines of the cluster, including the admin machine
-
Delete the file /root/.ssh/known_hosts on the admin node.
-
When connecting to the nodes, /root/.ssh/known_hosts will be automatically populated if the Scibian HPC default configuration is used.
Eyaml keys
Replacing the eyaml PKCS7 key pair consist in reality of two actions:
-
Generate a new pair of keys (
eyaml createkeys
) -
Replace all the values encoded with the old pair with ones encoded with the new pair of keys.
Note
|
As these operations implies decoding files and re-encoding them with another key pair, it is not possible to perform other administrative operations (like applying the configuration on nodes) on the cluster at the same time. The changing keys operation must be fully completed before resuming "normal" administrative operations. |
These steps must be followed in order to safely change the eyaml keys:
Save the old keys:
# cp /etc/puppet/secure/keys/private_key.pkcs7.pem \ /etc/puppet/secure/keys/private_key.pkcs7.pem.old # cp /etc/puppet/secure/keys/public_key.pkcs7.pem \ /etc/puppet/secure/keys/public_key.pkcs7.pem.old
Copy the new keys in /etc/puppet/secure/keys/.
Decrypt all the yaml files encoded using the old keys:
# eyaml decrypt \ --pkcs7-private-key /etc/puppet/secure/keys/private_key.pkcs7.pem.old \ --pkcs7-public-key /etc/puppet/secure/keys/public_key.pkcs7.pem.old \ -e hieradata/<cluster>/cluster.yaml \ > hieradata/<cluster>/cluster.decrypt.yaml
The decrypt.yaml
contains all the secret in plain text. It should be removed
as soon as possible.
Encrypt the files with the new keys:
# eyaml encrypt -e hieradata/<cluster>/cluster.decrypt.yaml \ > hieradata/<cluster>/cluster.yaml # rm hieradata/<cluster>/cluster.decrypt.yaml
Remove the old saved keys from the admin node:
# rm /etc/puppet/secure/keys/private_key.pkcs7.pem.old \ /etc/puppet/secure/keys/public_key.pkcs7.pem.old
Create a tarball, encode it with clara enc
and add it to the files
directory of the internal repository:
# tar cJf /tmp/keys.tar.xz /etc/puppet/secure/keys # clara enc encode /tmp/keys.tar.xz # mv /tmp/keys.tar.xz.enc <internal repository>/files/<cluster>/eyaml
Where:
-
<internal repository> is the directory that contains the clone of the internal repository.
-
<cluster> is the name of the cluster.
At this stage, the keys are now stored encrypted in the internal repository and are available locally in the standard eyaml paths.
In the default Scibian-HPC configuration, the PKCS7 keys propagation service runs on all the generic service nodes. First, the encoded tarball must be manually copied on the nodes:
# scp <internal repository>/files/<cluster>/eyaml/keys.tar.xz <generic server X>:/tmp
Where <generic server X> is the hostname of the generic service node.
Then apply the configuration using the new keys:
# hpc-config-apply -vv --keys-source=/tmp
This will copy the eyaml PKCS7 key pair in the right directory to be serviced by the propagation service to all others nodes when applying the puppet configuration. These last two operations must be executed on all the generic service nodes.
Don’t forget to remove the keys from the /tmp
directory on the admin node and
on all the service nodes.
# rm /tmp/keys.tar.xz # clush -w @service rm /tmp/keys.tar.xz
Internal repository encoding key
Note
|
As these operations implies decrypting files and re-encrypting them with another key, it is not possible to perform other administrative operations (like applying the configuration on nodes) on the cluster at the same time. The changing key operation must be fully completed before resuming "normal" administrative operations. |
Replacing the AES key used to encode files in the internal repository consist in several steps.
Generate a new AES key:
# openssl rand -base64 32
For each encoded file in the internal repository, it is necessary to decode it with the old key and re-encode it with the new one.
# clara enc decode <internal repository>/files/<cluster>/<filename>.enc # openssl aes-256-cbc \ -in <internal repository>/files/<cluster>/<filename> \ -out <filename>.enc -k <AES KEY> # rm <internal repository>/files/<cluster>/<filename>
Where:
-
<internal repository> is the directory that contains the clone of the internal repository
-
<cluster> is the name of the cluster
-
<filename> is the path of the file to encode
-
<AES KEY> is the random 256 bits key.
Using clara
for both operations, decode and encode, is not possible as it
support only one AES key.
This re-encryption step can be automated with the reencode-file.sh
script in
the puppet-hpc
scripts dir:
# cd <internal repository>/files/<cluster> # find -name "*.enc" \ -exec <puppet-hpc path>/scripts/reencode-file.sh\ /tmp/oldkey /tmp/newkey '{}' ';'
The files /tmp/oldkey
and /tmp/newkey
are files with just the old and new
AES key respectively. This script does not depend on clara
but basically
performs the same steps as above.
The AES key must be placed in cluster_decrypt_password in the cluster layer of the Hiera repository:
# eyaml edit hieradata/<cluster>/cluster.eyaml
Replace the key:
cluster_decrypt_password: DEC::PKCS7[<AES KEY>]!
Apply the new configuration on the admin node, to update clara
configuration:
# hpc-config-apply
Replication account password
The steps to change these credentials are described here:
-
Decode the configuration ldif file:
# clara enc edit <internal repository>/files/<cluster>/<filename>.enc
-
The field to change is
olcSyncrepl:
, it contains all the necessary informations to connect to the master LDAP server (login, password, URI, etc ..) -
Apply the new configuration on the proxy nodes.
-
Follow the LDAP bootstrap procedure as described in LDAP bootstrap on each proxy node. It is recommended to wait until the first ldap replicate is complete before attempting to update the second, to not disrupt authentication across the cluster.
Note
|
It is possible to change others values with this procedure, for example the root LDAP password. |
Monitoring certificates
The certificates used for monitoring are stored, encrypted, in the internal repository in <internal repository>/files/<cluster>/icinga2/certs/. Each host has a certificate and a key. The steps to follow to change them are:
-
Change the key and certificate files in the internal repository
-
Apply the configuration on the concerned node
-
Update the certificate on the Icinga2 server
Munge key
Note
|
Scheduling service and jobs must be stopped to change the munge key. |
Warning
|
This will kill running jobs. |
-
Stop the
slurmd
andslurmctld
daemons. -
Stop the munge daemon on all nodes.
-
Encrypt the new key with
Clara
and place it in <internal repository>/files/<cluster>/munge/munge.key.enc -
Apply the new configuration on all nodes.
-
restart the daemons.
Repo keyring
Note
|
The packages must be saved in another place. |
The cluster must use a private cluster keyring. This keyring is used to sign the local packages repository.
It is stored in the internal repository: <internal repository>/files/<cluster>/repo/
Here are the steps to follow to change it:
-
Generates a new keyring:
# LANG=C gpg --no-default-keyring \ --keyring <internal repository>/files/<cluster>/repo/cluster_keyring.gpg \ --secret-keyring <internal repository>/files/<cluster>/repo/cluster_keyring.secret.gpg \ --gen-key
-
Encode the secret file with
clara encode
. -
Apply the configuration on the admin node.
-
Delete the folder containing the local repository.
-
Re-create the repository with
clara
:# clara repo key # clara repo init scibian8-hpc
-
Add the previously saved packages with
clara
:# clara repo add scibian8-hpc mypackage_1-2.dsc # ...
MariaDB users
Generate passwords conform with your organization policy and edit the following
parameters with eyaml
in the hiera repository:
-
slurmdbd_slurm_db_password
-
slurmdbd_slurmro_db_password
These parameters correspond to the passwords of the MariaDB having respectively R/W and R/O grants on the SlurmDBD database.
Once modified, push and apply the configuration with the following commands:
# hpc-config-push && \ clush --fanout=1 -bg batch hpc-config-apply -v
The hpc-config-apply
command will perform the following steps, on each batch
node:
-
Update the passwords in the configuration file of the Slurm
mysql-setup
utility. -
Update the passwords in the MariaDB database
-
Update SlurmDBD configuration (if R/W password changed)
-
Restart SlurmDBD (if R/W password changed)
The --fanout=1
parameter of the clush
command makes sure the configuration
is not applied simultaneously on both batch nodes. This could cause the SlurmDBD
daemon to be restarted at the same time and make this service unavailable for a
short period of time.
Administration node re-installation
This procedure will wipe the first disk of the admin node, if some customizations are not in the Puppet configuration, this should be handled separately.
Before, powering off the administration node, check that:
-
There is an alternative route to connect to the service node (can be the service nodes themselves)
-
It is possible to connect to the BMC IPMI, and especially to the Serial Over LAN console
-
It is possible to connect to the Ethernet administration network switch
The administration node has no critical service in the reference architecture, so it can simply be powered off:
# poweroff
Note
|
In some Ethernet bonding setups, the node cannot do a PXE boot with an active bonding configuration on the Ethernet switch. If this is the case, refer to the documentation of the network switch to disable the bonding configuration. |
To be re-installed, the administration node must perform a network boot. This
can be configured with ipmitool(1)
installed on a host that has access to the
BMC network interface:
# ipmitool -I lanplus -H <bmc host> -U <bmc username> -P chassis bootdev pxe # ipmitool -I lanplus -H <bmc host> -U <bmc username> -P chassis power on
Next steps will happen once the node is installed and has rebooted, the installation can be followed through serial console:
# ipmitool -I lanplus -H <bmc host> -U <bmc username> -P sol activate
Note
|
If the Ethernet switch configuration had to be modified to setup PXE boot, the modification must be reverted to its nominal status. |
Service node re-installation
Before re-installing a Service node, active Virtual Machines on the nodes should be migrated away from the node. Clara can be used to list the active VMs and do the live migration.
Listing the VMs:
# clara virt list | grep clserviceX
Migrate the live VMs with the command:
# clara virt migrate <vmname> --dest-host clserviceY
These points should be checked before turning off a Service Node:
-
The ceph cluster should be
HEALTH_OK
(ceph health
), with at least three OSDin
-
consult
should return services as passing on at least three nodes -
On an Intel Omni-Path cluster, the
opafabricinfo
should return at least one Master and one Standby node
Once there is no VM remaining on the node, it can be powered off safely, the other Service node should ensure there is no service outage. The power off can be done from the node itself:
# poweroff
Note
|
In some Ethernet bonding setups, the node cannot do a PXE boot with an active bonding configuration on the Ethernet switch. If this is the case, refer to the documentation of the network switch to disable the bonding configuration. |
To be re-installed, the service node must perform a network boot. This can be configured with clara:
# clara ipmi pxe clserviceX # clara ipmi on clserviceX
Next steps will happen once the node is installed and as rebooted, the installation can be followed through serial console:
# clara ipmi connect clserviceX
After a Service node re-installation, the ceph services: OSD, MDS and RadosGW should be reconfigured automatically by the Puppet HPC configuration. The Mon service (not present on every node), must be boot-strapped again. This procedure is described with other Ceph bootstrap procedures.
In order to validate the generic service node re-installation, there are some relevant checks to perform.
-
High-Speed network manager (Intel Omni-Path):
# opafrabricinfo
The reinstalled node must appear as a Master or Standby node.
-
Check the ceph cluster is healthy:
# ceph status
The cluster should be HEALTH_OK
with all OSDs, Mons and MDSs.
-
Consul:
# consult
All services on all nodes should have the state passing
.
Note
|
If the Ethernet switch configuration had to be modified to setup PXE boot, the modification must be reverted to its nominal status. |
Frontend access
Draining
To perform a scheduled reboot of a frontend it is better to avoid new connection going to the frontend node that will be rebooted. The new connections are highly available and load balanced with IPVS.
It is possible to remove a frontend from the pool of node accepting new
connections without killing active connections with the ipvsadm
command by
setting the weight of a node to 0.
To list the current weight, on a frontend:
# ipvsadm -ln IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 172.16.1.16:22 rr persistent 600 -> 172.16.1.11:22 Route 1 10 0 -> 172.16.1.12:22 Route 1 6 0 -> 172.16.1.13:22 Route 1 1 0 -> 172.16.1.14:22 Route 1 15 0 -> 172.16.1.15:22 Route 1 1 0
To avoid a frontend node being attributed to new sessions, the weight of the node can be manually set to 0. This setting does not completely forbid new connection to go to the node, if a user already has a session, new session will go to the same node regardless of the weight. This setting also does not block connections made directly to the node and not the virtual IP address.
# ipvsadm -e -t 172.16.1.16:22 -r 172.16.1.11:22 -w 0 # ipvsadm -ln IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 172.16.1.16:22 rr persistent 600 -> 172.16.1.11:22 Route 0 10 0 -> 172.16.1.12:22 Route 1 6 0 -> 172.16.1.13:22 Route 1 1 0 -> 172.16.1.14:22 Route 1 15 0 -> 172.16.1.15:22 Route 1 1 0
The modification can be reversed by setting the weight back to 1 manually.
# ipvsadm -e -t 172.16.1.16:22 -r 172.16.1.12:22 -w 1
NFS HA
Starting a node
When a node start is should not start the keepalived service automatically. This permits a failed node to be started without it becoming master with an remaining problem.
Before starting the keepalived service, the following conditions must be met:
-
The multipath-tools service must be active with a running
multipathd
process. -
The keepalived service must be disabled
When these conditions are met, the service can be started:
# systemctl start keepalived
If the node is to become master (master node in the VIP configuration or other
node is down), check that the first check goes well. It runs every minutes and
logs are in /var/log/user.log
. The message following message must appear:
Mar 17 17:19:01 fbnfs1 hpc_nfs_ha_server_check.sh.info: INFO: fbnfs1 All checks are OK
Manual Fail Over
If the master node disappears, because it is turned off or because the keepalived service is stopped, the failover will happen, but it will take a bit of time (a little more than a minute). This timeout can be entirely avoided by doing a manual failover of the master node before cutting the keepalived service.
To do this, the keepalived configuration must be changed manually on the node.
Edit the file /etc/keepalived/keepalived.conf
. Find the configuration for the
NFS VIP and change the priority to 40, and the role to BACKUP
. The service
must be reloaded:
# service keepalived reload
The failover should happen quickly. Once the node failed over, stop the keepalived service:
# systemctl stop keepalived
The original configuration must be restored before starting the service again. This will happen if you launch a hpc-config-apply manually or if you reboot the node.
Services
This section contains usefull procedures for casual operations on infrastructure services.
Packages Caching purge
In order to invalidate and purge the packages caching service apt-cacher-ng
cache content, run the following commands consecutively:
# clush -bg service systemctl stop apt-cacher-ng.service # clush -bg service rm -rf /var/cache/apt-cacher-ng # clush -bg service mkdir /var/cache/apt-cacher-ng # clush -bg service chown apt-cacher-ng: /var/cache/apt-cacher-ng # clush -bg service systemctl start apt-cacher-ng.service
Virtual Machines
This section contains procedure related with virtual machines management with clara.
Deleting a Virtual Machine
A Virtual Machine is composed of two mostly independant objects:
-
The disk image
-
The definition on a host
The two objects must be deleted separately.
The first step is to stop the Virtual Machine:
# clara virt stop <vm_name>
Once it is in the state SHUTOFF
you can undefine it:
# clara virt undefine <vm_name>
The VM will still appear on clara virt list
with the state: MISSING
. It
means clara still sees the disk image but not the Virtual Machine definition.
You can then proceed with deleting the disk image, by checking the full disk
image name with clara virt list --details
, you must find the volume name
and the pool name.
On a physical host:
# virsh vol-delete --pool <pool_name> <volume_name>
On all other physical hosts:
# virsh pool-refresh <pool_name>