SQR-031: DM-EFD deployment and operation

  • Angelo Fausti

Latest Revision: 2019-07-05

Note

This technote is not yet published.

Instructions to deploy and operate the DM-EFD

1   Why are we doing this?

The DM-EFD is a solution based on Kafka and InfluxDB for recording telemetry, commands, and events for LSST. It was prototyped and tested with the simulator for the M1M3 subsystem. The next logical step is to deploy and test the DM-EFD with real hardware.

The Auxilary Telescope Camera (ATCamera) is being tested at the Tucson lab and it presents a good opportunity for testing the DM-EFD by recording data from these tests.

We might need to deploy the DM-EFD at the Summit, Base facility, and LDF. Thus, the ability of deploying the DM-EFD at different environments quickly and reproduce these deployments is crucial. To solve this problem we’ve adopted Docker and Kubernetes as our deployment platform, and a combination of Terragrunt, Terraform, and Helm to manage and automate the deployments.

In this technote, we demonstrate that we can deploy the DM-EFD on a single machine with Docker and k3s (“kubes”), a lightweight Kubernetes using the same Terraform modules and Helm charts that we used in our Google Could deployment. We also provide instructions on how to operate and use the DM-EFD system.

2   The ATCamera test environment

As of June 2019, the ATCamera test environment at the Tucson lab runs SAL 3.9. The following subsystems are being tested and produce data: ATCamera, ATHeaderService, ATArchiver, ATMonochromator, ATSpectrograph, and Electrometer.

Kafka writers are responsible for sending messages from each SAL topic to the Kafka brokers.

The DM-EFD will be deployed on a dedicated machine ts-csc-01 (140.252.32.142) and the first step for that is to provision a Kubernetes cluster.

3   Provisioning a k3s (“kubes”) cluster

k3s is a lightweight Kubernetes that we can run in a container.

3.1   Requirements

We assume a Linux box running Centos 7. We’ve installed Docker CE and kubectl:

  • Docker CE 18.09.6
  • kubectl 1.14.1

Note

We also tried k3s locally, on Docker Desktop for Mac, but it cannot route traffic from the host to the container (the --network host option does not work).

3.2   Configure Docker to start on boot

CentOS uses systemd to manage which services start when the system boots. Run the following to configure Docker to start on boot.

sudo systemctl enable docker

3.3   Start the k3s master

Start the k3s master with the following commands:

export K3S_PORT=6443
export K3S_URL=http://localhost:${K3S_PORT}
export K3S_TOKEN=$(date | base64)
export HOST_PATH=/data # change depending on your host
export CONTAINER_PATH=/opt/local-path-provisioner
sudo docker run  -d --restart always --tmpfs /run --tmpfs /var/run --volume ${HOST_PATH}:${CONTAINER_PATH} -e K3S_URL=${K3S_URL} -e K3S_TOKEN=${K3S_TOKEN} --privileged --network host --name master docker.io/rancher/k3s:v0.5.0-rc1 server --https-listen-port ${K3S_PORT} --no-deploy traefik

The --restart always option ensures that the k3s master is automatically restarted after a system reboot.

Data is persisted at $HOST_PATH with the --volume option (see also 3.4   Deploy the local-path provisioner).

With the --network host option, network traffic is routed from the host to the container, we need that in order to reach the different services running on k3s.

Note that we are not deploying Traefik Ingress Controller which is included in the k3s docker image, because the DM-EFD already deploys the NGINX Ingress Controller.

To connect to the master you need to copy the kubeconfig file from the container, and set the KUBECONFIG environment variable so that kubectl knows how to connect to the cluster:

sudo docker cp master:/etc/rancher/k3s/k3s.yaml k3s.yaml
export KUBECONFIG=$(pwd)/k3s.yaml
kubectl cluster-info

Kubernetes master is running at https://localhost:6443
CoreDNS is running at https://localhost:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

To connect to the cluster from an another machine, copy the k3s.yaml file and replace localhost by 140.252.32.142.

3.4   Deploy the local-path provisioner

The local-path provisioner will create hostPath persistent volumes on the node automatically. The directory /opt/local-path-provisioner will be used as the path for provisioning. The provisioner will be installed in the local-path-storage namespace by default.

kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml

At this point you should see the following pods running in the cluster:

kubectl get pods --all-namespaces
NAMESPACE            NAME                                      READY   STATUS    RESTARTS   AGE
kube-system          coredns-695688789-r9gkt                   1/1     Running   0          5m
local-path-storage   local-path-provisioner-5d4b898474-vz2np   1/1     Running   0          4s

3.5   Add workers (optional)

If there are more machines, you can easily add workers to the cluster. Copy the node-token from the master:

sudo docker cp master:/var/lib/rancher/k3s/server/node-token node-token

and start the worker(s):

export SERVER_URL=https://<master external IP>:${K3S_PORT}
export NODE_TOKEN=$(cat node-token)
export WORKER=kube-0
export HOST_PATH=/data # change depending on your host
export CONTAINER_PATH=/opt/local-path-provisioner
sudo docker run -d --tmpfs /run --tmpfs /var/run -v ${HOST_PATH}:${CONTAINER_PATH} -e K3S_URL=${SERVER_URL} -e K3S_TOKEN=${NODE_TOKEN} --privileged --name ${WORKER} rancher/k3s:v0.5.0-rc1

Note

By default /opt/local-path-provisioner will be used across all the nodes to store persistent volume data, see local-path provisioner configuration.

4   Deploy the DM-EFD

Once the cluster is ready we can deploy the DM-EFD.

4.1   Requirements

  • AWS credentials (we save the deployment configuration to an S3 bucket and create names for our services on Route53)
  • TLS/SSL certificates for the lsst.codes domain (certificates are shared via SQuaRE Dropbox account)
  • Deployment configuration for the DM-EFD test environment (secrets are shared via SQuaRE 1Password account)

Note

The current mechanism to share secrets and certificates is not ideal, we still need to integrate our DM-EFD deployment with the Vault service recently implemented by SQuaRE.

We automate the deployment of the DM-EFD with Terraform and Helm. Terragrunt is used to manage the different deployment environments (dev, test, stage, and production) while keeping the Terraform modules environment-agnostic. We also use Terragrunt to save the Terraform configuration and the state of a particular deployment remotely.

Install Terragrunt, Terraform, and Helm.

git clone https://github.com/lsst-sqre/terragrunt-live-test.git
cd terragrunt-live-test
make all
export PATH="${PWD}/bin:${PATH}"

Install the SSL certificates (this step requires access to the SQuaRE Dropbox account).

make tls

4.2   Initialize the deployment environment

The following commands initialize the deployment environment. (Terragrunt uses an S3 bucket to save the deployment configuration, so this step requires the AWS credentials).

export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""

cd afausti/efd
make all
terragrunt init --terragrunt-source-update
terragrunt init

4.3   Deployment configuration

The DM-EFD deployment configuration on k3s is defined by a set of Terraform variables listed in the terraform-efd-k3s repository.

Edit the terraform.tfvars file with the values obtained from the SQuaRE 1Password account. Search for terraform vars (efd test).

Finally deploy the DM-EFD with the following commands:

terragrunt plan
terragrunt apply

4.4   Outputs

If everything is correct you should see an output similar to this, indicating the services deployed:

confluent_lb_ips = [140.252.32.142]
grafana_fqdn = test-grafana-efd.lsst.codes
influxdb_fqdn = test-influxdb-efd.lsst.codes
nginx_ingress_ip = 140.252.32.142
prometheus_fqdn = test-prometheus-efd.lsst.codes

The Kafka cluster can be reached at test-efd.lsst.codes:31090.

5   Testing the DM-EFD

The DM-EFD deployment can be tested using kafkacat a command line utility implemented with librdkafka the Apache Kafka C/C++ client library.

Run in producer mode (-P) to produce messages for a test topic:

kafkacat -P -b test-efd.lsst.codes:31090 -t test_topic
Hello EFD!
^D

Run in Metadata listing mode (-L) to retrieve metadata from the cluster:

kafkacat -L -b test-efd0.lsst.codes:31090

The -d option enables librdkafka debugging. For instance, -d broker can be used to debug connection issues with the cluster:

kafkacat -L -b test-efd0.lsst.codes:31090 -d broker

6   Monitoring

The DM-EFD deployment includes dashboards for monitoring the k3s cluster and Kafka instrumented by Prometheus metrics. You can login with your GitHub credentials if you are a member of the lsst-sqre organization.

7   Using the DM-EFD

In this section we document some procedures that are useful for operating the DM-EFD. Please refer to DM-EFD prototype implementation based on Kafka and InfluxDB for an overview of the DM-EFD system.

Note

As of May 23, 2019 the Tucson test stand runs SAL version 3.8. This version does not include the Kafka writers. We are waiting for SAL 3.9 or later to be deployed to continue this work. The commands presented below were not test on that environment yet, but they illustrate how the interaction with the cluster to perform tasks like initialize a new SAL subsystem, check on the status of the SAL transform apps InfluxDB Sink connector, and retrieve data from the DM-EFD.

7.1   Initialize a SAL subsystem

The following command will initialize a SAL subsystem, deploy the corresponding SAL transform app and configure the InfluxDB Sink Connector to consume the SAL topics of that subsystem. In the example, the ATCamera:

helm install --name ATCamera --namespace kafka-efd-apps --set subsystem=ATCamera lsstsqre/kafka-efd-apps

7.2   Check a SAL transform app

Inspect the logs of a SAL transform app for a particular subsystem. In this example the ATCamera:

kubectl logs $(kubectl get pods --namespace kafka-efd-apps -l "app=saltransform,subsystem=ATCamera" -o jsonpath="{.items[0].metadata.name}") --n kafka-efd-apps

7.3   Check the InfluxDB Sink connector

Inspecting the Kafka connect logs:

kubectl logs $(kubectl get pods --namespace kafka -l "app=cp-kafka-connect,release=confluent" -o jsonpath="{.items[0].metadata.name}") cp-kafka-connect-server --namespace kafka -f

7.4   Getting data from the DM-EFD

InfluxDB provides an HTTP API for accessing the data. A Python code snippet to get data from a particular SAL topic from the DM-EFD is shown below. In this example, we retrieve the Temperature for CCD 0 in the last 24h:

import requests

INFLUXDB_API_URL = "https://test-influxdb-efd.lsst.codes"
INFLUXDB_DATABASE = "efd"

def get_topic_data(topic, period="24h"):
  params={'q': 'SELECT * FROM "{}\"."autogen"."{}" where time > now()-{}'.format(INFLUXDB_DATABASE, topic, period)}
  r = requests.post(url=INFLUXDB_API_URL + "/query", params=params)

  return r.json()

get_topic_data("lsst.sal.ATCamera.ccdTemp0")