SQR-031: DM-EFD deployment and operation

  • Angelo Fausti

Latest Revision: 2019-05-23

Note

This technote is not yet published.

Instructions to deploy and operate the DM-EFD

1   Why are we doing this?

The DM-EFD is a solution based on Kafka and InfluxDB for recording the LSST telemetry and events. It was prototyped and tested with the simulator for the M1M3 subsystem. The next logical step is to deploy and test the DM-EFD in a more realistic environment.

The Tucson test stand is currently testing the Auxilary Telescope Camera (ATCamera) and needs to record telemetry from these tests. Soon, we will have to deploy the DM-EFD at the NCSA test stand and at the summit for the Auxiliary Telescope, and finally for the main telescope and the whole observatory.

In this scenario the ability to deploy the DM-EFD on heterogeneous environments quickly, and ensure the reproducibility of the deployments is crucial. To solve the portability problem we adopted Docker and Kubernetes. To manage and automate the deployments we use a combination of Terragrunt, Terraform, and Helm.

In this technote, we demonstrate that we can deploy the DM-EFD on a single machine with Docker and k3s (“kubes”), a lightweight Kubernetes using the same Terraform modules and Helm charts that we used in our Google Could deployment. We also provide instructions on how to operate and use the DM-EFD system.

2   Deploy k3s (“kubes”)

k3s is a lightweight Kubernetes that we can run in a container.

2.1   Requirements

We assume you have a Linux box with Docker CE, kubectl installed. We used:

  • Docker CE 18.09.6
  • kubectl 1.14.1

Note

We have tested the DM-EFD deployment on k3s using Docker Desktop for Mac, but it cannot route traffic from the host to the container (--network host option). That limits our deployment to Linux.

2.1.1   Configure Docker to start on boot

CentOS uses systemd to manage which services start when the system boots. Run the following to configure Docker to start on boot.

sudo systemctl enable docker

2.2   Start the k3s master

Start the k3s master with the following commands:

export K3S_PORT=6443
export K3S_URL=http://localhost:${K3S_PORT}
export K3S_TOKEN=$(date | base64)
export HOST_PATH=/data # change depending on your host
export CONTAINER_PATH=/opt/local-path-provisioner
sudo docker run  -d --tmpfs /run --tmpfs /var/run -v ${HOST_PATH}:${CONTAINER_PATH} -e K3S_URL=${K3S_URL} -e K3S_TOKEN=${K3S_TOKEN} --privileged --network host --name master docker.io/rancher/k3s:v0.5.0-rc1 server --https-listen-port ${K3S_PORT} --no-deploy traefik

Note that we are not deploying Traefik because the DM-EFD already includes the NGINX Ingress Controller.

To connect to the master you need to copy the kubeconfig file from the container:

sudo docker cp master:/etc/rancher/k3s/k3s.yaml k3s.yaml

at this point you can access the cluster:

export KUBECONFIG=$(pwd)/k3s.yaml
kubectl cluster-info

Kubernetes master is running at https://localhost:6443
CoreDNS is running at https://localhost:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

2.3   Deploy the local-path provisioner

The local-path provisioner will create hostPath persistent volumes on the node automatically. The directory /opt/local-path-provisioner will be used as the path for provisioning. The provisioner will be installed in the local-path-storage namespace by default.

kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml

At this point you should see the following pods running in the cluster:

kubectl get pods --all-namespaces
NAMESPACE            NAME                                      READY   STATUS    RESTARTS   AGE
kube-system          coredns-695688789-r9gkt                   1/1     Running   0          5m
local-path-storage   local-path-provisioner-5d4b898474-vz2np   1/1     Running   0          4s

2.4   Add workers (optional)

If there are more machines you can easily add workers to the cluster. Copy the node-token from the master:

sudo docker cp master:/var/lib/rancher/k3s/server/node-token node-token

and start the worker(s):

export SERVER_URL=https://<master external IP>:${K3S_PORT}
export NODE_TOKEN=$(cat node-token)
export WORKER=kube-0
export HOST_PATH=/data # change depending on your host
export CONTAINER_PATH=/opt/local-path-provisioner
sudo docker run -d --tmpfs /run --tmpfs /var/run -v ${HOST_PATH}:${CONTAINER_PATH} -e K3S_URL=${SERVER_URL} -e K3S_TOKEN=${NODE_TOKEN} --privileged --name ${WORKER} rancher/k3s:v0.5.0-rc1

Note

By default /opt/local-path-provisioner will be used across all the nodes to store persistent volume data, see local-path provisioner configuration.

3   Deploy the DM-EFD

Once the cluster is ready we can deploy the DM-EFD.

3.1   Requirements

  • AWS credentials (we save the deployment configuration to an S3 bucket and create names for our services on Route53)
  • TLS/SSL certificates for the lsst.codes domain (certificates are shared via SQuaRE Dropbox account)
  • Deployment configuration for the DM-EFD test environment (secrets are shared via SQuaRE 1Password account)

Note

The current mechanism to share secrets and certificates is not ideal, we still need to integrate our DM-EFD deployment with the Vault service recently implemented by SQuaRE.

We automate the deployment of the DM-EFD with Terraform and Helm. Terragrunt is used to manage the different deployment environments (dev, test, stage, and production) while keeping the Terraform modules environment-agnostic. We also use Terragrunt to save the Terraform configuration and the state of a particular deployment remotely.

Install Terragrunt, Terraform, and Helm.

git clone https://github.com/lsst-sqre/terragrunt-live-test.git
cd terragrunt-live-test
make all
export PATH="${PWD}/bin:${PATH}"

Install the SSL certificates (this step requires access to the SQuaRE Dropbox account).

make tls

3.2   Initialize the deployment environment

The following commands initialize the deployment environment. (Terragrunt uses an S3 bucket to save the deployment configuration, so this step requires the AWS credentials).

export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""

cd afausti/efd
make all
terragrunt init --terragrunt-source-update
terragrunt init

3.3   Deployment configuration

The DM-EFD deployment configuration on k3s is defined by a set of Terraform variables listed in the terraform-efd-k3s repository.

Edit the terraform.tfvars file with the values obtained from the SQuaRE 1Password account. Search for terraform vars (efd test).

Finally deploy the DM-EFD with the following commands:

terragrunt plan
terragrunt apply

3.4   Outputs

If everything is correct you should see an output similar to this, indicating the services deployed:

confluent_lb_ips = [140.252.32.142]
grafana_fqdn = test-grafana-efd.lsst.codes
influxdb_fqdn = test-influxdb-efd.lsst.codes
nginx_ingress_ip = 140.252.32.142
prometheus_fqdn = test-prometheus-efd.lsst.codes

The Kafka cluster can be reached at test-efd0.lsst.codes:9094.

4   Testing the installation

The installation can be tested using kafkacat a command line utility implemented with librdkafka the Apache Kafka C/C++ client library.

Run in producer mode (-P) to produce messages for a test topic:

kafkacat -P -b test-efd0.lsst.codes:9094 -t test_topic
Hello EFD!
^D

Run in Metadata listing mode (-L) to retrieve metadata from the cluster:

kafkacat -L -b test-efd0.lsst.codes:9094

The -d option enables librdkafka debugging. For instance, -d broker can be used to debug connection issues with the cluster:

kafkacat -L -b test-efd0.lsst.codes:9094 -d broker

5   Monitoring

The DM-EFD deployment includes dashboards for monitoring the k3s cluster and Kafka instrumented by Prometheus metrics. You can login with your GitHub credentials if you are on the GitHub lsst-sqre organization.

6   Using the DM-EFD

In this section we document some procedures that are useful for operating the DM-EFD. Please refer to DM-EFD prototype implementation based on Kafka and InfluxDB for an overview of the DM-EFD system.

Note

As of May 23, 2019 the Tucson test stand runs SAL version 3.8. This version does not include the Kafka writers. We are waiting for SAL 3.9 or later to be deployed to continue this work. The commands presented below were not test on that environment yet, but they illustrate how the interaction with the cluster to perform tasks like initialize a new SAL subsystem, check on the status of the SAL transform apps InfluxDB Sink connector, and retrieve data from the DM-EFD.

6.1   Initialize a SAL subsystem

The following command will initialize a SAL subsystem, deploy the corresponding SAL transform app and configure the InfluxDB Sink Connector to consume the SAL topics of that subsystem. In the example, the ATCamera:

helm install --name ATCamera --namespace kafka-efd-apps --set subsystem=ATCamera lsstsqre/kafka-efd-apps

6.2   Check a SAL transform app

Inspect the logs of a SAL transform app for a particular subsystem. In this example the ATCamera:

kubectl logs $(kubectl get pods --namespace kafka-efd-apps -l "app=saltransform,subsystem=ATCamera" -o jsonpath="{.items[0].metadata.name}") --n kafka-efd-apps

6.3   Check the InfluxDB Sink connector

Inspecting the Kafka connect logs:

kubectl logs $(kubectl get pods --namespace kafka -l "app=cp-kafka-connect,release=confluent" -o jsonpath="{.items[0].metadata.name}") cp-kafka-connect-server --namespace kafka -f

6.4   Getting data from the DM-EFD

InfluxDB provides an HTTP API for accessing the data. A Python code snippet to get data from a particular SAL topic from the DM-EFD is shown below. In this example, we retrieve the Temperature for CCD 0 in the last 24h:

import requests

INFLUXDB_API_URL = "https://test-influxdb-efd.lsst.codes"
INFLUXDB_DATABASE = "efd"

def get_topic_data(topic, period="24h"):
  params={'q': 'SELECT * FROM "{}\"."autogen"."{}" where time > now()-{}'.format(INFLUXDB_DATABASE, topic, period)}
  r = requests.post(url=INFLUXDB_API_URL + "/query", params=params)

  return r.json()

get_topic_data("lsst.sal.ATCamera.ccdTemp0")