Downtime is a serious issue in Pubnixes
I am maintaining some public facing services on Exozyme Pubnix or Shared Linux Server for less than a year ig.
I faced issues like outage and one of the services or containers goes down and I had to everytime fix them and restart them manually.
This blog documenting the strategies I applied to solve this program and get consistent uptime from these services.
I have deployed many services, checking them manually to see if they are up is tiddius task. Thanks to iacore and Anthony Wang who created https://status.exozy.me which monitors status of every services been deployed in the server.
Services I deployed
I have deployed:
Now cyberchef and Mysite is statically deployed that there is no daemon constantly running in background to keep them up. but rest are podman containers. Sometimes when global wide system outage occurs they went down. and this is where problem occurs as I may not be available in time to fix the issue and up them again.
Automation to fix service downtime
Thats why I created this script which will run hourly checkup on each services and restart them if they are found to be down.
#!/bin/sh
while true; do
/home/nvpie/.local/bin/healthcheck.sh yt-local
/home/nvpie/.local/bin/healthcheck.sh umami
/home/nvpie/.local/bin/healthcheck.sh spdf
systemctl --user restart calibre-web.service
echo "health check finish for all containers"
sleep 1h
done
This is a revive_pods.sh
script which is posix
script.
this deployed as systemd user service so it constantly run in bg.
It uses following healthcheck.sh
script to perform
healthcheck on each containers or services.
#!/bin/sh
for container in "$@"; do
status=$(podman inspect -f '{{.State.Status}}' $container)
if [ "$status" = "exited" ]; then
podman start $container && echo "Container $container started" || echo "failed to start container $container"
systemctl --user restart $container.service && echo "service restarted" || echo "failed to restart service"
else
echo "Container $container is $status"
fi
done
since services are exposed onto public port using unix-socket they
need to run with unlink-early
tag so if service goes down
the residual unix socket gets removed. so its ideal to run your service
like this:
socat UNIX-LISTEN:/srv/http/yt-local,mode=660,fork,unlink-early TCP:localhost:5052
Healthcheck.sh takes first argument as container or service name so
it the unix-socket name and container name should be same. It uses the
podman inspect
option to check the state of the service and
then start the container and its respected user service or it just
prints the running status of service if its still up.
I have been running this method since weeks till now and even we had multiple outtages I am glad to say I didnt saw the services down status for a while now.