Dkron – Distributed Cron System – Solution to replace traditional cron and optimise cloud cost

There are many cron systems on the market right now, and each will serve its purpose. However, with the limitations of the traditional cron system, we perhaps think thousands of times that there might be something better, especially in the cloud world nowadays. In this article, I want to talk about a distributed cron system called Dkron, and why it’s probably a good solution to replace the traditional cron that you might be looking for, especially in case you want to optimize cloud costs, which probably helps you in the same way it helps me.

Table of Contents

What is Dkron?

Dkron is a distributed, fault-tolerant job scheduling system, similar to the Unix cron service, but designed for modern, distributed environments, e.g. in several machines in a cluster or like in microservices, ensuring reliable job execution even if some machines fail. It started with open source and is available for free. It also offers a paid version. But I think the free version is enough for us in most cases unless you need any special features, so go with the paid version.

Simply, Dkron is designed to address the limitations of traditional cron, e.g. scalability, high availability, and resilience to failures, etc.

Problems with traditional cron?

Traditional cron has served us well so far despite some challenges, I can name a few here and they gradually become reasons leading me to find some alternative ways.

1. Single Point of Failure (SPOF)

In traditional cron setups, jobs are scheduled and executed on a single machine. If that machine crashes, undergoes maintenance, or experiences a network issue, all jobs scheduled on it will fail to execute. This is what’s known as a single point of failure (SPOF) — a part of a system that would stop the entire system from working if it were to fail

In this case, Dkron avoids SPOF pretty smoothly, it distributes jobs across all nodes in its cluster when operating in HA mode. The HA mode uses a consensus protocol (known as Raft protocol) for leader election among server-mode nodes, ensuring one node (the leader) schedules jobs at a time. When the leader dies, another server-mode node is elected to become the leader and continue scheduling jobs to server and agent nodes, ensuring the system continues working without interruption.

2. Lack of Scalability

The traditional cron system normally has a number of jobs allocated to run on a single server, when the workload grows causing the CPU or Memory to be exhausted, we have to make a decision to either reduce the jobs on that server by manually changing configuration or upgrade the server to a larger size. Imagine if you have a lot of cron servers hitting the maximum of resources and have to go to change one by one. This may be annoying and not efficient, especially, when we run the cron system in the cloud environment or want to implement the spot instance strategy to reduce cost on AWS.

Dkron addresses this issue by allowing jobs to be scheduled across a cluster of servers (pretty good based on my experience), balancing the load to some extent (It doesn’t balance completely from my usage). That’s when we can scale the system horizontally if we find the number of jobs or resources increases. My company has a fleet of nearly 20 Dkron nodes running on stable version 3.2.6 (now they have version 4.x but still have issues as I see), I myself find that Dkron does not completely handle balancing effectively as it hasn’t been able to detect whether the server is about to run out of resources. For example, when current running jobs are occupying almost all CPU/Memory and there is just a small amount available, Dkron sometimes still schedules more jobs to run on, causing that server to overload. However, we still have several mechanisms to implement additionally in order to prevent this, I’ll talk about it at the end of this article. Hopefully, they improve this soon, but it’s still a good choice so far when we want something with more high availability and resilience.

3. Lack of Fault Tolerance

When operating the traditional cron system, I usually encounter situations where servers suddenly crash at night when no one is available to fix them immediately. All of the jobs running on the server fail to execute, leading to delays, and have to wait until the next day when I or other developers are available to fix. This refers to the Fault Tolerance ability.

Dkron has a few mechanisms to handle this, which include:

When a node dies, the leader starts scheduling jobs to other available nodes as I mentioned in SPOF.
Job Retries option: when a job on one node suddenly fails to execute due to server issues, it will try to run the job again in another node until the retries count reaches the limit.
Dkron uses an embedded distributed KV store engine based on BuntDB where the jobs’ data is stored persistently amongst servers. So even if one server goes down, the cluster is still working and accessing the storage engine to query jobs normally.

Why use Dkron over others?

With the rise of microservices and cloud-native applications, I think there are many more approaches we can search for on the internet, but Dkron fits the needs for our requirements. For example, while searching for a distributed cron system, I only found Dkron as a candidate. An alternative is to use Kubernetes Cron Jobs which also supports distributed cron tasks as I discovered.

However, regarding the ease of use, and the learning curve while meeting the requirements of high availability, reliability and fault tolerance in the fast-paced environment, it becomes a matter of what problems does it tackle? Why do you consider adding to your tech stack? How much benefit does it bring to your organisation, not just an IT department?

For example, if you just need a fast solution to replace the traditional cron, and you have zero knowledge of Kubernetes. Why would you choose Kubernetes over Dkron? At least you know what you are doing. I’m not saying you don’t have to learn about Dkron but it’s much simpler than going with Kubernetes.

In reverse, if your team have a plan to redesign the system or part of the system, and want to try microservice architecture in the near future. Probably this is a chance to utilize and implement Kubernetes, and also replace traditional cron systems in one go.

Use Case:

For my company use case, we choose Dkron because it answers the questions I mentioned above:

what problems does it tackle?

Dkron meets our requirements of high availability, reliability, and fault tolerance. It addresses the problems of traditional cron causing us. It is also the bridge to achieving the third question.

Why do you consider adding to your tech stack?

To us, it’s a small tool and a fast implementation without much of a learning curve. It aims to support the next question below, which is the main purpose.

How much benefit does it bring to your organisation, not just an IT department?

Dkron fits our AWS Cost Optimization strategy, as we run Dkron on a fleet of EC2 instances. We apply the old but gold method called Spot Instance strategy, which brings down the cost significantly compared to running a fleet of traditional cron servers. That seemed to be insane the first time we found out. Even if we are using Savings Plan, that has helped, but it doesn’t yield the same results we achieve with Spot Instances. For instance, c7g family instance type costs us $0.76440/hour through Savings Plan (we use c7g.4xlarge and a lot of other c7g types), but when running with spot instance, the cost is just around 0.18xxx/hour based on your availability zones. That’s quite impressive! We combine Autoscaling + Spot Instances + Dkron, which is very convenient, and the system scales up or down depending on our jobs’ workload, and we can adjust based on our expectations.

How Dkron works?

Dkron nodes can work in two modes: agent or server.

Depending on your setup, you can run Dkron on a standalone server (running as both server and agent mode). Or run it as a cluster of nodes, where it recommends deployment is either 3 or 5 servers running (I would suggest at least 3 servers) in server mode, and then can add many servers into the cluster running as agent mode. In the cluster of server-mode nodes, only one server plays the leader role, and is responsible for starting job execution queries in the cluster. And the others are member and ready to be elected when the leader goes down.

Basic definition

Here is just the basic definition that helps you have a grasp of what Dkron’s components do, you can go to the dkron docs to know more.

A Dkron agent is a cluster member that can handle job executions, run your scripts and return the resulting output to the server.

A Dkron server is also a cluster member but can send job execution queries to agents or other servers, so servers can execute jobs too.

The main distinction is that server nodes order job executions, can be used to schedule jobs, handle data storage and participate in leader election.

Dkron clusters have a leader, the leader is responsible for starting job execution queries in the cluster.

Dkron mainly advocates the concept of doing one thing and being good at one thing, which is the job scheduling system in a modern distributed system.

Installation

Install Dkron is pretty straightforward, I’ll show you how to set up a HA cluster because we aims to build a HA cron system rather than just a standalone one which the traditional cron can serve us well too.

I’ll demo by using 3 Ubuntu 22.04 servers running on the virtual box:

dkron-1: 192.168.68.116
dkron-2: 192.168.68.117
dkron-3: 192.168.68.118

The installation steps are the same when you install on EC2 instances. If you want to deploy through Kubernetes, we need different steps. Please refer to installation docs to know more

Note: if you want to have local DNS server so that you won’t have to remember IP address of each machine in this way – please check out How to Set Up a Local DNS Server with Dnsmasq in Ubuntu 24.04 (Fast & Easy) to facilitate your testing.

On each dkron server, add Debian repo to the server so that we can download and install dkron package. I’m running as non-root user.

echo "deb [trusted=yes] https://repo.distrib.works/apt/ /" | sudo tee -a /etc/apt/sources.list.d/dkron.list

Update APT repository:

sudo apt update

Install Dkron:

sudo apt install dkron

Setup the Cluster

Before proceeding, if you have iptables or firewall open, need to open these port so that dkron servers can communicate to each other:

8946 for serf layer between agents
8080 for HTTP for the API and Dashboard
6868 for gRPC and raft layer communication between agents

Run the command below to open ports in Firewalld. I turn off the firewall for an easier demo.

ufw allow 8946/tcp
ufw allow 8946/udp
ufw allow 8080/tcp
ufw allow 6868/tcp

Optional: If you deploy Dkron on AWS EC2 instance, need to open Security Group.

8946/tcp, 8946/udp, 6868/tcp: These ports should be opened between dkron servers
8080/tcp: open this for other private subnets in your local network so that you can access from outside to the dkron servers through dashboard.

Next, we bootstrap a single Dkron server and capture its IP address. After we have that nodes IP address, we place this address in the configuration.

On dkron-1, run the following commands:

# Back up the default config file to use as reference later
sudo cp /etc/dkron/dkron.yml /etc/dkron/dkron.yml.bak

Run sudo nano /etc/dkron/dkron.yml and add the following content:

server: true
bootstrap-expect: 1

Press Ctrl O and press Enter save the content to the file. And then Press Ctrl + X to exit

Start dkron service to bootstrap this dkron-1 to become a leader first.

sudo systemctl start dkron.service

When checking systemctl status dkron.service. You will have this info log saying that dkron-1 accquired leadership role.

Next. Stop dkron service on dkron-1 server to add IPs of 3 dkron servers to form the cluster

sudo systemctl stop dkron.service

Now, on all 3 dkron servers, run sudo nano /etc/dkron/dkron.yml and add the following content:

server: true # indicates server is running as server mode, remove it turn server running as agent mode
bootstrap-expect: 3 # Change this param according to the number of server-mode servers running
retry-join:
- 192.168.68.116
- 192.168.68.117
- 192.168.68.118

Press Ctrl O and press Enter save the content to the file. And then Press Ctrl + X to exit.

The setting bootstrap-expect: 3 means your cluster requires 3 server-mode servers to form the cluster. It doesn’t count agent-mode servers.

Optional: Dkron offers Cloud Auto Join, so you don’t have to specify each IP manually. If you deploy Dkron on EC2 instances for example, you can add tag to each ec2 instances with key-value like below

retry-join: ["provider=aws tag_key=My-Dkron-Cluster tag_value=enabled region=us-east-1"]

All servers have tag like this will auto join to dkron cluster. Same scenarios when deploying with Kubernetes.

Back to our setup, on dkron-1 server, restart dkron service to make the leader start running first:

sudo systemctl restart dkron.service

On dkron-2 and dkron-3, start dkron service respectively.

sudo systemctl start dkron.service

On any dkron servers, if you run dkron raft list-peers , you will have a similar result below. My dkron-1 is the leader now.

Access the Dkron dashboard through http://192.168.68.116:8080/ui/ (can use any IP of server-mode servers). You’ll get the result as below.

Create a Job to test. Go to Jobs section -> click Create.

Fill in the example of the fields below and press Save:

Name: test-hello-world
Schedule: @every 10s (you can specify like normal cron jobs here e.g. 0 0 0 * * 0). See cron-spec
Executor: shell
Executor config *: {"command": "echo \"Hello from parent\""}

Reload the page and click on the new job that you’ve created

Go to Executions tab, you’ll see the job has been executed successfully

Normally, you might want to visit https://dkron.io/api/ and prepare a list of jobs that you want to create and create them through the API rather than doing it manually. There are many options to search on, so I guess it’s best that you visit the docs page and do your due diligence.

That’s all, now you have your Cluster ready to be used.

Recommend: Regarding the Fault Tolerance, I think we still have to implement additional methods to make this Dkron Architecture completely works at this stage until they improve the tool.

Implement a script to monitor each dkron server. If the CPU/Memory reaches the threshold, let’s say 70% of the threshold, put the server into maintenance mode. What I mean is that when you create a job, use the Tag strategy, where it only allows jobs to run on nodes that match the tags specified. For example, if jobs have the tag set to "my_role": "web", only nodes with this tag set on them can execute jobs. When the server reaches the threshold, the script can change the tag to "my_role": "maintenance". Then reload the dkron service (remember reload, not restart) to take the new tag configuration into effect. That’s when the server stops receiving new jobs allocated to it and lets the server finish the currently running jobs. When the CPU/Memory goes back to a normal state, the script can change the tag back to what it was and reload the dkron service again. The server then starts receiving jobs to run on it again.
Should run fewer jobs on server-mode servers, and run jobs mainly on agent-mode servers. This preserves the server-mode servers without being overloaded, which causes corruption in leader election.
Implement a mechanism to detect whether the jobs are currently being run on any nodes to prevent them from running simultaneously if your system can’t handle it in such a case. That’s because at this stage, I realize when the leader node goes down and starts electing a new leader, all of the running jobs during this event will be doubling, which causes annoying issues. It’s marked as a bug in https://github.com/distribworks/dkron/issues/1569. Hope they fix it soon.

Hope this article inspires you in a way or gives you an idea of what you should do to optimize your system. Thanks.

Discover more from Turn DevOps Easier

Subscribe to get the latest posts sent to your email.

Dkron – Distributed Cron System – Solution to replace traditional cron and optimise cloud cost

What is Dkron?

Problems with traditional cron?

1. Single Point of Failure (SPOF)

2. Lack of Scalability

3. Lack of Fault Tolerance

Why use Dkron over others?

Use Case:

How Dkron works?

Basic definition

Installation

Setup the Cluster

Like this:

Discover more from Turn DevOps Easier

By Binh

Leave a Reply Cancel reply

Dkron – Distributed Cron System – Solution to replace traditional cron and optimise cloud cost

What is Dkron?

Problems with traditional cron?

1. Single Point of Failure (SPOF)

2. Lack of Scalability

3. Lack of Fault Tolerance

Why use Dkron over others?

Use Case:

How Dkron works?

Basic definition

Installation

Setup the Cluster

Share this:

Like this:

Discover more from Turn DevOps Easier

By Binh

Leave a Reply Cancel reply