Managing a Distributed Cluster?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-24 17:20 出处：网络

Suppose that one has set up a cassandra cluster. You\'ve got a 10[TB] database that is distributed evenly between 10 nodes, everything runs smoothly etc.

Suppose that one has set up a cassandra cluster. You've got a 10[TB] database that is distributed evenly between 10 nodes, everything runs smoothly etc.

Suppose that you have 100 machines at your disposal, each trying to read (different) data from the cassandra cluster. in addition, you have many jobs that constantly need to be run, each job at a different time (and obviously, each job needs to be run on a different machine).

How do you manage all these 开发者_开发技巧tasks/jobs? how do you distribute the tasks between the machines? how do you keep track of the jobs / machines in the process?

Are there any open-source tools (preferably, with a Python client) that help doing it in a Linux environment?

What you need is a Grid/HPC Framework to handle your distributed infrastructure and to run jobs.

In unix/linux there are two systems that might of good use for you. Portable Batch Systems (PBS) or Condor

How do you manage all these tasks/jobs?

Both Condor and PBS have a master need to act as receptor of every Job/Task, for every Job/Task you can associate level of priority and discriminators. The administrator of the cluster sets up rules based on those discriminators to schedule the jobs.

how do you distribute the tasks between the machines?

Condor or PBS will do it for you, you only need to submit the job to the master node and specify priority, inputs and outputs, etc.

You can periodically check for when a job is finished, subscribe for notification via different mechanisms or do a sort of job.wait() to block till its finished.

how do you keep track of the jobs / machines in the process?

Both PBS and Condor have top alike commands to list jobs that are queued in wait, or running, or cancel. They also have utilities to stop or cancel a job if the process allows snapshots.

For a large cluster, my advice is to try Condor. It's been there for ages to solve problems exactly like they one you have. Here there are some examples for Condor + Python

Other more recent solutions to consider are: