Dashboard

The Dashboard application gives a bird's eye view of the currently running and queued jobs, along with the overall status of the JobServer scheduling engine and related resources such as Queues, distributed Agent nodes and Mesos cluster nodes. It shows you quickly, which Partitions, Agents and Mesos slave nodes are being used and allows you to drill down to see which jobs are running and where they are running.

From this tool you can create, edit, and view the Partitions available in JobServer. Partitions are a way of managing resources available to JobServer. A Partition defines a set of resources including the maximum number of jobs that can be run at any one time. When a job is ready to run, it is placed into a Partition's corresponding Queue. Once in the Queue, the job will wait until the Partition has free resources available and is ready to execute the job. A job will remain in the Queue until a thread is available and allows it to begin processing. Jobs are moved from the Queue and into a running state based on a priority scheme that uses a first in first out algorithm. Jobs with higher priority get queued and run ahead of jobs with lower priority.

Please note, if you have view only permissions to this module, you will not have some of these advanced configuration functions available in this tool. Also some features, such as Agents, are only available to JobServer Professional users (not available for JobServer Standard). Mesos related features are avilable for both JobServer Pro and Standard editions, but Mesos is only available for JobServer server installations running on Linux. Here are the major features available from this tool:

Dashboard (Monitoring Partitions and Clusters)

Partition and Cluster Details

Queued Jobs

Edit Partition

Managing Agents

Add/Remove Partitions

Advanced Options


Dashboard (Monitoring Partitions and Clusters)

Scheduling Engine Status
The Dashboard shows high-level status of all available Partitions and clusters along with the status of the JobServer Scheduling Engine. Here are the possible values of the Scheduling Engine:

Show Cluster Details
If this checkbox option is selected, it will allow you to see details of how Agents and Mesos slaves nodes that are being used and their association to the particular Partitions. Partitions can use three different cluster modes. These are Local, Agent and Mesos. Local cluster mode (basically means no clustering) just allows you to run jobs within the same local node as the job scheduling engine. Agent clustering, allows you run jobs one or more distributed Agents in order to run jobs on remote machines. Using Agents, allows a Partition to increase its job processing capacity by spreading the job processing across a defined set of nodes and defined set of capacity. By selecting Mesos for a Partition's cluster type, you can launch the jobs across a Mesos cluster and use Mesos to dyanamically manage resource allocation.

Partition and Agent Details

This Dashboard allows the user to view the current status of all available Partitions and their processing state. It shows the Partition's name, the current number of running jobs, maximum number of jobs allowed, whether the Partition's Job processing is enabled or disabled, and how many jobs are in the Queue waiting to be run.

Each row in the table shows the available Partitions. And if you show the Cluster Type details, you will see each Agent node or Mesos slave associated with the Partition and the state of each node/slave relative to the Partition it is associated with. Each row contains the following columns/cells and related information:

Partition Name
Click on this column/cell for each Partition and it will take you to the "Edit Partition" screen where you can edit the various Partition properties and resources. This screen allows you to choose a Partition and edit its properties/resources. You can edit the Partition's job processing status (enabled or disabled) and set the maximum number of running jobs allowed, along with other features such as changing the Partition's Clustering Type.

If you are using JobServer Pro version, the Partition edit screen allows you to add and remove Agents to the Partition or assocaite the Partition to a Mesos cluster and distribute jobs across available nodes in the Mesos cluster. Agents are a way to add processing capacity to a Partition and allows you to distribute load across multiple machines on a set static set of Agent nodes. Mesos on the other hand let's you distribute load on a dynamic Mesos Cluster and lets Mesos manage the resources dynamically. When using Agents, can enable/disable mutliple Agent nodes to a Partition. A disabled Agent node will not receive new jobs to run. You can set the maximum allowed job size for the Agent to not exceed a certain number of concurrent jobs.

When using a Mesos cluster with your Partition, you can set the maximum allowed number of jobs that can run on the Mesos cluster. So you can put a hard limit on the number of jobs that can run on Mesos, however it is up to the Mesos cluster to "offer" resources and slaves to the Partition. When using Mesos, you can set the maxmimum allowed CPU capacity and memory allocation that a job will request in order to run on a slave. When Mesos offers up resources to your Partition, the Partition will try to match the CPU/Memory requests of the job to a Mesos slave node that matches the request requirements. Note, that if the Mesos cluster does not have enough slave/node capacity to service your Partition's job requests, jobs will sit in the queue waiting on Mesos cluster resources to become available.

Running Jobs
This column/cell shows the number of currently running jobs. If you click on the link you will see a popup that shows any currently running jobs. For example, a value of "0 / X" means that JobServer is not running or there are no actively running jobs. If JobServer is running it will show something like "2 / 6", for example. The first number shows the number of jobs actively running in the Partition. The second value is the maximum number of jobs allowed to run on the Partition. With JobServer Professional edition, you will see something like slightly different that will instead look like this "2 / 6 (10)". The last number in parentheses, "(10)", shows the hard upper limit of the allowed maximum jobs permitted by the Partition. The user sets this maximum capacity of the Partition which is the hard upper limit on the number of jobs that can run in a given Partition. The second number in this example, "6", is the effective maximum number of jobs allowed and is a "calculated" size by the Partition. This effective maximum value is "calculated" because when you are using remote Agents with a Partition, the actual capacity of job processing my vary due to Agents being available or not. For example an Agent host computer may become unavailable due to many factors. If this happens, the Partition will detect this and no longer send jobs to the Agent, until the Agent is available again. So this second number "6" in this example, is the actual effective size and can be different from the user entered maximum size because remote Agents may not always be available for job processing. The effective maximum size is the sum of all Agents and "primary/secondary" job processing capacity and indicates the true upper limit of the number of jobs allowed to run on a Partition at any give point int time. As mentioned this effective maximum size is dynamic and can change as Agents status changes (Agents go and come back online). But this number can never exceed the hard maximum capacity shown in parentheses. Note, that you will only see Partition's maximum capacity limit (number that is shown in parentheses) if it is different from the effective maximum size. If they are equal, you will just see something like "3 / 6", for example.

For Mesos Partitions, the Running Jobs field behaves a little differently. The effective maximum size of the Partition is dynamically defined by Mesos, so it is not reflected directly in this label/link. For example, in a Mesos cluster you might see a value of "2 / 7". The "7" in this case is not a true capacity setting, but only a hard limit on the maximum allowed jobs that can be run on the Mesos cluster. The actual capacity is up to Mesos to determine and Mesos will offer your Partition as many resources as it can that match the requests of the jobs running in the Partition, without exceeding the hard maximum upper limit which is "7" in this example. The "2" in this example, indicates the number of jobs that are running on the Mesos cluster. If you click on the link, you will get a popup that will show all the jobs that are currently running on the Mesos cluster and which slave nodes they are running on.

If there are running jobs, then the first number will show a count greater than zero and you can see the details of what jobs are running by clicking on the cell for the given Partition. This will launch a popup that shows what jobs are running and will give some basic information on when and how the jobs where started...etc. You also have the option of requesting that a running job be killed/terminated from this tool. The job will be terminated by JobServer if and when it reaches a safe check point, so it may not respond immediately to the kill request. It is not guaranteed that the job will actually be killed, if a safe check point is not reached before normal completion of the job. However, if the job was run in its own process or JVM, it will be killed immediately.

If you are using Agents, then you can also see the "Agents" column/row for each Partition. From this you can see how many jobs, from this Partition, are running on a particular Agent. If you click on the link, you will be taken to a popup that shows you the details of the running jobs on that Agent/Partition combination. From there you can also kill running jobs if you like.

If you are using Mesos, then you can also see the "Slaves" column/row for each Partition. From this you can see how many jobs, from this Partition, are running on a particular slave node. If you click on the link for the corresponding slave node, you will be taken to a popup that shows you the details of the running jobs on that slave/Partition combination. From there you can also kill running jobs if you like.

Job Processing Status
This column shows whether the Scheduler, that is associated with this Partition, is capable of scheduling jobs (enabled or disabled). When disabled, jobs that are ready to run will not run until the Scheduler is enabled again. This lets you enable and disable job scheduling on a Partition by Partition basis. For the Agent rows, this columns indicates if the Agent is allowed to accept jobs for this Partition.

Queued Jobs

This column shows the number of queued jobs associated to the Partition. To view the details of the queued jobs, click on the highlighted link in the cell. This will bring up a popup window which allows you to see the specific jobs that are in the Queue for this given Partition (as well as for all Partitions). You also have the option of deleting any queued jobs and editing their ordering in the Queue.

Edit Partition

The Edit Partition screen, lets you edit the major attributes and resources associated with a Partition. If you are using JobServer Standard edition, you can choose from two Cluster Types: Local and Mesos. If you are using JobServer Pro edition you can choose from three Cluster Types: Local, Mesos and Distributed Agents.

Max Jobs Allowed
Controls the maximum number of jobs that can be run concurrently at any one time by the Partition. Changing this value can increase or decrease the maximum number of jobs a Partition can run concurrently at any one point in time. Note, for Mesos Cluster Type, this value only sets a hard upper limit and does not acutally define the try capacity off the Partition.

Job Processing Status
If a Partition's Job Processing is disabled, jobs that are in the Queue will remain in the Queue. Even when a Partition is disabled, jobs can still get scheduled and placed into the Queue, however, they will remain waiting in the Queue until the Partition is enabled again. If a Partition is disabled while jobs are already running in it, the jobs that are already running will continue to run until completion, but newly scheduled jobs will remain in the Queue until the Partition is enabled again.

Agent Cluster Type
If you are running on JobServer Professional edition, you have additional options. With the Pro edition, you can assign any number of available Agents to a Partition. This allows a partition to run jobs on multiple remote Agent servers, along with running jobs locally on the "primary/secondary" Partition machine. By default you will always have a "primary/secondary" resource to run jobs on. Using Agents are optional. To use Agents you must enable this by selecting the check box "Use remote agents". If this features is not available or not selected then you can only run jobs on the main JobServer processing engine. You can enable/disable and set the maximum concurrent jobs for each Agent/Partition combination. Note you can allocate more Agent concurrent maximum job capacity than you can use, but your actual maximum size is limited by the value set by "Max Concurrent Jobs" at the Partition level. For redundancy, it is recommended to allocate more Agent job processing capacity than you need; this way if a single Agent goes down, you will have backup capacity for the Partition to use on other Agents. This is just an example of a strategy you can use and is not required.

Job JVM Configuration Options
In a Partition with Local or Agent Cluster Type, you have the option to configure jobs to run in their own dedicated and isolated JVM. You have several options where you can have jobs running in the Partition to either run in an isolated JVM (each job runs in its own dedicated JVM separate from the main JobServer process) or have jobs run in the same JVM as the JobServer process. Isolating jobs in their own JVM can be useful in situations where you need to limit the possibility of a misbehaving job from negatively impacting the rest of the shared system. Also, jobs running in their own isolated JVM are easier to kill and destroy, but keep in mind, if you have a large number of jobs running concurrently, having each one run in its own JVM can consume a lot of system memory. If you have enough memory and related database resources, this will not be a problem. You have the option to also let the individual job designer decide where to run jobs (shared or isolated JVM) by leaving the decision to them. The job designer can create the job and decide where the job should run or you can force all jobs in the Partition to use only one of the possible JVM options (isolated or shared). If you choose to use the isolated JVM option, you can also limit the maximum memory that the job and JVM can use, and you can also pass additional custom JVM options to the JVM. If you set the maximum memory of the JVM at the Partition level, the job designer, at the job level, will not be able to increase the maximum JVM memory capacity at the per job level. Leaving the maximum JVM memory blank, at the Partition level, will allow the user editing the job to set any JVM maximum memory setting they wish.

Mesos Cluster Type
When you select Mesos as your Cluster Type, you will have several options to set for how jobs will run in the Partition and within Mesos. You can define the follow attributes:

Alert Emails
Alerts are sent to the email addresses listed, when a job that is part of a Partition, encounters any kind a unexpected failure. This allows a person or group of people to be notified if anything exceptional goes wrong with any job within a Partition.

Job alerts notify users when a job failure occurs during processing. These are typically failures associated with the Job/Tasklet throwing an unexpected exception that may result in the Tasklet or job failing to continue processing. For example an uncaught out of memory exception or sql exception would constitute such a situation. Also when a Job/Tasklet throws TaskletFailureException this will also trigger an alert to be sent out. Note that errors and warnings logged via Log4J or the Java Logging API do not trigger an email alert. The email alerts use a cascading mechanism. It works by first sending an alert to the email address listed at the system level. It then sends the alert to the email addresses defined for that job's Partition, it then sends it to the job's Group alert addresses, and then it finally will send it to the alert email addresses defined for the specific job. With this design you can setup a hierarchy of email alerts. So, for example, you can set it up so that you only receive emails when a specific job fails or when any job in a particular Partition fails...etc.

A Tasklet may also programmatically trigger alerts by using the SOAFaces API. Refer to the API TaskletOutputContext.sendAlert().

Managing Agents

You can add any number of Agents to a Partition. This allows you to distribute job processing capacity to remote Agent servers. You have the option to set the maximum jobs allowed to run on a per Agent and Partition level. And you can enable or disable each Agent for a particular Partition. By default there will always be a "primary/secondary" Agent that will run jobs on the local JobServer host machine. If you do not want to use Agents or do not have access to remote Agents then you don't need to concern yourself with this feature.

Agents can't be deleted if they are running jobs or if the Agent is not disabled. An Agent must be disabled before you can disassociate it from a Partition and the Agent must not have any jobs running on it.

Add and Remove Partition

Partitions can be added and removed through this screen. The user can create as many Partitions as allowed by their environment and available resources. Existing Partitions can be deleted only when JobServer is not running (in Idle state) and the there are no jobs assigned to the Partition. The "RootPartition", however, can't be deleted as is the default Partition.

Advanced Options

The advanced options screen lets you configure some of the more advanced scalability options available in JobServer. You will only be able to access these options if you are using JobServer Professional. Note, that some of these features require JobServer to be in an "Idle" state for the the settings to take effect. This means performing a "jsshutdown" followed by a "jsstartup" for the features to take effect.

If you ar using JobServer Professional, you have additional optional settings to configure. JobServer Professional has advanced features that allows an administrator to configure high-end scalability settings. By default, a single Scheduler resource is shared among all the Partitions. JobServer Professional, however, can be configured such that each Partition has its own private Scheduler. If your environment has a large number of jobs that run concurrently (e.g. thousands of jobs), this feature can extend the scalability power of JobServer and allows for more fine grained control over a Partition's configuration. Go to the "Advanced Options" screen to configure these advanced options.

Scheduler Scan Paths
You can set the number of scan threads that the main Scheduler uses to find and run jobs that are ready to be scheduled. Increasing this number can improve response times of the Scheduler, especially in the case where you have a large number of jobs that run in and around the same time. Note, that the more scan threads you use the more system resources will be consumed. On a single processor system, setting the scan threads above a value of "2" may not buy you anything, however on SMP and multi-core hardware it can significantly improve scheduling response times and throughput. Under normal conditions you do not need to concern yourself with this feature.

Do not edit this particular Scan Threads property unless you know what you are doing. This field controls the number of scheduler scan threads that will be assigned to a Partition's Scheduler. Increasing the scan threads can improve the Schedulers response times, especially, when there are a large number of concurrently scheduled jobs. Note, this does NOT increase the number of concurrently running jobs allowed, it only controls the number of internal threads that will try to put jobs in a ready to run state. Consult JobServer Support Team for questions about this advanced feature.

Scheduler Thread Per Partition
If you have a large number of jobs and Partitions, turning on this setting allows each Partition to have its own dedicated Scheduler thread. This also allows each Partition to be controlled and managed individually, including each Partition/Scheduler having its own dedicated set of scan threads and the Scheduler can be enabled/disable separately from other Scheduler/Partition pairs.

Database Resources Per Scan Thread
This feature, essentially gives each scan thread its own database connection from which to talk to the database. This feature can improve Scheduler concurrency but can also consume significant amounts of database resources, especially if you have a large number of Scheduler scan threads and Partitions. With this feature turned off, all the scan threads share the same database connection of their parent Scheduler/Partition. This is the default. Consult JobServer Support Team for questions about this feature.