This is not my normal fare. If you’re not a computer geek you may find the following paragraphs a little bit technical and quite possibly uninteresting because of that. I’d encourage you to read on though as what you should come away with is a new way to look at the problems you face and a strategy for dealing with them that will bring you much personal satisfaction or at least will cause you to pull the least amount of hair out of your head as possible.
Start here: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion
There is never anything really new in the world of computing. All we have are problems that have been solved before and new flavors of those same problems and solutions. What really changes is that people forget that we’ve already solved all of the really difficult problems many years ago. We had to because they were new problems when computing was something fresh in industry. Now that computing is pervasive what we have is a repeating cycle of identifying problems to be solved and figuring out how they’ve been solved before or ignoring the past (at our peril) and creating entirely new solutions which are in fact, just different colors of the same solutions we came up with before… if we’re lucky. That amounts to a statement like, “Well, we have a really complex problem, so here’s a stunningly complicated solution.”
I, for one, detest the idea that complex problems need newly invented ultra complex solutions simply because the problem appeared superficially (or actually is) complex or new. There is no problem so complicated that a very simple solution cannot be identified if you think about the problem the right way. There are insanely few problems which are in reality the least bit new. At best, they’re just the same problem in a new shape or color, so to speak. In a moment, you’ll be introduced to my preferred method of solving problems which always yields fairly simple solutions. It does that because it works like the thought process of early Macintosh computers. Early Mac’s were built; seemingly, with a notion something like, “Give them so little memory and processing power that they won’t be able to do anything anyway.” I must at this point give a wink and a nod to Douglas Adams who originally made that exact statement and from whom I’ve borrowed it. There’s a certain amount of sarcasm in that but hang with me and you’ll see my point.
What I mean by all of that is, simplifying the problem comes down to really seeing where the actual fundamental problem is (Mac users, of which I am one, wanting to do very intensive computational tasks on end-user grade hardware is the fundamental problem.) and not where the superficial problem is. In this case the superficial problem is one of Mac’s being the preferred platform for those doing computationally intensive tasks; like video editing for example because they’re user friendly, as opposed to Windows which is user unfriendly and UNIX/Linux which is downright user-hostile. UNIX/Linux server-grade hardware would be the right way to do these computationally intensive tasks but they suck to use for humans. So Mac users are the fundamental problem. They picked the wrong tool. Apple responded by making sure that the user would realize that and would eventually put those workloads onto higher end hardware. Now we have video editors doing very small bits of editing on very small bits of video on their Mac and then sending many such snippets to a larger compute cluster for rendering and final processing to come out with a whole “thing”.
Those familiar with “Grid Computing”, “High Performance Compute” and other flavors of the topic know that what you’re really dealing with is a system that understands bounded resource blocks and workload. What it amounts to is you have a bucket of resource (CPU/Memory/Disk/Network) capacity and a bucket of workloads that have a discreet moment of being started and which will run to “completion”. You want to dispatch computation jobs to be executed, allow them to run to completion and then report on the status and resources taken to accomplish that. What you don’t want to do is worry about uneven load profiles, manually intervening when jobs fail or systems lean over, or figuring out which host to execute a job on.
Some systems like LSF/OpenLava and others were created back in a day where there was a huge variety of capability as far as horsepower and there were lots of proprietary hardware platforms. Those factors joined with factors like making sure that software licenses which were few in number were always in use, fair share allocation of computational horsepower & software licenses and organizationally induced prioritization of this project versus that project.
Today, hardware performance is orders of magnitude better and we’re not so much worried about computational horsepower so much as footprint cost efficiency. Back in the old days we’d run on-premise clusters of large numbers of very expensive servers in very expensive data centers. Nowadays we Cloud Service Providers which can provide enormous amounts of extra computational capacity on-demand which can be spun up only for as long as it’s needed and then spun down immediately afterward to minimize run costs. We’ve eliminated the sunken portion of data center run cost from the equation.
As we all know, most of the really great inventions in history were made by eliminating something from a prior invention: A magnificent martini is made that way by the elimination, or at least minimization, of the Martini (vermouth) from the equation. In the same way, eliminating the concept of owning actual servers and putting the load in the cloud enables organizations to radically alter the cost associated with operating high performance computation grids.
Kubernetes has the ability to dispatch arbitrary code execution to nodes. The cluster is aware of what nodes are part of the cluster and how much load they’re under so it’s relatively easy to code in a little Python/Ruby/C/Whatever to interface with a SQL or NoSQL database to build a list of jobs needing dispatch and to get them dispatched. When there becomes a queue of jobs due to lacking of free resources the code can, with very simple boundary configurations, elect to launch new execution node instances on the CSP (Cloud Service Provider) infrastructure of choice or to persist with the queue having some non-zero depth.
The efficiency to be gained is not simply in the fact that the company no longer has to own large numbers of servers and to pay for the continuous operation of those servers regardless of their being fully utilized or not. A huge gain is in the simple fact that CSP’s tend toward pricing based on utilization of network bandwidth and data ingress/egress from their assorted block or object storage systems but not from in-cloud usage of those very same storage sub-systems. The actual cost of the CSP provided CPU cycles, memory utilization and in-cloud storage access is heavily subsidized by out-of-cloud network/storage IO charges. High performance compute grids are almost universally highly intense in their utilization of CPU and memory and are notoriously weak in their need to import/export large amounts of data from the computational environment.
The next big change we see is that jobs are not actually arbitrary in large part. Many jobs are regularized. That is, they are routine and come about as a byproduct of the development process. When you complete a piece of code, it needs unit tested and regression tested. When you design an ASIC it generates follow-on load which is predictable. Many organizations rely on grid computing to run routine, regular reports, analytics and business processes. These are things that can be statically defined either in code or in databases. It’s a standard workload. Everything else is arbitrary workload.
So what we have here is an incipient change in how HPC gets done. The hard part had always been dispatching jobs. Now the hard part is architectural. Orchestrating job dispatch has been made trivially easy. Discerning what is a static job versus what is an arbitrary job and causing Kubernetes configuration to be automated is the current challenge. This is actually trivially easy to accomplish because of the ease of determining the static versus arbitrary nature of any particular job.
I’m not saying that there’s no effort in creating the necessary bits of code and building the necessary back end systems to accomplish these goals. What I’m saying is that we no longer need to pay IBM’s (or whomever) extortionist license fees for LSF (or whatever) and we no longer need to maintain extensive farms of servers, difficult to manage and highly specialized grid computing engines which require expensive-as-hell HPC experts like myself. All you need now is a basic bitch sysadmin who knows extremely common and popular technologies like NoSQL/SQL, Python/Perl/Ruby, Linux, Kubernetes, Docker, etc… There are maybe a few thousand people in the USA that really know how to make IBM’s LSF grid computing software work and to troubleshoot it. There are probably a million or so Linux sysadmins (also like myself) who know NoSQL/SQL, Python/Perl/Ruby, Linux, Kubernetes, Docker, etc… and even if they don’t know one of more of those things, they’re all easy to learn if you’re already a Linux sysadmin. They’re easy to learn for us because they were bloody well meant to be. If we’re to use them, and we’re a lazy bunch which is why we automate everything we can figure out how to, it has to be easy to learn, easy to use and easy to automate or we won’t do it.
So, now that I’ve given you this off book use case for Kubernetes, get out and use it. Yes it’ll take a few weeks longer than LSF would to implement but in the end it’ll cost you millions of dollars less to maintain and you won’t have to pay IBM’s (or anyone else’s) heart thumping-ly exorbitant license fees which are deliberately structured to extract every possible last cent from your organization.
Go (to heck Big) Blue!