Batch schedulers
I used to work at amd and arm, where both places used IBM Platform LSF in their hpc clusters. Hardware engineers would use lsf to submit large simulation or emulation jobs to the hpc cluster. Now I’m at a software company, I’ve seen that k8s is really popular for running containerized services, which is a similar usecase (instead of running batch jobs).
My understanding is that k8s supports batch jobs as well, I thought I’d poke around and see what batch schedulers are around - here are some of the ones I could find:
- slurm
- lsf
- torque
- SGE
- LoadLeveler
- kubernetes
- nomad
This page talks about some of the differences between slurm/lsf/k8s:
There’s even a rosetta stone for going from lsf and other batch schedulers to slurm:
Slurm seems to be a popular alternative to lsf, maybe one day I’ll figure out what it takes to run a slurm cluster.
One thing I remember was it was nice to be able to submit batch jobs to the cluster without having to worry about containers and such. I was normally working on an interactive host, but the environment was pretty much the same as the cluster machines (same nfs mounts, tools etc.). I remember there being some pains with versioning and such but it was mostly easy.
This is sorta an advantage to something like k8s, or even nomad, where you have to write job manifests before submitting a job. You could even do interactive sessions with something like bsub -I -q interactive -n 4,10 bash
.
Appendix
I started thinking this a little more after running into the following post on hacker news: