Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!crdgw1!rpi!batcomputer!cornell!ken From: ken@gvax.cs.cornell.edu (Ken Birman) Newsgroups: comp.sys.isis Subject: Re: roll your own supercomputer Message-ID: <43872@cornell.UUCP> Date: 28 Jul 90 19:18:28 GMT Sender: nobody@cornell.UUCP Reply-To: carey@cs.wisc.edu (Michael Carey) Distribution: comp Organization: Dept. of Computer Science, U. Wisconsin Lines: 39 > From: carey@cs.wisc.edu (Michael Carey) > To: ken@gvax.cs.cornell.edu > Subject: RE: roll your own supercomputer > Status: R > Ken, > I saw your note on the net. Just FYI, there is actually a facility of > that nature - not ISIS-based, but quite effective - available (at no charge > for universities, I'm pretty sure) from the University of Wisconsin. It's > called Condor, and currently manages jobs for about 170 workstations (from > Sun, DEC, IBM, and HP) here. Folks in our dept who do simulation studies > rely heavily on it as a way to get lots of CPU days in a short time; folks > who do things like explore large search spaces (e.g., to understand what the > search space looks like for optimizing very large join queries) often run > many-hour programs on it. It does periodic checkpointing, and it hops off > a workstation when the workstation's owner returns. If you're interested > in it, or know of folks who would be, it's supported here by Mike Litzkow > (mike@cream.cs.edu); he's one of the dept's research programmers. There > was a paper about it in the 8th ICDCS Conference (in 1988) about it called > "Condor - A Hunter of Idle Workstations" (by Litzkow, Livny, and Mutka). ... I am aware of Condor, but I just in case other readers of this group are interested I am posting this message. My feeling is that Condor makes a lot of assumptions about why people are trying to manage the resources in their machine and what it means to schedule a task. Although quite nice for the simulation work being done at Wisconsin, many applications would have problems with IO performance degradations factors of 2-3, and the Condor concept of job checkpointing is also very specific to the type of jobs Wisconsin is running on the system. Also, I have the impression that Condor isn't very fault-tolerant, but I could be out of touch with the most recent release of this system. I would be more interested in seeing a "resource management tool" on which more specific solutions such as Condor could be layered. Anyhow, thanks for the pointer!