Lazy vs. Eager Workers

Coordinator
Jan 31, 2009 at 5:26 AM
An issue that has been kicking around in my head since the start of this project has been "lazy" versus "eager" workers. Basically, there are two scenarios for how task workers should behave:

  • Lazy: All the workers in a swarm merely wait until they are told explicitly what to do by a peer.
  • Eager: All the workers in a swarm actively query peers to find more tasks to process if they're idle.
It may be good in the long run to implement both of these methods of operation. In the short term, I think that one of them should be picked and implemented, in order to reduce feature creep and keep things simple. The question is.... lazy vs. eager?

Both methods of operation have pros and cons associated. Lets take a brief look:

Lazy:
  • Allows the peer who owns a job to more easily coordinate the execution of the job.
  • In code, the Worker only listens for incoming activity and responds, rather than creating network traffic on it's own. This makes the network model slightly simpler.
Eager:
  • Ensures that all available processing power is utilized at any given time.
  • Workers coming online immediately begin working.
Jan 31, 2009 at 5:31 AM
It would seem to me that the eager method, although more complicated would result in better efficiency. If one worker finds itself dragging it could pass the load along the line allowing the outside tasks to complete quicker, and then get back to processing once the outside tasks are complete.

Also this creates a less centralized host, spreading the work around without needing the extra steps of verifying activity with the server.
Coordinator
Jan 31, 2009 at 5:55 AM
Ok, "eager" was what I was thinking would work best, as well.

The next question is of networking implementation.

Ok, so you have a bunch of machines hooked up together over a LAN. You want to distribute encoding to all of them, so you run PN264 on each machine (which forces itself to a single application instance). Each of these machines is now a peer. So far, so good.

However, once you add the concept of workers to the peer model, you now have another layer of complexity. A quad-core machine would likely have 4 distinct workers, a dual core would have 2 workers by default, etc. Now you have a group of peers, which communicate directly with eachother, and you also have multiple workers per peer. Do you treat each worker as a separate entity and communicate with it separately from the peer that owns/supervises it? Or, do you use peers as a "proxy" to communicate between a peer and a worker?
Jan 31, 2009 at 6:42 AM
I think I could better explain this in a flow chart but give me a chance and see if you can follow (its about bedtime)

You would have a central (main) server/rendering machine which would delegate tasks to the clients,
the clients would then balance the load across the chipsets (split the load equally into processor parts maybe at 90% .

If another task latches on to any of the any of the processors The newly burdened processor slides the remaining load it has to whichever processor had the lightest load, the newly free node runs whatever process it needs to run 

Once freed the processor would then query for which chip has the largest load and grab half of that task.

This could eventually scaled up across the network as other hosts become free as they would ask around to help process unfinished data, on still working clients

This would again splitthe task with the second host, and so on. basically it would first slide around on the Hosts chips, then as other hosts become unburdened the load sharing would be scaled across hosts.to take advantage of the free processors on.

Again it is late, and I could draw this better hopefully you see the idea.


Coordinator
Feb 2, 2009 at 5:12 AM
Edited Feb 2, 2009 at 5:13 AM
Ok, it seems you're suggesting a client/server model, which is something I've been trying to avoid. Or more accurately, in the model I'm thinking of, each peer is a server for it's own jobs, which means the network as a whole has no central point of failure. Only a job has a central point of failure which is the peer who owns that job.

Also, you bring up another point which is timing of task distribution. Originally I intended on having workers maintain a small queue of tasks and only requesting new tasks (and associated data) when the queue has slots available. The major issue with pre-distribution of all task data to all known workers becomes available resources on each machine.

A typical standard-definition encoding job, for example an hour and a half of movie footage will be: 90 minutes * 29.97 frames per second = 161838 frames, which will be transmitted/cached/stored uncompressed to save CPU cycles. So with a standard-definition frame being 720x480 @ 24 bits per pixel, you end up with 1036800 bytes per frame, for a total of 167793638400 bytes (approx. 156GB) for a 90 minute NTSC clip. In order to pre-distribute all task data for a given job, your swarm would need to be able to temporarily store 156GB of data, just for a single job. For a 90-minute high definition source, you're looking at ((((1920 * 1080 * 24 bits per pixel / 8 bits per byte) * (90 minutes * 60 seconds per minute * 29.97 frames per second)) / 1024) / 1024) / 1024 = 937.62002 GB of data.

However, because the data will be retrieved from a non-linear media source (NSynth or Avisynth), we can more efficiently work with such a large amount of data. In most cases the best minimum and maximum frame count for a Group-of-Pictures are (frame rate), and (frame rate * 10), respectively. For NTSC, min = 30, max = 300. Reading 300 frames from the media source, sending it to a peer for processing, then retreiving the much smaller output from the processing reduces the resource needs of the network as a whole by a huge amount.