I began the JParallel library a little while ago (Right after the post about not doing my own parallel programming (It doesn’t make a lot of sense)), and nearly immediately got a response from Mohamed Hafez on the issue tracker recommending I add a bunch of features. And a week later he implemented things in his own JRuby-specific library.
He uses the Java ThreadPoolExecutor and says that a pure Ruby version of this would be cool to make. So I spent a day trying to make that happen. Thank goodness for the thread-pool-and-other-stuff library by “meh” ’cause I don’t know how to make thread pools of my own. Its nice having a thread-pool library in one’s own language; it makes my life a lot easier.
Features of mohamedhafez/parallelizer
- Return exceptions as elements in the final returned collection.
Returning an exception if a computation doesn’t happen is fairly straightforward. Easy fix.
- Have a persistent thread pool so we don’t incur the cost of creating a new one every time.
I didn’t have a clue as to how to get this done and make sure that the JParallel object is threadsafe itself. This probably took the most amount of time (>3 hours (I’m not proud)) to figure out from Hafez’s code.
- Use the calling thread to do some of the work, so it minimizes the amount of work delegated to a different thread.
I haven’t implemented this yet.
- Can consume an array of Procs instead of just data.
Not implemented yet.
- Has a timeout for each job
Not implemented yet.
Using a persistent thread-pool
This is a hairy problem. In the first round of this library, I’d create a new thread-pool, pass all the stuff-that-needs-to-be-done to it, and use “pool.shutdown” to wait for all the work to get done, and finally return the finished product. Easy peasy beautiful.
The problem with a persistent thread pool is that there is no way as part of the pool to know if a particular group of jobs is done. “pool.shutdown” waits till all the work is done. But what if you have 2 jobs going? How do you determine that all the tasks for job1 are done? You don’t want to wait for all the tasks of every job sent to the pool to complete. Thread-pools (I don’t think its just this implementation) don’t implement anything like that; its not their responsibility to do so. We need a way to track each task (Each item in the input-collection) and its progress.
Hafez uses “futures” in his library. A “future” is a bit of computation that you do asynchronously, and you can poll it or wait till its done. Exactly what we need. Meh’s thread library implements futures. Yay! So all thats needed to be done is wrap every bit of the computation that can be executed in parallel inside a future. Put those futures into an array, and then wait till all of them are done.
Since we don’t want to have every future start up a new thread (Definitely defeats the purpose of having a thread-pool), we just need to feed each future into the threadpool, and Bob’s you uncle. Hafez creates a future for each Computation (A component class of Parallelizer) and then has the thread-pool “execute” each future. I rattled by brains for about 1.5 hours trying to figure out how to do that in Ruby. Computation implements “Callable”, some Java concurrency thing. And then I looked at the Thread::Future source code and found that you can just give a new Future a thread-pool to run itself on! So problem solved. Thank you, Meh.
ToDo
- Change the readme for Meh’s library so that others don’t need to dig through the source-code to realize that a Thread::Future can just be passed a Thread::Pool.
- Implement the other features that Hafez has in the JRuby-specific code.
- Add more tests to JParallel.
Need to make sure that each instance of JParallel is threadsafe itself. I’ve not idea how to test this just yet. Basic google search didn’t help. Will try again.
- Add tests to mohamedhafez/parallelizer.
- DRY up my code.
Leave a Reply