JIE: Solving the Distributed Lock Management Problem
Posted by rbpasker on April 18, 2008
Yesterday I wrote about the pitfalls of trying to do Distributed Lock Management. Well, it turns out my friends at Terracotta have been hard at work on an awesome management tool to let users detect the problem in a console. Nice job! I wonder, however, how easy it will be for a developer to spot the problem in the case where there are more than two locks involved in the deadlock.
What I would like to see is an automatic solution to the problem, one that detects such a deadly embrace and chooses a victim to kill. Detecting the deadlock would have to be heuristic, in the sense of watching the locks to see how long they usually take, and considering only those locks which exceed the normal holding time.
In a distributed locking case, such as with Terracotta, the deadlock detector could
System.exit(), let the other VM continue along, and the management system would automagically restart the victim VM. It wouldn’t prevent the problem from happening again in 10 seconds, but it might at least ring lots of bells so someone can come look at the problem, rather than having the whole cluster deadlocked. In the single VM case, we’d have to wait for a proper solution to Thread.stop(), which I also talked about yesterday.
The other question I have about deadlock detection is whether some of it can be done via static analysis, but this is not my area of expertise. An alternative would be to use AOP to instrument the locks. I’m sure someone has already done this.