[Pogamut-list] Proper Server Termination

Wed Apr 27 13:38:22 CEST 2011

Re: Proper Server Termination

Author: jakub.gemrot

> Ok, some more questions related to server/bot termination. I've discovered that my code definitely has a memory leak,
> and maybe also a Thread leak, with the help of the NetBeans profiler. I verified a few things:

> If I run my bot by itself against an opponent on an externally launched server, there is no memory/thread leak.

Good.

> If I run my evolution code, but make it launch only one server with an unlimited time limit (so basically, one server
> gets launched with some bots), there is no memory/thread leak.

This means using UCCWrapper?

> If I evolve using just one server at a time (no multithreading, but still using the same code that can launch multiple server threads),
> with a population size of 1 for several generations, the code leaks both memory and threads.

You mean code as a whole or yours? Or this could not have been verified?

> I'm not sure of the cause, but I have some theories and a little data that I'm hoping will help.
> I'm wondering if whenever I stop/kill a server/bot, the resulting FatalErrorEvent is not closing
> all the running threads, and therefore not freeing the memory used by objects in those threads.
> How do I completely wipe out all objects and threads created by a server/bot? Simply using stop/kill
> doesn't seem to work. I've noticed in some of the Pogamut code interesting use of various sync tools,
> such as listeners for state changes, countdown latches, and things like that, but before I start
> trying to copy this code, I was wondering if anyone could help explain what is necessary and how/why
> it works.

No, FatalErrorEvent is propagated to every single thread agent thread. I.e., whenever FatalErrorEvent happens
it will gracefully tear down ONE agent, if this is failing -> there is a bug.

Such event might be then sensed by other parts of Pogamut such as
UT2004BotRunner that might then kill all other agents.

The key point here is to discover which threads were not killed. This could be found out from the debugger itself
(no need to use profiler) either in Eclipse or NetBeans... whenever such FatalErrorEvent occurs and your JVM won't terminate
(meaning there are still some threads running), simply pause JVM and examine how many threads it has and what their names
are. You will notice that all threads (that Pogamut Library is spawning) are nicely named, so it will be easy to distinguish between them.

> Also, although the profiler lets me look at the allocated objects, it crashes whenever I try to use the
> "Record stack trace for allocations" option, so I'm not able to find the exact sources of the classes
> that are causing the problem. However, from looking at the number, age, and surviving generations
> of the different classes, I can say that the following seem to be causing problems:

Probably Guice and byte-code weaving are to blame... but I'm not sure.

> HashMaps, ArrayLists, Locations, Strings, char arrays and Object arrays. The char arrays are inside the
> Strings, and the Object Arrays are inside the ArrayLists. Strings are probably the keys for a lot of the HashMaps,
> and Locations are probably what's being looked up. From all of this, I have the feeling that navigation and
> path planning information is not being cleared out when the bots die, but maybe I'm jumping to conclusions.
> Without the ability to follow the stack traces, I can't know for certain.

No HashMap is using String as a key in Pogamut, this is wrong - we always use Token which is "named" String
that has much faster equals() method. When you're using profiler, there is a problem that all objects in the core
consist of many lists / maps / Strings, thus it seems that these objects leak the most, but they are roots of the
mountain of objects that leaked just because some "high-level object" were not gc()ed. Thus you should look for objects
name "XYZModule" or "XYZAgent" or "XYZBot" which will be probably at the end of the list of objects (when sorted
according to number of instances).

> On the thread side, RMI connections, sockets, and TCP connections seem to be responsible. These threads
> tend to stay alive long after they should be closed.

There is JMX problem - to be able to publish agent's JMX interface one need to have demon threads running RMI registry.
If those registry are not shut down properly (via PogamutPlatform.getPlatform().close();) it will run forever.

> I can provide more specific profiler data if needed. I would really like to figure this out.

Try JProfiler for your code, it is paid, but you can obtain 14 days trial license that is very powerful, much more useful
than NetBeans profiler.

=========================

The thing we probably need the most right now is the list of threads that were not stopped after some error.

Cheers!
Jimmy

-- 
Reply Link: <http://diana.ms.mff.cuni.cz/main/tiki-view_forum_thread.php?forumId=4&comments_reply_threadId=4&comments_parentId=682&post_reply=1#form>