Monday, November 23, 2015

WebLogic Server: Analyzing stuck threads

A while ago I was confronted with a WebLogic server which did not want to start. The log files did not give an indication of what was wrong. I tried the usual suspects such as clearing the tmp and removing lok files but they did solve the issue. How to proceed? In this blog article I'll provide an example of how such an issue can be debugged.

A stuck thread is a thread which takes longer than the configured stuck thread max wait time (600 seconds by default). To simulate a stuck thread issue I created a simple servlet which does a Thread.sleep upon a request. I'll show how to use various tools to detect the servlet causing the problem and even identity the specific request which causes the code to malfunction. A similar mechanism can be used to debug for example JDBC issues and can help in the identification of which connection causes an issue. The tools used do not require the server to be fully started.


Setup

I created a simple HTTP servlet which did a Thread.sleep. After the stuck thread max time has passed, this thread is marked as STUCK. If you want to do your own tests with stuck threads, you can download sample programs doing something similar such as; Stuck thread for free.
 
Get a thread dump

A thread dump indicates which threads have which state and on which thread a thread is waiting. It also provides a stack trace of each thread. There are various ways to obtain a thread dump. If you have this dump, you can use for example ThreadLogic to analyze it. You can create a thread stack dump from the WebLogic console but also by issuing a command in a console.

WebLogic Console

In you have stuck threads but the WebLogic Console is still available, you can go to Environment, Servers and select a server. Now you can go to Monitoring, Threads. Here you can look at threads and identify stuck and hogging threads. Also you can request a dump of the Thread stacks.


Command line

In case the WebLogic Console is not available, you can fall back to the command line. On Windows and Linux/Unix you can use jstack to have the JVM output the thread dump to the console. On Linux/Unix, you can also issue a kill -3 <PID> to ask the JVM to print a thread dump to its standard out. On WebLogic server this thread dump usually ends up in a .out file in the domain log directory.

In order to execute jstack or the kill command, you first have to identify your process. On Windows I recommend Microsoft Sysinternals Process Explorer (GUI). You can also use jps (part of the JDK) which is a command-line tool.


On Linux determining the java process can be done with for example:

ps axw | grep java


Look at the command-line which is executed. This will tell you which process belongs to which WebLogic server.

In my case, the WebLogic server is running as PID 9049. I can now issue a kill -3 9049 giving me in the thread dump in the out log file or a jstack -l 9049 giving me the output directly on the console.

In this article I've described various command line tools in more detail.

Analyze the thread dump

ThreadLogic

To help you analyze thread dumps, the Oracle A-Team has been so kind as to provide us ThreadLogic (introduced here). It is a fork from TDA expanded with several useful features such the option to obtain actual advice and gain some detailed insight. Also it contains some WebLogic specific patterns often found in thread dumps which greatly helps with the analyzis.


In this case requests to my StuckThread servlet are indicated as fatal. Sometimes this provides enough information to solve the issue. However as you have seen, the thread dump contains relatively little information. In order to learn more you can obtain and analyze a heap dump.

Get a heap dump

Of course, knowing which thread causes the issue usually only gives you a general idea of what could be wrong. A JDBC connection is having a problem. Nice, but if you have a lot of JDBC connections configured which connect to different databases. You usually want to know a bit more. You can obtain a heap dump quite easily with jmap;


This generates an output.dmp file which contains the Java heap space (its in a binary hprof format). This file can become quite large so make sure you have enough disk space. For JRockit (a JVM Oracle obtained from BEA Systems) the command is different. See here for some useful JRockit commands. On other JVM's, the commands also differ.

Analyze the heap dump

Once you have obtained a heap dump, you can start analyzing it. A useful tool for this is MAT. A Memory Analyzer Tool. This tool loads the entire heap dump in memory and starts analyzing it. Thus it is recommended to have some memory available to do this. It is not easy to use though and requires some practice to filter the relevant information from a heap dump. Download and start the tool. Load the heap dump you have created. In the dominator_tree (one of the top buttons after you have loaded your heap dump) you can search for a specific string or class. In this case I look for a GET request related to my troublesome servlet. You can also search for the class causing the stuck thread (which you can obtain from the thread dump / ThreadLogic).




I can then list objects with incoming references. By going a bit deeper I can obtain information on the request causing the stuck thread.



GET /StuckThreadServlet/stuckthreadservlet HTTP/1.1..Host: 127.0.0.1:7101..User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0..Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8..Accept-Language: en-US,en;q=0.5..Accept-Encoding: gzip, deflate

Analyzing a heap dump can look difficult at first. You can compare it with looking at SOA Suite debug log files, which can become quite large. At first it is difficult to find what you are looking for (a lot is happening in SOA Suite) but with a bit of experience after a while you start to see patterns and develop your own search methods.

Finally

Getting a heap dump without looking at a thread dump first is not recommended since it will complicate the analysis and ThreadLogic gives you a head start by providing advice on what you should look at.

When not to analyze

Of course if a line like the following in the domain or server log provides you with enough information, you do not require an analysis like this.

####<Nov 22, 2015 10:03:34 AM EST> <Error> <WebLogicServer> <localhost.localdomain> <DefaultServer> <[ACTIVE] ExecuteThread: '73' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <d699ecd1-da0e-4c0a-8674-4fe3a362b5ec-0000022d> <1448204614760> <[severity-value: 8] [rid: 0] [partition-id: 0] [partition-name: DOMAIN] > <BEA-000337> <[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "686" seconds working on the request "Http Request Information: weblogic.servlet.internal.ServletRequestImpl@575083fd[GET /StuckThreadServlet/stuckthreadservlet]
", which is more than the configured time (StuckThreadMaxTime) of "600" seconds in "server-failure-trigger". Stack trace:
        java.lang.Thread.sleep(Native Method)
        nl.amis.smeetsm.StuckThreadServlet.doGet(StuckThreadServlet.java:27)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:790)

Causes

I've not talked about solutions to stuck thread problems. In order to provide suggestions for solutions, you should first look at common causes. Stuck threads can be caused by various issues and there is no single solution which works in every case. Some things I often see are connection problems; a request is send somewhere and a response is expected, but that response never comes and a timeout parameter has not been set (if a timeout has been set, the connection times out). Load balancers, firewalls or bad application programming can cause this. This can occur with most requests which use a connection such as HTTP and JDBC requests. Stuck threads can also be a symptom of memory shortage. If the server starts swapping memory to disk, it can slow down to such an amount that the stuck thread max wait time is reached for actions which can usually be performed within a period before the timeout is reached. Also memory shortage can cause high CPU due to constant garbage collection in a JVM causing a a drop in performance which can lead to stuck threads.

Dealing with stuck threads automatically

Dealing with stuck threads can be done automatically as stated;
I'm not counting 'restarting the server' as dealing with the issue. You're often better off analyzing what is wrong and fixing it instead of configuring automatic stuck thread handling.

JVMD

JVMD, part of Oracle Enterprise Manager Weblogic Management Pack can also be used to analyze stuck threads. Not all customers / operations departments allow developers to access Oracle Enterprise Manager and not everywhere Weblogic Management Pack is available. The tools described in this blog article can be used in such a case.

Java Mission Control

I have not looked at Java Mission Control in this article. Since Oracle JDK 7 Update 40 it is bundled with the JDK. It offers features like 'flight recorder' to do offline trend analyzes (something which JConsole for example does not provide) and allows detection of memory leaks. When analyzing a heap dump, finding memory leaks is far more difficult; you want to look at differences in memory usage over time and not at a fixed moment.

References

Overview monitoring tools Oracle JRockit
http://javaoraclesoa.blogspot.nl/2012/09/overview-monitoring-tools-oracle-jrockit.html

Java Mission Control
http://www.oracle.com/technetwork/java/javaseproducts/mission-control/java-mission-control-1998576.html

JVMD
https://blogs.oracle.com/oem/entry/quickly_diagnose_the_root_cause1

Automatic handling of stuck threads
https://community.oracle.com/thread/1983700

Memory Analyzer Tool
https://eclipse.org/mat/

ThreadLogic
https://java.net/projects/threadlogic

Microsoft Sysinternals Process Explorer
https://technet.microsoft.com/en-us/sysinternals/processexplorer.aspx