Wednesday, March 12, 2014

Questions one should have answers for while setting High Availability in Oracle SOA

The customers who are looking to embark on Oracle SOA for the first time especially on a high available mode are mostly daunted looking at various deployment topologies and administrative tasks involved with in. While the Oracle Enterprise Deployment Guide ( EDG) is very popular and is used as a reference to setup the high availability there would be still a lot of questions lingering around like why certain things are being setup or setup in a particular way. In fact, sometimes,its not clear why we are doing somethings in the entire process. This post attempts to help resolve those vague concepts.

Question : What is the use of external load balancer(software or hardware) when we plan to implement a web logic cluster that does load balancing anyway.?

Answer :  External Load balancers provide an additional layer of security and more reliability of load balancing as they are not affected by the internal systems performances but are used only to load balance across Http servers or some cases ( not suggested) application servers using well defined algortihms. The need for load balancing at cluster level is for clustered object stubs when no algorithm is specified at the object level. The algorithm at weblogic cluster level supports load balancing for RMI objects and EJBs.

Question : Can Managed servers continue to run in case admin server comes down ?
Answer : Yes of course they do. They try to contact admin server at a particular frequency that is setup int he configurations, but then will continue on their own. Another point to note is that managed servers can also start by themselves ( with help of node manager) with out Admin server being up. This is called as MSI mode or Managed Server Independence mode.The problem with this mode is that managed servers can not get the latest update on the configurations done in the domain as it runs with its own copy of the configuration which is collected upon its start when admin was up and running.

Question : What is whole server migration ? Why do I need it when my cluster is performing a "fail over" of various services when one of its members is down ?

Answer : In a WebLogic Server cluster, most services are deployed homogeneously on all server instances in the cluster, enabling transparent failover from one server to another. In contrast, "pinned services" such as JMS and the JTA transaction recovery system are targeted at individual server instances within a cluster—for these services, WebLogic Server supports failure recovery with migration, as opposed to failover.

Question : Why does JMS and JTA are targeted to single server ?

Answer : Weblogic services are divided into two categories , one Clustered Services and two Singleton Services. The JMS,JTA services come under the Singleton services that can be targeted only to a single instance in a cluster and more importantly will be managed by a single persistent store ( Note : single persistent store is the key).
The reason for this setup is that it offers high QOS by ensuring message consistency and reliability. For example a JMS message on a topic or a queue should not be stored in multiple places for data consistency. Also the information related to the message like transactions, acknowledgement etc should not be shared across.

Question : Is Whole Server migration the only way to ensure JMS and JTA "Singleton" services are recovered in case of failures?

Answer : No. We can do service migration too. ie., migration of individual JMS and JTA services to other server. The problem is that only JTA service migration supports the migration to other server using the default persistent store. Other services such as JMS needs to use custom persistent stores for them to be able to migrate to other servers.

Question : Would the problem be same for any other Singleton services in my server?
Answer : Yes. The singleton services as the name indicates, means that they are meant to be targeted to a single server instance and will use one persistent store allocated for those services in that server instance.

Question : What is the impact if Server/Service migration is not setup ?
Answer  :  Your instances that was running crashes and it may be retried based on various parameters. If its an asynchronous instance that is dehydrated by the time it crashed, the server may try to recover it. Manual recovery sometimes is still the option in certain cases.

This still sounds theoretical although it clears a cloud of vagueness regarding some important concepts of HA architecture. Understanding of the above in more practical way with examples will be done my upcoming blog.

Quick recap of JVM in context of weblogic

This is probably an age old concept that was discussed many times since the beginning of java but it seems like many programmers struggle or often lose track of these trivial but important basics.This post is to just refresh that knowledge. Specially the people who are working on weblogic server often encounter issues such as "OutOfMemoryError" , "PermGenSpace Out of memory" and end up tweaking these settings with no or less real understanding.

Lets see what different JVM settings are-( not full list)

1. Xms - Initial heap space.
2. Xmx - Maximum heap space.
3. XX (or) XX:PermSize  - Permenant Generation heap space.


Initial heap space - The amount of heap space allocated to the JVM heap at the time of server start.

Maximum heap space - Once server is started and applications/instances are deployed / redeployed , the available (initial) heap space is utilized( free space decreases) and also is fragmented to some extent.
Fragmented means that the heap memory is available as not one big chunk at one place but rather is scattered at various multiple places.The problem with the fragmentation is that in order to write / create a object in heap , the amount of memory at one location of heap space may not be sufficient to store the object entirely. As a result the object is created at two different places. This causes a same instance data / application data to be scattered across different parts in the heap making it very time consuming while reading.
Now lets come back to what maximum heap space is about. Because of rapid use of applications if the increase of heap space becomes equal to the initial heap space we allocated , the operating system allocates more heap space based on need upto the max heap space.

How do I set these two parameters for better perforamce ?
Answer : Depends on your need. Lets consider two cases a) Where you know that you are going to run huge load on the server that needs much heap space. b) Where you ddont know how heap space is needed initially. In case of former(first case), you should allocate initial heap pace = max heap space ( -Xms equals to Xmx), as this would avoid an unneccassary allocation of memory by operating system once the initial heap space threshhold is reached. But as in this case initial heap is set maximum high OS doesnt intevene in between and hence can save good time.But setting a high heap right from the beginning is also not good for applications that fall into the second category where the growth of instance beyond initial heap space is unsure.However,when you allocate more initial heap space, the server may take more time to start.

PermGen space - There are three kinds of racks( or Generations) with in the Java Heap.
a) Young Generation b) Old Generation c) Permanent generation( PermGen)
Young Generation - most recently created and running objects use this.
Old Generation - Old objects which are still live will be moved to Old generation heap space. They are not frequently used but are still live.
Perm Generation - The space allocated with in the java heap to store the classes and other permanent static files that needs to be always there in heap.The space allocation completely depends on the number of applications used( classes that you deploy) and on the way programming is done ( ie., usage of static functions).
Few other memory jargon -

Eden Space (heap - young generation):pool from which memory is INITIALLY allocated for most objects.
Survivor Space (heap - young generation):pool containing objects that have survived GC of eden space.
Tenured Generation (heap - old generation):pool containing objects that have existed for some time in the survivor space
Permanent Generation (non-heap - stack):holds all the reflective data of the virtual machine itself, stores class level details, loading and unloading classes (e.g. JSPs), methods, String pool
Code Cache (non-heap):HotSpot JVM also includes a "code cache" containing memory used for compilation and storage of native code.

In order to set custom memory arguments for each of the servers (Admin and/or Managed) , you need to set the USER_MEM_ARGS in each the respective servers env script ie., setSOADomainEnv.cmd / setOSBDomainEnv.cmd / setOEREnv.cmd respectively

set USER_MEM_ARGS=-Xms256m -Xmx512m -XX:PermSize=256m -XX:MaxPermSize=768m

However doing so poses a problem - 
setDomainEnv calls the respective Env' files of the managed servers in the same domain.
- lets say setSOADomainEnv.sh , setOSBDomainEnv.sh , setOERDomainEnvsh in the same order.
Now each of these env scripts have their own USER_MEM_ARGS set. As USER_MEM_ARGS is a common variable across it holds the value updated by the last(latest) script in the order. In this case for eg : setOERDomainEnv.sh. So now all server's get started with the same memory arguments which is very bad.

For the same , I have written a very small tweak / custom script inside the setDomainEnv.cmd/sh. This will read the server name being started and then sets the memory parameters accordingly

 @REM **********START CUSTOM SCRIPT**************  
 @REM This script is needed to set the right memory parameters to the JVM based on the server being started.  
 if "%SERVER_NAME%" == "AdminServer" (  
  set USER_MEM_ARGS=-Xms256m -Xmx512m -XX:PermSize=256m -XX:MaxPermSize=768m  
 )  
 if "%SERVER_NAME%" == "soa_server1" (  
  set USER_MEM_ARGS=-Xms256m -Xmx1024m -XX:PermSize=256m -XX:MaxPermSize=768m  
 )  
 if "%SERVER_NAME%" == "oer_server1" (  
  set USER_MEM_ARGS=-Xms256m -Xmx512m -XX:PermSize=256m -XX:MaxPermSize=768m  
 )  
 if "%SERVER_NAME%" == "osb_server1" (  
  set USER_MEM_ARGS=-Xms256m -Xmx1024m -XX:PermSize=256m -XX:MaxPermSize=768m  
 )  
 if "%SERVER_NAME%" == "bam_server1" (  
  set USER_MEM_ARGS=-Xms256m -Xmx1024m -XX:PermSize=256m -XX:MaxPermSize=768m  
 )  
 if "%SERVER_NAME%" == "osr_server1" (  
  set USER_MEM_ARGS=-Xms256m -Xmx512m -XX:PermSize=256m -XX:MaxPermSize=768m  
 )  
 @REM Set the memory args in the same way above for any other servers apart from the above those participate in the same domain  
 @REM ********END CUSTOM SCRIPT*********************  

With this script you need not bother about setting the memory arguments in the individual scripts at all as this overwrites all of them.

Note :  This script has to be placed before the below lines in the setDomainEnv.bat

if NOT "%USER_MEM_ARGS%"=="" (
set MEM_ARGS=%USER_MEM_ARGS%
)

Update : I came across this blog by Antony Reynolds explaining the same issue for which he had a similar solution.

Garbage Collection:

The other option we can use to produce higher throughput is to garbage collection.
GC algorithms are of two types:
a) Parallel,Serial and b) Concurrent.
Parallel GC stops the execution of all the application and performs the full GC, this generally provides better throughput but also high latency using all the CPU resources during GC.Its mostly called as "Stop the world" GC where everything else stops while it runs.The reason it is named parallel because Multiple threads in parallel are allocated for GC. The only difference between this and serial GC is that in serial there is only single thread allocated for GC.

Concurrent GC on the other hand, produces low latency but also low throughput since it performs GC while application executes ( not in all phases of GC though).

The Hot spot JVM provides following options for GC
-XX:-UseParallelGC
-XX:-UseSerialGC

The JRockit JVM provides some useful command-line parameters for Garbage collection -

-XgcPrio:pausetime (To minimize latency, parallel GC)
-XgcPrio:throughput (To minimize throughput, concurrent GC )
-XgcPrio:deterministic (To guarantee maximum pause time, for real time systems)

Force Garbage collection:

I have found a great simple program over net that forces garbage collection. Below is the snippet of the below code that can be run
1. Go to your weblogic home and ie., eg: Oracle\Middleware\wlserver_10.3\common\bin
2. Run the command wlst forceGC.py.
The forceGC.py can be placed under the common\bin folder and its contents as below


 forceGC.py  
 # WLST script which calls GC.  
 from java.util import *  
 from javax.management import *  
 import javax.management.Attribute  
 print 'starting the script .... '  
 # please replace userid and password with your AdminServer userid and password  
 # plz change the IP adresss and port number accordingly  
 connect('weblogic','weblogic123',url='t3://localhost:7001')  
 state('AdminServer')  
 # For Force GC ....  
 domainRuntime()  
 cd('/ServerRuntimes/AdminServer/JVMRuntime/AdminServer')  
 print ' Performing Force GC...'  
 cmo.runGC()  
 disconnect()  
 print 'End of script ...'  
 exit()  


Imp Note : If you want to run the force gc for your managed server, then just change the state('AdminServer') to state('WLS_SOA1') where WLS_SOA1 is the managed server name. Make sure that Node managers are accessible and running.

Weblogic DMS spy a war deployed on weblogic admin server at url http://adminhost:port/dms/index.html provides us with very useful metrics.One of them is the JVM_Memory Set which gives us the used vs free heap and non heap memories as shown below






When I ran the forceGC.py, in few minutes my used heap memory came down. This is pretty useful in many development environment when your server is performing very slow because of less memory and not yet garbage collected.

Conclusion : Use JVM memory settings according to your application needs. Understanding of the basic jvm memory concepts help you to do better configuration and makes you troubleshoot more meaningfully.