TAO Performance and Footprint Tuning

TAO Compile-time and Run-time Performance and Footprint Tuning

Overview

TAO is increasingly being used to support high-performance distributed real-time and embedded (DRE) applications. DRE applications constitute an important class of distributed systems where predictability and efficiency are essential for success. This document describes how to configure TAO to enhance its throughput, scalability, and latency for a variety of applications. We also explain various ways to speedup the compilation of ACE+TAO and applications that use ACE+TAO.

As with most applications, including compilers, enabling optimizations can often introduce side-effects that may not be desirable for all use-cases. TAO's default configuration therefore emphasizes programming simplicity rather than top speed or scalability. Our goal is to assure that CORBA applications work correctly ``out-of-the-box,'' while also enabling developers to further optimize their CORBA applications to meet stringent performance requirements.

TAO's performance tuning philosophy reflects the fact that there are trade-offs between speed, size, scalability, and programming simplicity. For example, certain ORB configurations work well for a large number of clients, whereas others work better for a small number. Likewise, certain configurations minimize internal ORB synchronization and memory allocation overhead by making assumptions about how applications are designed.

This document is organized as follows:

Optimizing Run-time Throughput
- Optimizing Client Throughput
- Optimizing Server Throughput
Optimizing Run-time Scalability
- Optimizing Client Scalability
- Optimizing Server Scalability
Reducing Compilation Time
- Compilation Optimization
- Compilation Inlining
Reducing Memory Footprint
- Compile-time Footprint
- Run-time Footprint

Optimizing Throughput

In this context, ``throughput'' refers to the number of events occurring per unit time, where ``events'' can refer to ORB-mediated operation invocations, for example. This section describes how to optimize client and server throughput.

It is important to understand that enabling throughput optimizations for the client may not affect the server performance and vice versa. In particular, the client and server ORBs may be designed by different ORB suppliers.

Optimizing Client Throughput

Client ORB throughput optimizations improve the rate at which CORBA requests (operation invocations) are sent to the target server. Depending on the application, various techniques can be employed to improve the rate at which CORBA requests are sent and/or the amount of work the client can perform as requests are sent or replies received. These techniques consist of:

Run-time features offered by the ORB, such as Asynchronous Method Invocations (AMI)
ORB configurations, such as disabling synchronization of various parts of the ORB in a single-threaded application

We explore these techniques below.

Run-time Client Optimizations

For two-way invocations, i.e., those that expect a reply (including ``void'' replies), Asynchronous method invocations (AMI) can be used to give the client the opportunity to perform other work as a CORBA request is sent to the target, handled by the target, and the reply is received.

Client Optimizations via ORB Configuration

A TAO client ORB can be optimized for various types of applications:

Single-Threaded
- Other options include disabling synchronization in the components of TAO responsible for constructing and sending requests to the target and for receiving replies. These components are called ``connection handlers.'' To disable synchronization in the client connection handlers, simply add:
  
  -ORBClientConnectionHandler ST
  
  to the list of Client_Strategy_Factory options. Other values for this option, such as RW, are more appropriate for "pure" synchronous clients. See the -ORBClientConnectionHandler option documentation for details.
Low Client Scalability Requirements
- Clients with lower scalability requirements can dedicate a connection to one request at a time, which means that no other requests or replies will be sent or received, respectively, over that connection while a request is pending. The connection is exclusive to a given request, thus reducing contention on a connection. However, that exclusivity comes at the cost of a smaller number of requests that may be issued at a given point in time. To enable this behaviour, add the following option to the Client_Strategy_Factory line of your svc.conf file:
  
  -ORBTransportMuxStrategy EXCLUSIVE

Optimizing Server Throughput

Throughput on the server side can be improved by configuring TAO to use a thread-per-connection concurrency model. With this concurrency model, a single thread is assigned to service each connection. That same thread is used to dispatch the request to the appropriate servant, meaning that thread context switching is kept to minimum. To enable this concurrency model in TAO, add the following option to the Server_Strategy_Factory entry in your svc.conf file:

-ORBConcurrency thread-per-connection

While the thread-per-connection concurrency model may improve throughput, it generally does not scale well due to limitations of the platform the application is running. In particular, most operating systems cannot efficiently handle more than 100 or 200 threads running concurrently. Hence performance often degrades sharply as the number of connections increases over those numbers.

Other concurrency models are further discussed in the Optimizing Server Scalability section below.

Optimizing Scalability

In this context, ``scalability'' refers to how well an ORB performs as the number of CORBA requests increases. For example, a non-scalable configuration will perform poorly as the number of pending CORBA requests on the client increases from 10 to 1,000, and similarly on the server. ORB scalability is particularly important on the server since it must often handle many requests from multiple clients.

Optimizing Client Scalability

In order to optimize TAO for scalability on the client side, connection multiplexing must be enabled. Specifically, multiple requests may be issued and pending over the same connection. Sharing a connection in this manner reduces the amount of resources required by the ORB, which in turn makes more resources available to the application. To enable this behavior use the following Client_Strategy_Factory option:

-ORBTransportMuxStrategy MUXED

This is the default setting used by TAO.

Optimizing Server Scalability

Scalability on the server side depends greatly on the concurrency model in use. TAO supports two concurrency models:

Reactive, and
Thread-per-connection

The thread-per-connection concurrency model is described above in the Optimizing Server Throughput section.

A reactive concurrency model employs the Reactor design pattern to demultiplex incoming CORBA requests. The underlying event demultiplexing mechanism is typically one of the mechanisms provided by the operating system, such as the select(2) system call. To enable this concurrency model, add the following option to the Server_Strategy_Factory entry in your svc.conf file:

-ORBConcurrency reactive

This is the default setting used by TAO.

The reactive concurrency model provides improved scalability on the server side due to the fact that less resources are used, which in turn allows a very large number of requests to be handled by the server side ORB. This concurrency model provides much better scalability than the thread-per-connection model described above.

Further scalability tuning can be achieved by choosing a Reactor appropriate for your application. For example, if your application is single-threaded then a reactor optimized for single-threaded use may be appropriate. To select a single-threaded select(2) based reactor, add the following option to the Advanced_Resource_Factory entry in your svc.conf file:

-ORBReactorType select_st

If your application uses thread pools, then the thread pool reactor may be a better choice. To use it, add the following option instead:

-ORBReactorType tp_reactor

This is TAO's default reactor. See the -ORBReactorType documentation for other reactor choices.

Note that may have to link the TAO_Strategies library into your application in order to take advantage of the Advanced_Resource_Factory features, such as alternate reactor choices.

A third concurrency model, unsupported by TAO, is thread-per-request. In this case, a single thread is used to service each request as it arrives. This concurrency model generally provides neither scalability nor speed, which is the reason why it is often not used in practice.

Reducing Compilation Time

Compilation Optimization

When developing software that uses ACE+TAO you can reduce the time it takes to compile your software by not enabling you compiler's optimizer flags. These often take the form -O<n>.

Disabling optimization for your application will come at the cost of run time performance, so you should normally only do this during development, keeping your test and release build optimized.

Compilation Inlining

When compiler optimization is disabled, it is frequently the case that no inlining will be performed. In this case the ACE inlining will be adding to your compile time without any appreciable benefit. You can therefore decrease compile times further by build building your application with the -DACE_NO_INLINE C++ flag.

In order for code built with -DACE_NO_INLINE to link, you will need to be using a version of ACE+TAO built with the "inline=0" make flag.

To accommodate both inline and non-inline builds of your application it will be necessary to build two copies of your ACE+TAO libraries, one with inlining and one without. You can then use your ACE_ROOT and TAO_ROOT variables to point at the appropriate installation.

Reducing Memory Footprint

Compile-time Footprint

It has also been observed recently that using -xO3 with -xspace on SUN CC 5.x compiler gives a big footprint reduction of the order of 40%.

Also footprint can be saved by specifying the following in your platform_macros.GNU file:

optimize=1
debug=0
CPPFLAGS += -DACE_NLOGGING=1

If portable interceptors aren't needed, code around line 729 in $TAO_ROOT/tao/orbconf.h can be modified to hard-code TAO_HAS_INTERCEPTORS as 0, and all interceptor code will be skipped by the preprocessor.

IDL compiler options to reduce compile-time footprint
Command-Line Option Description and Usage
-Sc Suppresses generation of the TIE classes (template classes used to delegate request dispatching when IDL interface inheritance would cause a 'ladder' of inheritance if the servant classe had corresponding inheritance). This option can be used almost all the time.
-Sa Suppresses generation of Any insertion/extraction operators. If the application IDL contains no Anys, and the application itself doesn't use them, this can be a useful option.
-St Suppresses type code generation. Since Anys depend on type codes, this option will also suppress the generation of Any operators. Usage requires the same conditions as for the suppression of Any operators, plus no type codes in application IDL and no application usage of generated type codes.
-GA Generates type code and Any operator definitions into a separate file with a 'A' suffix just before the .cpp extension. This is a little more flexible and transparent than using -Sa or -St if you are compiling to DLLs or shared objects, since the code in this file won't get linked in unless it's used.
-Sp Suppresses the generation of extra classes used for thru-POA collocation optimization. If the application has no collocated client/server pairs, or if the performance gain from collocation optimization is not important, this option can be used.
-H dynamic_hash
-H binary_search
-H linear_search
Generates alternatives to the default code generated on the skeleton side for operation dispatching (which uses perfect hashing). These options each give a small amount of footprint reducion, each amount slightly different, with a corresponding tradeoff in speed of operation dispatch.

**IDL compiler options to reduce compile-time footprint**
Command-Line Option	Description and Usage
`-Sc`	Suppresses generation of the TIE classes (template classes used to delegate request dispatching when IDL interface inheritance would cause a 'ladder' of inheritance if the servant classe had corresponding inheritance). This option can be used almost all the time.
`-Sa`	Suppresses generation of Any insertion/extraction operators. If the application IDL contains no Anys, and the application itself doesn't use them, this can be a useful option.
`-St`	Suppresses type code generation. Since Anys depend on type codes, this option will also suppress the generation of Any operators. Usage requires the same conditions as for the suppression of Any operators, plus no type codes in application IDL and no application usage of generated type codes.
`-GA`	Generates type code and Any operator definitions into a separate file with a 'A' suffix just before the `.cpp` extension. This is a little more flexible and transparent than using `-Sa` or `-St` if you are compiling to DLLs or shared objects, since the code in this file won't get linked in unless it's used.
`-Sp`	Suppresses the generation of extra classes used for thru-POA collocation optimization. If the application has no collocated client/server pairs, or if the performance gain from collocation optimization is not important, this option can be used.
`-H dynamic_hash` `-H binary_search` `-H linear_search`	Generates alternatives to the default code generated on the skeleton side for operation dispatching (which uses perfect hashing). These options each give a small amount of footprint reducion, each amount slightly different, with a corresponding tradeoff in speed of operation dispatch.

Run-time Footprint

**Control size of internal data structures**
Define	Default	Minimum	Maximum	Description
TAO_DEFAULT_ORB_TABLE_SIZE	16	1	-	The size of the internal table that stores all ORB Cores.

More information on reducing the memory footprint of TAO is available here.

Ossama Othman

Last modified: Thu Jul 14 16:36:12 CDT 2005