source/doc/ex_adminman.adoc

0001 Enduro/X Administration Manual
0002 ==============================
0003
0004 == How to configure Enduro/X
0005
0006 To configure Enduro/X you have to finish several steps.
0007
0008 . Have a separate system user for each Enduro/X instance.
0009
0010 . Setup System Environment (mount mq file system, configure mq params)
0011
0012 . Setup environment configuration
0013
0014 . Setup basic environment (demo)
0015
0016 . Startup the application
0017
0018 == Setup System
0019
0020 Enduro/X supports different back-end message transports. Following mechanisms
0021 are available:
0022
0023 - EPOLL (for FreeBSD it is Kqueue) over the Posix queues. This is fastest and
0024 most preferred transport mechanism when available. True on queue multiple
0025 servers mechanism is supported for different services across different XATMI
0026 server binaries (transport code "epoll" for GNU/Linux and "kqueue" for FreeBSD).
0027 In case of EPOLL on Linux kernels newer than version 4.5, Enduro/X uses *EPOLLEXCLUSIVE*
0028 flag, so that if on the same XATMI Service queue multiple processes waits
0029 for a service call, only one process will be wakened up. For FreeBSD, there is no
0030 such flag, thus in case of load-balanced processes there might be extra
0031 wakeups (wasted CPU cycles) if multiple idle processes waits/advertises for
0032 the same service call. The wasted CPU cycles are small (process wakes up, tries
0033 to receive a message, gets *EAGAIN* error and goes back to poll).
0034 The resource wasting might be noticeable if the service TPS load is very high
0035 and actual work time for the service is short, and service is load balanced in
0036 very high number of processes (for example 250).
0037
0038 - SVAPOLL - This mode is almost equivalent to EPOLL mode, except that System V
0039 queues are used. This mode is suitable only for IBM AIX, for which poll() system
0040 call accepts System V Queue IDs for polling. What comes with this mode are two
0041 additional shared memory mapping tables where where SystemV msgid is mapped to
0042 queue string name. And reverse mapping where queue string name is mapped back to
0043 msgid. Queue string names are used internally by Enduro/X. In this mode operations
0044 and configuration at the Enduro/X level is the same as with epoll mode. Each binary
0045 may advertise arbitrary number of services, where each service is shared queue,
0046 that may be shared between several other processes which advertise that service.
0047 Prior IBM AIX 7.3 poll did not had equivalent functionality of the Linux *EPOLLEXCLUSIVE* flag,
0048 so IBM AIX was prone to the CPU wasting (thundering herd problem) at specific
0049 workloads as in FreeBSD (described above). Starting from AIX 7.3 IBM
0050 have implemented *POLLEXCL* flag in AIX for poll() system call. With this
0051 new flag Enduro/X on IBM AIX works with the same performance benefits as on the Linux (having
0052 true one-queue/multiple-servers topology for all XATMI services).
0053 *Additional administrative notice:* Enduro/X requires that *POLLEXCL_POLICY* environment
0054 variable is set to *ONE*. Global shell settings are not needed
0055 (unless for some special reason, e.g. server started outside *ndrxd*).
0056 If for some reason other value needs to be used for the XATMI server process,
0057 that can be set at *<envs>/<env>* tag in *ndrxconfig.xml(5)*. But recommended
0058 way is to use *ONE* or (*ONE:RR* as *RR* is by default). The poll() *POLLEXCL* usage
0059 can be disabled for particular process by setting at *NDRX_NOPOLLEXCL*
0060 to *Y* at *<envs>/<env>* tag (Thought this is not recommend,
0061 and it will introduce thundering herd issue, if several same type binaries are started).
0062 *NOTE:* that *svapoll* mode, due to bug in AIX 7.3
0063 (https://www.ibm.com/support/pages/apar/IJ37827), can be used only starting from
0064 AIX version 7.3 TL0, SP2 (where the bug is fixed).
0065
0066 -  System V message queues, this is generally second best transport available
0067 on Linux and Unix operating systems. One queue multiple servers mechanism is
0068 available via Request Address option (rqaddr) for XATMI server. The limitation is that
0069 each server running on same request address *MUST* provide all services provided
0070 by other servers in same Request Address. This mechanism uses at-least one
0071 additional thread per XATMI participant for handling message send/receive time-outs.
0072 In case if XATMI server, additional more thread is used for administrative message
0073 handling and dispatching to main server thread. Thus compiler must support
0074 the multi-threaded operations for user applications (transport code "SystemV").
0075 SystemV Enduro/X builds can be used as alternative for *kqueue* or *svapoll* modes,
0076 to avoid the CPU wasting associate with extra wakups for idle load balanced
0077 same XATMI service servers.
0078
0079 - The third option if POLL over the Posix queues. This uses round-robin approach
0080 for message delivery to load balanced servers. One additional thread is used
0081 for server process to monitor queues (transport code "poll").
0082
0083 - The forth option is emulated message queue which uses shared memory and process
0084 shared Posix locks to emulate the message queue (transport code "emq").
0085
0086 .Enduro/X IPC transport sub-systems
0087 [width="80%",cols="^2,^2,^2,^2,^2,^2, ^2",options="header"]
0088 |=========================================================
0089 |Operating System/IPC Transport|epoll |kqueue |systemv |poll |emq |svapoll
0090 |GNU/Linux|R |X |R |S |S |X
0091 |FreeBSD|X |R |S |S |S |X
0092 |IBM AIX|X |X |S |S |S |R
0093 |Oracle Solaris|X |X |R |S |S |X
0094 |MacOS|X |X |X |X |R |X
0095 |=========================================================
0096
0097 Legend:
0098
0099 'S' - supported.
0100
0101 'R' - supported and release provided.
0102
0103 'X' - not supported.
0104
0105 Each of these IPC transports for particular operating system requires specific
0106 approach for configuring the limits and other attributes for runtime.
0107
0108 Note that transport version is built into Enduro/X distribution. Thus to change
0109 the IPC transport, different Enduro/X version must be installed (i.e. cannot
0110 be changed by parameters). As the ABI for user apps stays the same, the user
0111 application is not required to be rebuilt.
0112
0113 === Release file format
0114
0115 The release file for Enduro/X encodes different information. For example
0116 file names
0117
0118 . endurox-5.4.1-1.ubuntu18_04_GNU_epoll.x86_64_64.deb
0119
0120 . endurox-5.4.1-1.SUNOS5_10_GNU_SystemV.sparc_64
0121
0122 encodes following information:
0123
0124 .Enduro/X distribution file name naming conventions
0125 [width="80%", options="header"]
0126 |=========================================================
0127 |Product name|Version|Release|OS Name|C Compiler ID|OS Version|IPC Transport|CPU Arch|Target
0128 |endurox |5.4.1 |1 |Ubuntu | 18.04| GNU GCC| EPOLL |x86_64 | 64 bit mode
0129 |endurox |5.4.1 |1 |SUNOS - Solaris | 5.10 (10)|GNU GCC |System V queues |SPARC | 64 bit mode
0130 |=========================================================
0131
0132 === Linux setup
0133
0134 In this chapter will be described typical GNU/Linux system configuration required
0135 for Enduro/X. Two sets of configurations are available for Linux OS. One is for
0136 Posix queues with epoll and another configuration is System V configuration.
0137
0138 Kernel parameter configuration is needed for Enduro/X runtime. But as we plan here
0139 to build the system and run unit-tests, then we need a configuration for runtime.
0140
0141 ==== Increase OS limits
0142
0143 ---------------------------------------------------------------------
0144 $ sudo -s
0145 # cat << EOF >> /etc/security/limits.conf
0146
0147 # Do not limit message Q Count.
0148 # Some Linux 3.x series kernels have a bug, that limits 1024
0149 # queues for one system user.
0150 # In 2.6.x and 4.x this is fixed, to have
0151 # unlimited count of queues (memory limit).
0152 # ealrier and later Linuxes have fixed this issue.
0153 *               soft    msgqueue        -1
0154 *               hard    msgqueue        -1
0155
0156 # Increase the number of open files
0157 *               soft    nofile  1024
0158 *               hard    nofile  65536
0159
0160 EOF
0161 # exit
0162 $
0163 ---------------------------------------------------------------------
0164
0165 ==== Linux system setup for running in EPOLL/Posix queue mode
0166
0167 This step request mounting of Posix queues and change Posix queue limits
0168
0169 ===== Mounting Posix queues
0170
0171 This step does not apply to following Operating Systems - for these continue with
0172 next chapter:
0173
0174 . Ubuntu 16.04 and above
0175
0176 . Debian 8.x and above
0177
0178 When running in e-poll mode Enduro/X needs access to virtual file system which
0179 provides Posix queue management. One way would be to mount it via "/etc/fstab",
0180 but for older system compatibility, we provide instructions that would work for
0181 all OSes. To do this automatically at system startup, Linuxes which supports
0182 '/etc/rc.local', must add following lines before "exit 0".
0183
0184 ---------------------------------------------------------------------
0185 #!/bin/bash
0186
0187 # Mount the /dev/mqueue
0188 # Not for Debian 8.x: queue is already mounted, thus test:
0189
0190 if [ ! -d /dev/mqueue ]; then
0191         mkdir /dev/mqueue
0192         mount -t mqueue none /dev/mqueue
0193 fi
0194
0195 exit 0
0196 ---------------------------------------------------------------------
0197
0198 Note for Centos/RHEL/Oracle Linux 7+ you need to give execute
0199 permissions for rc.local:
0200
0201 ---------------------------------------------------------------------
0202 # chmod +x /etc/rc.local
0203 ---------------------------------------------------------------------
0204
0205 Load the configuration by doing:
0206
0207 ---------------------------------------------------------------------
0208 # /etc/rc.local
0209 ---------------------------------------------------------------------
0210
0211 ===== Setting Posix queue limits
0212
0213 Next step is to configure queue limits, this is done by changing Linux kernel
0214 parameters, in persistent way, so that new settings are applied at the OS boot.
0215
0216 ---------------------------------------------------------------------
0217 $ sudo -s
0218
0219 # cat << EOF >> /etc/sysctl.conf
0220
0221 # Max Messages in Queue
0222 fs.mqueue.msg_max=10000
0223
0224 # Max message size, to pass unit tests, use 1M+1K
0225 fs.mqueue.msgsize_max=1049600
0226
0227 # Max number of queues system-wide
0228 fs.mqueue.queues_max=10000
0229
0230 EOF
0231
0232 # Apply kernel parameters now
0233 $ sudo sysctl -f /etc/sysctl.conf
0234
0235 # to check the values, use (print all) and use grep to find:
0236 $ sudo sysctl -a | grep msgsize_max
0237 ---------------------------------------------------------------------
0238
0239 ==== Setting System V queue limits
0240
0241 To pass the Enduro/X unit tests, certain queue configuration is required. Use
0242 following kernel settings:
0243
0244 ---------------------------------------------------------------------
0245
0246 $ sudo -s
0247
0248 # cat << EOF >> /etc/sysctl.conf
0249
0250 # max queues system wide, 32K should be fine
0251 # If more is required, then for some Linux distributions such as Ubuntu 20.04
0252 # kernel boot parameter ipcmni_extend shall be set.
0253 kernel.msgmni=32768
0254
0255 # max size of message (bytes), ~1M should be fine
0256 kernel.msgmax=1049600
0257
0258 # default max size of queue (bytes), ~10M should be fine
0259 kernel.msgmnb=104960000
0260
0261 EOF
0262
0263 # persist the values
0264 $ sudo sysctl -f /etc/sysctl.conf
0265
0266 # Check status...
0267 $ sudo sysctl -a | grep msgmnb
0268 ---------------------------------------------------------------------
0269
0270 === FreeBSD setup
0271
0272 For FreeBSD only officially supported version if Posix queues, thus this operating
0273 system requires some settings for these IPC resources to pass the unit testing and
0274 also settings are generally fine for average application.
0275
0276 ==== Configuring the system
0277 Queue file system must be mounted when OS starts. Firstly we need a folder
0278 '/mnt/mqueue' where the queues are mount. And secondly we will add the automatic
0279 mount at system startup in '/etc/fstab'.
0280
0281 ---------------------------------------------------------------------
0282 # mkdir /mnt/mqueue
0283 # cat << EOF >> /etc/fstab
0284 null    /mnt/mqueue     mqueuefs             rw      0       0
0285 EOF
0286 # mount /mnt/mqueue
0287 ---------------------------------------------------------------------
0288
0289 You also need to change the queue parameters:
0290
0291 ---------------------------------------------------------------------
0292 # cat << EOF >> /etc/sysctl.conf
0293
0294 # kernel tunables for Enduro/X:
0295 kern.mqueue.curmq=1
0296 kern.mqueue.maxmq=30000
0297 kern.mqueue.maxmsgsize=64000
0298 kern.mqueue.maxmsg=1000
0299
0300 EOF
0301
0302 # sysctl -f /etc/sysctl.conf
0303 ---------------------------------------------------------------------
0304
0305 For LMDB testing more semaphores shall be allowed
0306
0307 ---------------------------------------------------------------------
0308 # cat << EOF >> /boot/loader.conf
0309
0310 # kernel tunables for Enduro/X:
0311 kern.ipc.semmns=2048
0312 kern.ipc.semmni=500
0313
0314 EOF
0315
0316 ---------------------------------------------------------------------
0317
0318 After changing /boot/loader.conf, reboot of system is required.
0319
0320 Enduro/X testing framework uses '/bin/bash' in scripting, thus we must
0321 get it working. Also perl is assumed to to be '/usr/bin/perl'. Thus:
0322 ---------------------------------------------------------------------
0323 # ln -s /usr/local/bin/bash /bin/bash
0324 # ln -s /usr/local/bin/perl /usr/bin/perl
0325 ---------------------------------------------------------------------
0326
0327 *reboot* to apply new settings (limits & mqueue mount)
0328
0329 === AIX setup
0330
0331 On the other hand AIX do not require any fine tuning for System V queues, because
0332 it is doing automatic adjustments to queue limitations. However to pass the
0333 Enduro/X standard unit tests, the security limits must be configured. Unit tests
0334 uses standard user "user1" for this purposes. Thus here stack, data mem size,
0335 file size and rss sizes are set to unlimited. For example if stack/data/rss is
0336 not set correctly, some multi-threaded components of Enduro/X might hang during
0337 the startup, for example *tpbridge(8)*.
0338
0339 --------------------------------------------------------------------------------
0340 $ su - root
0341
0342 # cat << EOF >> /etc/security/limits
0343
0344 user1:
0345         stack = 655360
0346         data = -1
0347         rss = -1
0348         fsize = -1
0349 EOF
0350 --------------------------------------------------------------------------------
0351
0352 If during runtime following errors are faced:
0353
0354 --------------------------------------------------------------------------------
0355
0356  fork: retry: Resource temporarily unavailable
0357
0358 --------------------------------------------------------------------------------
0359
0360 Check that number of user processes are allowed:
0361
0362 --------------------------------------------------------------------------------
0363
0364 $ su - root
0365
0366 # /usr/sbin/lsattr -E -l sys0 | grep maxuproc
0367 maxuproc        40                                 Maximum number of PROCESSES allowed per user        True
0368
0369 --------------------------------------------------------------------------------
0370
0371 Updated to *2000*:
0372
0373 --------------------------------------------------------------------------------
0374
0375 # /usr/sbin/chdev -l sys0 -a maxuproc=2000
0376
0377 --------------------------------------------------------------------------------
0378
0379 === Solaris setup
0380
0381 To pass the Enduro/X unit tests on Solaris, System V queue settings must be applied.
0382
0383 ---------------------------------------------------------------------
0384 # cat << EOF >> /etc/system
0385 set msgsys:msginfo_msgmni = 10000
0386 set msgsys:msginfo_msgmnb = 10496000
0387
0388 EOF
0389 ---------------------------------------------------------------------
0390
0391 So here 'msgmni' is maximum number of queues that can be created and 'msgmnb'
0392 is single queue maximum size which here is 10MB.
0393
0394 After changing the settings, reboot the server.
0395
0396
0397 === MacOS setup
0398
0399 OSX does not use require any kernel parameter changes, as emulated message queue
0400 is used here. Only it required that sufficient disk space is available to '/tmp'
0401 directory, as the memory mapped queue files will be stored there.
0402
0403 As Enduro/X uses System V shared memory segments, the default sizes are not
0404 sufficient for the at least Enduro/X unit testing. Thus limits needs to be
0405 changed:
0406
0407
0408 Starting from OS X 10.3.9 the config file is */etc/sysctl.conf*, in oder versions
0409 use */boot/loader.conf*:
0410
0411 --------------------------------------------------------------------------------
0412 $ sudo -s
0413 # cat << EOF >> /etc/sysctl.conf
0414 kern.sysv.shmmax=838860800
0415 kern.sysv.shmmin=1
0416 kern.sysv.shmmni=10000
0417 kern.sysv.shmseg=50
0418 kern.sysv.shmall=204800
0419 kern.maxfiles=524288
0420 kern.maxfilesperproc=262144
0421
0422 EOF
0423 --------------------------------------------------------------------------------
0424
0425 Starting from Macos version 10.15 (Catalina) or later */boot/loader.conf* does
0426 not work and plist file needs to be installed. Prior making this file, possibly
0427 SIP disable is required and root / file system needs to remounted as RW. But firstly
0428 may try directly setup this file from root user and only if that does not work, change
0429 the SIP mode and perfrom fs-remount.
0430
0431 --------------------------------------------------------------------------------
0432
0433 $ sudo -s
0434
0435 # bash
0436
0437 # cd /Library/LaunchDaemons
0438
0439 # cat << EOF >> endurox.plist
0440 <?xml version="1.0" encoding="UTF-8"?>
0441 <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
0442 <plist version="1.0">
0443 <dict>
0444  <key>Label</key>
0445  <string>shmemsetup</string>
0446  <key>UserName</key>
0447  <string>root</string>
0448  <key>GroupName</key>
0449  <string>wheel</string>
0450  <key>ProgramArguments</key>
0451  <array>
0452  <string>/usr/sbin/sysctl</string>
0453  <string>-w</string>
0454  <string>kern.sysv.shmmax=838860800</string>
0455  <string>kern.sysv.shmmin=1</string>
0456  <string>kern.sysv.shmmni=10000</string>
0457  <string>kern.sysv.shmseg=50</string>
0458  <string>kern.sysv.shmall=204800</string>
0459  <string>kern.maxfiles=524288</string>
0460  <string>kern.maxfilesperproc=262144</string>
0461   </array>
0462  <key>KeepAlive</key>
0463  <false/>
0464  <key>RunAtLoad</key>
0465  <true/>
0466 </dict>
0467 </plist>
0468 EOF
0469
0470 --------------------------------------------------------------------------------
0471
0472 After above settings reboot is required.
0473
0474 == Setup environment configuration
0475
0476 Enduro/X depends lot of Environment variables. See manpage of 'ex_env' (<<EX_ENV>>)
0477 to see all parameters that must be setup. There is also sample configuration
0478 provided. Normally it is expected that separate shell script file is setup containing
0479 all parameters. Then to load the environment, login with Enduro/X user in, and run
0480 following command in your app dir, for example:
0481
0482 --------------------------------------------------------------------------------
0483 $ cd /endurox/app/conf
0484 $ . setapp
0485 --------------------------------------------------------------------------------
0486
0487 == Setting up Enduro/X demonstration environment
0488
0489 This section describes how to create a basic Enduro/X environment. Document will
0490 also explain the resources used by Enduro/X from the system setup and
0491 administrative perspective. Section will also explain the contents for each of
0492 the generated file, so that runtime can be manually reconstructed, which is
0493 usable for AIX operating system, as there "xadmin provision" command is not
0494 available.
0495
0496 === Creating default runtime and starting it up
0497
0498 To create generic runtime with Enduro/X "stock" servers processes, use following
0499 command:
0500
0501 --------------------------------------------------------------------------------
0502 $ xadmin provision  -d
0503 To control debug output, set debugconfig file path in $NDRX_DEBUG_CONF
0504 N:NDRX:4:00000000:    0:7fc81a75c900:000:20181110:113655631:plugins_load:inbase.c:0180:No plugins defined by NDRX_PLUGINS env variable
0505 N:NDRX:5:00000000:    0:7fc81a75c900:000:20181110:113655631:cconfig_load:config.c:0429:CC tag set to: []
0506 N:NDRX:5:00000000:    0:7fc81a75c900:000:20181110:113655631:x_inicfg_new:inicfg.c:0114:_ndrx_inicfg_new: load_global_env: 1
0507 N:NDRX:5:00000000:    0:7fc81a75c900:000:20181110:113655631:ig_load_pass:config.c:0396:_ndrx_cconfig_load_pass: ret: 0 is_internal: 1 G_tried_to_load: 1
0508 N:NDRX:5:d5d3db3a: 8685:7fc81a75c900:000:20181110:113655632:x_inicfg_new:inicfg.c:0114:_ndrx_inicfg_new: load_global_env: 0
0509 Enduro/X 5.4.1, build Nov  7 2018 08:48:27, using SystemV for LINUX (64 bits)
0510
0511 Enduro/X Middleware Platform for Distributed Transaction Processing
0512 Copyright (C) 2009-2016 ATR Baltic Ltd.
0513 Copyright (C) 2017,2018 Mavimax Ltd. All Rights Reserved.
0514
0515 This software is released under one of the following licenses:
0516 AGPLv3 or Mavimax license for commercial use.
0517
0518 Logging to ./ULOG.20181110
0519
0520     ______          __                    ___  __
0521    / ____/___  ____/ /_  ___________    _/_/ |/ /
0522   / __/ / __ \/ __  / / / / ___/ __ \ _/_/ |   /
0523  / /___/ / / / /_/ / /_/ / /  / /_/ //_/  /   |
0524 /_____/_/ /_/\__,_/\__,_/_/   \____/_/   /_/|_|
0525
0526                      Provision
0527
0528 Compiled system type....: LINUX
0529
0530
0531 *** Review & edit configuration ***
0532
0533  0: Edit qpath        :Queue device path [/dev/mqueue]:
0534  1: Edit nodeid       :Cluster node id [2]:
0535  2: Edit qprefix      :System code (prefix/setfile name, etc) [test1]:
0536  3: Edit timeout      :System wide tpcall() timeout, seconds [90]:
0537  4: Edit appHome      :Application home [/tmp/demo]:
0538  6: Edit binDir       :Executables/binaries sub-folder of Apphome [bin]:
0539  8: Edit confDir      :Configuration sub-folder of Apphome [conf]:
0540  9: Edit logDir       :Log sub-folder of Apphome [log]:
0541 10: Edit ubfDir       :Unified Buffer Format (UBF) field defs sub-folder of Apphome [ubftab]:
0542 11: Edit tempDir      :Temp sub-dir (used for pid file) [tmp]:
0543 12: Edit installQ     :Configure persistent queue [y]:
0544 13: Edit tmDir        :Transaction Manager Logs sub-folder of Apphome [tmlogs]:
0545 14: Edit qdata        :Queue data sub-folder of Apphone [qdata]:
0546 15: Edit qSpace       :Persistent queue space namme [SAMPLESPACE]:
0547 16: Edit qName        :Sample persistent queue name [TESTQ1]:
0548 17: Edit qSvc         :Target service for automatic queue for sample Q [TESTSVC1]:
0549 18: Edit eventSv      :Install event server [y]:
0550 19: Edit cpmSv        :Configure Client Process Monitor Server [y]:
0551 20: Edit configSv     :Install Configuration server [y]:
0552 21: Edit bridge       :Install bridge connection [y]:
0553 22: Edit bridgeRole   :Bridge -> Role: Active(a) or passive(p)? [a]:
0554 24: Edit ipc          :Bridge -> IP: Connect to [172.0.0.1]:
0555 25: Edit port         :Bridge -> IP: Port number [21003]:
0556 26: Edit otherNodeId  :Other cluster node id [2]:
0557 27: Edit ipckey       :IPC Key used for System V semaphores [44000]:
0558 28: Edit ldbal        :Load balance over cluster [0]:
0559 29: Edit ndrxlev      :Logging: ATMI sub-system log level 5 - highest (debug), 0 - minimum (off) [5]:
0560 30: Edit ubflev       :Logging: UBF sub-system log level 5 - highest (debug), 0 - minimum (off) [1]:
0561 31: Edit tplev        :Logging: /user sub-system log level 5 - highest (debug), 0 - minimum (off) [5]:
0562 32: Edit usv1         :Configure User server #1 [n]:
0563 50: Edit ucl1         :Configure User client #1 [n]:
0564 55: Edit addubf       :Additional UBFTAB files (comma seperated), can be empty []:
0565 56: Edit msgsizemax   :Max IPC message size [56000]:
0566 57: Edit msgmax       :Max IPC messages in queue [100]:
0567 ndrxconfig: [/tmp/demo/conf/ndrxconfig.xml]
0568 appini: [/tmp/demo/conf/app.ini]
0569 setfile: [/tmp/demo/conf/settest1]
0570
0571
0572 To start your system, run following commands:
0573 $ cd /tmp/demo/conf
0574 $ source settest1
0575 $ xadmin start -y
0576
0577
0578 Provision succeed!
0579 --------------------------------------------------------------------------------
0580
0581 During the provision following directory structure was created at the project root
0582 which is "/tmp/demo", where the following data is intended to be stored:
0583
0584 .Enduro/X distribution file name naming conventions
0585 [width="40%",options="header"]
0586 |=========================================================
0587 |Directory|File stored
0588 |ubftab|UBF field tables
0589 |tmlogs/rm1|transaction manager logs, sub-folder for resource manager 1
0590 |conf|configuration files
0591 |bin|program binaries (executables)
0592 |qdata|persistent queue data
0593 |tmp|temporary files like pid file, etc.
0594 |log|Enduro/X and user log files
0595 |=========================================================
0596
0597 If demo needs to be started on AIX os, then these folders needs to be created by
0598 hand.
0599
0600 The most interesting thing at the given step is the configuration files. The provision
0601 generates the following list of files in "conf" folder:
0602
0603 .Enduro/X typical application configuration files
0604 [width="40%", options="header"]
0605 |=========================================================
0606 |Directory|File stored
0607 |app.ini|Application configuration
0608 |ndrxconfig.xml|Application server process configuration
0609 |settest1|Bash script for setting the Enduro/X environment
0610 |=========================================================
0611
0612 Next chapters describe contents for each of the configuration files
0613
0614 ==== Configuration file: "app.ini" for Common-Configuration (CC) mode
0615
0616 This file contains global settings (which alternatively can be set as environment
0617 variables, see ex_env(5)) in section *[@global]*. *app.ini* also contains debug
0618 configuration in section *[@debug]* (which alternatively can be configured in
0619 separated file, see ndrxdebug.conf(5)). The ini file is also used by other
0620 Enduro/X services like persistent queues, defined in *[@queue]*. The ini files
0621 allows sections to inherit settings from parents sections. The sub-sections
0622 can be configuration at process level with *NDRX_CCTAG* env variable, or this
0623 can be done in *ndrxconfig.xml* at *<cctag />* XML tag for XATMI servers and
0624 *cctag* attribute for CPMSRV clients.
0625
0626 The demo *app.ini* section looks like:
0627
0628 --------------------------------------------------------------------------------
0629 [@global]
0630 NDRX_CLUSTERISED=1
0631 NDRX_CMDWAIT=1
0632 NDRX_CONFIG=${NDRX_APPHOME}/conf/ndrxconfig.xml
0633 NDRX_DMNLOG=${NDRX_ULOG}/ndrxd.log
0634 NDRX_DPID=${NDRX_APPHOME}/tmp/ndrxd.pid
0635 NDRX_DQMAX=100
0636 NDRX_IPCKEY=44000
0637 NDRX_LDBAL=0
0638 NDRX_LEV=5
0639 NDRX_LOG=${NDRX_ULOG}/xadmin.log
0640 NDRX_MSGMAX=100
0641 NDRX_MSGSIZEMAX=56000
0642 NDRX_NODEID=2
0643 NDRX_QPATH=/dev/mqueue
0644 NDRX_QPREFIX=/test1
0645 NDRX_RNDK=0myWI5nu
0646 NDRX_SRVMAX=10000
0647 NDRX_SVCMAX=20000
0648 NDRX_TOUT=90
0649 NDRX_UBFMAXFLDS=16000
0650 NDRX_ULOG=${NDRX_APPHOME}/log
0651 FIELDTBLS=Exfields
0652 FLDTBLDIR=${NDRX_APPHOME}/ubftab
0653
0654 ; Environment for Transactional Queue
0655 [@global/RM1TMQ]
0656 NDRX_XA_RES_ID=1
0657 NDRX_XA_OPEN_STR=${NDRX_APPHOME}/qdata
0658 NDRX_XA_CLOSE_STR=${NDRX_APPHOME}/qdata
0659 NDRX_XA_DRIVERLIB=libndrxxaqdisks.so
0660 ; dylib needed for osx
0661 NDRX_XA_RMLIB=libndrxxaqdisk.so
0662 NDRX_XA_LAZY_INIT=0
0663
0664 [@debug]
0665 ; * - goes for all binaries not listed bellow
0666 *= ndrx=5 ubf=1 tp=5 file=
0667 xadmin= ndrx=5 ubf=1 tp=5 file=${NDRX_ULOG}/xadmin.log
0668 ndrxd= ndrx=5 ubf=1 tp=5 file=${NDRX_ULOG}/ndrxd.log
0669
0670 ; Queue definitions goes here, see man q.conf(5) for syntax
0671 [@queue]
0672 ; Default manual queue (reserved name '@'), unknown queues are created based on this template:
0673 @=svcnm=-,autoq=n,waitinit=0,waitretry=0,waitretryinc=0,waitretrymax=0,memonly=n,mode=fifo
0674
0675 [@queue/RM1TMQ]
0676 ; Sample queue (this one is automatic, sends messages to target service)
0677 TESTQ1=svcnm=TESTSVC1,autoq=y,tries=3,waitinit=1,waitretry=1,waitretryinc=2,waitretrymax=5,memonly=n,mode=fifo
0678 --------------------------------------------------------------------------------
0679
0680 The above also describes the configuration for Resource Manager 1 - which is used
0681 by persistent message queue. The Resource manager settings applies at global
0682 level and one process may only work with one RM, thus processes operating with
0683 particular Resource Manager, shall use CCTAG "RM1TMQ".
0684
0685 ==== Configuration file: "ndrxconfig.xml" for demo process descriptions
0686
0687 The demo system does not include any user processes, but almost all Enduro/X
0688 distributed special services are configuration. The configuration of system
0689 processes looks almost the same as for user processes, thus this gives some
0690 insight on how to configure the system.
0691
0692 --------------------------------------------------------------------------------
0693 <?xml version="1.0" ?>
0694 <endurox>
0695 <!--
0696     *** For more info see ndrxconfig.xml(5) man page. ***
0697 -->
0698     <appconfig>
0699         <!--
0700             ALL BELLOW ONES USES <sanity> periodical timer
0701             Sanity check time, sec
0702         -->
0703         <sanity>1</sanity>
0704
0705         <!--
0706             Seconds in which we should send service refresh to other node.
0707         -->
0708         <brrefresh>5</brrefresh>
0709
0710         <!--
0711             Do process reset after 1 sec
0712         -->
0713         <restart_min>1</restart_min>
0714
0715         <!--
0716             If restart fails, then boot after +5 sec of previous wait time
0717         -->
0718         <restart_step>1</restart_step>
0719
0720         <!--
0721             If still not started, then max boot time is a 30 sec.
0722         -->
0723         <restart_max>5</restart_max>
0724
0725         <!--
0726             <sanity> timer, usage end
0727         -->
0728
0729         <!--
0730         Time (seconds) after attach when program will start do sanity & respawn checks,
0731         starts counting after configuration load
0732         -->
0733         <restart_to_check>20</restart_to_check>
0734
0735
0736         <!--
0737             Setting for pq command, should ndrxd collect service
0738             queue stats automatically If set to Y or y,
0739             then queue stats are on. Default is off.
0740         -->
0741         <gather_pq_stats>Y</gather_pq_stats>
0742
0743     </appconfig>
0744     <defaults>
0745
0746         <min>1</min>
0747         <max>2</max>
0748         <!--
0749             Kill the process which have not started in <start_max> time
0750         -->
0751         <autokill>1</autokill>
0752
0753         <!--
0754             The maximum time while process can hang in 'starting' state i.e.
0755             have not completed initialization, sec X <= 0 = disabled
0756         -->
0757         <start_max>10</start_max>
0758
0759         <!--
0760             Ping server in every X seconds (step is <sanity>).
0761         -->
0762         <pingtime>100</pingtime>
0763
0764         <!--
0765             Max time in seconds in which server must respond.
0766             The granularity is sanity time.
0767             X <= 0 = disabled
0768         -->
0769         <ping_max>800</ping_max>
0770
0771         <!--
0772             Max time to wait until process should exit on shutdown
0773             X <= 0 = disabled
0774         -->
0775         <end_max>10</end_max>
0776
0777         <!--
0778             Interval, in seconds, by which signal sequence -2, -15, -9, -9.... will be sent
0779             to process until it have been terminated.
0780         -->
0781         <killtime>1</killtime>
0782
0783     </defaults>
0784     <servers>
0785         <server name="cconfsrv">
0786             <min>2</min>
0787             <max>2</max>
0788             <srvid>1</srvid>
0789             <sysopt>-e ${NDRX_ULOG}/cconfsrv.log -r</sysopt>
0790         </server>
0791         <server name="tpevsrv">
0792             <min>2</min>
0793             <max>2</max>
0794             <srvid>20</srvid>
0795             <sysopt>-e ${NDRX_ULOG}/tpevsrv.log -r</sysopt>
0796         </server>
0797         <server name="tmsrv">
0798             <min>3</min>
0799             <max>3</max>
0800             <srvid>40</srvid>
0801             <cctag>RM1TMQ</cctag>
0802             <sysopt>-e ${NDRX_ULOG}/tmsrv-rm1.log -r -- -t1 -l${NDRX_APPHOME}/tmlogs/rm1</sysopt>
0803         </server>
0804         <server name="tmqueue">
0805             <min>1</min>
0806             <max>1</max>
0807             <srvid>60</srvid>
0808             <cctag>RM1TMQ</cctag>
0809             <sysopt>-e ${NDRX_ULOG}/tmqueue-rm1.log -r -- -m SAMPLESPACE -s1</sysopt>
0810         </server>
0811         <server name="tpbridge">
0812             <min>1</min>
0813             <max>1</max>
0814             <srvid>150</srvid>
0815             <sysopt>-e ${NDRX_ULOG}/tpbridge_2.log -r</sysopt>
0816             <appopt>-f -n2 -r -i 172.0.0.1 -p 21003 -tA -z30</appopt>
0817         </server>
0818         <server name="cpmsrv">
0819             <min>1</min>
0820             <max>1</max>
0821             <srvid>9999</srvid>
0822             <sysopt>-e ${NDRX_ULOG}/cpmsrv.log -r -- -k3 -i1</sysopt>
0823         </server>
0824     </servers>
0825     <!--
0826         Client section
0827     -->
0828     <clients>
0829         <!--
0830             Test parameter passing to process
0831             - To list clients:$ xadmin pc
0832             - To stop client: $ xadmin sc -t TAG1 -s SUBSECTION1
0833             - To boot client: $ xadmin bc -t TAG1 -s SUBSECTION1
0834         -->
0835         <client cmdline="your_test_binary.sh -t ${NDRX_CLTTAG} -s ${NDRX_CLTSUBSECT}">
0836             <exec tag="TAG1" subsect="SUBSECTION1" autostart="Y" log="${NDRX_ULOG}/testbin-1.log"/>
0837             <exec tag="TAG2" subsect="SUBSECTION2" autostart="Y" log="${NDRX_ULOG}/testbin-3.log"/>
0838         </client>
0839         <client cmdline="your_test_binary2.sh -t ${NDRX_CLTTAG}">
0840             <exec tag="TAG3" autostart="Y" log="${NDRX_ULOG}/testbin2-1.log"/>
0841         </client>
0842     </clients>
0843 </endurox>
0844
0845 --------------------------------------------------------------------------------
0846
0847 The above configuration includes the maximum settings which are by default on
0848 from the provision script. This includes configuration servers (*cconfsrv(8)*) -
0849 which allows to download the configuration from ini files by standard *tpcall(3)*
0850 command. Then it also includes event server, persistent queue and transaction
0851 manager for persistent queue. Bridge connection, configured as active (client)
0852 side is added and client process monitor (*cpmsrv(8)*) is started with server id 9999.
0853 Thus once *cpmsrv* is booted, it will start the processes from "<clients/>" tag.
0854
0855
0856 == Cluster configuration
0857
0858 To setup cluster see you have to setup bridge ATMI processes on each of the machines.
0859 See <<TPBRIDGE>> documentation to have understanding of clustering. Sample setup of
0860 cluster node which actively connects to Node 2 and waits call from Node 12 could
0861 look like:
0862
0863 --------------------------------------------------------------------------------
0864 <?xml version="1.0" ?>
0865 <endurox>
0866     <appconfig>
0867         <sanity>10</sanity>
0868         <brrefresh>6</brrefresh>
0869         <restart_min>1</restart_min>
0870         <restart_step>1</restart_step>
0871         <restart_max>5</restart_max>
0872         <restart_to_check>20</restart_to_check>
0873     </appconfig>
0874     <defaults>
0875         <min>1</min>
0876         <max>2</max>
0877         <autokill>1</autokill>
0878         <respawn>1<respawn>
0879         <start_max>2</start_max>
0880         <pingtime>1</pingtime>
0881         <ping_max>4</ping_max>
0882         <end_max>3</end_max>
0883         <killtime>1</killtime>
0884     </defaults>
0885     <servers>
0886         <!-- Connect to cluster node 2, we will wait for call -->
0887         <server name="tpbridge">
0888             <max>1</max>
0889             <srvid>101</srvid>
0890             <sysopt>-e /tmp/BRIDGE002 -r</sysopt>
0891             <appopt>-n2 -r -i 0.0.0.0 -p 4433 -tP -z30</appopt>
0892         </server>
0893         <!-- Connect to cluster node 12, we try to connect activetly to it -->
0894         <server name="tpbridge">
0895             <max>1</max>
0896             <srvid>102</srvid>
0897             <sysopt>-e /tmp/BRIDGE012 -r</sysopt>
0898             <appopt>-n12 -r -i 195.122.24.13 -p 14433 -tA -z30</appopt>
0899         </server>
0900     </servers>
0901 </endurox>
0902 --------------------------------------------------------------------------------
0903
0904 === Starting the demo application server instance
0905
0906 The startup is straight forward. The environment variables needs to be loaded
0907 either by *source* command or by dot (.) notation.
0908
0909 --------------------------------------------------------------------------------
0910 $ cd /tmp/demo/conf
0911 $ source settest1
0912 $ xadmin start -y
0913 Enduro/X 5.4.1, build Nov  7 2018 08:48:27, using SystemV for LINUX (64 bits)
0914
0915 Enduro/X Middleware Platform for Distributed Transaction Processing
0916 Copyright (C) 2009-2016 ATR Baltic Ltd.
0917 Copyright (C) 2017,2018 Mavimax Ltd. All Rights Reserved.
0918
0919 This software is released under one of the following licenses:
0920 AGPLv3 or Mavimax license for commercial use.
0921
0922 * Shared resources opened...
0923 * Enduro/X back-end (ndrxd) is not running
0924 * ndrxd PID (from PID file): 18037
0925 * ndrxd idle instance started.
0926 exec cconfsrv -k 0myWI5nu -i 1 -e /tmp/demo/log/cconfsrv.log -r --  :
0927         process id=18041 ... Started.
0928 exec cconfsrv -k 0myWI5nu -i 2 -e /tmp/demo/log/cconfsrv.log -r --  :
0929         process id=18045 ... Started.
0930 exec tpevsrv -k 0myWI5nu -i 20 -e /tmp/demo/log/tpevsrv.log -r --  :
0931         process id=18049 ... Started.
0932 exec tpevsrv -k 0myWI5nu -i 21 -e /tmp/demo/log/tpevsrv.log -r --  :
0933         process id=18053 ... Started.
0934 exec tmsrv -k 0myWI5nu -i 40 -e /tmp/demo/log/tmsrv-rm1.log -r -- -t1 -l/tmp/demo/tmlogs/rm1 --  :
0935         process id=18057 ... Started.
0936 exec tmsrv -k 0myWI5nu -i 41 -e /tmp/demo/log/tmsrv-rm1.log -r -- -t1 -l/tmp/demo/tmlogs/rm1 --  :
0937         process id=18072 ... Started.
0938 exec tmsrv -k 0myWI5nu -i 42 -e /tmp/demo/log/tmsrv-rm1.log -r -- -t1 -l/tmp/demo/tmlogs/rm1 --  :
0939         process id=18087 ... Started.
0940 exec tmqueue -k 0myWI5nu -i 60 -e /tmp/demo/log/tmqueue-rm1.log -r -- -m SAMPLESPACE -s1 --  :
0941         process id=18102 ... Started.
0942 exec tpbridge -k 0myWI5nu -i 150 -e /tmp/demo/log/tpbridge_2.log -r -- -f -n2 -r -i 172.0.0.1 -p 21003 -tA -z30 :
0943         process id=18137 ... Started.
0944 exec cpmsrv -k 0myWI5nu -i 9999 -e /tmp/demo/log/cpmsrv.log -r -- -k3 -i1 --  :
0945         process id=18146 ... Started.
0946 Startup finished. 10 processes started.
0947 --------------------------------------------------------------------------------
0948
0949 The application instance is started!
0950
0951 == Max message size and internal buffer sizes
0952
0953 Starting from Enduro/X version 5.1+, the max message size what can be transported
0954 over the XATMI sub-system is limited to the operating system's queue settings.
0955 For example on Linux kernel 3.13 the message size limit (/proc/sys/fs/mqueue/msgsize_max)
0956 is around 10 MB. The message size is configured with *NDRX_MSGMAX* environment
0957 variable, see ex_env(5) man page.
0958
0959 Also regarding the buffer sizes, when *NDRX_MSGMAX* is set bellow 64K, the buffer
0960 size is fixed to 64K, this means that operations like network packet size when
0961 using tpbridge, is set to 64K.
0962
0963 As the message size is at the same time as a internal buffer size, this means that
0964 not all space can be used by sending some data (for example CARRAY or UBF buffer).
0965 Some overhead is added by Enduro/X, message headers, and for bridge protocol format
0966 extra data is added for TLV structure. Thus to be safe, for example, if expected
0967 data size is 64K, then the message size (*NDRX_MSGMAX*) should be set
0968 to something like 80KB.
0969
0970 == Enduro/X Transaction & Message identifier
0971
0972 Enduro/X generates 16 byte long custom CID (Enduro/X cluster ID) identifier for following purposes:
0973
0974 . Global Transaction ID
0975
0976 . TMQ Message ID.
0977
0978 The CID is composed of the following parts:
0979
0980 - Byte 1: Enduro/X cluster node id (NDRX_NODEID).
0981
0982 - Bytes 2-5: PID of the process generated CID, in network order.
0983
0984 - Bytes 6: tv_usec youngest bits 7..14
0985
0986 - Bytes 7-9: Sequence counter, the start value is randomized during process init, in network order.
0987
0988 - Bytes 9 (oldest 7 bits): tv_usec youngest bits 0..6
0989
0990 - Bytes 9 (youngest bit 1)-14: 33 bit Unix echo time stamp in seconds, in network order.
0991
0992 - Bytes 14-16: Random number.
0993
0994 Random is generated by rand_r(), the start is randomized by time/pid/uid/time and
0995 /dev/urandom or /dev/random (if available).
0996 The CID guarantees that 16 million transaction IDs/TMQ IDs per second would
0997 be unique within the cluster, executed by a single process.
0998
0999 In case if OS has 64bit pid_t (such as AIX), the TMSRV and TMQUEUE include
1000 additionally in identifiers srvid, which shall cope with the cases if some PIDs
1001 gets youngest 4 bytes equal.
1002
1003 If the administrator changes the time on the operating system backwards
1004 (manually, not by NTP), then sequence counter and random number
1005 shall protect against duplicates.
1006
1007 == Enduro/X Smart Cache
1008
1009 Enduro/X support SOA level cache. This means that administrator can configure
1010 system configuration, so that certain services are cached. Thus if some client
1011 process calls some service X, and it gets valid results back, then data key is
1012 built (specified in config) and for this key data is saved to
1013 Lightning Memory-Mapped Database (LMDB). Next time service is called, the cache
1014 is checked, again, key is built, and lookup to LMDB is made. If results are found
1015 in db, then actual service is X is not called, but instead saved buffer from
1016 cache is returned back to caller. Cache works for tpcall() function.
1017
1018 Cache supports different features:
1019
1020 . Limited or unlimited caches are available. The unlimited cache is bound to
1021 physical dimensions of db file (also specified in configuration). In case of
1022 limited cache, number of logical items stored in cache can be specified. This is
1023 set by 'limit' parameter for database configuration. In case if limit is specified
1024 the strategy how to remove over-reached records can be specified in database
1025 flags. The strategies supported are following: *LRU* - keep records recently
1026 used, *FIFO* - delete records by chronological order (older records added to
1027 cache are being deleted), *HITS* - records mostly accessed stays in cache.
1028
1029 . Multiple physical storage definitions, so that XATMI services can be allocated
1030 in different or same physical storage. This can help to solve challenges between
1031 storage space limitations and performance limitations (when multiple writes are
1032 done in same physical storage).
1033
1034 . Cache is Enduro/X cluster aware. Records can be distributed and deleted across
1035 the cluster nodes. Time based sync is supported when in the same time both nodes
1036 adds records to non existing cache cell. On both cluster nodes will survive record
1037 which is fresher. The older duplicate is zapped by tpcall() or by tpcached.
1038
1039 . Records can be grouped for example statement pages can be all linked to single
1040 user. If transaction happens for user, then whole group can be invalidated. Thus
1041 build cache again. Grouping can be also used for Denial Of Service (DoS)
1042 protection. Enduro/X can be configured to limit the max number of new records in
1043 group, after which any new non existing data element lookup in group will make
1044 request buffer to reject with configured tperrno, user return code and buffer.
1045
1046 . Records in cache can be cross-invalidated. Meaning that "hooks" can be put on
1047 certain service calls in order to invalidate - zap contents of some other
1048 cache.
1049
1050 . Cache supports refresh conditions. So that in case if specific condition over
1051 the data is true, the cached data not returned, but service invocation is performed
1052 and re-cached (old data overwritten).
1053
1054
1055 image:tpcache.png[caption="Figure 1: ", title="Enduro/X Smart Cache", alt="endurox start cache"]
1056
1057
1058 === Limitations of the cache
1059
1060 The LMDB is build in such way that if write transaction on the database is
1061 open, then other writes will not be able to process it in meantime. While read
1062 only transactions are processed, while some other process holds write transaction.
1063 Also if process which is holding the lock is crashed (e.g. segfaul, kill, etc..),
1064 then lock is automatically made free. Thus for example is using *hits* or *lru*
1065 limitation caches, then this automatically means that during the tpcall() caches
1066 needs to be updated, thus lock is needed, and this means that all callers will
1067 have to sync in that place - thus makes point of bottleneck.
1068
1069
1070 == Configuring distributed transactions support
1071
1072 Enduro/X supports two phase commit - distributed transactions. System provides
1073 configuration interface for enabling up to 255 transaction groups. Transaction
1074 group basically is set of credentials how to connect to database. From XA point
1075 of view, group represents a transaction branch. Typically for the same transaction
1076 branch, resources (databases, queues, etc.) allows only one process to be active
1077 on particular transaction within the branch. Thus if several processes needs to
1078 do the work in global transaction, either processes must be located in different
1079 groups, or within same groups processes must perform transaction suspend before
1080 continuing with other process.
1081
1082 Enduro/X configuration for distributed transactions uses following terminology:
1083
1084 *XA Driver Lib* - this is set of libraries shipped with Enduro/X. These libraries
1085 are interface between database specifics and the Enduro/X. Basically these are
1086 adapter for wider range of different resources. Typically they resolve the XA
1087 switch in resource specific way. Thus adding new XA resource to Enduro/X shall
1088 not be a big change, just writing XA Switch resolve function, typically few
1089 code lines. Driver library is configured in *NDRX_XA_DRIVERLIB* environment
1090 variable.
1091
1092 Following drivers (shared libraries .so or .dylib) are shipped with Enduro/X distribution:
1093
1094 . *libndrxxadb2s* (for static reg) and *libndrxxadb2d* (for dynamic reg) -
1095 Loads IBM DB2 Resource Manager. Resource manager driver is loaded
1096 from library set in *NDRX_XA_RMLIB* env variable.
1097
1098 . *libndrxxaoras(8)* (for static reg / "xaosw") and *libndrxxaorad* (for dynamic reg / "xaoswd") -
1099 Loads Oracle DB Resource Manager. Resource manager driver is loaded
1100 from library set in *NDRX_XA_RMLIB* env variable.
1101
1102 . *libndrxxanulls(8)* - null switch ('tmnull_switch'). This basically allows processes to participate
1103 in global transaction, but without any linkage to any real resource managers.
1104 The *NDRX_XA_RMLIB* parameter shall be set to "-" (indicate that value is empty).
1105
1106 . *libndrxxapq(8)* (PQ Driver) and *libndrxxaecpg(8)* (ECPG/PQ Driver) - these
1107 drivers emulates XA switch for PostgreSQL. The resource manager driver in
1108 *NDRX_XA_RMLIB* shall be set to "-". The libpq is pulled in my Enduro/X driver
1109 dependencies.
1110
1111 . *libndrxxatmsx(8)* (Built in XA Switch with help of ndrx_xa_builtin_get() func)
1112 - this resolves XA switch from process built-in symbols. Built in symbols can
1113 be added to process by using *buildserver(8)*,*buildclient(8)* and *buildtms(8)*.
1114 If built in switch is not compiled in, then NULL switch is returned. For server
1115 processes the built in handler is provided by *libatmisrvinteg*. The pointer
1116 to XA Switch can be passed to _tmstartserver() entry point function. Usually
1117 the entry point call is generated by *buildserver* program.
1118
1119 . *libndrxxawsmqs(8)* (for static reg) and *libndrxxawsmqd(8)* (for dynamic reg) -
1120 IBM WebSphere MQ XA Driver loader. The *NDRX_XA_RMLIB* shall be set to libmqmxa64_r.so.
1121
1122 Different configuration of transaction groups:
1123
1124 image:transaction_groups.png[caption="Figure 2: ", title="Transaction group configurations"]
1125
1126 Transaction groups are configured in environment variables. Enduro/X stores configuration
1127 files in ini files in section *[@global]*. Subsections are used to define different
1128 groups. These sub-sections then via *NDRX_CCTAG* env variable (or CCTAG in *ndrxconfig.xml(5)*)
1129 can be assigned to different processes. The full list of env variables and their function
1130 can be seen in ex_env(5) man page.
1131
1132 XA Group configuration consists of following env variables:
1133
1134
1135 . *NDRX_XA_RES_ID* - mandatory parameter, this is group number.
1136
1137 . *NDRX_XA_OPEN_STR* - mandatory parameter, driver open string.
1138
1139 . *NDRX_XA_CLOSE_STR* - mandatory parameter, driver close string.
1140
1141 . *NDRX_XA_DRIVERLIB* - mandatory parameter, Enduro/X resource driver loader.
1142
1143 . *NDRX_XA_RMLIB* - mandatory parameter, Resource manager driver (if any). For empty
1144 used "-".
1145
1146 . *NDRX_XA_LAZY_INIT* - optional, if set to *1* XA at process level will be initialized
1147 only when functionality is used.
1148
1149 . *NDRX_XA_FLAGS* - optional, reconnect flags and other XA switch work mode flags
1150 may be configured here.
1151
1152
1153 The following configuration example will show example for 4 processes which
1154 will each live in it's own transaction group. Groups and processes will be following:
1155
1156 . *Group 1*: Client process will operate with NULL switch (*test_nullcl*).
1157
1158 . *Group 2*: Server process will operate with Oracle DB (*test_orasv*).
1159
1160 . *Group 3*: Server process will operate with PostgreSQL DB (*test_pgsv*).
1161
1162 . *Group 4:* TMQ transactional persistent queue sub-system (*tmqueue* queue server).
1163
1164
1165 Following environment sub-sections/groups will be defined in *app.ini*:
1166
1167 --------------------------------------------------------------------------------
1168
1169 #
1170 # Group 1 Null switch
1171 #
1172 [@global/Group1]
1173 NDRX_XA_RES_ID=1
1174 NDRX_XA_OPEN_STR=-
1175 NDRX_XA_CLOSE_STR=-
1176 NDRX_XA_DRIVERLIB=libndrxxanulls.so
1177 NDRX_XA_RMLIB=-
1178 NDRX_XA_LAZY_INIT=1
1179
1180 #
1181 # Group 2 Oracle DB
1182 #
1183 [@global/Group2]
1184 NDRX_XA_RES_ID=2
1185 NDRX_XA_OPEN_STR="ORACLE_XA+SqlNet=ROCKY+ACC=P/endurotest/endurotest1+SesTM=180+LogDir=/tmp/xa+nolocal=f+Threads=true"
1186 NDRX_XA_CLOSE_STR=${NDRX_XA_OPEN_STR}
1187 NDRX_XA_DRIVERLIB=libndrxxaoras.so
1188 NDRX_XA_RMLIB=/u01/app/oracle/product/11.2.0/dbhome_1/lib/libclntsh.so.11.1
1189 NDRX_XA_LAZY_INIT=1
1190
1191 #
1192 # Group 3 PostgreSQL
1193 #
1194 [@global/Group3]
1195 NDRX_XA_RES_ID=3
1196 NDRX_XA_OPEN_STR={"url":"postgresql://testuser:testuser1@localhost:5432/testdb"}
1197 NDRX_XA_CLOSE_STR=${NDRX_XA_OPEN_STR}
1198 NDRX_XA_DRIVERLIB=libndrxxapq.so
1199 NDRX_XA_RMLIB=−
1200 NDRX_XA_LAZY_INIT=1
1201
1202
1203 #
1204 # Group 4 TMQ
1205 #
1206 [@global/Group4]
1207 NDRX_XA_RES_ID=4
1208 NDRX_XA_OPEN_STR=datadir="${NDRX_APPHOME}/queues/QSPACE1",qspace="QSPACE1"
1209 NDRX_XA_CLOSE_STR=$NDRX_XA_OPEN_STR
1210 NDRX_XA_DRIVERLIB=libndrxxaqdisks.so
1211 NDRX_XA_RMLIB=libndrxxaqdisk.so
1212 NDRX_XA_LAZY_INIT=0
1213
1214 --------------------------------------------------------------------------------
1215
1216 The following environment sub-sections/groups will be defined in *ndrxconfig.xml*.
1217 Configuration file defines Transaction Manager Server for each of the groups.
1218 *tmsrv(8)* dynamically loaded (or build with buildtms) is must have for each
1219 of the group:
1220
1221 --------------------------------------------------------------------------------
1222 <?xml version="1.0" ?>
1223 <endurox>
1224     <appconfig>
1225         ...
1226     </appconfig>
1227     <defaults>
1228         ...
1229     </defaults>
1230     <servers>
1231         <server name="tmsrv">
1232             <srvid>50</srvid>
1233             <min>1</min>
1234             <max>1</max>
1235             <cctag>Group1</cctag>
1236             <sysopt>-e ${NDRX_ULOG}/TM1.log -r -- -t60 -l${NDRX_APPHOME}/tmlogs/rm1 </sysopt>
1237         </server>
1238
1239         <server name="tmsrv">
1240             <srvid>150</srvid>
1241             <min>1</min>
1242             <max>1</max>
1243             <cctag>Group2</cctag>
1244             <sysopt>-e ${NDRX_ULOG}/TM1.log -r -- -t60 -l${NDRX_APPHOME}/tmlogs/rm2 </sysopt>
1245         </server>
1246
1247         <server name="tmsrv">
1248             <srvid>250</srvid>
1249             <min>1</min>
1250             <max>1</max>
1251             <cctag>Group3</cctag>
1252             <sysopt>-e ${NDRX_ULOG}/TM1.log -r -- -t60 -l${NDRX_APPHOME}/tmlogs/rm3 </sysopt>
1253         </server>
1254
1255         <server name="tmsrv">
1256             <srvid>350</srvid>
1257             <min>1</min>
1258             <max>1</max>
1259             <cctag>Group4</cctag>
1260             <sysopt>-e ${NDRX_ULOG}/TM1.log -r -- -t60 -l${NDRX_APPHOME}/tmlogs/rm4 </sysopt>
1261         </server>
1262
1263         <server name="test_orasv">
1264             <srvid>400</srvid>
1265             <cctag>Group2</cctag>
1266             <sysopt>-e ${NDRX_ULOG}/test_orasv.log -r</sysopt>
1267         </server>
1268
1269         <server name="test_pgsv">
1270             <srvid>500</srvid>
1271             <cctag>Group3</cctag>
1272             <sysopt>-e ${NDRX_ULOG}/test_pgsv.log -r</sysopt>
1273         </server>
1274
1275         <server name="tmqueue">
1276             <max>1</max>
1277             <srvid>600</srvid>
1278             <cctag>Group4</cctag>
1279             <sysopt>-e ${NDRX_ULOG}/tmqueue.log -r -- -s1</sysopt>
1280         </server>
1281
1282     </servers>
1283     <clients>
1284         <client cmdline="test_nullcl" CCTAG="Group1">
1285             <exec tag="NULLCL" autostart="Y" log="${NDRX_ULOG}/testnullbin.log"/>
1286         </client>
1287     </clients>
1288 </endurox>
1289
1290 --------------------------------------------------------------------------------
1291
1292 Once a application is started, any other process may be started in the specific transaction
1293 group by providing the environment variable first. For example to run the process in
1294 Oracle DB Environment (which is group 2), do the following on shell:
1295
1296 --------------------------------------------------------------------------------
1297 $ NDRX_CCTAG=Group2 ./test_oracl
1298 --------------------------------------------------------------------------------
1299
1300 Note that this configuration assumes that following folders are created:
1301
1302 . $\{NDRX_APPHOME\}/tmlogs/rm[1,2,3,4] - Transaction manager machine readable logs
1303 for transaction completion and recovery.
1304
1305 . $\{NDRX_APPHOME\}/queues/QSPACE1 - Folder for persistent queue data storage.
1306
1307 == Enduro/X Monitoring with SNMP
1308
1309 SNMP monitoring is provided by Enduro/X Enterprise Edition module, details
1310 are described in *endurox-ee* module documentation.
1311
1312 == Enduro/X Monitoring with NetXMS
1313
1314 NetXMS monitoring tool has the agent plugin for Enduro/X. This section will
1315 describe the basic elements how to monitor Enduro/X with help of this tool
1316
1317 Enduro/X exposes the following list of the tables which can monitor:
1318
1319 - *Endurox.Clients* - information about client processes.
1320
1321 - *Endurox.Machines* - information about cluster machines.
1322
1323 - *Endurox.Queues* - information about local queues.
1324
1325 - *Endurox.ServerInstances* - information about XATMI server processes.
1326
1327 - *Endurox.ServiceGroups* - dynamic information about XATMI services.
1328
1329 - *Endurox.Services* - static information about XATMI services.
1330
1331 To start the Enduro/X monitoring with the NetXMS, firstly the agent must be compiled
1332 with Enduro/X support. Thus the system has to have compiler installed and access
1333 to Internet must be (for fetching the sources from the github).
1334
1335 === Building the Agent
1336
1337 To build the agent, system must have C/C++ compiler installed and "git" tool too.
1338 Basically if Enduro/X build dependencies are met on the host, then Netxms agent
1339 will build too. For more details consult with the project specific documentation.
1340
1341 But in general, to build the agent for Enduro/X, do the following steps:
1342
1343 --------------------------------------------------------------------------------
1344
1345 $ git clone https://github.com/netxms/netxms
1346 $ cd netxms
1347 $ ./reconf
1348 $ ./configure --with-agent --prefix=/path/to/install --with-tuxedo=/usr --disable-mqtt
1349 $ make
1350 $ sudo make install
1351 --------------------------------------------------------------------------------
1352
1353 If doing basic setup, then usually you need to setup the configuration file for
1354 agent to allow the incoming servers connections, for example:
1355
1356 --------------------------------------------------------------------------------
1357
1358 # cat << EOF > /etc/nxagentd.conf
1359
1360 LogFile=/var/log/nxagentd
1361
1362 # IP white list, can contain multiple records separated by comma.
1363 # CIDR notation supported for subnets.
1364 MasterServers=127.0.0.0/8,172.17.0.1,192.168.43.98
1365 ServerConnection=192.168.43.98
1366 SubAgent=tuxedo.nsm
1367
1368 EOF
1369
1370 --------------------------------------------------------------------------------
1371
1372 Once configuration is done, the *nxagentd* shall be started from Enduro/X
1373 environment, so that agent will be able to call *tpadmsv(8)* services. Usually
1374 agent is started from *cpmsrv(8)*.
1375
1376 To start the agent manually, following commands may be used:
1377
1378 --------------------------------------------------------------------------------
1379 $ cd /path/to/install/bin
1380
1381 -- have some debug in current session:
1382 $ ./nxagentd -D5
1383
1384 -- or to start as deamon:
1385 $ ./nxagentd -D5
1386
1387
1388 --------------------------------------------------------------------------------
1389
1390 In case of CPMSRV, following can be used as configuration:
1391
1392 --------------------------------------------------------------------------------
1393 ...
1394     <!-- Client section -->
1395     <clients>
1396 ...
1397         <client cmdline="/path/to/install/bin/nxagentd -D5 -c/etc/nxagentd.conf" log="/tmp/nxagentd.log">
1398                 <exec tag="NXAGENT" autostart="Y" />
1399         </client>
1400 ...
1401     </clients>
1402 --------------------------------------------------------------------------------
1403
1404
1405 === Checking the available parameters from server
1406
1407 To check the list parameters that can be monitored, use following command:
1408
1409 --------------------------------------------------------------------------------
1410
1411 $ nxget -l <agent ip addr>  Agent.SupportedParameters
1412 ...
1413 Endurox.Client.ActiveConversations(*)
1414 Endurox.Client.ActiveRequests(*)
1415 Endurox.Client.Machine(*)
1416 Endurox.Client.Name(*)
1417 Endurox.Client.State(*)
1418 Endurox.Domain.ID
1419 Endurox.Domain.Queues
1420 Endurox.Domain.Servers
1421 Endurox.Domain.Services
1422 Endurox.Domain.State
1423 Endurox.Machine.Accessers(*)
1424 Endurox.Machine.Clients(*)
1425 Endurox.Machine.Conversations(*)
1426 Endurox.Machine.State(*)
1427 Endurox.Queue.Machine(*)
1428 Endurox.Queue.RequestsCurrent(*)
1429 Endurox.Queue.State(*)
1430 Endurox.ServerInstance.CommandLine(*)
1431 Endurox.ServerInstance.Generation(*)
1432 Endurox.ServerInstance.Machine(*)
1433 Endurox.ServerInstance.Name(*)
1434 Endurox.ServerInstance.PID(*)
1435 Endurox.ServerInstance.State(*)
1436 Endurox.Service.State(*)
1437 Endurox.ServiceGroup.CompletedRequests(*)
1438 Endurox.ServiceGroup.FailedRequests(*)
1439 Endurox.ServiceGroup.LastExecutionTime(*)
1440 Endurox.ServiceGroup.MaxExecutionTime(*)
1441 Endurox.ServiceGroup.MinExecutionTime(*)
1442 Endurox.ServiceGroup.State(*)
1443 Endurox.ServiceGroup.SuccessfulRequests(*)
1444
1445 --------------------------------------------------------------------------------
1446
1447
1448 To return the values from particular table, use following command:
1449
1450 --------------------------------------------------------------------------------
1451
1452 $ nxget -T <agent ip> <table name e.g. Endurox.Clients>
1453
1454 --------------------------------------------------------------------------------
1455
1456 ==== Monitoring list of the items
1457
1458 In NetXMS it is possible import and monitor list of the resources. That can be
1459 done in the following way:
1460
1461 Firstly in Configure Data Collection Items (DCI) for new item. For example:
1462
1463 image:netxms_new_dci.png[caption="Figure 4: ", title="New DCI", alt="New DIC"]
1464
1465 *NOTE*: As Enduro/X uses comma in identifiers, then in templates quotes must be
1466 used surrounding "'{instance}'" placeholder. For following classes quotes are needed:
1467
1468 - Endurox.Queue
1469
1470 - Endurox.Clients
1471
1472
1473 Next configure agent list from which to discover the items:
1474
1475 image:netxms_new_dci2.png[caption="Figure 5: ", title="Agent list", alt="Agent list"]
1476
1477 Once this is configured, instances shall be discovered. On monitored node in NetXMS
1478 Console, press *left mouse button > Poll > Instance discovery*
1479
1480
1481 After running the instance discovery, following output may be received:
1482
1483 --------------------------------------------------------------------------------
1484
1485 [02.09.2019 20:57:57] **** Poll request sent to server ****
1486 [02.09.2019 20:57:57] Poll request accepted
1487 [02.09.2019 20:57:57] Starting instance discovery poll for node mypc
1488 [02.09.2019 20:57:57] Running DCI instance discovery
1489 [02.09.2019 20:57:57]    Updating instances for FileSystem.UsedPerc({instance}) [548]
1490 [02.09.2019 20:57:57]    Updating instances for FileSystem.FreePerc({instance}) [552]
1491 [02.09.2019 20:57:57]    Updating instances for Endurox.Client.State('{instance}') [627]
1492 [02.09.2019 20:57:57]       Creating new DCO for instance "/n00b,clt,reply,tmsrv,29321,2"
1493 [02.09.2019 20:57:57]       Creating new DCO for instance "/n00b,clt,reply,tmsrv,29304,2"
1494 [02.09.2019 20:57:57]       Creating new DCO for instance "1/NXAGENT/-/1"
1495 [02.09.2019 20:57:57]       Creating new DCO for instance "1/BINARY1/1"
1496 [02.09.2019 20:57:57]       Creating new DCO for instance "1/BINARY2/2"
1497 [02.09.2019 20:57:57] **** Poll completed successfully ****
1498
1499 --------------------------------------------------------------------------------
1500
1501 In the results in latest values new instances can be seen. In particular case
1502 status of clients are monitored:
1503
1504 image:netxms_clients_list.png[caption="Figure 6: ", title="Clients list", alt="Clients list"]
1505
1506
1507 === Configuration recipes for monitoring
1508
1509 This chapter will give some recipes how to efficiently configure the NetXMS monitoring
1510 system, to show following items on the dashboard:
1511
1512 - Show the single client process status (dead or running).
1513
1514 - Show the status for the group of processes or services (get the number of
1515 running instances) and show the last response times in the group of services.
1516
1517 - Show the total number of processed messages for some services and calculate the
1518 TPS. Also calculate the total failed messages.
1519
1520 The solution is based on NetXMS 3.1 version (2019), where status indicator is only
1521 available for Nodes and business services. This tutorial will use business services
1522 for status indicators. Data for monitoring can be gathered in two ways, one is
1523 by using DCI (GetDCIValues() (with last 60 seconds visibility to not to see removed DCIs) and
1524 other is by direct parameter readings (AgentReadTable() and AgentReadParameter()).
1525
1526 This tutorial will use AgentRead functions.
1527
1528 === Client status monitoring
1529
1530 For XATMI status monitoring, script will be created which will check the presence
1531 for particular parameter and check that value matches 'ACT' constant. If it matches,
1532 then script returns value *1*, if it does not matches or parameter is not present,
1533 then script returns value *0*.
1534
1535 Further this script can be used for building business service or for building
1536 new DCI, to get numeric value for client process status. This assumes that $node
1537 variable is available (i.e. script will be executed for some node/server monitored).
1538
1539
1540 --------------------------------------------------------------------------------
1541
1542 //Convert Enduro/X parameter state to number
1543 //@param parameter is parameter name like "EnduroX.Client.State('2/TXUPLD/RUN7/1')"
1544 // which is being monitored
1545 //@return 0 - parameter not found or not ACT, 1 - Parameter found and is ACT
1546 sub NdrxState2Num(parameter)
1547 {
1548    v = AgentReadParameter($node, parameter);
1549
1550     if (null==v)
1551     {
1552         return 0;
1553     }
1554
1555     if (v=="ACT")
1556     {
1557         return 1;
1558     }
1559
1560     return 0;
1561 }
1562
1563
1564 //If called from DCI...
1565 //return NdrxState2Num($1);
1566
1567 --------------------------------------------------------------------------------
1568
1569 To register script, it NetXMS Management Console, go to *Configuration > Script Library*
1570 and in the window press left mouse button "New..." to create a new script. The
1571 name may be the same 'NdrxState2Num'. Copy the contents there in the window and
1572 save.
1573
1574 To call the *NdrxState2Num()* from DCI, create a wrapper script like this and save
1575 with name *NdrxState2NumDci* under the Script Library.
1576
1577 --------------------------------------------------------------------------------
1578
1579 use NdrxState2Num;
1580
1581 //Wrapper for DCI
1582 return NdrxState2Num($1);
1583
1584 --------------------------------------------------------------------------------
1585
1586
1587 To have status monitor indicators, next step is to create a business service.
1588 For example we want to monitor following 7 processes (IDs for clients, get by
1589 '$ nxget -T 127.0.0.1 Endurox.Clients' or '$ xadmin mibget -c T_CLIENT'):
1590
1591 - 2/TXUPLD/RUN1/1
1592
1593 - 2/TXUPLD/RUN2/1
1594
1595 - 2/TXUPLD/RUN3/1
1596
1597 - 2/TXUPLD/RUN4/1
1598
1599 To do this, in left menu under the "Business Services", new "Business Service"
1600 needs to be created under which "Node link" must be added only then "Service check..."
1601 shall be added. In other combination it wont work, and you will see question marks
1602 in the icon tree of NetXMS console.
1603
1604 To use NdrxState2Num() script for process checking in business service, following
1605 script can be used:
1606
1607 --------------------------------------------------------------------------------
1608 //Use script library
1609 use NdrxState2Num;
1610
1611 if (0==NdrxState2Num("EnduroX.Client.State('2/TXUPLD/RUN1/1')"))
1612 {
1613         return FAIL;
1614 }
1615
1616 return OK;
1617 --------------------------------------------------------------------------------
1618
1619 image:netxms_service_chk.png[caption="Figure 7: ", title="Business Service for status indicator"]
1620
1621
1622 === Getting the number of servers, response times, etc. for the XATMI services
1623
1624 To get the number of service providers (XATMI servers advertising the service) and other aggregated
1625 data, analysis will be done on Agent tables, for example "Endurox.ServiceGroups".
1626
1627 Script function will be created which provides following aggregation options:
1628
1629 - min - return min value found for the group;
1630
1631 - max - return max value found in the group;
1632
1633 - avg - return average value of all matched items;
1634
1635 - sum - sum of the matched items.
1636
1637 - cnt - count of the items matched.
1638
1639 Function shall accept the following arguments:
1640
1641 - Table name;
1642
1643 - Key column name;
1644
1645 - Key value name;
1646
1647 - Aggregation column name;
1648
1649 So firstly to see the columns available for data analysis, you may use the following script
1650 (execute server script on the Node, i.e. Shift+Alt+S):
1651
1652 --------------------------------------------------------------------------------
1653 t = AgentReadTable($node, "Endurox.ServiceGroups");
1654
1655 if (null==t)
1656 {
1657     return "Table is not found? Is Agent configured for Enduro/X?";
1658 }
1659
1660 for (c : t->columns) {
1661     print(c->name . " | ");
1662 }
1663
1664 println("");
1665
1666 for (row : t->rows) {
1667     for(cell : row->values) {
1668         print(cell . " | ");
1669     }
1670
1671         println("");
1672 }
1673 --------------------------------------------------------------------------------
1674
1675 Sample output could be:
1676
1677 --------------------------------------------------------------------------------
1678 *** FINISHED ***
1679
1680 Result: (null)
1681
1682 SVCNAME | SRVGROUP | LMID | GROUPNO | RQADDR | STATE | RT_NAME | LOAD | PRIO | COMPLETED | QUEUED | SUCCESSFUL | FAILED | EXECTIME_LAST | EXECTIME_MAX | EXECTIME_MIN |
1683 @CCONF | 2/1 | 2 | 0 |  | ACT |  | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1684 @CCONF | 2/2 | 2 | 0 |  | ACT |  | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1685 .TMIB | 2/10 | 2 | 0 |  | ACT |  | 0 | 0 | 1094 | 0 | 1094 | 0 | 0 | 4000 | 0 |
1686 .TMIB-2-10 | 2/10 | 2 | 0 |  | ACT |  | 0 | 0 | 9 | 0 | 9 | 0 | 0 | 0 | 0 |
1687 .TMIB | 2/11 | 2 | 0 |  | ACT |  | 0 | 0 | 31 | 0 | 31 | 0 | 0 | 2000 | 0 |
1688 .TMIB-2-11 | 2/11 | 2 | 0 |  | ACT |  | 0 | 0 | 5 | 0 | 5 | 0 | 0 | 0 | 0 |
1689 DEBIT | 2/80 | 2 | 0 |  | ACT |  | 0 | 0 | 83649 | 0 | 83649 | 0 | 29000 | 35000 | 0 |
1690 DEBIT | 2/81 | 2 | 0 |  | ACT |  | 0 | 0 | 83629 | 0 | 83629 | 0 | 24000 | 32000 | 0 |
1691 CREDIT | 2/140 | 2 | 0 |  | ACT |  | 0 | 0 | 163463 | 0 | 163463 | 0 | 0 | 6000 | 0 |
1692 CREDIT | 2/141 | 2 | 0 |  | ACT |  | 0 | 0 | 3788 | 0 | 3788 | 0 | 0 | 4000 | 0 |
1693 CREDIT | 2/142 | 2 | 0 |  | ACT |  | 0 | 0 | 27 | 0 | 27 | 0 | 0 | 1000 | 0 |
1694 HANDLER | 2/240 | 2 | 0 |  | ACT |  | 0 | 0 | 55878 | 0 | 55878 | 0 | 36000 | 56000 | 0 |
1695 HANDLER | 2/241 | 2 | 0 |  | ACT |  | 0 | 0 | 55647 | 0 | 55647 | 0 | 29000 | 58000 | 0 |
1696 HANDLER | 2/242 | 2 | 0 |  | ACT |  | 0 | 0 | 55753 | 0 | 55753 | 0 | 32000 | 57000 | 0 |
1697 @CPMSVC | 2/9999 | 2 | 0 |  | ACT |  | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
1698 --------------------------------------------------------------------------------
1699
1700 Thus following script function, can be written to get the count of the services
1701 advertised:
1702
1703 --------------------------------------------------------------------------------
1704
1705 //Match the table entry, get the count
1706 //@param tableName e.g. "Endurox.ServiceGroups"
1707 //@param keyColName to on which perform testings, to count on, .e.g "SVCNAME"
1708 //@param keyExpr regular expression to match given name, e.g. "^@CCONF$"
1709 //@param aggrFunc aggregation function name - min, max, sum, avg, cnt
1710 //@param aggrCol aggregation column used for min, max, sum and avg.
1711 //@return aggregated value
1712 sub NdrxGet(tableName, keyColName, keyExpr, aggrFunc, aggrCol)
1713 {
1714     ret = 0;
1715     t = AgentReadTable($node, tableName);
1716
1717     if (null==t)
1718     {
1719         return null;
1720     }
1721
1722     key_col = -1;
1723     agg_col = -1;
1724
1725     for(i = 0; i < t->columnCount; i++)
1726     {
1727         if (t->getColumnName(i) == keyColName)
1728         {
1729                 chk_col=i;
1730         }
1731         else if (t->getColumnName(i) == aggrCol)
1732         {
1733             agg_col=i;
1734         }
1735
1736     }
1737
1738     //No column found..
1739     if (-1==chk_col)
1740     {
1741         print("ERROR! Key column not found: ".keyColName."\n");
1742         return null;
1743     }
1744
1745     if (-1==agg_col && (aggrFunc=="min" || aggrFunc=="max" || aggrFunc=="sum" || aggrFunc=="avg"))
1746     {
1747         print("ERROR! Aggregation column not found: ".aggrCol."\n");
1748         return null;
1749     }
1750
1751     match_rows = 0;
1752     // Process the data...
1753     for(i = 0; i < t->rowCount; i++)
1754     {
1755         keycolvalue = t->get(i, chk_col);
1756
1757         if (keycolvalue ~= keyExpr)
1758         {
1759             match_rows++;
1760
1761             if (aggrFunc=="cnt")
1762             {
1763                 ret++;
1764             }
1765             else
1766             {
1767                 data = t->get(i, agg_col);
1768
1769                 //print("AGG: ".data."\n");
1770
1771                 if (aggrFunc=="sum" || aggrFunc=="avg")
1772                 {
1773                     ret+=data;
1774                 }
1775                 else if (aggrFunc=="min")
1776                 {
1777                     if (1==match_rows)
1778                     {
1779                         ret = data;
1780                     }
1781                     else if( data < ret )
1782                     {
1783                             ret = data;
1784                     }
1785                 }
1786                 else if (aggrFunc=="max")
1787                 {
1788                     if (1==match_rows)
1789                     {
1790                         ret = data;
1791                     }
1792                     else if( data > ret )
1793                     {
1794                         ret = data;
1795                     }
1796                 }
1797
1798                 first=0;
1799             }
1800         }
1801     }
1802
1803     if (0==match_rows && (aggrFunc=="min" || aggrFunc=="max" || aggrFunc=="sum" || aggrFunc=="avg"))
1804     {
1805         ret = null;
1806     }
1807     else if (aggrFunc=="avg")
1808     {
1809         ret = ret/match_rows;
1810     }
1811
1812     return ret;
1813 }
1814
1815 //To test:
1816 //return NdrxGet("Endurox.ServiceGroups", "SVCNAME", "^DEBIT$", "sum", "COMPLETED");
1817 //return NdrxGet("Endurox.ServiceGroups", "SVCNAME", "^DEBIT$", "avg", "COMPLETED");
1818 //return NdrxGet("Endurox.ServiceGroups", "SVCNAME", ".TMIB", "min", "COMPLETED");
1819 //return NdrxGet("Endurox.ServiceGroups", "SVCNAME", ".TMIB", "max", "COMPLETED");
1820 //return NdrxGet("Endurox.ServiceGroups", "SVCNAME", ".TMIB", "avg", "COMPLETED");
1821
1822 //To start the script from DCI, we need to actually call it:
1823 //return NdrxGet($1, $2, $3, $4, $5);
1824
1825 --------------------------------------------------------------------------------
1826
1827 Store the script in library as "NdrxGet".
1828
1829 To run "NdrxGet" from DCI, lets create a wrapper script and save it as *NdrxGetDci*
1830 in script library.
1831
1832 --------------------------------------------------------------------------------
1833 use NdrxGet;
1834
1835 //Call this from DCI, pass the arguments
1836 //as: NdrxGet("Endurox.ServiceGroups","SVCNAME","HANDLER","sum","FAILED")
1837 return NdrxGet($1, $2, $3, $4, $5);
1838 --------------------------------------------------------------------------------
1839
1840 To create Some Data Collection Items based on these script, for hosts, DCI items
1841 can be created. Document shows how to create following data collection items.
1842
1843 ==== DCI: Average response time over several servers for one service
1844
1845 The average response time here is measured for service named "HANDLER".
1846
1847 - DCI Origin: Script;
1848
1849 - Parameter: NdrxGetDci("Endurox.ServiceGroups","SVCNAME","HANDLER","avg","EXECTIME_LAST");
1850
1851 - Data Type: Float
1852
1853
1854 image:netxms_avg_rsp.png[caption="Figure 8: ", title="Average response time"]
1855
1856
1857 ==== DCI: Number of successful processed messages for one service with several servers
1858
1859 Number of successful messages processed here are measured for "HANDLER" service.
1860
1861 - DCI Origin: Script;
1862
1863 - Parameter: NdrxGetDci("Endurox.ServiceGroups","SVCNAME","HANDLER","sum","SUCCESSFUL")
1864
1865 - Data Type: Integer
1866
1867 image:netxms_succ.png[caption="Figure 9: ", title="Successful requests processed by service copies"]
1868
1869
1870 ==== DCI: Cumulative number of messages waiting in queues (for all services)
1871
1872 This indicator usually shall be very small like 0..1..2, if it grows higher, then
1873 this indicates that system is unable to cope with the workload. This value is recommended
1874 to be monitored.
1875
1876 - DCI Origin: Script;
1877
1878 - Parameter: NdrxGetDci("Endurox.Queues","NAME",".*","sum","RQ_CURRENT");
1879
1880 - Data Type: Integer
1881
1882 image:netxms_qsize.png[caption="Figure 10: ", title="Number of messages in queue"]
1883
1884
1885 ==== DCI: TPS for one service with several servers
1886
1887 Also it is useful to monitor the number system throughput. This shall be done
1888 one some 'main' service which handles all the incoming traffic. In this case
1889 service "HANDLER" is monitored.
1890
1891 - DCI Origin: Script;
1892
1893 - Parameter: NdrxGetDci("Endurox.ServiceGroups","SVCNAME","HANDLER","sum","COMPLETED");
1894
1895 - Data Type: Integer
1896
1897 - Transformation: Average delta per second
1898
1899 image:netxms_tps.png[caption="Figure 11: ", title="TPS Configuration"]
1900
1901 image:netxms_tps_transf.png[caption="Figure 12: ", title="TPS transformation"]
1902
1903
1904 == High availability features
1905
1906 This section lists the notes regarding to the high availability features available
1907 in Enduro/X product.
1908
1909 === Available features
1910
1911 Enduro/X is cluster-aware and scalable across several instances. Out of the box
1912 Enduro/X provides the following cluster-aware functions:
1913
1914 - *tpbridge(8)* process ensures that XATMI services are shared across the connected
1915 machines (nodes).
1916
1917 - XATMI clients can call the services across the cluster (directly connect nodes)
1918 and load balancing may be configured to split the network service and local
1919 service usage. In case local services become unavailable, the network services
1920 are chosen to fulfill the request. In case local services are not available, the call
1921 will go to network services.
1922
1923 - Enduro/X smart cache can replicate cache records across the linked machines. With
1924 limitation, that if the link is lost between machines, the cache records might fork,
1925 as currently no mechanisms are provided to synchronize caches while the link is restored.
1926
1927 - Process groups, which can be configured in singleton mode (basically failover mode),
1928 thus it is possible to configure that set of processes (XATMI clients and servers)
1929 in a cluster of several machines run only on one machine at a time,
1930 meanwhile, if the active machine goes out of order, then failover happens available machine.
1931
1932 - In case of failover distributed transaction manager *tmsrv(8)* can be configured
1933 to automatically recover transaction logs written by another node. However, transaction logs
1934 must be placed on the shared file system.
1935
1936 - In case of failover, transactional queue *tmqueue(8)* can be configured
1937 to automatically load messages from the disk processed by another node.
1938 However, the message store must be placed on the shared file system.
1939
1940 Singleton process groups are available from Enduro/X version *8.0.10*.
1941
1942 === Distributed transactions in a failover environment
1943
1944 The full life-cycle of the transaction is managed by a particular
1945 transaction manager *tmsrv(8)* instance which starts the particular
1946 distributed transaction. *tmsrv* instance instance is identified
1947 by Enduro/X cluster node id and <srvid> tag from *ndrxconfig.xml(5)*.
1948
1949 In case if high-availability is required for the distributed transaction
1950 environment, the failover functionality may be configured for the *tmsrv*. So that,
1951 if the node had active transaction managers, but for some reason node
1952 becomes unavailable, the other Enduro/X node may take over the transactions
1953 started on the crashed node.
1954
1955 By design, distributed transaction manager *tmsrv(8)* uses file system storage,
1956 to strong the transaction logs. First step of the configuring failover, the transaction
1957 manager shall store its logs on the shared file system, for example, IBM Spectrum Scale *GPFS*
1958 or *GFS2* or any other FS which has Posix semantics. Logs directory for *tmsrv* is set by
1959 *-l* parameter.
1960
1961 The second step for implementing failover for transaction managers is to configure
1962 the singleton groups for *tmsrv(8)* instances. Additionally instances of the different
1963 Enduro/X nodes shall point to the same transaction log directory on shared fs.
1964 As transaction logs include information about the Enduro/X cluster node id
1965 and <srvrid> values on which the transaction was started, it is crucial that
1966 *tmsrv(8)* on different Enduro/X nodes match following attributes:
1967
1968 - Common virtual Enduro/X cluster node id shall be set for matching *tmsrv(8)* instances.
1969 The value shall be passed to the *tmsrv* parameter *-n*. Which shall be
1970 set to any node number from the cluster.
1971
1972 - Transaction manager instances in different Enduro/X nodes must
1973 match *<srvid>* tag from *ndrxconfig.xml(5)*.
1974
1975 After the transaction manager process definition in the singleton process group,
1976 *tmrecoversv(8)* shall be configured. This ensures that any orphaned prepared
1977 transactions branches are collected (rolled back). For failover groups,
1978 it is recommended to use *-i* (immediate) flag, so that *tmrecoversv* run is done
1979 without a delay. Orphans may happen if Enduro/X cluster node crashes during the
1980 prepare phase of the distributed transaction, as at that time for performance reasons,
1981 fsync() on transaction log file storage is not used.
1982
1983 In case if using *tmsrv* for *tmqueue(8)* group, the *tmrecoversv* shall be put
1984 right after the *tmqueue* process definition in *ndrxconfig.xml(5)*, instead of the *tmsrv*.
1985
1986 Another instance of *tmrecoversv* shall be configured at the end of the whole application
1987 for each Enduro/X node. Such *tmrecoversv* shall run periodically. That will allow of housekeeping
1988 any orphaned prepared transactions for whom corrupted transaction logs
1989 were left on the disk after the failover.
1990
1991 Additionally, during the failover, the active-state (not yet prepared according to 2PC)
1992 transaction branches may not be controlled by the *tmsrv* which took over the operation,
1993 as *tmsrv* for performance reason such info logs to disk with *fflush()* Unix call,
1994 which means that on a node that took over, the transaction data might not be
1995 available, and active transactions in resource manages cannot be rolled back by Enduro/X
1996 tools. However developer may use *exec_on_bootlocked* or *exec_on_locked* user exit points
1997 and develop scripts for collecting activate transactions from the resource managers.
1998 Additional details for working with specific resource managers are outside of the scope
1999 of this guide.
2000
2001 *tmsrv(8)* shall be configured to use *FSYNC* (or *FDATASYNC*) and *DSYNC* flags
2002 in the *NDRX_XA_FLAGS* env, so that the commit decision is fully logged to the storage
2003 device.
2004
2005 *WARNING!* To use failover functionality for the distributed transactions, the
2006 underlying hardware and software infrastructure shall provide certain capabilities.
2007 which are described in the following chapter.
2008
2009 === Hardware and software requirements for transaction manager failover and transactional message queue
2010
2011 When the transaction is about to commit, *tmsrv(8)* writes the
2012 decision to the persistent storage. For transaction consistency, it is crucial that
2013 *no write happens from the failed node to the shared storage after the failover*.
2014 If such writes happen, transaction logs might become corrupted and
2015 Enduro/X transaction manager *may lose transactions*.
2016
2017 For transactional message *tmqueue*, it is mandatory that no writes happen
2018 to the shared file system (including file rename and unlink), open new handles, write
2019 *from the failed node to the shared storage after the failover*.
2020
2021 Before the failed node is repaired (automatically or manually) and shared
2022 fs becomes mounted back in read/write mode, the Enduro/X application
2023 must be killed (at minimum all *tmsrv* and *tmqueue* processes which
2024 were using a particular shared file system).
2025
2026 Example cases, where write to disk from failed node might happen:
2027
2028 - Virtual machine with currently active node is paused, shared file
2029 system assumes that the node has failed and left the cluster, the shared file
2030 system cluster gives fcntl() lock (used by singleton groups) to failover
2031 node which takes over. After the takeover, the paused node is resumed. In such
2032 cases, there might be possible shared file system corruption or transaction
2033 corruption, as concurrent writes to transaction logs are not supported
2034 by Enduro/X transaction manager.
2035
2036 - Active node is so busy with work and becomes unresponsive (basically server overloaded),
2037 so that the shared file system cluster may decide that node failed (unresponsive), and it does
2038 the failover (release locks/give locks to another node). When the failed node
2039 load is back to normal, it starts to write to disk again, which may corrupt the data.
2040
2041 To resolve such scenarios, hardware-level fencing is required.
2042 In an ideal scenario, *STONITH* shall be employed, so that if a node fails,
2043 it is restarted and only then shared file system give fcntl() lock
2044 to the failover node. *GFS2* is using this mechanism. *STONITH* is recommended
2045 way to the fence, as that would also ensure that user application processes are
2046 terminated immediately and not duplicate runs in the cluster would be possible.
2047
2048 *GPFS* on the other hand, does not use *STONITH*, however, it offers two mechanisms,
2049 to fence such invalid writes:
2050
2051 - Time-based leasing. At the driver level *GPFS* detects that failover has happened,
2052 and time leas has expired, then no writes to disk happen. However, this is not
2053 always reliable, as, if VM is paused, and then resumed the time during
2054 the pause/resume cycle by some VM hypervisors, time does not move forward during the pause,
2055 resulting in the time lease not expired on the failed node, but assumed to be expired by other nodes.
2056 This might cause corruption of the shared file system, as the failed node at resume
2057 writes immediately to the disk, which corrupts structures processed by
2058 other nodes. This might lead to *GPFS* partition being unmounted to all cluster nodes,
2059 and *mmfsck* on the partition might be required to mount the partition back. Additionally,
2060 after the *mmfsck*, some files might be lost/corrupted.
2061 The second problem here is, if such writes continue, that might cause Enduro/X
2062 transaction log corruption. If still deciding to use time-based fencing, then
2063 risks shall be considered, to what extend such pause/resume or server overload (or
2064 another such similar kind of errors) actually can happen in the given environment.
2065
2066 - Another option which is supported by IBM *GPFS*, is to have IO fencing,
2067 such as Persistent Reserve (SCSI-3 protocol), In this case, at the hardware level
2068 shared disk is disconnected from the failed node. This is the recommended way
2069 to fence the failed node, as it ensures that no writes to the shared file system. However,
2070 additional configuration is required, that during the partition unmount at the failure
2071 by the *GPFS* all the Enduro/X application processes are killed. This is needed to
2072 protect the transaction logs by concurrent writes, after the *GPFS* remounts the
2073 shared file system, when it re-initializes Persistent Reservation and mounts
2074 the disk back. Please consult with IBM, about which hardware is supported by the
2075 *GPFS* for SCSI-3 Persistent Reservation fencing. An additional benefit from
2076 this fencing mechanism is that cluster node loss detected is faster than
2077 using time leases.
2078
2079 ==== Share FS main attributes for transactional operations (applies to tmsrv and tmqueue)
2080
2081 The following key Posix file system properties must be met by the shared file system,
2082 to have the transaction and persistent queue consistency after the crash:
2083
2084 - fsync() or fdataync() Unix calls after the successful return guarantees
2085 that data is written to the file, even after the system crash.
2086
2087 - When a new file is created, and the program did Unix call fsync() on the file descriptor there must be
2088 guarantees, that file appears in the directory after the fsync() returns successful results.
2089 For Linux "DSYNC" Enduro/X flag can be set which opens the
2090 directory in read-only mode and performs fsync() on the directory file descriptor.
2091
2092 - When the program issues rename() Unix call on the file, the operation is atomic and the new path is
2093 persisted on disk after rename() successfully returns. For Linux "DSYNC" Enduro/X flag
2094 ensures that Linux synchronises the directory to the final state.
2095
2096 - When the program issues unlink() Unix call to the file, the operation must be atomic and
2097 persisted to the disk after unlink() successfully returns. For Linux
2098 "DSYNC" Enduro/X flag ensures that Linux synchronizes the directory
2099 to the final state.
2100
2101 If the above features are not provided by the Operating System or Shared File System,
2102 at the point of crash and failover, there might be a chance for a transaction
2103 loss.
2104
2105 ==== Additional notes to the IBM Spectrum Scale (GPFS)
2106
2107 To ensure, that Enduro/X application is killed on the shared fs failure,
2108 GPFS callback script shall be configured. Local event *unmount* is recommended
2109 for this purpose, and can be configured in the following way:
2110
2111 Firstly let's create the script which will be called on the event. The script
2112 invokes the Enduro/X xadmin "killall" command which matches
2113 the "$ps -ef" pattern in grep style, extracts PIDs, and kills them with -15, -9
2114 signals with additional *ndrxd* killing first, to avoid respawning,
2115 this is done from the *root* user. The "xadmin killall" is quicker,
2116 but it might note remove all processes, thus after that from the application user and
2117 application environment, the *xadmin down* is called to ensure that all processes and
2118 shared resources from the application are removed. Note that if disk failure
2119 happens, the application on the broken node is terminated, and shall be booted
2120 back manually. It might be possible to use *mount* callback, to start
2121 the application back, however, that's outside of the scope of this document, and
2122 at such a crash as FS loss, server reboot would be recommended, whereas at startup
2123 server boot scripts would start the application back.
2124
2125 Forced termination script:
2126
2127 --------------------------------------------------------------------------------
2128
2129 $ su - root
2130
2131 # mkdir /opt/app
2132
2133 -- 1) In the following command replace "user1" with application user.
2134 -- 2) Replace "/home/user1/test/conf/settest1" with environment file of the application.
2135
2136 # cat << EOF > /opt/app/app_cleanup.sh
2137 #!/bin/bash
2138
2139 #
2140 # pre-mount script, after the crash ensure that
2141 # no processes from "user1" are working
2142 # add to GPFS by mmaddcallback app_cleanup --command=/opt/app/app_cleanup.sh --event unmount --sync
2143 #
2144 xadmin killall ndrxd user1
2145 su - user1 -c "source /home/user1/test/conf/settest1; xadmin down -y"
2146 exit 0
2147
2148 EOF
2149
2150 # chmod +x /opt/app/app_cleanup.sh
2151
2152 --------------------------------------------------------------------------------
2153
2154 When the script is, ready it *must be copied to all the nodes in the cluster*. After that,
2155 the callback can be enabled by:
2156
2157 --------------------------------------------------------------------------------
2158 $ su - root
2159 # mmaddcallback app_cleanup --command=/opt/app/app_cleanup.sh --event unmount --sync
2160 --------------------------------------------------------------------------------
2161
2162 To enable Persistent Reservation for the GPFS, use the following command:
2163
2164 --------------------------------------------------------------------------------
2165 $ su - root
2166 # mmchconfig usePersistentReserve=yes
2167 --------------------------------------------------------------------------------
2168
2169 To check the status of the Persistent Reservation enabled:
2170
2171 --------------------------------------------------------------------------------
2172 $ su - root
2173
2174 # mmlsnsd -X
2175
2176  Disk name       NSD volume ID      Device          Devtype  Node name or Class       Remarks
2177 \-------------------------------------------------------------------------------------------------------
2178  nsd1            C0A8126F652700FD   /dev/sdb        generic  g7node1                  server node,pr=yes
2179  nsd1            C0A8126F652700FD   /dev/sdb        generic  g7node2                  server node,pr=yes
2180  nsd1            C0A8126F652700FD   /dev/sdb        generic  g7node3                  server node,pr=yes
2181
2182 --------------------------------------------------------------------------------
2183
2184 *pr=yes*, means that the persistent reservation feature is enabled.
2185
2186 ==== Cluster configurations for Red-Hat GFS2 and IBM Spectrum Scale GPFS
2187
2188 While technically it is possible to configure *GFS2* an *GPFS* to use two server
2189 node clusters, for simplicity of administration and better fault tolerance,
2190 three node clusters are recommended. All nodes shall have access to the shared disk storage.
2191
2192 As for the Enduro/X application itself, it may run on two nodes, and the third node
2193 maybe used as a quorum node only for the shared file-system infrastructure.
2194
2195 The shared file-system configuration instructions are out of the scope of this document.
2196
2197 The typical topology of the Enduro/X cluster would look like the following:
2198
2199 image:ex_adminman_failover_hw.png[caption="Figure 13: ", title="Failover hardware", alt="failover hardware of two processing nodes ad quorum node"]
2200
2201 AFor the disk connectivity, the shared disk array may be connected via
2202 *FibreChannel* (FC) or *iSCSI*. FibreChannel method is the preferred way of connecting
2203 disks. For Fibre Channel, link speeds typically would be *16 Gbps* or *32 Gbps*.
2204 For iSCSI link speed typically is *10G* or *25G*.
2205
2206 For the performance requirements, disk models, disk array, configuration guides
2207 and what is supported with the chosen shared file system,
2208 please consult with the corresponding Software and Hardware vendors.
2209
2210 *NOTE:* For IBM Spectrum Scale GPFS, hardware level Persistent Reserve (SCSI-3 protocol)
2211 fence support, consult with IBM on which devices on GPFS support the given feature.
2212
2213 When choosing the hardware for transaction manager (*tmsrv*) and transactional
2214 message queue support (*tmqueue*), the key performance aspect of the hardware,
2215 is the number of fsync() operations per second disk infrastructure is capable of.
2216 fsync() is a command that issues write to disk and returns only when data is actually
2217 persisted on the disk (i.e. avoids the "cached only" state).
2218
2219 Please ask the enterprise disk vendors, which models they recommend for fast fsync()
2220 operations (i.e. number of fsync() per second, as pure random write IOPS not
2221 always correlated with the fsync() numbers). For example Samsung 980 Pro according to
2222 specs, has about 500K random write IOPS and Intel S4600 SSDs have 72K random write IOPS,
2223 however in tests with fsync(), Intel S4600 is about 10x faster than Samsung
2224 counterpart. Probably it is related to the fact that Intel S4600 is an enterprise
2225 SSD, and might contain some capacitor-backed cache with the drive. In classic RAID setups,
2226 it has been also shown that Enterprise level battery-backed RAID controllers have
2227 better fsync() performance, than drives without battery backup. The disks selection
2228 in case if no clear information is available from vendors, may be chosen
2229 on an empirical base, by doing benchmarks on different configurations.
2230
2231 Typically Enduro/X has the following fsync numbers:
2232
2233 1. 2x fsync() per *tmsrv* transaction completion.
2234
2235 2. 3x fsync() operations per *tmqueue* operation (per msg enqueue or dequeue),
2236 not counting fsync related to *tmsrv*.
2237
2238 However, fsync() is not the only parameter which affects the application
2239 performance.
2240
2241 === tmqueue queue space in a failover environment
2242
2243 *tmqueue(8)* server uses file-system-based storage for storing persistent
2244 messages. For cluster-oriented operations, failover is supported, but in that
2245 case directory that holds the messages shall be stored on the shared file system,
2246 for example, *GFS2* or *GPFS*.
2247
2248 The directory for *tmqueue(8)* is configured in the XA open string for *libndrxxaqdisks(8)*.
2249 Failover for *tmqueue* can be enabled by
2250 configuring it in the singleton process group, as described in the next chapter.
2251
2252 As message files stored on this disk encode the original Enduro/X server ID
2253 and Enduro/X cluster node ID, for failover mode, all cluster nodes which
2254 provides backup locations for the given queue space, shall configure the following
2255 attributes:
2256
2257 - Common/virtual Enduro/X cluster node id shall be set for matching *tmqueue(8)* instances.
2258 The value shall be passed to the *tmsrv* parameter *-n*. Which shall be
2259 set to any node number from the cluster, however, for all instances, it must be
2260 the same.
2261
2262 - Queue space server instance in different Enduro/X nodes must
2263 match *<srvid>* tag from *ndrxconfig.xml(5)*.
2264
2265 === Singleton process groups
2266
2267 If the requirement for the application is to have only one copy of the binary in
2268 the server cluster, then singleton process groups can be used for this purpose.
2269 Each singleton group requires a lock provider *exsinglesv(8)* process instance, which, by
2270 acquiring the lock, distinguishes on which machine the singleton group is started.
2271 *exsinglesv(8)* currently works with fcntl(2) Unix locks, meaning that cluster
2272 file system shall provide such functionality (e.g. *GFS2* or *GPFS*) and it must
2273 be configured for the Enduro/X nodes to support singleton process groups.
2274
2275 In the event of node failover, the next available node will acquire
2276 the lock and after a certain time delay singleton group will be started on the new node.
2277 Meanwhile, if old node is still active but lost the lock for some technical reason (such as
2278 shared file system becomes unavailable, or *exsinglesv* was shutdown),
2279 the processes in groups are killed with signal 9.
2280
2281 The processes in groups (as well as all processes in Enduro/X) are controlled by either
2282 *ndrxd(8)* or *cpmsrv(8)*, thus group activation and deactivation (kill -9, if the lock was lost)
2283 are performed by theses binaries within the sanity check intervals, thus reasonable
2284 short sanity cycles shall be set for the processes (tag <sanity> in *ndrxconfig.xml(5)* for
2285 *ndrxd* and *-i* flag for *cpmsrv*). Reasonably these should be set to *1* or *2* seconds,
2286 however, depending on the process count configuration, that might increase
2287 the system load spent on checks.
2288
2289 The singleton process group boot order by default is the same as for the normal startup. In case
2290 if failover happens, the *exsinglesv* got the lock, did failover wait and after reporting
2291 locked status to shared memory, at the next *ndrxd* sanity cycle, will
2292 boot all the XATMI server processes in the original boot sequence.
2293 When all servers in the group are started, *cpmsrv* will continue with the
2294 boot of the client processes.
2295
2296 By setting attribute *noorder="Y"* in *<procgroup>*, the boot order is ignored, and
2297 all server processes in the group are started in parallel. The *cpmsrv* will not
2298 wait for the process group servers to boot but will proceed immediately.
2299
2300 The maximum number of processes in the process group is limited by the
2301 *NDRX_MSGMAX* setting from the *ex_env(5)*. The default is limit *64*. The hard limit
2302 is *99999*.
2303
2304 Singleton process groups do not ensure a 100% probability that processes
2305 will run in one copy only at the same time on any Enduro/X node instance. There could
2306 be cases, such as VM pauses and resumes when the current node was removed from
2307 the shard FS cluster fcntl lock is removed, the other node gets
2308 the shared file system lock and the group is booted on the new node.
2309 Meanwhile, for some reason VM on the old node was resumed from pause,
2310 the old node *exsinglesv*/*ndrxd*/*cpmsrv* will detect that the group lock was
2311 lost and group processes will be killed, but this detection requires
2312 run of full *exsinglesv* check cycle (and shared FS shall not freeze the access)
2313 and over that and additional two sanity cycles of *ndrxd*/*cpmsrv* are required.
2314 Meaning that during this time singleton processes will run on two nodes at the same time.
2315 To ensure that this never happens, the *STONITH* device must be configured. For example *GFS2*
2316 requires *STONITH* as part of the shared file system installation. For
2317 other shared file system clusters, please consult with the corresponding vendor.
2318 However, if *STONITH* is not available on the given shared file system, Enduro/X
2319 transaction manager still can keep the distributed transaction consistency,
2320 with the requirement, that in the event of shared fs cluster failure, fs becomes unavailable
2321 or read-only.
2322
2323 Singleton groups keeps the following health-checks:
2324
2325 - Each group in shared memory keeps the refresh time-stamp interval.
2326 the *exsinglesv* periodically refreshes the time-stamp in shared memory (monotonic clock-based time-stamp).
2327 If processes such as *ndrxd* and *cpmsrv* detect at sanity cycles that
2328 time-stamp refresh has not happened in the configured time (*NDRX_SGREFRESH* from
2329 *ex_env(5)*), the group in shared memory is unlocked, and *ndrxd*/*cpmsrv*
2330 proceeds with killing all the group processes.
2331
2332 - *exsinglesv* regularly verifies the lock status. Checks include
2333 requests to other nodes via *tpbridge(8)* if *exsingleckl(8)* and
2334 *exsingleckr(8)* services are configured. If these services are not
2335 used, then checks are done against ping/heartbeat file (to detect the highest lock
2336 counter of all involved nodes). If the locked node detects that other nodes
2337 have higher counter value, then the given node performs group unlock and after that
2338 *ndrxd* and *cpmsrv* step in to kill the group processes.
2339
2340 - In case *STONITH* is not used *tmsrv(8)* after each commit decision logged,
2341 verifies that it is still the locked node. Verification is done against, shared
2342 memory entry of the singleton group, additionally *tmsrv* checks the
2343 status against the heartbeat file or performs heartbeat status reading from other
2344 nodes. If using *STONITH* device, this check optionally can be disabled,
2345 by setting *ndrxconfig.xml(5)* setting *sg_nodes_verify* to *N*. For
2346 each transaction, 3x checks are made. At start of the transaction, before
2347 committing decision logging and right after the decision is successfully logged. For
2348 TMQ transactions, additional checks are made when when TMQ transaction is
2349 started and at the point where it is being prepared.
2350
2351 - To support failure recovery, *tmsrv* and *tmqueue*
2352 processes have been updated new CLI setting *-X<seconds>*. This setting is used
2353 to detect parallel run of the *tmsrv* and *tmqueue* processes, to
2354 recover from cases when parallel instances have generated queue messages or
2355 transactions. *tmsrv* will load un-seen transaction
2356 logs on the fly, however, *tmqueue* will reboot in case new
2357 messages appear in Q-space disk so that queue-space on active
2358 node can be consistently read messages from the disk. *-X* setting for
2359 *tmsrv* also lets to remove expired/broken transaction logs,
2360 which for some scenarios allows *tmrecoversv(8)* to remove
2361 any locked prepared records.
2362
2363 *NOTE:* when operating with process groups, it is not recommended to
2364 store the *ndrxd*, *cpmsrv* and *exsinglesv* process logs on the shared
2365 file system, as in case of shared file system failure, the processes may
2366 get killed, leading to the fact that singleton groups will not be unlocked locally,
2367 and processes from singleton groups would not be removed.
2368
2369 The following diagram depicts the core functionality of the singleton group
2370 failover, including distributed transaction and persistent queue
2371 processing.
2372
2373 image:ex_adminman_failover.png[caption="Figure 14: ", title="Failover processing", alt="failover processing diagram of two nodes"]
2374
2375 === Sample configuration of the failover setup with singleton groups
2376
2377 This chapter provides a sample configuration for having a cluster of two machines,
2378 and for the matter of keeping instructions short, one singleton process group is
2379 configured, which is located on both machines. By default, *64* process may
2380 be configured, and they may be configured in the same way as described here.
2381
2382 Process group "RM1TMQ" will be configured in the example, allocated on nodes
2383 *1* and *2*.
2384
2385 ==== Recommended process configuration for failover
2386
2387 For *tmsrv* following additional settings shall be used:
2388
2389 .tmsrv - Recommended additional CLI parameters for failover processing
2390 [width="80%", options="header"]
2391 |=========================================================
2392 |Parameter|Description|Recommended value
2393 |-n<node_id> | Common cluster node id, used for log file recognition|
2394 Any node ID (NDRX_NODEID) used in a cluster. Shall be common for all participants in the cluster
2395 |-X<seconds> | Number of seconds to run for disk checks
2396 (i.e. scan for unseen transaction logs or housekeep broken records after their expiry).|*10*
2397 |-h<seconds> | Number of seconds to wait for housekeeping to remove corrupted transaction logs.
2398 Corrupted logs might happen if the active node opened the transaction and did not pass till the
2399 committing phase. On the new node, some records in the resource (e.g. DB, TMQ) might stuck in the prepare
2400 phase, until housekeep happens.|*30*
2401 |-a<nr_tries> | Number of attempts to try to load broken transaction logs.|*3*
2402 |=========================================================
2403
2404 .tmqueue - Recommended additional CLI parameters for failover processing
2405 [width="80%", options="header"]
2406 |=========================================================
2407 |Parameter|Description|Recommended value
2408 |-n<node_id> | Common cluster node id, used for log file recognition|
2409 Any node ID (NDRX_NODEID) used in a cluster. Shall be common for all participants in the cluster
2410 |-X<seconds> | Number of seconds to run for disk checks (i.e. scan for non-loaded messages,
2411 left from another node if had concurrent run). Normally shall not happen if proper fencing is used,
2412 however, it is recommended to leave this setting for any failover setup.|*60*
2413 |-t<seconds> | Default timout for automatic/forward transactions. Due to fact
2414 that for example, *GPFS* may freeze the working node, if not-active node leaves
2415 the cluster, the transaction timeout shall be set to a reasonable value, so
2416 that the transaction is not rolled during the cluster reconfiguation.
2417 This generally applies to any global tansactions used in the application.
2418 In case of *GPFS*, during the tests it is shown, that for good node,
2419 share fs may freeze for about to *30* sec.|*60*
2420 |=========================================================
2421
2422 Additional notes:
2423
2424 If expecting a large number of messages in the persistent queue (>5K), please increase
2425 the process boot time in *<server>* tag for *tmqueue*, for example:
2426
2427 --------------------------------------------------------------------------------
2428
2429         <server name="tmqueue">
2430             ...
2431             <start_max>200</start_max>
2432             <srvstartwait>200</srvstartwait>
2433         </server>
2434
2435 --------------------------------------------------------------------------------
2436
2437 would allow the boot sequence to wait for the message to load into the memory queue space,
2438 before moving on to the next XATMI server to boot.
2439 Check the settings in *ndrxconfig.xml(5)*.
2440
2441 .tpbridge - Recommended additional CLI parameters for failover processing
2442 [width="80%", options="header"]
2443 |=========================================================
2444 |Parameter|Description|Recommended value
2445 |-z<zero_snd_time> | Interval into which zero-length message is sent to the socket to keep the
2446 activity|*5*
2447 |-c<check_time> | Number of seconds for check intervals. This interval checks
2448 for running -z message sending|*2*
2449 |-a<rcv_time> | Number of seconds into which something must be received from
2450 the socket, if not received, the connection is closed|*10*
2451 |=========================================================
2452
2453 The above configuration will ensure, that socket will be reset after 10 seconds,
2454 in case of failed node cut-off, and services will not be available on the
2455 survived node.
2456
2457 .exsinglesv/exsingleckl - Recommended [@exsinglesv/<subsection>] settings
2458 [width="80%", options="header"]
2459 |=========================================================
2460 |Parameter|Description|Recommended value
2461 |noremote | Disable remote checks, i.e. do not call @SGREM service on
2462 other node to check the active node lock status, use only @SGLOC, which
2463 would do the node counter checks in heartbeat file| *1* - is recommended
2464 for *GPFS* setups, as file read-only access possibly is faster than network
2465 access. Default is *0* - disabled (use @SGREM
2466 for remote node status/counter reading
2467 |=========================================================
2468
2469 Defaults (and recommended) values for *exsinglesv(8)* (with default *NDRX_SGREFRESH* env value) are:
2470
2471 - Check interval *10* seconds (*chkinterval*).
2472
2473 - Maximum refresh time, by which the local node will unlock is *30* seconds (*NDRX_SGREFRESH*).
2474 I.e. *exsinglesv* must have run at least one check cycle at this time.
2475
2476 - Other Enduro/X node singleton process group boot time, after the failover (shared file system has
2477 released *lockfile_1* to another node) is *6* check cycles (*locked_wait*). This means that
2478 total time for the survived node to boot would depend on the shared file system
2479 re-configuring timeouts/settings to release fcntl() locks,
2480 plus this wait time (e.g. *6* cycles multiplied by *10* sec of the *chkinterval*).
2481 The wait time shall be at least two times higher than other node refresh times, as that
2482 would ensure that if the lock provider lost the lock on the other node for some
2483 reason, two times higher value would give time to the other node's
2484 *ndrxd*/*cpmsrv* processes to kill the group. And only after that time,
2485 the new node would boot the processes back on.
2486
2487 ==== INI configuration part
2488
2489 Following ini-based common configuration settings for all cluster nodes which
2490 work for the singleton group would be needed:
2491
2492 --------------------------------------------------------------------------------
2493
2494 ################################################################################
2495 # Additional debug settings for lock providers. Loggin happens to *tp* topic
2496 ################################################################################
2497
2498 [@debug]
2499 ...
2500 # lock providers are using "tp" log topic:
2501 exsinglesv=tp=5 file=
2502 exsingleckl=tp=5 file=
2503 exsingleckr=tp=5 file=
2504
2505 ################################################################################
2506 # Lock provider log files
2507 # Most of the settings are defaults (thus not listed here)
2508 ################################################################################
2509
2510 # settings for singleton group RM1TMQ, note that CCTAG here used is the same
2511 # as the group name, but these are two different identifiers, just used
2512 # with the same name for convenience.
2513 [@exsinglesv/RM1TMQ]
2514 lockfile_1=/path/to/shared/storage/RM1TMQ/lock_1
2515 lockfile_2=/path/to/shared/storage/RM1TMQ/lock_2
2516
2517 ################################################################################
2518 # queues for each of the queues spaces
2519 ################################################################################
2520 [@queue/RM1TMQ]
2521 Q1=svcnm=QFWD1,autoq=T,tries=3,waitinit=0,waitretry=1,waitretrymax=5,memonly=n,mode=fifo,workers=1
2522 Q2=svcnm=QFWD2,autoq=T,tries=3,waitinit=0,waitretry=1,waitretrymax=5,memonly=n,mode=fifo,workers=1
2523
2524 ################################################################################
2525 # XA driver configuration
2526 ################################################################################
2527
2528 [@global/RM1TMQ]
2529 NDRX_XA_RES_ID=1
2530 NDRX_XA_OPEN_STR=datadir="/path/to/shared/storage/qsp/SAMPLESPACE",qspace="SAMPLESPACE"
2531 NDRX_XA_CLOSE_STR=${NDRX_XA_OPEN_STR}
2532 NDRX_XA_DRIVERLIB=libndrxxaqdisks.so
2533 NDRX_XA_RMLIB=libndrxxaqdisk.so
2534 NDRX_XA_LAZY_INIT=0
2535
2536 --------------------------------------------------------------------------------
2537
2538 ==== XML configuration part
2539
2540 *ndrxconfig-dom1.xml* (for node 1) and *ndrxconfig-dom2.xml* (for node 2).
2541 Configuration file instances mostly are the same for all involved nodes,
2542 thus only one template will be prepared in this guide:
2543
2544 --------------------------------------------------------------------------------
2545 <?xml version="1.0" ?>
2546 <endurox>
2547     <appconfig>
2548         ...
2549         <sanity>1</sanity>
2550         ...
2551     </appconfig>
2552     <procgroups>
2553         <procgroup grpno="5" name="RM1TMQ" singleton="Y" sg_nodes="1,2" sg_nodes_verify="Y"/>
2554         ...
2555     </procgroups>
2556     <defaults>
2557     ...
2558     </defaults>
2559     <servers>
2560     ...
2561         <!--
2562             Start of the group RM1TMQ
2563         -->
2564         <server name="exsinglesv">
2565             <min>1</min>
2566             <max>1</max>
2567             <srvid>30</srvid>
2568             <sysopt>-e ${NDRX_ULOG}/exsinglesv-a.log -r</sysopt>
2569             <!-- set as lock provider -->
2570             <procgrp_lp>RM1TMQ</procgrp_lp>
2571             <cctag>RM1TMQ</cctag>
2572         </server>
2573         <!-- support servers, local -->
2574         <server name="exsingleckl">
2575             <min>10</min>
2576             <max>10</max>
2577             <srvid>40</srvid>
2578             <sysopt>-e ${NDRX_ULOG}/exsingleckl-a.log -r</sysopt>
2579             <procgrp_lp>RM1TMQ</procgrp_lp>
2580             <cctag>RM1TMQ</cctag>
2581         </server>
2582         <!-- support servers, remote -->
2583         <server name="exsingleckr">
2584             <min>10</min>
2585             <max>10</max>
2586             <srvid>50</srvid>
2587             <sysopt>-e ${NDRX_ULOG}/exsingleckr-a.log -r</sysopt>
2588             <procgrp_lp>RM1TMQ</procgrp_lp>
2589             <cctag>RM1TMQ</cctag>
2590         </server>
2591         <server name="tmsrv">
2592             <min>3</min>
2593             <max>3</max>
2594             <srvid>60</srvid>
2595             <cctag>RM1TMQ</cctag>
2596             <!-- /kvmpool/data/test/tmlogs/rm1 points to shared folder -->
2597             <sysopt>-e ${NDRX_ULOG}/tmsrv-rm1.log -r -- -t60 -l/kvmpool/data/test/tmlogs/rm1 -n2 -X10 -h30 -a3</sysopt>
2598             <procgrp>RM1TMQ</procgrp>
2599         </server>
2600         <server name="tmqueue">
2601             <min>1</min>
2602             <max>1</max>
2603             <srvid>80</srvid>
2604             <cctag>RM1TMQ</cctag>
2605             <sysopt>-e ${NDRX_ULOG}/tmqueue-rm1.log -r -- -s1 -n2 -X60 -t60</sysopt>
2606             <procgrp>RM1TMQ</procgrp>
2607
2608             <!-- Increase time for start wait, as there might be lots of messages
2609                  to load*
2610              -->
2611             <start_max>200</start_max>
2612             <!-- do not start next server within startup time -->
2613             <srvstartwait>200</srvstartwait>
2614         </server>
2615         <server name="tmrecoversv">
2616             <min>1</min>
2617             <max>1</max>
2618             <srvid>250</srvid>
2619             <procgrp>RM1TMQ</procgrp>
2620             <sysopt>-e ${NDRX_ULOG}/tmrecoversv-a.log -r -- -i</sysopt>
2621         </server>
2622
2623         <!-- User-specific binaries: -->
2624         <server name="testsv.py">
2625             <min>3</min>
2626             <max>3</max>
2627             <srvid>1250</srvid>
2628             <sysopt>-e ${NDRX_ULOG}/testsv.py.log -r</sysopt>
2629             <procgrp>RM1TMQ</procgrp>
2630             <cctag>RM1TMQ</cctag>
2631         </server>
2632         <server name="banksv_a">
2633             <min>1</min>
2634             <max>1</max>
2635             <srvid>1500</srvid>
2636             <procgrp>GRP_A</procgrp>
2637             <sysopt>-e ${NDRX_ULOG}/banksv-a.log -r -- -s1</sysopt>
2638         </server>
2639         ... other RM1TMQ servers here ...
2640         <!--
2641             *** End of the group RM1TMQ ***
2642         -->
2643         <!--
2644             Establish link between nodes 1 & 2 (so transaction checks run faster)
2645         -->
2646         <server name="tpbridge">
2647             <max>1</max>
2648             <srvid>500</srvid>
2649             <sysopt>-e ${NDRX_ULOG}/tpbridge.log</sysopt>
2650             <!-- using -z5 / -c2 to enable broken connection detect after 10 sec -->
2651             <appopt>-f -n<other_node_id> -r -i <binding_or_host_ip> -p <binding_or_host_port> -t<A_for_client_B_for_server> -z5 -c2 -a10</appopt>
2652         </server>
2653         ...
2654         <server name="tmrecoversv">
2655             <min>1</min>
2656             <max>1</max>
2657             <srvid>4000</srvid>
2658             <sysopt>-e ${NDRX_ULOG}/tmrecoversv-periodic.log -r -- -p</sysopt>
2659         </server>
2660         <!--
2661             For demonstration purposes, client process will be also attached
2662             to the singleton group
2663         -->
2664         <server name="cpmsrv">
2665             <min>1</min>
2666             <max>1</max>
2667             <srvid>9999</srvid>
2668             <sysopt>-e ${NDRX_ULOG}/cpmsrv-dom1.log -r -- -k3 -i1</sysopt>
2669         </server>
2670     </servers>
2671     <clients>
2672         <!--  client processes linked to process group -->
2673         <client cmdline="restincl" procgrp="RM1TMQ">
2674             <exec tag="TAG1" subsect="SUBSECTION1" autostart="Y" log="${NDRX_ULOG}/restincl-a.log"/>
2675         </client>
2676     </clients>
2677 </endurox>
2678
2679
2680 --------------------------------------------------------------------------------
2681
2682 Once groups are defined, Enduro/X instances can be started. Each group will
2683 be booted only on one of the nodes *1* or *2*, but not on both at the same time.
2684 Processes that are not booted,
2685 are put in the "wait" state. For example, having node 1 start:
2686
2687 --------------------------------------------------------------------------------
2688
2689 $ xadmin start -y
2690
2691 --------------------------------------------------------------------------------
2692
2693 The test output from commands:
2694
2695 --------------------------------------------------------------------------------
2696
2697 $ xadmin start -y
2698 * Shared resources opened...
2699 * Enduro/X back-end (ndrxd) is not running
2700 * ndrxd PID (from PID file): 3494
2701 * ndrxd idle instance started.
2702 exec cconfsrv -k 0myWI5nu -i 1 -e /home/user1/test/log/cconfsrv.log -r --  :
2703         process id=3496 ... Started.
2704 exec cconfsrv -k 0myWI5nu -i 2 -e /home/user1/test/log/cconfsrv.log -r --  :
2705         process id=3497 ... Started.
2706 exec tpadmsv -k 0myWI5nu -i 10 -e /home/user1/test/log/tpadmsv.log -r --  :
2707         process id=3498 ... Started.
2708 exec tpadmsv -k 0myWI5nu -i 11 -e /home/user1/test/log/tpadmsv.log -r --  :
2709         process id=3499 ... Started.
2710 exec tpevsrv -k 0myWI5nu -i 20 -e /home/user1/test/log/tpevsrv.log -r --  :
2711         process id=3500 ... Started.
2712 exec exsinglesv -k 0myWI5nu -i 30 -e /home/user1/test/log/exsinglesv-a.log -r --  :
2713         process id=3501 ... Started.
2714 exec exsingleckl -k 0myWI5nu -i 40 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2715         process id=3502 ... Started.
2716 exec exsingleckl -k 0myWI5nu -i 41 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2717         process id=3503 ... Started.
2718 exec exsingleckl -k 0myWI5nu -i 42 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2719         process id=3504 ... Started.
2720 exec exsingleckl -k 0myWI5nu -i 43 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2721         process id=3505 ... Started.
2722 exec exsingleckl -k 0myWI5nu -i 44 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2723         process id=3506 ... Started.
2724 exec exsingleckr -k 0myWI5nu -i 50 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2725         process id=3507 ... Started.
2726 exec exsingleckr -k 0myWI5nu -i 51 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2727         process id=3508 ... Started.
2728 exec exsingleckr -k 0myWI5nu -i 52 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2729         process id=3509 ... Started.
2730 exec exsingleckr -k 0myWI5nu -i 53 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2731         process id=3510 ... Started.
2732 exec exsingleckr -k 0myWI5nu -i 54 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2733         process id=3511 ... Started.
2734 exec tmsrv -k 0myWI5nu -i 60 -e /home/user1/test/log/tmsrv-rm1.log -r -- -t1 -l/kvmpool/data/test/tmlogs/rm1 -X10 -n2 -h30 -a3 --  :
2735         process id=3512 ... Started.
2736 exec tmsrv -k 0myWI5nu -i 61 -e /home/user1/test/log/tmsrv-rm1.log -r -- -t1 -l/kvmpool/data/test/tmlogs/rm1 -X10 -n2 -h30 -a3 --  :
2737         process id=3524 ... Started.
2738 exec tmsrv -k 0myWI5nu -i 62 -e /home/user1/test/log/tmsrv-rm1.log -r -- -t1 -l/kvmpool/data/test/tmlogs/rm1 -X10 -n2 -h30 -a3 --  :
2739         process id=3536 ... Started.
2740 exec tmqueue -k 0myWI5nu -i 80 -e /home/user1/test/log/tmqueue-rm1.log -r -- -s1 -n2 -X60 -t60 --  :
2741         process id=3548 ... Started.
2742 exec tmrecoversv -k 0myWI5nu -i 250 -e /home/user1/test/log/tmrecoversv-a.log -r -- -i --  :
2743         process id=3711 ... Started.
2744 exec testsv.py -k 0myWI5nu -i 1250 -e /home/user1/test/log/testsv.py.log -r --  :
2745         process id=3712 ... Started.
2746 exec testsv.py -k 0myWI5nu -i 1251 -e /home/user1/test/log/testsv.py.log -r --  :
2747         process id=3713 ... Started.
2748 exec testsv.py -k 0myWI5nu -i 1252 -e /home/user1/test/log/testsv.py.log -r --  :
2749         process id=3714 ... Started.
2750 exec tpbridge -k 0myWI5nu -i 150 -e /home/user1/test/log/tpbridge_2.log -r -- -n2 -r -i x.x.x.x -p 21003 -tA -z30 :
2751         process id=3715 ... Started.
2752 exec tmrecoversv -k 0myWI5nu -i 9900 -e /home/user1/test/log/tmrecoversv.log -- -p  --  :
2753         process id=3722 ... Started.
2754 exec cpmsrv -k 0myWI5nu -i 9999 -e /home/user1/test/log/cpmsrv.log -r -- -k3 -i1 --  :
2755         process id=3723 ... Started.
2756 Startup finished. 27 processes started.
2757
2758
2759 $ xadmin pc
2760 * Shared resources opened...
2761 * ndrxd PID (from PID file): 3494
2762 TAG1/SUBSECTION1 - running pid 3725 (process group RM1TMQ (no 5), Sun Oct  8 07:19:12 2023)
2763
2764 --------------------------------------------------------------------------------
2765
2766 When node 2 is started, the following output is expected from the system (sample output):
2767
2768 --------------------------------------------------------------------------------
2769
2770 $ xadmin start -y
2771 * Shared resources opened...
2772 * Enduro/X back-end (ndrxd) is not running
2773 * ndrxd PID (from PID file): 31862
2774 * ndrxd idle instance started.
2775 exec cconfsrv -k 0myWI5nu -i 1 -e /home/user1/test/log/cconfsrv.log -r --  :
2776         process id=31864 ... Started.
2777 exec cconfsrv -k 0myWI5nu -i 2 -e /home/user1/test/log/cconfsrv.log -r --  :
2778         process id=31865 ... Started.
2779 exec tpadmsv -k 0myWI5nu -i 10 -e /home/user1/test/log/tpadmsv.log -r --  :
2780         process id=31866 ... Started.
2781 exec tpadmsv -k 0myWI5nu -i 11 -e /home/user1/test/log/tpadmsv.log -r --  :
2782         process id=31867 ... Started.
2783 exec tpevsrv -k 0myWI5nu -i 20 -e /home/user1/test/log/tpevsrv.log -r --  :
2784         process id=31868 ... Started.
2785 exec exsinglesv -k 0myWI5nu -i 30 -e /home/user1/test/log/exsinglesv-a.log -r --  :
2786         process id=31869 ... Started.
2787 exec exsingleckl -k 0myWI5nu -i 40 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2788         process id=31870 ... Started.
2789 exec exsingleckl -k 0myWI5nu -i 41 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2790         process id=31871 ... Started.
2791 exec exsingleckl -k 0myWI5nu -i 42 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2792         process id=31872 ... Started.
2793 exec exsingleckl -k 0myWI5nu -i 43 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2794         process id=31873 ... Started.
2795 exec exsingleckl -k 0myWI5nu -i 44 -e /home/user1/test/log/exsingleckl-a.log -r --  :
2796         process id=31874 ... Started.
2797 exec exsingleckr -k 0myWI5nu -i 50 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2798         process id=31875 ... Started.
2799 exec exsingleckr -k 0myWI5nu -i 51 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2800         process id=31876 ... Started.
2801 exec exsingleckr -k 0myWI5nu -i 52 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2802         process id=31877 ... Started.
2803 exec exsingleckr -k 0myWI5nu -i 53 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2804         process id=31878 ... Started.
2805 exec exsingleckr -k 0myWI5nu -i 54 -e /home/user1/test/log/exsingleckr-a.log -r --  :
2806         process id=31879 ... Started.
2807 exec tmsrv -k 0myWI5nu -i 60 -e /home/user1/test/log/tmsrv-rm1.log -r -- -t1 -l/kvmpool/data/test/tmlogs/rm1 -n2 -X10 -h30 -a3 --  :
2808         process id=0 ... Waiting on group lock.
2809 exec tmsrv -k 0myWI5nu -i 61 -e /home/user1/test/log/tmsrv-rm1.log -r -- -t1 -l/kvmpool/data/test/tmlogs/rm1 -n2 -X10 -h30 -a3 --  :
2810         process id=0 ... Waiting on group lock.
2811 exec tmsrv -k 0myWI5nu -i 62 -e /home/user1/test/log/tmsrv-rm1.log -r -- -t1 -l/kvmpool/data/test/tmlogs/rm1 -n2 -X10 -h30 -a3 --  :
2812         process id=0 ... Waiting on group lock.
2813 exec tmqueue -k 0myWI5nu -i 80 -e /home/user1/test/log/tmqueue-rm1.log -r -- -s1 -n2 -X60 -t60 --  :
2814         process id=0 ... Waiting on group lock.
2815 exec testsv.py -k 0myWI5nu -i 1250 -e /home/user1/test/log/testsv.py.log -r --  :
2816         process id=0 ... Waiting on group lock.
2817 exec testsv.py -k 0myWI5nu -i 1251 -e /home/user1/test/log/testsv.py.log -r --  :
2818         process id=0 ... Waiting on group lock.
2819 exec testsv.py -k 0myWI5nu -i 1252 -e /home/user1/test/log/testsv.py.log -r --  :
2820         process id=0 ... Waiting on group lock.
2821 exec tmrecoversv -k 0myWI5nu -i 250 -e /home/user1/test/log/tmrecoversv-a.log -r -- -i --  :
2822         process id=0 ... Waiting on group lock.
2823 exec tpbridge -k 0myWI5nu -i 150 -e /home/user1/test/log/tpbridge_2.log -r -- -n1 -r -i 0.0.0.0 -p 21003 -tP -z30 :
2824         process id=31880 ... Started.
2825 exec tmrecoversv -k 0myWI5nu -i 9900 -e /home/user1/test/log/tmrecoversv.log -- -p  --  :
2826         process id=31887 ... Started.
2827 exec cpmsrv -k 0myWI5nu -i 9999 -e /home/user1/test/log/cpmsrv.log -r -- -k3 -i1 --  :
2828         process id=31888 ... Started.
2829 Startup finished. 19 processes started.
2830
2831 $ xadmin pc
2832 * Shared resources opened...
2833 * ndrxd PID (from PID file): 31862
2834 TAG1/SUBSECTION1 - waiting on group lock (process group RM1TMQ (no 5), Sun Oct  8 14:19:49 2023)
2835
2836 --------------------------------------------------------------------------------
2837
2838 Group was locked on node *1* and on node *2* processes in group *RM1TMQ* are put
2839 in the wait state (waiting on lock).
2840
2841 After the startup XATMI server process states can be inspected by the following
2842 command:
2843
2844 $ xadmin ppm
2845
2846 Example from 2nd node:
2847
2848 --------------------------------------------------------------------------------
2849
2850
2851 $ xadmin ppm
2852 BINARY   SRVID      PID    SVPID STATE REQST AS EXSQ RSP  NTRM LSIG K STCH FLAGS
2853 -------- ----- -------- -------- ----- ----- -- ---- ---- ---- ---- - ---- -----
2854 cconfsrv     1     5711     5711 runok runok  1    0   61    0   61 1   61
2855 cconfsrv     2     5712     5712 runok runok  1    0   61    0   61 1   61
2856 tpadmsv     10     5713     5713 runok runok  1    0   61    0   61 1   61
2857 tpadmsv     11     5714     5714 runok runok  1    0   61    0   61 1   61
2858 tpevsrv     20     5715     5715 runok runok  1    0   61    0   61 1   61
2859 exsingle    30     5716     5716 runok runok  1    0   61    0   61 1   61 L
2860 exsingle    40     5717     5717 runok runok  1    0   61    0   61 1   61
2861 exsingle    41     5718     5718 runok runok  1    0   61    0   61 1   61
2862 exsingle    42     5719     5719 runok runok  1    0   61    0   61 1   61
2863 exsingle    43     5720     5720 runok runok  1    0   61    0   61 1   61
2864 exsingle    44     5721     5721 runok runok  1    0   61    0   61 1   61
2865 exsingle    50     5722     5722 runok runok  1    0   61    0   61 1   61
2866 exsingle    51     5723     5723 runok runok  1    0   61    0   61 1   61
2867 exsingle    52     5724     5724 runok runok  1    0   61    0   61 1   61
2868 exsingle    53     5725     5725 runok runok  1    0   61    0   61 1   61
2869 exsingle    54     5726     5726 runok runok  1    0   61    0   61 1   61
2870 tmsrv       60        0        0 wait  runok  1    0   61    0   61 1   61
2871 tmsrv       61        0        0 wait  runok  1    0   61    0   61 1   61
2872 tmsrv       62        0        0 wait  runok  1    0   61    0   61 1   61
2873 tmqueue     80        0        0 wait  runok  1    0   61    0   61 1   61
2874 testsv.p  1250        0        0 wait  runok  1    0   61    0   61 1   61
2875 testsv.p  1251        0        0 wait  runok  1    0   61    0   61 1   61
2876 testsv.p  1252        0        0 wait  runok  1    0   61    0   61 1   61
2877 tmrecove   250        0        0 wait  runok  1    0   61    0   61 1   61
2878 tpbridge   150     5727     5727 runok runok  1    0   61    0   61 1   61 BrC
2879 tmrecove  9900     5734     5734 runok runok  1    0   61    0   61 1   61
2880 cpmsrv    9999     5735     5735 runok runok  1    0   61    0   61 1   61
2881
2882
2883 --------------------------------------------------------------------------------
2884
2885 With the following command lock groups can be inspected for the current application
2886 running:
2887
2888 $ xadmin ppm -3
2889
2890 Example from the second node:
2891
2892 --------------------------------------------------------------------------------
2893
2894 $ xadmin ppm -3
2895 BINARY   SRVID      PID    SVPID PROCGRP   PGNO PROCGRPL PGLNO PGNLA
2896 -------- ----- -------- -------- -------- ----- -------- ----- -----
2897 cconfsrv     1     5711     5711 -            0 -            0     0
2898 cconfsrv     2     5712     5712 -            0 -            0     0
2899 tpadmsv     10     5713     5713 -            0 -            0     0
2900 tpadmsv     11     5714     5714 -            0 -            0     0
2901 tpevsrv     20     5715     5715 -            0 -            0     0
2902 exsingle    30     5716     5716 -            0 RM1TMQ       5     5
2903 exsingle    40     5717     5717 -            0 RM1TMQ       5     0
2904 exsingle    41     5718     5718 -            0 RM1TMQ       5     0
2905 exsingle    42     5719     5719 -            0 RM1TMQ       5     0
2906 exsingle    43     5720     5720 -            0 RM1TMQ       5     0
2907 exsingle    44     5721     5721 -            0 RM1TMQ       5     0
2908 exsingle    50     5722     5722 -            0 RM1TMQ       5     0
2909 exsingle    51     5723     5723 -            0 RM1TMQ       5     0
2910 exsingle    52     5724     5724 -            0 RM1TMQ       5     0
2911 exsingle    53     5725     5725 -            0 RM1TMQ       5     0
2912 exsingle    54     5726     5726 -            0 RM1TMQ       5     0
2913 tmsrv       60        0        0 RM1TMQ       5 -            0     0
2914 tmsrv       61        0        0 RM1TMQ       5 -            0     0
2915 tmsrv       62        0        0 RM1TMQ       5 -            0     0
2916 tmqueue     80        0        0 RM1TMQ       5 -            0     0
2917 testsv.p  1250        0        0 RM1TMQ       5 -            0     0
2918 testsv.p  1251        0        0 RM1TMQ       5 -            0     0
2919 testsv.p  1252        0        0 RM1TMQ       5 -            0     0
2920 tmrecove   250        0        0 RM1TMQ       5 -            0     0
2921 tpbridge   150     5727     5727 -            0 -            0     0
2922 tmrecove  9900     5734     5734 -            0 -            0     0
2923 cpmsrv    9999     5735     5735 -            0 -            0     0
2924
2925 --------------------------------------------------------------------------------
2926
2927 Where columns explained:
2928
2929 - *PROCGRP* - Process group name set in <procgrp> for the server.
2930
2931 - *PGNO* - Process group number.
2932
2933 - *PROCGRPL* - This process is lock provider for a given group (group name).
2934
2935 - *PGLNO* - This process is lock provider for a given group (group number).
2936
2937 - *PGNLA* - This is the reported group number for which the lock provider is serving.
2938
2939
2940 === Group-based process startup, stopping and restart
2941
2942 Additionally, the following useful commands are available in Enduro/X for support
2943 of process/singleton groups.
2944
2945
2946 Processes in process groups (including singleton process groups), additionally
2947 can be controlled by following commands:
2948
2949 *XATMI servers:*
2950
2951 - $ xadmin start -g <group name>
2952
2953 - $ xadmin stop -g <group name>
2954
2955 - $ xadmin sreload -g <group name>
2956
2957 - $ xadmin restart -g <group name>
2958
2959
2960 *Client processes:*
2961
2962 - $ xadmin bc -g <group name>
2963
2964 - $ xadmin sc -g <group name>
2965
2966 - $ xadmin rc -g <group name>
2967
2968
2969 === System maintenance
2970
2971 To avoid initiation of the failover (the good node takes the bad node's lock and boots the group),
2972 Enduro/X has the following commands:
2973
2974 - xadmin mmon
2975
2976 - xadmin mmoff
2977
2978 *mmon* command enables maintenance mode for a given node where the command is executed.
2979 After enabling such a command, the node will not take new locks / perform fail-over.
2980 This might be useful if the planned shutdown of the currently active node shall be done,
2981 and failover to other working nodes is not required.
2982
2983 *mmoff* command puts the node back to normal.
2984
2985 the current state of the maintenance mode can be checked by:
2986
2987 - $ xadmin shmcfg
2988
2989 For example
2990
2991 --------------------------------------------------------------------------------
2992
2993 $ xadmin shmcfg
2994
2995 shmcfgver_lcf = 0
2996 use_ddr = 0
2997 ddr_page = 0
2998 ddr_ver1 = 0
2999 is_mmon = 1
3000
3001 --------------------------------------------------------------------------------
3002
3003 indicates that mode is enabled (is_mmon is *1*). The setting is stored in shared memory and can be
3004 enabled before the given node startup.
3005
3006 At full Enduro/X shutdown, if it was set, the node is set back to *0*.
3007
3008 To check singleton group statuses on the given machine, the following command may be used:
3009
3010 - xadmin psg
3011
3012 Example:
3013
3014 --------------------------------------------------------------------------------
3015 $ xadmin psg
3016
3017 SGID LCKD MMON SBOOT CBOOT LPSRVID    LPPID LPPROCNM          REFRESH RSN FLAGS
3018 ---- ---- ---- ----- ----- ------- -------- ---------------- -------- --- -----
3019    5 Y    N    Y     Y          10 24642038 exsinglesv             1s   0 iv
3020
3021 --------------------------------------------------------------------------------
3022
3023 Where columns explained:
3024
3025 *SGID* - Singleton Process Group id/number. Set in *<endurox>/<procgroups>/<procgroup>*
3026 attribute *grpno*.
3027
3028 *LCKD* - Is group locked locally. *Y* - locked, *N* - not.
3029
3030 *MMON* - RFU, always *N*.
3031
3032 *SBOOT* - Is server boot completed. *Y* - yes, *N* - no.
3033
3034 *CBOOT* - Is client (in <clients> section) boot completed  *Y* - yes, *N* - no.
3035
3036 *LPSRVID* - *srvid* of the lock provider server.
3037
3038 *LPPID* - Lock provider pid.
3039
3040 *LPPROCNM* - Lock provider process name.
3041
3042 *REFRESH* - Time from last lock provider refresh (i.e. is locked, and verification passed of
3043 the locks).
3044
3045 *RNS* - Reason for lock loss.  May take following values: *1* - Expired by missing refresh,
3046 *2*  - PID missing of lock holder, *5* - Normal shutdown. *6* - Locking error (by exsinglesv),
3047 *7* - Corrupted structures. *8*  - Locked by network response (other node has got the lock),
3048 *9*  - Network sequence ahead of us (>=). *10*  Their - (other node seq) in heartbeat file (lockfile_2)
3049 (>=) our lck seq.
3050
3051 *FLAGS* - *n* - no order boot enabled (i.e. *noorder* set to *Y* for group),
3052 *i* - singleton group configured (in-use), *v* - group verification enabled
3053 (i.e. group *sg_nodes_verify* setting set to *Y*).
3054
3055 == Standard configuration strings
3056
3057 Enduro/X in several places uses standard configuration string format. This
3058 configuration string may encode *key=value* data, as well only *key* (keywords)
3059 may appear in the string. The separator between keywords/keyword+value pairs
3060 is any of <space>, <tab>, <comma> which may be combined. The value may be quoted
3061 string with single or double quotes. The value may contain the quotes which shall
3062 be escaped with a backslash, in case if using other quotes, then for value opening
3063 the escape is not needed. The backslash can be escaped too.
3064
3065 Examples:
3066
3067 --------------------------------------------------------------------------------
3068
3069 # key/value based:
3070
3071 - somekey1='value1, "other value"', somekey2=some_value somekey3="some value \" \\3"
3072
3073 which will have:
3074
3075 - [somekey1] with value [value1, "other value"]
3076 - [somekey2] with value [some_value]
3077 - [somekey3] with value [some value " \3]
3078
3079 # key based:
3080
3081 - keyword1, keyword2,,,, keyword3       keyword4
3082
3083 which will have:
3084
3085 - [keyword1]
3086 - [keyword2]
3087 - [keyword3]
3088 - [keyword4]
3089
3090 --------------------------------------------------------------------------------
3091
3092 This format is not used in all places in Enduro/X, but where the use is made, the
3093 documentation refers to this as "standard configuration string".
3094
3095 == Trouble shooting
3096
3097 This section lists common issues and their resolutions for runtime operations.
3098
3099 === EDB_READERS_FULL: Environment maxreaders limit reached errors
3100
3101 Error *EDB_READERS_FULL* may appear, if processes using LMDB data access
3102 (as part of UBFDB or Enduro/X Smart Cache), reaches readers limit (which is
3103 default is 1000). If more than this number of processes accesses the LMDB database,
3104 the error *EDB_READERS_FULL* is generated.
3105
3106 To solve this issue, parameter *max_readers* must be adjusted in Enduro/X ini files.
3107
3108 In case if error is related to UBF DB  (dynamic UBF field table):
3109
3110 --------------------------------------------------------------------------------
3111
3112 [@ubfdb[/CCTAG]]
3113 ...
3114 # Increase the MAX_READERS, for example 5000
3115 max_readers=5000
3116 resource=DATABASE_DIRECTORY
3117 ...
3118
3119 --------------------------------------------------------------------------------
3120
3121 In case if error is related to Enduro/X Smart Cache:
3122
3123 --------------------------------------------------------------------------------
3124
3125 [@cachedb/DBNAME_SECTION]
3126
3127 # Increase the MAX_READERS, for example 5000
3128 max_readers=5000
3129 resource=DATABASE_DIRECTORY
3130
3131 --------------------------------------------------------------------------------
3132
3133 The target number of *max_readers* depends on number of processes actually
3134 using that resource, so some estimation must be made, to set the value correctly.
3135
3136 To activate the parameter:
3137
3138 . Update the configuration (as previously written).
3139
3140 . Stop Enduro/X application.
3141
3142 . Contents of the resource directory 'DATABASE_DIRECTORY' (path pointed by *resource*)
3143 shall be removed.
3144
3145 . Start Enduro/X application. New readers count shall be effective.
3146
3147 [bibliography]
3148
3149 == Additional documentation
3150
3151 This section lists additional related documents.
3152
3153 [bibliography]
3154 .Related documents
3155
3156 - [[[EX_OVERVIEW]]] ex_overview(guides)
3157 - [[[MQ_OVERVIEW]]] 'man 7 mq_overview'
3158 - [[[EX_ENV]]] ex_env(5)
3159 - [[[NDRXCONFIG]]] ndrxconfig.xml(5)
3160 - [[[DEBUGCONF]]] ndrxdebug.conf(5)
3161 - [[[XADMIN]]] xadmin(8)
3162 - [[[TPBRIDGE]]] tpbridge(8)
3163
3164 [glossary]
3165
3166 == Glossary
3167
3168 This section lists specific keywords used in document.
3169
3170 [glossary]
3171 ATMI::
3172   Application Transaction Monitor Interface
3173
3174 UBF::
3175   Unified Buffer Format it is similar API as Tuxedo's FML
3176
3177
3178 ////////////////////////////////////////////////////////////////
3179 The index is normally left completely empty, it's contents being
3180 generated automatically by the DocBook toolchain.
3181 ////////////////////////////////////////////////////////////////