October 1995

Advanced Monitoring and Tuning

by Adrian Cockcroft


Welcome Back

Last month, I described some basic performance rules that can be used to monitor the behavior of a computer system. I also made available a performance toolkit that includes an implementation of the rules, and a GUI front end called ruletool.se. We also held a competition to give ruletool.se a better name; the winning name will be announced in a future column. This month, I will take a closer look at the virtual_adrian.se script and explain how the SE performance toolkit works so you can build your own scripts.


Look Out, There's a Guru About

Sysadmin: “What's wrong with this system? Can you tune it, please? Here's the root password.”

Performance Guru: “I've tuned a few kernel variables, but it looks pretty idle to me right now.”

Sysadmin: “Yes, but it gets really, really busy some of the time.....”

Performance Guru: “You should start collecting some data. Call me again when you have some logs of when it is busy.”

Have you ever found that your most troublesome system is idle, or behaves perfectly, when you call in the consultants and performance gurus? Then, when you collect data that shows the problem, the guru wants some extra data that you didn't bother to collect.

The SE toolkit provides a way to capture and distribute the expertise of a performance specialist. I have used the toolkit to build a personalized performance monitoring and tuning tool called virtual_adrian.se. It starts up when the system boots, does some simple tuning, and every 30 seconds, it takes a look at the state of the system. When run with superuser permissions, it can monitor more of the system than ruletool.se. It can also make immediate changes to some kernel tunables.

When you run virtual_adrian.se, it keeps quiet until it sees a problem. Then it complains and shows the data that caused the complaint. For example, if it keeps complaining about a slow disk, then you should find a way to reduce the load on that disk. Unlike a real performance guru, the virtual guru can look at lots of data very quickly, and never gets bored or falls asleep. Most performance problems occur over and over again. The script picks out a large number of “obvious” problems. (Obvious to a guru :-).

If you disagree with my rules (which are a bit crude in places), or want to add your own rules, you can easily modify the script and rename it virtual_kimberley.se or whatever.


The virtual_adrian.se Performance Tuner and Monitor

The core of virtual_adrian.se is the same set of rules that are used by ruletool.se. There is an initial check and tune of the system, and a few extra rules that require superuser permissions. If you run virtual_adrian.se as a regular user, it runs with functionality equivalent to ruletool.se writing to a logfile.

Check and Tune the System

I'm always asked for a magic bullet - the secret tweak to a kernel variable that will miraculously make the whole machine run much faster. I'm sorry, but tuning the kernel should be the last thing you try. It is very rare for this kind of change to make a difference that can be reliably measured at all. That said, there are a small number of common kernel tuning variables that I like to tune. I generally bring older releases in line with later ones, so most tweaks are for Solaris 2.3 and are not needed for Solaris 2.4 or 2.5.

The usual method is to set variables in /etc/system. The problem is that these variables may be incorrectly set due to folklore. They may also be unnecessary or harmful in later releases, and the /etc/system file is often propagated intact to very different systems. Deciding when a tweak is useful and what the value should be for a particular release is complex. Your motto should be, 'If in doubt, do without!'

My solution is to remove everything from /etc/system that doesn't have a big comment that explains why it is there and how its value was derived. The well known maxusers tunable automatically scales as you add memory, so you don't need to set it unless you have several GB of RAM and are short on kernel memory. I implemented a check-and-tune routine in virtual_adrian.se that checks and tunes some values directly. It tells you what it does, and tells you to set variables in /etc/system for the next reboot if they cannot be set on line. The check is performed once only (at start-up) and is implemented by a routine called static_check(). Note that the current value of each tunable is checked by my script. Please do not blindly set all these values in your own /etc/system file as you may decrease something that you thought you were increasing.

The tuning actions performed for Solaris 2.3 are:

If you have a patched Solaris 2.3 kernel (patch 101318-45 or later), the Solaris 2.4 inode cache algorithm was added to the patch. It is not necessary to make the inode cache limit so big, but it doesn't hurt. (I don't have code to figure out the patch level!)

The tuning actions performed for Solaris 2.4 and 2.5 are:

File System Flush Process Monitor

The fsflush process is responsible for flushing modified file system data to disk. It usually flushes anything that has been resident for 30 seconds (autoup), and runs every five seconds (tune_t_fsflushr). Fsflush checks each page of memory in turn. Systems with a lot of memory can waste CPU time. If you are running a raw disk resident database, fsflush does not need to run often. In virtual_adrian.se, a configurable process monitor watches a single process and complains if it uses too much CPU. By default process id 3 (fslush) is monitored and a complaint occurs if it takes more than five percent of one CPU.

NFS Mount Point Service Time Monitor

The NFS protocol maintains retransmit timers for client side requests when run over the UDP/IP protocol. These can be viewed with nfsstat -m on the NFS clients. Note that Solaris 2.5 supports NFS over TCP/IP, the timers are not needed and are reported as zero in this case. The timers can only be read when running as root, which is the main reason this check is not part of ruletool.se. I check that the overall smoothed round trip time is under 50ms (srtt for All:). This is similar in concept to checking that disk I/O service time is better than 50ms in the disk rule.

Figure 1 Example output from nfsstat -m

-------------------------------------------------------------------------
/home/username from server:/export/home3/username
 Flags:   vers=2,hard,intr,down,dynamic,rsize=8192,wsize=8192,retrans=5
 Lookups: srtt=7 (17ms), dev=4 (20ms), cur=2 (40ms)
 Reads:   srtt=16 (40ms), dev=8 (40ms), cur=6 (120ms)
 Writes:  srtt=19 (47ms), dev=3 (15ms), cur=6 (120ms)
 All:     srtt=15 (37ms), dev=8 (40ms), cur=5 (100ms)
-------------------------------------------------------------------------

Sample Output from virtual_adrian.se

When you install the SE toolkit, you are asked if you want to set up virtual_adrian.se to run automatically on every bootup. If you say yes then output is sent to the console and logged to /var/adm/sa/monitor.log. Let's take a look at the output collected from running it manually.

-------------------------------------------------------------------------
% virtual_adrian.se
Warning: Cannot init kvm: Permission denied
Warning: Kvm not initialized: Variables will have invalid values
Adrian is monitoring your system starting at: Thu Sep 21 00:37:26 1995

Warning: Cannot get info for pid 3.
superuser permissions are needed to access every process
Using predefined rules for disk, net, rpcc, swap, ram, kmem,
cpu, mutex, dnlc and inode

Checking the system every 30 seconds...
-------------------------------------------------------------------------

The warning about kvm (the raw kernel data interface) can be ignored, as the script does not attempt to use any kvm data unless it runs as root. When you run it as root you see the extra rules are configured.

-------------------------------------------------------------------------
# /opt/RICHPse/examples/virtual_adrian.se
Adrian is monitoring your system starting at: Thu Sep 21 00:46:01 1995

Process watcher pid set to 3, process name fsflush, max CPU usage  5.0%
NFS client threshold set at All: srtt=20 (50ms) max NFS round trip
Minimum client NFS ops/sec considered active 2.00/s
Using predefined rules for disk, net, rpcc, swap, ram, kmem,
cpu, mutex, dnlc and inode

Checking the system every 30 seconds...
-------------------------------------------------------------------------

I ran the command % find / -ls >/dev/null which makes the disk and the name caches rather busy, and got this output on a 32MB SPARCstation IPX.

-------------------------------------------------------------------------
Adrian detected slow disk(s): Thu Sep 21 00:48:33 1995
Move load from busy disks to idle disks
State  disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b  delay
red    c0t3d0   23.9  5.6   81.1   44.5  2.6  1.3  130.7  17  78 3856.1
amber  c0t5d0    1.3  1.9    7.1   35.2  0.0  0.1   30.3   0   6   94.9

Adrian detected Directory Name Cache problem (amber): Thu Sep 21 00:48:33 1995
Poor DNLC hitrate, increase ncsize
DNLC hitrate 44.1%, reference rate 125.50/s
DNLC has 617 entries, try increasing it (and inode cache) to 1234

Adrian detected Inode Cache problem (amber): Thu Sep 21 00:48:33 1995
Poor inode cache hitrate, increase ufs_ninode
Inode hitrate 26.9%, reference rate 64.63/s

Adrian detected RAM shortage (amber): Thu Sep 21 00:49:05 1995
The system is getting short on RAM, perhaps add some more
  procs        memory         page             faults              cpu
 r  b  w    swap    free    pi  po  sr   in    sy    cs  smtx us sy wt id
 0  0  0   43940     700     8   2  55  342   842   205    11 23 18 20 39
-------------------------------------------------------------------------

As you can see, the output is somewhat verbose, and is based on extended versions of familiar command output where appropriate.

The whole point of the SE toolkit is that you can customize the tools very easily. To encourage you to do this, the rest of this column is a tour of the SE language and toolkit classes.


Features of the SE Language

The SE language is based on a subset of the C language. It is much easier to learn a dialect of an existing language than it is to learn a brand new language.

A typical script includes a whole bunch of header files that define the functions and classes to be used, then handles command line arguments and loops, reading a class variable and printing out the values.

Iostat Written in SE

The SE language comes with a collection of scripts that clone the common UNIX performance utilities, like iostat and vmstat. The code is very simple and easy to understand as shown in Figure 2 below.

Figure 2 Code for iostat.se

-------------------------------------------------------------------------
#! /opt/RICHPse/bin/se
 
#include <stdio.se>
#include <tdlib.se>
#include <unistd.se>
#include <string.se>
#include <kstat.se>
#include <sysdepend.se>
#include <p_iostat_class.se>
#include <dirent.se>
#include <inst_to_path_class.se>

#define SAMPLE_INTERVAL   5

main(int argc, string argv[2])
{
  p_iostat p_iostat$disk;
  p_iostat tmp_disk;
  int i;
  int interval = SAMPLE_INTERVAL;
  int ndisks;

  switch(argc) {
  case 1:
    break;
  case 2:
    interval = atoi(argv[1]);
    break;
  default:
    printf("use: %s [interval]\n", argv[0]);
    exit(1);
  }
  ndisks = p_iostat$disk.disk_count;
  for(;;) {
    sleep(interval);
    printf("extended disk statistics\n");
    printf("disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %%w  %%b\n");
    for(i=0; i < ndisks; i++) {
      p_iostat$disk.number$ = i;
      tmp_disk = p_iostat$disk;
      printf("%-8.8s %4.1f %4.1f %6.1f %6.1f %4.1f %4.1f %6.1f %3.0f %3.0f\n",
        tmp_disk.name$,
        tmp_disk.reads, tmp_disk.writes,
        tmp_disk.kreads, tmp_disk.kwrites,
        tmp_disk.avg_wait, tmp_disk.avg_run,
        tmp_disk.service,
        tmp_disk.wait_percent, tmp_disk.run_percent);
    }
  }
}
-------------------------------------------------------------------------
To illustrate the language, let's walk through this script:

  1. The first line tells the shell where to find the interpreter for this script.
  2. The #include lines bring in the definitions, and the code that does the real work.
  3. The definitions in main() define two copies of the per-disk iostat class. One has the special prefix “p_iostat$” that matches the function name defined in the class. This is an active variable. By convention, dollar signs are embedded in active variables to indicate that they are special. The other is just a regular data structure of the same storage type. It is used to hold a temporary snapshot copy of the data. Internally, a snapshot of the kernel's disk counters is made as the class initializes itself.
  4. Simple command line processing allows the measurement interval to be specified.
  5. The first read of the class picks out how many disks there are.
  6. The script loops forever until it is interrupted. It sleeps for a few seconds to start with, so the subsequent measurement is made over the specified interval.
  7. After printing the headers, the script loops through the disks. For each disk, you must write the index into the class. The index tells the class which disk you want data for.
  8. A structure-to-structure copy takes a snapshot of all the data for a single disk. Before the data is read from the class the kernel counters are read, differenced from the previous measurements, and processed into floating point values.
  9. All that remains is to print out the data in the right format.
The output matches the regular iostat -x command.

-------------------------------------------------------------------------
% iostat.se
extended disk statistics
disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b
sd3       0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0
sd5       0.0  4.6    0.0   29.2  0.0  0.1   18.3   1   5
-------------------------------------------------------------------------


Rule Construction

Let's take a look at a typical rule to see how it is implemented. You will see how easy it is to define and use your own performance rules. In just a few hours, you can build a complete customized virtual guru of your own. If you would like to share your work, send it to se-feedback@chessie.eng.sun.com, and we will try to add it to the next release of the SE Performance Toolkit.

States and Actions

I decided to extend the usual red/amber/green conditions that most tools implement. I wanted to indicate a few extra conditions that sometimes occur:

Each state is associated with an action, indicating the nature of the problem and, if possible, what to do about it.

Rules as Objects

I used the object oriented class mechanism of the language to implement the rules. This gives several benefits:

Pure Rule Objects

I built what I called pure rules. These are provided with input data to evaluate by writing to elements of the class. They are evaluated by reading the rule and examining both the state code and action string that are always provided as elements of the class. These rules make no assumptions about where the data came from. They could be given simulated data, historical data samples, or live data.

Live Rule Objects

The next set of rule classes produced were called live rules. These package up the code needed to:

A live rule is completely trivial to use in a script. It is self initializing and self updating. You just declare an instance of the class, and each time you read it, you get the current state and action string.

-------------------------------------------------------------------------
lr_disk_t lr_disk$dr;
lr_disk_t tmp_dr;
/* use the live disk rule */
tmp_dr = lr_disk$dr;
if ( tmp_dr.state > ST_GREEN) {
	printf("The disks are in the %s state: %s\n",
		state_string(tmp_dr.state), tmp_dr.action);
}
-------------------------------------------------------------------------


The CPU Power Rule

This is a relatively simple rule, so it will be explained in detail. The full implementation of all the rules is too complex to explain here, but you can browse the pure_rules.se and live_rules.se header files.

Rule Definition

Each rule was initially defined in terms of the output of standard system commands. For example the shorthand vmstat30.r means: Run the command vmstat with a 30-second interval, and look at the column labelled with an r.

The number of CPUs on the system must be known. It is referred to as ncpus in the rules. The run queue length is divided by the number of CPUs. This is based on the assumption that every CPU takes a job off the run queue in each time slice.

Table 1 CPU Rule

-------------------------------------------------------------------------
CPU RULE				LEVEL		ACTION
-------------------------------------------------------------------------
0 == vmstat30.r				White		1. CPU Idle
0 < (vmstat30.r / ncpus) < 3.0		Green		No Problem
3.0 <= (vmstat30.r / ncpus) < 5.0	Amber		2. CPU Busy
5.0 <= (vmstat30.r / ncpus)		Red		2. CPU Busy
-------------------------------------------------------------------------

  1. CPU Idle - The CPU power of this system is underutilized. Fewer or less powerful CPUs could do this job.

  2. CPU Busy - There is insufficient CPU power, and jobs spend an increasing amount of time in the queue before being assigned to a CPU. This reduces throughput and increases interactive response times.

CPU Pure Rule Code

Each pure rule has a series of threshold values and defaults. They can be set via environment variables. To describe each threshold, an initialized data structure is declared. This data structure is used by a number of standard functions that get or print the threshold. Each structure contains the environment variable name, the default value, the units of the value (e.g. pages or milliseconds), precision values for formatting the value with printf, and a descriptive string. Thresholds can be integer, double precision floating point, or string types. The code is shown in Figure 3 below.

Figure 3 Code for Pure CPU Rule

-------------------------------------------------------------------------
rule_thresh_dbl cpu_runq_idle = {"RUNQ_IDLE", 0.0, "",
	4, 1, "Spare CPU capacity" };
rule_thresh_dbl cpu_runq_busy = {"RUNQ_BUSY", 3.0, "",
	4, 1, "OK up to this level" };
rule_thresh_dbl cpu_runq_overload = {"RUNQ_OVERLOAD",
	5.0, "", 4, 1, "Warning up to this level" };
 
print_pr_cpu(ulong file) {
    print_thresh_dbl(file, cpu_runq_idle);
    print_thresh_dbl(file, cpu_runq_busy);
    print_thresh_dbl(file, cpu_runq_overload);
}

class pr_cpu_t {
  /* output variables */
  int state;
  string action;
  /* input variables */
  ulong timestamp;
  int runque; /* i.e. p_vmstat.runque load level */
  int ncpus; /* i.e. sysconf(_SC_NPROCESSORS_ONLN) */
  /* threshold variables */
  double cpu_idle;
  double cpu_busy;
  double cpu_overload;

  pr_cpu$()
  {
    double cpu_load;
    ulong lasttime;	   /* previous timestamp */
    if (timestamp == 0) { /* reset defaults */
	 cpu_idle = get_thresh_dbl(cpu_runq_idle);
	 cpu_busy = get_thresh_dbl(cpu_runq_busy);
	 cpu_overload = get_thresh_dbl(cpu_runq_overload);
	 return;
    }
    if (timestamp != lasttime) {
      cpu_load = runque; cpu_load /= ncpus;
      if (cpu_load <= cpu_idle)
        {
        state = ST_WHITE;
        action = "There is more CPU power configured than you need right now";
        }
      else
        {
        if (cpu_load < cpu_busy)
          {
          state = ST_GREEN;
          action = "No problem";
          }
        else
          {
          if (cpu_load < cpu_overload)
            {
            state = ST_AMBER;
            action = "The CPU is quite busy, perhaps add more CPU power";
            }
          else
            {
            state = ST_RED;
            action = "CPU overload, add more power or quit some programs";
            }
          }
        }
	 lasttime = timestamp;
    }
  }
};
-------------------------------------------------------------------------

The code defines three threshold structures and a function to print them to a file descriptor. The function is provided for convenient use in scripts.

The CPU rule itself is defined as a class. The first part is just like a regular C structure definition. By convention, output variables for the state and action are always defined. Input variables are used to provide information needed by the rule (runque and ncpus) and time stamp each invocation of the rule. Threshold variables hold the current values of the thresholds.

The final element of the class is the block of code that is first executed when the class is declared and againwhenever its data is read. The function name ends in a “$” and must be used as the prefix for the name of any active instances of the class.

Some local variables are defined. By convention the time stamp is used in two special ways. A zero time stamp executes rule initialization code. In this case, the thresholds are set using a function that checks the environment variable and if not defined goes to the default. If the time stamp is unchanged from the last invocation of the rule, the code is not executed. This prevents unnecessary evaluations of the rule. I use a one-second resolution time stamp.

Finally, the per-CPU run queue length is calculated and compared with the thresholds to determine the state code and action string.

The rule is used when setting the input variables, updating the time stamp, and reading the output variables. This will be shown in the definition of the live rule.

CPU Live Rule Code

The live rule wraps up the code needed to read the current values of the required input variables, with the pure rule. As shown in Figure 4 below, the definition is simpler than the pure rule. The only data items defined are a state code and action string. Again, the class function name is used as a prefix for active variables. An active instance of the pure rule is defined, along with a temporary copy that just holds the defined data.

The live rule initializes itself when it is first declared, i.e. before the script starts to run, by reading the time, updating the global copy of the vmstat class data, setting the number of CPUs correctly in the pure rule, then resetting the pure rule while initializing the temporary copy. The state code and action string are set up.

The first time the live rule is actually read, it updates the global copy of the vmstat class, then sets the run queue and time stamp in the pure rule, and invokes it by reading it into the temporary copy. The state code and action string are propagated unchanged, up to the live rule values.

Figure 4 Code for Live CPU Rule

-------------------------------------------------------------------------
class lr_cpu_t {
  /* output variables */
  int state;
  string action;

  lr_cpu$()
  {
    ulong lasttime = 0; /* previous timestamp */
    ulong timestamp = 0;
    pr_cpu_t pr_cpu$cpu;
    pr_cpu_t tmp_cpu;

    if (timestamp == 0) {
      timestamp = time();
      pvm_update(timestamp);
      pr_cpu$cpu.ncpus = GLOBAL_pvm_ncpus;
      pr_cpu$cpu.timestamp = 0;
      tmp_cpu = pr_cpu$cpu; /* reset pure rule */
      action = uninit;
      state = ST_WHITE;
      lasttime = timestamp;
      return;
    }
    timestamp = time();  
    if (timestamp == lasttime) { 
      return;
    }
    /* use the rule */
    pvm_update(timestamp);
    pr_cpu$cpu.runque = GLOBAL_pvm[0].runque;
    pr_cpu$cpu.timestamp = timestamp;  
    tmp_cpu = pr_cpu$cpu;
    state = tmp_cpu.state;
    action = tmp_cpu.action;
    lasttime = timestamp;   
  }
};
-------------------------------------------------------------------------

This is one of the simplest live rules, but it is used in the same way as the most complex ones. Whenever you want to know what state the CPU is in, simply define an instance of the live rule class, then read it.


The Free Memory Non-Rule

I deliberately avoid using some metrics in rules. They can be side effects of other problems, or metrics that don't have the same meaning in Solaris as in other versions of UNIX. The “free memory” reported by vmstat and sar is one of them. Please ignore it. If you really want to know why, read the piece I wrote for my SunWorld Online Q&A column this month.

That's All Folks!

Thank you for reading to the end of this column. I've completed the introduction to the SE toolkit that I started last month. I will return to the subject of tools in the future.


Next month: How to get highest performance from the least effort.

Send your comments and questions to adrian.cockcroft@sun.com.

See Also Previous Column


Questions or comments? webmaster@sun.com
Copyright 1994-1998 Sun Microsystems, Inc., 901 San Antonio Road, Palo Alto, CA 94303 USA.
All rights reserved. Legal Terms