October 1995

Advanced Monitoring and Tuning

Last month, I described some basic performance rules that can be used to monitor the behavior of a computer system. I also made available a performance toolkit that includes an implementation of the rules, and a GUI front end called ruletool.se. We also held a competition to give ruletool.se a better name; the winning name will be announced in a future column. This month, I will take a closer look at the virtual_adrian.se script and explain how the SE performance toolkit works so you can build your own scripts.

Look Out, There's a Guru About

Sysadmin: “What's wrong with this system? Can you tune it, please? Here's the root password.”

Performance Guru: “I've tuned a few kernel variables, but it looks pretty idle to me right now.”

Sysadmin: “Yes, but it gets really, really busy some of the time.....”

Performance Guru: “You should start collecting some data. Call me again when you have some logs of when it is busy.”

Have you ever found that your most troublesome system is idle, or behaves perfectly, when you call in the consultants and performance gurus? Then, when you collect data that shows the problem, the guru wants some extra data that you didn't bother to collect.

The SE toolkit provides a way to capture and distribute the expertise of a performance specialist. I have used the toolkit to build a personalized performance monitoring and tuning tool called virtual_adrian.se. It starts up when the system boots, does some simple tuning, and every 30 seconds, it takes a look at the state of the system. When run with superuser permissions, it can monitor more of the system than ruletool.se. It can also make immediate changes to some kernel tunables.

When you run virtual_adrian.se, it keeps quiet until it sees a problem. Then it complains and shows the data that caused the complaint. For example, if it keeps complaining about a slow disk, then you should find a way to reduce the load on that disk. Unlike a real performance guru, the virtual guru can look at lots of data very quickly, and never gets bored or falls asleep. Most performance problems occur over and over again. The script picks out a large number of “obvious” problems. (Obvious to a guru :-).

If you disagree with my rules (which are a bit crude in places), or want to add your own rules, you can easily modify the script and rename it virtual_kimberley.se or whatever.

The `virtual_adrian.se` Performance Tuner and Monitor

The core of virtual_adrian.se is the same set of rules that are used by ruletool.se. There is an initial check and tune of the system, and a few extra rules that require superuser permissions. If you run virtual_adrian.se as a regular user, it runs with functionality equivalent to ruletool.se writing to a logfile.

-------------------------------------------------------------------------
/home/username from server:/export/home3/username
 Flags:   vers=2,hard,intr,down,dynamic,rsize=8192,wsize=8192,retrans=5
 Lookups: srtt=7 (17ms), dev=4 (20ms), cur=2 (40ms)
 Reads:   srtt=16 (40ms), dev=8 (40ms), cur=6 (120ms)
 Writes:  srtt=19 (47ms), dev=3 (15ms), cur=6 (120ms)
 All:     srtt=15 (37ms), dev=8 (40ms), cur=5 (100ms)
-------------------------------------------------------------------------

Sample Output from `virtual_adrian.se`

When you install the SE toolkit, you are asked if you want to set up virtual_adrian.se to run automatically on every bootup. If you say yes then output is sent to the console and logged to /var/adm/sa/monitor.log. Let's take a look at the output collected from running it manually.

-------------------------------------------------------------------------
% virtual_adrian.se
Warning: Cannot init kvm: Permission denied
Warning: Kvm not initialized: Variables will have invalid values
Adrian is monitoring your system starting at: Thu Sep 21 00:37:26 1995

Warning: Cannot get info for pid 3.
superuser permissions are needed to access every process
Using predefined rules for disk, net, rpcc, swap, ram, kmem,
cpu, mutex, dnlc and inode

Checking the system every 30 seconds...
-------------------------------------------------------------------------

The warning about kvm (the raw kernel data interface) can be ignored, as the script does not attempt to use any kvm data unless it runs as root. When you run it as root you see the extra rules are configured.

-------------------------------------------------------------------------
# /opt/RICHPse/examples/virtual_adrian.se
Adrian is monitoring your system starting at: Thu Sep 21 00:46:01 1995

Process watcher pid set to 3, process name fsflush, max CPU usage  5.0%
NFS client threshold set at All: srtt=20 (50ms) max NFS round trip
Minimum client NFS ops/sec considered active 2.00/s
Using predefined rules for disk, net, rpcc, swap, ram, kmem,
cpu, mutex, dnlc and inode

Checking the system every 30 seconds...
-------------------------------------------------------------------------

I ran the command % find / -ls >/dev/null which makes the disk and the name caches rather busy, and got this output on a 32MB SPARCstation IPX.

-------------------------------------------------------------------------
Adrian detected slow disk(s): Thu Sep 21 00:48:33 1995
Move load from busy disks to idle disks
State  disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b  delay
red    c0t3d0   23.9  5.6   81.1   44.5  2.6  1.3  130.7  17  78 3856.1
amber  c0t5d0    1.3  1.9    7.1   35.2  0.0  0.1   30.3   0   6   94.9

Adrian detected Directory Name Cache problem (amber): Thu Sep 21 00:48:33 1995
Poor DNLC hitrate, increase ncsize
DNLC hitrate 44.1%, reference rate 125.50/s
DNLC has 617 entries, try increasing it (and inode cache) to 1234

Adrian detected Inode Cache problem (amber): Thu Sep 21 00:48:33 1995
Poor inode cache hitrate, increase ufs_ninode
Inode hitrate 26.9%, reference rate 64.63/s

Adrian detected RAM shortage (amber): Thu Sep 21 00:49:05 1995
The system is getting short on RAM, perhaps add some more
  procs        memory         page             faults              cpu
 r  b  w    swap    free    pi  po  sr   in    sy    cs  smtx us sy wt id
 0  0  0   43940     700     8   2  55  342   842   205    11 23 18 20 39
-------------------------------------------------------------------------

As you can see, the output is somewhat verbose, and is based on extended versions of familiar command output where appropriate.

The whole point of the SE toolkit is that you can customize the tools very easily. To encourage you to do this, the rest of this column is a tour of the SE language and toolkit classes.

Features of the SE Language

The SE language is based on a subset of the C language. It is much easier to learn a dialect of an existing language than it is to learn a brand new language.

Scripts are passed through the standard C preprocessor before being executed.
C statement types are all implemented apart from goto.
C operators are all implemented apart from a few unary and ternary operators.
Most C data types are implemented, including floating point, but there are no pointers in the language. A new type called string is added. Some common C library routines are redefined to avoid pointer arguments.
Arrays, structures, and initialized structures are implemented.
A mechanism for defining entry points into the standard dynamically linked C libraries is provided. This avoids the need to supply more than a basic set of built-in functions in the interpreter.
A large set of header files is provided. These files define structures that map onto the many kernel data structures which provide performance information.
The most significant extension to the language is a simple object oriented class. This is just a structure definition with a function definition embedded in it. A variable of this type can be defined so that whenever an element of the structure is accessed, the function is executed first.
By defining classes that map onto the kernel data sources, every time they are read the current values of each metric are provided.
The code required to provide data in ready-to-use formats is also embedded in classes. For example, an iostat class provides the per-second rates for each disk as floating point values in a data structure.

A typical script includes a whole bunch of header files that define the functions and classes to be used, then handles command line arguments and loops, reading a class variable and printing out the values.

Iostat Written in SE

The SE language comes with a collection of scripts that clone the common UNIX performance utilities, like iostat and vmstat. The code is very simple and easy to understand as shown in Figure 2 below.

Figure 2 Code for iostat.se

-------------------------------------------------------------------------
#! /opt/RICHPse/bin/se
 
#include <stdio.se>
#include <tdlib.se>
#include <unistd.se>
#include <string.se>
#include <kstat.se>
#include <sysdepend.se>
#include <p_iostat_class.se>
#include <dirent.se>
#include <inst_to_path_class.se>

#define SAMPLE_INTERVAL   5

main(int argc, string argv[2])
{
  p_iostat p_iostat$disk;
  p_iostat tmp_disk;
  int i;
  int interval = SAMPLE_INTERVAL;
  int ndisks;

  switch(argc) {
  case 1:
    break;
  case 2:
    interval = atoi(argv[1]);
    break;
  default:
    printf("use: %s [interval]\n", argv[0]);
    exit(1);
  }
  ndisks = p_iostat$disk.disk_count;
  for(;;) {
    sleep(interval);
    printf("extended disk statistics\n");
    printf("disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %%w  %%b\n");
    for(i=0; i < ndisks; i++) {
      p_iostat$disk.number$ = i;
      tmp_disk = p_iostat$disk;
      printf("%-8.8s %4.1f %4.1f %6.1f %6.1f %4.1f %4.1f %6.1f %3.0f %3.0f\n",
        tmp_disk.name$,
        tmp_disk.reads, tmp_disk.writes,
        tmp_disk.kreads, tmp_disk.kwrites,
        tmp_disk.avg_wait, tmp_disk.avg_run,
        tmp_disk.service,
        tmp_disk.wait_percent, tmp_disk.run_percent);
    }
  }
}
-------------------------------------------------------------------------

To illustrate the language, let's walk through this script:

The first line tells the shell where to find the interpreter for this script.
The #include lines bring in the definitions, and the code that does the real work.
The definitions in main() define two copies of the per-disk iostat class. One has the special prefix “p_iostat$” that matches the function name defined in the class. This is an active variable. By convention, dollar signs are embedded in active variables to indicate that they are special. The other is just a regular data structure of the same storage type. It is used to hold a temporary snapshot copy of the data. Internally, a snapshot of the kernel's disk counters is made as the class initializes itself.
Simple command line processing allows the measurement interval to be specified.
The first read of the class picks out how many disks there are.
The script loops forever until it is interrupted. It sleeps for a few seconds to start with, so the subsequent measurement is made over the specified interval.
After printing the headers, the script loops through the disks. For each disk, you must write the index into the class. The index tells the class which disk you want data for.
A structure-to-structure copy takes a snapshot of all the data for a single disk. Before the data is read from the class the kernel counters are read, differenced from the previous measurements, and processed into floating point values.
All that remains is to print out the data in the right format.

The output matches the regular iostat -x command.

-------------------------------------------------------------------------
% iostat.se
extended disk statistics
disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b
sd3       0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0
sd5       0.0  4.6    0.0   29.2  0.0  0.1   18.3   1   5
-------------------------------------------------------------------------

Rule Construction

Let's take a look at a typical rule to see how it is implemented. You will see how easy it is to define and use your own performance rules. In just a few hours, you can build a complete customized virtual guru of your own. If you would like to share your work, send it to se-feedback@chessie.eng.sun.com, and we will try to add it to the next release of the SE Performance Toolkit.

States and Actions

I decided to extend the usual red/amber/green conditions that most tools implement. I wanted to indicate a few extra conditions that sometimes occur:

White state - completely idle, untouched
Blue state - imbalanced, idle while other instances of the resource are overloaded
Green state - no problem, normal operating state
Amber state - warning condition
Red state - overloaded or problem detected
Black state - critical problem that may cause processes or the whole system to fail

Each state is associated with an action, indicating the nature of the problem and, if possible, what to do about it.

Rules as Objects

I used the object oriented class mechanism of the language to implement the rules. This gives several benefits:

A way to package each rule
A common interface based on a data structure
The implementation is hidden in the class function
Changes to the rule can be made without changing the interface as long as the input data is the same
Rules become reusable components stored in a header file
Existing rules provide templates for the creation of new rules
Common operations, such as setting custom thresholds, can be performed using generic functions

Pure Rule Objects

I built what I called pure rules. These are provided with input data to evaluate by writing to elements of the class. They are evaluated by reading the rule and examining both the state code and action string that are always provided as elements of the class. These rules make no assumptions about where the data came from. They could be given simulated data, historical data samples, or live data.

Live Rule Objects

The next set of rule classes produced were called live rules. These package up the code needed to:

Read data from the current system
Feed it to a copy of the pure rule
Pass on the state code and action string
Pass on derived data used to determine the state and action
Maintain globally available copies of the complete vmstat, iostat, and netstat classes

A live rule is completely trivial to use in a script. It is self initializing and self updating. You just declare an instance of the class, and each time you read it, you get the current state and action string.

-------------------------------------------------------------------------
lr_disk_t lr_disk$dr;
lr_disk_t tmp_dr;
/* use the live disk rule */
tmp_dr = lr_disk$dr;
if ( tmp_dr.state > ST_GREEN) {
	printf("The disks are in the %s state: %s\n",
		state_string(tmp_dr.state), tmp_dr.action);
}
-------------------------------------------------------------------------

The CPU Power Rule

This is a relatively simple rule, so it will be explained in detail. The full implementation of all the rules is too complex to explain here, but you can browse the pure_rules.se and live_rules.se header files.

Rule Definition

Each rule was initially defined in terms of the output of standard system commands. For example the shorthand vmstat30.r means: Run the command vmstat with a 30-second interval, and look at the column labelled with an r.

The number of CPUs on the system must be known. It is referred to as ncpus in the rules. The run queue length is divided by the number of CPUs. This is based on the assumption that every CPU takes a job off the run queue in each time slice.

Table 1 CPU Rule

-------------------------------------------------------------------------
CPU RULE				LEVEL		ACTION
-------------------------------------------------------------------------
0 == vmstat30.r				White		1. CPU Idle
0 < (vmstat30.r / ncpus) < 3.0		Green		No Problem
3.0 <= (vmstat30.r / ncpus) < 5.0	Amber		2. CPU Busy
5.0 <= (vmstat30.r / ncpus)		Red		2. CPU Busy
-------------------------------------------------------------------------

CPU Idle - The CPU power of this system is underutilized. Fewer or less powerful CPUs could do this job.
CPU Busy - There is insufficient CPU power, and jobs spend an increasing amount of time in the queue before being assigned to a CPU. This reduces throughput and increases interactive response times.

CPU Pure Rule Code

Each pure rule has a series of threshold values and defaults. They can be set via environment variables. To describe each threshold, an initialized data structure is declared. This data structure is used by a number of standard functions that get or print the threshold. Each structure contains the environment variable name, the default value, the units of the value (e.g. pages or milliseconds), precision values for formatting the value with printf, and a descriptive string. Thresholds can be integer, double precision floating point, or string types. The code is shown in Figure 3 below.

Figure 3 Code for Pure CPU Rule

-------------------------------------------------------------------------
rule_thresh_dbl cpu_runq_idle = {"RUNQ_IDLE", 0.0, "",
	4, 1, "Spare CPU capacity" };
rule_thresh_dbl cpu_runq_busy = {"RUNQ_BUSY", 3.0, "",
	4, 1, "OK up to this level" };
rule_thresh_dbl cpu_runq_overload = {"RUNQ_OVERLOAD",
	5.0, "", 4, 1, "Warning up to this level" };
 
print_pr_cpu(ulong file) {
    print_thresh_dbl(file, cpu_runq_idle);
    print_thresh_dbl(file, cpu_runq_busy);
    print_thresh_dbl(file, cpu_runq_overload);
}

class pr_cpu_t {
  /* output variables */
  int state;
  string action;
  /* input variables */
  ulong timestamp;
  int runque; /* i.e. p_vmstat.runque load level */
  int ncpus; /* i.e. sysconf(_SC_NPROCESSORS_ONLN) */
  /* threshold variables */
  double cpu_idle;
  double cpu_busy;
  double cpu_overload;

  pr_cpu$()
  {
    double cpu_load;
    ulong lasttime;	   /* previous timestamp */
    if (timestamp == 0) { /* reset defaults */
	 cpu_idle = get_thresh_dbl(cpu_runq_idle);
	 cpu_busy = get_thresh_dbl(cpu_runq_busy);
	 cpu_overload = get_thresh_dbl(cpu_runq_overload);
	 return;
    }
    if (timestamp != lasttime) {
      cpu_load = runque; cpu_load /= ncpus;
      if (cpu_load <= cpu_idle)
        {
        state = ST_WHITE;
        action = "There is more CPU power configured than you need right now";
        }
      else
        {
        if (cpu_load < cpu_busy)
          {
          state = ST_GREEN;
          action = "No problem";
          }
        else
          {
          if (cpu_load < cpu_overload)
            {
            state = ST_AMBER;
            action = "The CPU is quite busy, perhaps add more CPU power";
            }
          else
            {
            state = ST_RED;
            action = "CPU overload, add more power or quit some programs";
            }
          }
        }
	 lasttime = timestamp;
    }
  }
};
-------------------------------------------------------------------------

The code defines three threshold structures and a function to print them to a file descriptor. The function is provided for convenient use in scripts.

The CPU rule itself is defined as a class. The first part is just like a regular C structure definition. By convention, output variables for the state and action are always defined. Input variables are used to provide information needed by the rule (runque and ncpus) and time stamp each invocation of the rule. Threshold variables hold the current values of the thresholds.

The final element of the class is the block of code that is first executed when the class is declared and againwhenever its data is read. The function name ends in a “$” and must be used as the prefix for the name of any active instances of the class.

Some local variables are defined. By convention the time stamp is used in two special ways. A zero time stamp executes rule initialization code. In this case, the thresholds are set using a function that checks the environment variable and if not defined goes to the default. If the time stamp is unchanged from the last invocation of the rule, the code is not executed. This prevents unnecessary evaluations of the rule. I use a one-second resolution time stamp.

Finally, the per-CPU run queue length is calculated and compared with the thresholds to determine the state code and action string.

The rule is used when setting the input variables, updating the time stamp, and reading the output variables. This will be shown in the definition of the live rule.

CPU Live Rule Code

The live rule wraps up the code needed to read the current values of the required input variables, with the pure rule. As shown in Figure 4 below, the definition is simpler than the pure rule. The only data items defined are a state code and action string. Again, the class function name is used as a prefix for active variables. An active instance of the pure rule is defined, along with a temporary copy that just holds the defined data.

The live rule initializes itself when it is first declared, i.e. before the script starts to run, by reading the time, updating the global copy of the vmstat class data, setting the number of CPUs correctly in the pure rule, then resetting the pure rule while initializing the temporary copy. The state code and action string are set up.

The first time the live rule is actually read, it updates the global copy of the vmstat class, then sets the run queue and time stamp in the pure rule, and invokes it by reading it into the temporary copy. The state code and action string are propagated unchanged, up to the live rule values.

Figure 4 Code for Live CPU Rule

-------------------------------------------------------------------------
class lr_cpu_t {
  /* output variables */
  int state;
  string action;

  lr_cpu$()
  {
    ulong lasttime = 0; /* previous timestamp */
    ulong timestamp = 0;
    pr_cpu_t pr_cpu$cpu;
    pr_cpu_t tmp_cpu;

    if (timestamp == 0) {
      timestamp = time();
      pvm_update(timestamp);
      pr_cpu$cpu.ncpus = GLOBAL_pvm_ncpus;
      pr_cpu$cpu.timestamp = 0;
      tmp_cpu = pr_cpu$cpu; /* reset pure rule */
      action = uninit;
      state = ST_WHITE;
      lasttime = timestamp;
      return;
    }
    timestamp = time();  
    if (timestamp == lasttime) { 
      return;
    }
    /* use the rule */
    pvm_update(timestamp);
    pr_cpu$cpu.runque = GLOBAL_pvm[0].runque;
    pr_cpu$cpu.timestamp = timestamp;  
    tmp_cpu = pr_cpu$cpu;
    state = tmp_cpu.state;
    action = tmp_cpu.action;
    lasttime = timestamp;   
  }
};
-------------------------------------------------------------------------

This is one of the simplest live rules, but it is used in the same way as the most complex ones. Whenever you want to know what state the CPU is in, simply define an instance of the live rule class, then read it.

The Free Memory Non-Rule

I deliberately avoid using some metrics in rules. They can be side effects of other problems, or metrics that don't have the same meaning in Solaris as in other versions of UNIX. The “free memory” reported by vmstat and sar is one of them. Please ignore it. If you really want to know why, read the piece I wrote for my SunWorld Online Q&A column this month.

That's All Folks!

Thank you for reading to the end of this column. I've completed the introduction to the SE toolkit that I started last month. I will return to the subject of tools in the future.

Next month: How to get highest performance from the least effort.

Send your comments and questions to adrian.cockcroft@sun.com.

Previous Column

Advanced Monitoring and Tuning

by Adrian Cockcroft

Welcome Back

Look Out, There's a Guru About

The `virtual_adrian.se` Performance Tuner and Monitor

Check and Tune the System

File System Flush Process Monitor

NFS Mount Point Service Time Monitor

Sample Output from `virtual_adrian.se`

Features of the SE Language

Iostat Written in SE

Rule Construction

States and Actions

Rules as Objects

Pure Rule Objects

Live Rule Objects

The CPU Power Rule

Rule Definition

CPU Pure Rule Code

CPU Live Rule Code

The Free Memory Non-Rule

That's All Folks!

Advanced Monitoring and Tuning

by Adrian Cockcroft

Look Out, There's a Guru About

The virtual_adrian.se Performance Tuner and Monitor

File System Flush Process Monitor

Features of the SE Language

Rule Construction

Pure Rule Objects

The CPU Power Rule

CPU Pure Rule Code

The Free Memory Non-Rule

The `virtual_adrian.se` Performance Tuner and Monitor