CIS 5406 - Lecture Notes # 22 - Performance Analysis

                          COMPUTER AND NETWORK
                         SYSTEM  ADMINISTRATION
                         Summer 1998 - Lesson 11

                          Performance Analysis


A. Introduction

   1. When performance is bad, user complaints come in the form of:
      "Why is the system so sloooooow?"
      or
      "My job is taking forever to run!"
      User will report slow keyboard response or long compilation times.
      Hopefully you as the administrator notice these problems first
      before the bombardment of user complaints.

   2. Where to start? What to monitor?
      Performance is affected by the efficency of the
      four main resources that a system offers:

      - CPU 
      - Memory 
      - Disk 
      - Network 

   3. These are all related.

      - NFS traffic depends on network bandwidth as well
        as disk bandwidth

      - disk bandwidth depends on memory if disk caching
        is in place

   4. What is good performance?

      - the system administrator must distinguish between
        poor performance caused by system malfunctioning
        and that caused by heavy usage

      - times of heavy usage are good times to analyze the
        system and see where bottlenecks are

      - this will help you determine where to put scarce
        funds

      - long term analysis


B. CPU monitoring

   1. time: time a command

      - several system commands will time a job

      - /usr/bin/time, /usr/5bin/time (Solaris), shell's built-in "time"

      Example (/usr/bin/time): 
      % /usr/bin/time  find / -name csh.1 -print
        /usr/share/man.xi.orig/man1/csh.1

        real        3.2
        user        0.4
        sys         1.9

      real: wall clock time
      user: user CPU time
      sys: system CPU time

      Example: (csh built-in time):
      % time  find / -name csh.1 -print
        /usr/share/man.xi.orig/man1/csh.1

        0.39u 1.64s 0:02.56 79.2%

      0.39u: user CPU time
      1.64s: system CPU time
      0:02.56: wall clock time
      79.2%: percentage of time spent on CPU ((u+s)/w)

   3. uptime: report current time, amount of time system has been up,
        number of users, load average

      Example:
      % uptime
        3:03pm  up 1 day(s),  1:20,  14 users,  load average: 0.20, 0.09, 0.08

      - load average is rough measure of CPU use
      - reports the average number of processes active during the last
        minute, 5 minutes, 15 minutes

   4. rup: show host status of remote machines

      Example:
      % rup xi upsilon sed nu linuxfs1 linuxfs2
              xi    up  1 day,   1:31,    load average: 0.13, 0.26, 0.19
         upsilon    up 63 days, 23:31,    load average: 0.00, 0.02, 0.02
             sed    up  1 day,  21:47,    load average: 0.00, 0.00, 0.00
              nu    up 37 days, 21:39,    load average: 0.11, 0.09, 0.00
        linuxfs1    up  1 day,  23:15,    load average: 0.00, 0.06, 0.09
        linuxfs2    up 14 days, 17:35,    load average: 0.03, 0.01, 0.00

   5. ps: report process status 

      - has many options - read man page for specifics

      Example: (Solaris) 
      % ps -ef
          UID   PID  PPID  C    STIME TTY      TIME CMD
         root     0     0  0   Jul 03 ?        0:00 sched
         root     1     0  0   Jul 03 ?        0:04 /etc/init -r
         root     2     0  0   Jul 03 ?        0:00 pageout
         root     3     0  0   Jul 03 ?        3:36 fsflush
         root   449     1  0   Jul 03 ?        0:00 /usr/lib/saf/sac -t 300
         root   224     1  0   Jul 03 ?        1:09 /usr/lib/autofs/automountd
         root   136     1  0   Jul 03 ?        0:27 /usr/sbin/rpcbind
        healy  7763  7712  1 14:18:43 pts/7    0:12 emacs signal.c
        koshy  3279  3276  0   Jul 03 pts/13   0:00 -reg-csh
         root 17242   243  0 19:20:12 ?        0:00 /usr/samba/bin/smbd -D

      UID: login name
      PID: process id
      PPID: process ID of the parent process
      C: current scheduler value
      STIME: start time
      TTY: associated terminal
      TIME: accumulated CPU time
      CMD: command 

      - ps is often used in pipes

      Example:
      % ps -ef | grep httpd
      nobody  9538   299  0 15:29:23 ?      0:00 /usr/local/etc/httpd/httpd 
      nobody  9302   299  0 15:22:59 ?      0:00 /usr/local/etc/httpd/httpd
      nobody  9557   299  0 15:31:29 ?      0:00 /usr/local/etc/httpd/httpd 
      nobody  9540   299  0 15:29:24 ?      0:00 /usr/local/etc/httpd/httpd 
      nobody  9112   299  0 15:17:34 ?      0:00 /usr/local/etc/httpd/httpd 
      nobody  9304   299  0 15:23:22 ?      0:00 /usr/local/etc/httpd/httpd 

   6. top: display and update information about the top cpu processes

      - excellent tool for overall view of system

      - combines output of several commands (uptime, ps, vmstat)

      Example:
      % top
      last pid:  9649;  load averages:  0.03,  0.05,  0.12                   15:36:19
      113 processes: 112 sleeping, 1 on cpu
      CPU states: 97.6% idle,  1.0% user,  1.4% kernel,  0.0% iowait,  0.0% swap
      Memory: 152M real, 48M free, 54M swap, 721M free swap

       PID USERNAME PRI NICE  SIZE   RES STATE   TIME   WCPU    CPU COMMAND
      9649 barnash   31    0 1864K 1472K cpu     0:00  1.26%  0.99% top
      5356 sheff     33    0 2064K 1384K sleep   8:19  0.33%  0.39% xsysstats
      9114 nobody    33    0 1968K 1472K sleep   0:00  0.11%  0.16% httpd
      9585 nobody    35    0 1968K 1448K sleep   0:00  0.11%  0.12% httpd
      9304 nobody    33    0 1984K 1480K sleep   0:00  0.01%  0.10% httpd

      PID: process id
      USERNAME: name of the process's owner
      PRI: current priority of the process
      NICE: nice amount (in the range -20 to 20)
      SIZE: total size of the process (text, data and stack; kilobytes)
      RES: current amount of resident memory (kilobytes)
      STATE: current state (sleep, wait, run, idl, zomb, stop)
      TIME: number of system and user cpu seconds the process has used
      WCPU: weighted percentage of cpu time 
      CPU: raw percentage of cpu time
      COMMAND: name of the command

      - from within top, you can control behavior of processes with renice and kill 
        renice: 
           - change nice number (requested execution priority) 
           - Syntax: r new-nice-number pid
           - nice range either -20 to 20 or 0 to 39
           - the lower the nice number the higher the priority
           - only superuser can lower nice number       

        kill: terminate process
           - Syntax: k [-signal] pid

   7. Task Manager - Windows NT

      - ctrl-alt-delete and choose Task Manager 
        or right click on taskbar and choose Task Manager

      - Applications: shows which applications are active
           - status should be "Running"
           - if status is "Not Responding" you can use End Task to kill it
           - double click on application or click "Switch To" to bring it to front
           - click "New Task" to start new application
           - right click on application to bring up menu of options

      - Processes: shows which processes are active (similar to top)
           - applications have one or more processes but not all processes 
             have application
           - can end process by choosing "End Process"; be sure you 
             know what process is before ending
           - right click on process to set priority
           - can reorder listing by clicking on column headings

      - Performance: CPU and memory utilization
           - graphical representation of CPU utilization
           - minimize Task Manager for CPU utilization graphic on Task Bar

   8. QuickSlice - WinNT Server Resource Kit
      - nice graphical tool for analyzing cpu utilization

B. Memory performance analysis

   1. buying more memory is generally the cheapest way to 
      improve performance 

   2. generally, active processes require more physical memory
      than is available
      - paging: involves moving sections of a process's memory to disk
      - page fault: occurs when a process needs a page of memory that is not
                    resident and must be read in from disk
      - swapping: writing an entire process to disk, freeing all of its memory


   3. swapper (BSD) / sched (Solaris)
  
       - the swapper moves processes which has been idle for more than 
         20 seconds (preventative swapping - normal housekeeping)
       - if the pagedaemon cannot keep lotsfree high enough, if the
        number of Kbytes of free memory fall below minfree then
        the swapper kicks in (desperation swapping)

      - the swapper chooses a process to swap out based on 2 criteria:
        > longest sleep time
        > if none are sleeping, then use resident memory size
          (the swapper chooses largest 4 processes, then picks the one
           which has been resident longest)

      - when a process is swapped out, everything goes - even the user 
        structure and the page tables

      - swapping is much more expensive than paging so a highly loaded
        system - that invokes swapping frequently - does not perform well


   4. When do we have problems?

      - preventative swapping is normal
      - a ps -aux usally shows many swapped out processes 
           STAT column - W as second letter means swapped out
      - linux top also has STAT column
      - paging is also part of normal operations

        > a new process must have new pages brought into memory
        > also must page in when it references non-recently used section
          of memory

      - page faults always cause a performance degradation

      - usually, the pagedaemon quickly fixes the problem by
        getting rid of unneeded pages and loading the needed ones

      - when the pagedeamon fails then desperation swapping begins
  
      - what types of processes are likely to be swapped out by
        desperation swapping?

        > ans: ones that sleep: editors, shells, generally interactive
               processes

        > keyboard response time goes to pot since a keystroke requires
          a disk access (and the disk is probably heavily loaded at this
          time)

   5. how to diagnose

      1. tools - BSD: vmstat
                 S5:  sar
		 Solaris: mpstat
                 WinNT: Task Manager

      2. these tools report:

         page-ins
         page-outs
         swap-ins
         swap-outs

      3. page-ins

         - most UNIX systems use 'demand paging'
         - when a process is started only the memory 
           maps for the process are loaded in physical
           memory
         - each memory access causes a page fault
           and each page is brought in 'on demand'
         - the alternative is 'pre-paging'
         - thus page-ins are normal

      4. swap-ins
         - a new process acts like a swap-in
         - not very useful

      5. page-outs

         - this is a first indicator that your memory is
           inadequate
         - some page-out activity is normal
         - does the frequency of page-outs dramatically
           increase whenever system performance is sluggish?

	 - acceptable rate is O/S and hardware dependent
         - in order to know you need to establish baselines of
           activity

      6. swap-outs 
         - heavy amount of swap-outs signify problem

      7. Example (BSD):
         % vmstat -S

procs   memory              page                disk       faults     cpu
r b w avm   fre   si  so pi  po  fr  de  sr  d0 d1 d2 d3  in  sy  cs us sy id
0 0 0   0  3028    4   1  1   2   1   0   0   2  2  0  0  0  82 177  89 33 9


        - procs 

          Number of processes:
     
             r  - runnable (not waiting for I/O or sleeping)

             b  - blocked for resources (i/o, paging, etc.)

             w  - runnable or short sleeper (<  20  secs)  but
                  swapped

        - any number but 0 in the w column indicates what?

          > ans: desperation swapping
    
        -  memory 

             avm - number of active virtual Kbytes (used in last 20 secs)

             fre - size of the free list in Kbytes 

               > when this gets close to lotsfre, then page-outs begin

        - page 

          Report information about swapping, page faults, and paging
             activity

          Reported in units per second (averaged over last 5 seconds)

             si - procs swap-ins
             so - procs swap-outs (not due to idle)
             pi - kilobytes per second paged in
             po - kilobytes per second paged out
             fr - kilobytes freed per second
             de - anticipated short term memory shortfall in Kbytes
             sr - pages scanned by clock algorithm, per-second

        - disk

          Report number of disk operations per second.

        - faults 

          Report trap/interrupt rate averages per second over last 5 seconds
          
             in - (non clock) device interrupts per second
             sy - system calls per second
             cs - CPU context switch rate (switches/sec)

        - cpu  

          Give a breakdown of percentage usage of CPU time.
       
             us - user time for normal and low priority processes
             sy - system time
             id - CPU idle

        - we are most concerned with swap-outs and page-outs

procs   memory              page               disk       faults     cpu
r b w avm   fre  si so  pi  po  fr  de  sr d0 d1 d2 d3  in  sy  cs us sy id

0 0 0   0  2508  20  0   0   0   0   0   0 13  0  0  0 226 216 350  7  6 87
0 0 0   0  2280   0  0  16   0   0   0   0  3  0  0  0 258 361 343  5  8 87
0 0 0   0  2104  21  0 124  56 184   0 111  5  0  0  0 545 667 563 14 16 70
0 0 0   0  2120   0  0  36  12  60   0  37  0  0  0  0 338 387 345  3  5 92
0 0 0   0  2076   0  0  12   0  28   0  23  1  0  0  0 263 271 370  3  4 92
0 1 0   0  2048   5  0   0   0  44  16  33  1  0  0  0 320 473 497  6  9 85
8 1 0   0  2116  10  0   0   0 100   0  56 23  0  0  0 514 377 898 14 14 72
0 0 0   0  2084   5  0  24  16 148   0  67  6  0  0  0 350 424 529  9 10 81

     8. Example (Solaris):
     % sar -g 5
     SunOS xi 5.5.1 Generic_103640-03 sun4u    07/05/97

     20:15:43  pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
     20:15:48     0.00     0.00     0.00     0.00     0.00

        - pgout/s: number of page out operations
        - ppgout/s: number of pages paged out
        - pgfree/s: number of reclaimed pages
        - pgscan/s: average number of pages scanned in order 
          to find cadidates to reclain
        - percentage of inodes removed from the free list

     9. Example: Task Manager
        - Choose "Process" tab
        - From "View" menu, choose "Select columns..."
        - Can choose from a variety of choices including Page Faults,
          Virtual Memory Size, etc.
        - "Performance" tab has graphical depiction of memory usage
          and other statistics
          - how may handles, threads and processes exist
          - total physical memory, how much is free and how much used for cache
          - commit charge shows how much memory is allocated to application
            and system programs. Also shows memory limit and peak.
          - memory used by kernel, how much is paged and nonpaged

     10. wmem freeware utility for NT: shows RAM and paging information
         ftp://ftp.winsite.com/pub/pc/winnt/dskutil/wmem.zip
         ftp://mirrors.aol.com/pub/cica/pc/winnt/dskutil/wmem.zip