HPC(High Performance Computer) Linux Cluster HowTo
이전	4장. Benchmark	다음

4.5. Install & Run

이제 여러분의 클러스터 시스템에 벤치마크 프로그램을 설치해보고 직접 성능을 체크 해보자. 이 문서에서는 NetPIPE 와 SCALAPACK 을 설치, 테스트 해볼것이다.

4.5.1. 테스트 사양

CPU : AMD MP 1900+ 1.5Ghz 256KB Cash (SINGLE)
RAM : 2Giga PC 2100 ECC Memory
HDD : 10000 RPM SCSI
NIC : 3Com Gigabit Card, 3Com 10/100 Lancard

NetPIPE 는 머신간에 혹은 에더넷 카드간에 네트웍 성능을 측정할수 있는 벤치마크 프로그램 이다. PVM 과 MPI 둘다 지원하고 있으며 여러 기종의 하드웨어 도 지원하고 있다. 인스톨을 해보자. 다음의 url 에서 소스를 다운받는다. 이글을 쓰는 현재 NetPIPE 의 최신버전은 3.3 이다. http://www.scl.ameslab.gov/netpipe/ 클러스터간 공유디렉토리 (필자의 환경에서는 /home/share ) 에서 작업을 하는 것이 좀더 편리하다.

[micro@master share]$ tar xzf NetPIPE_3.3.tar.gz 
[micro@master share]$ cd NetPIPE_3.3

Makefile 을 편집하도록 하자. 대부분 손댈부분은 거의 없고 MPI 의 설정 부분만 편집해주면 된다. MPICC 컴파일러 를 해당 사용자의 시스템 에 맞 는 컴파일러를 지정해주면 된다. 즉 LAM-MPI 로 NetPIPE 를 벤치마킹 할경우는 LAM 의 mpicc 를 지정해주면 되고 MPICH 의 성능을 테스트 해보려면 MPICH 의 mpicc 를 지정해주면 된다.

[micro@master NetPIPE_3.3]$ vi makefile 
# For MPI, mpicc will set up the proper include and library paths
MPICC       = /usr/local/mpich/bin/mpicc    	# ß 이부분
……………………………….
MPI2CC   = /usr/local/mpich/bin/mpicc		# ß 이부분

수정을 했으면 컴파일을 해보도록 한다. 단순히 tcp 의 성능을 테스팅 해 보기 위해선 make tcp 로 컴파일 하면 된다.

[micro@master NetPIPE_3.3]$ make tcp

기본적으로 NetPIPE 의 퍼포먼스 측정은 양방향 ping-pong 테스트 이다. Tcp 성능 벤치마킹 을 하는경우 한쪽에선 receiver 가 되고 다른 한쪽에 선 sender 로 실행시켜 주면 된다.

[micro@master NetPIPE_3.3]$ ./NPtcp -r &
[micro@master NetPIPE_3.3]$ rsh node01
[micro@node01 NetPIPE_3.3]$ ./NPtcp -t -h master 
Send and receive buffers are 512000 and 512000 bytes
(A bug in Linux doubles the requested buffer sizes)
Now starting the main loop
  0:       1 bytes    500 times -->      0.10 Mbps in      78.87 usec
  1:       2 bytes   1267 times -->      0.19 Mbps in      78.89 usec
  2:       3 bytes   1267 times -->      0.29 Mbps in      78.84 usec
  3:       4 bytes    845 times -->      0.39 Mbps in      79.14 usec
4:       6 bytes    947 times -->      0.58 Mbps in      79.00 usec
………………………………
123: 8388611 bytes      3 times -->    505.81 Mbps in  126529.34 usec

필자의 환경에서는 PEAK 결과가 다음과 같이 나왔다.

표 4-1. NetPIPE 벤치마크 결과 (tcp)

Message Size	Bandwidth	Latency (Usec)
8.3 Mbytes	505 Mbps	126529

이제 MPI 를 이용한 네트웍 성능을 측정해 보도록 하자. 참고로 이 문서에서는 lam-mpi 의 mpi 성능을 측정해 볼것이다. 앞서 makefile 에 mpicc 를 수정해 줬으면 컴파일을 하도록 한다.

[micro@master NetPIPE_3.3]$ make mpi

Npmpi 프로그램이 컴파일 되었을 것이다. Lam-mpi 의 mpirun 으로 실행 해 보도록 하자.

[micro@master NetPIPE_3.3]$ mpirun -O -np 2 ./Npmpi
0: master
1: node01
0:       1 bytes    500 times -->	0.09 Mbps in      81.29 usec
1:       2 bytes   1230 times -->      0.19 Mbps in      81.40 usec
2:       3 bytes   1228 times -->      0.28 Mbps in      81.38 usec
3:       4 bytes    819 times -->      0.38 Mbps in      81.16 usec
………………………………
123: 8388611 bytes      3 times -->    504.27 Mbps in  126917.33 usec

lam-mpi 의 mpi 네트웍 최대 성능은 다음과 같이 나왔다.

표 4-2. NetPIPE 벤치마크 결과 (mpi)

Message Size	Bandwidth	Latency (Usec)
8.3 Mbytes	508 Mbps	94347

마찬가지로 mpich 의 mpi 네트웍 퍼포먼스도 한번 테스트 해보기 바란다.

4.5.3. SCALAPACK (LINPACK Benchmark)

SCALAPACK 은 선형대수 의 해를 구하는 패키지로 SCALAPACK 에 기본적으로 포함되어 있으며 많은 부분이 부동소수점 연산으로 구성되어 있다. LINPACK 벤치마크 에서 중점적으로 사용되는 루틴들은 Gauss 소거법을 이용한 N 개의 선형방정식 의 해를 구하는 것으로 BLAS (Basic Linear Algebra Subprograms) 에 포함되어 있다. BLAS 는 LINPACK 벤치마크 에서 가장 기본이 되는 라이브러리 로써 기본적인 선형대수 연산함수 들을 구현해놓은 집합이다. 이것은 Fortran 으로 짜여져 있으며 BLAS 라이브러리 내의 각 함수들은 연산자와 연산결과가 Vector 냐 Matrix 냐 에 따라 계산 레벨이 나뉘어 진다. 이 BLAS 를 이용해 벤치마킹을 할수도 있지만 ATLAS (Automatically Tuned Linear Algebra Software) 를 이용하여 해당 플랫폼에 최적화된 루틴 라이브러리 를 생성할수도 있다. 이 문서에서는 BLAS, BLACS, ATLAS, SCALAPACK 을 이용하여 벤치마킹을 할것이다. 여기서는 설치 및 실행의 간결한 Guide 만을 제시할 예정이며 이 문서 외의 기타 자세한 내용은 SCALAPACK 의 홈페이지 http://www.netlib.org/scalapack/ 이나 한국클러스터 기술센터(http://www.cluster.or.kr/board/read.php?table=benchmark) 의 벤치마크 Guide 를 참고하기 바란다.

4.5.3.1. LINPACK

http://www.netlib.org/blas/blas.tgz 에서 BLAS 를 다운 받아서 설치한다.

[micro@master share]$ mkdir BLAS
[micro@master share]$ cd BLAS
[micro@master share]$ tar xzf blas.tgz

컴파일 한다. 참고로 해당 프로세서에 최적화된 컴파일러 를 사용하면 성능향상을 볼수 있다 (Intel 의 pgcc 나 Compaq 의 ccc 등등..)

[micro@master share]$ f77 -c *.f

생성된 오브젝트 파일들 (확장자가 *.o) 을 라이브러리 로 만든다.

[micro@master share]$ ar cr blas_LINUX.a *.o

BLACS (Basic Linear Algebra Communication Subprograms) 설치. BLACS 는 다양한 분산메모리 환경에서 프로세서간 메시지 통신을 위 한 선형대수 라이브러리 이다. PVM 과 MPI 용이 따로 있으므로 필요 한 파일을 다운받는다. 여기서는 MPI 를 사용하므로 http://www.netlib.org/blacs/ 에서 mpiblacs.tgz 와 blacstester.tgz 를 다운 받는다. 다음 mpiblacs.tgz 를 압축을 풀면 BLACS 디렉토리가 생긴다.

[micro@master share]$ tar xzf mpiblacs.tgz 
[micro@master share]$ tar xzf blacstester.tgz BLACS/TESTING/*
[micro@master share]$ cd BLACS

BMAKES 디렉토리에서 머신에 맞는 Bmake 파일을 BLACS 디렉토리 로 복사한다.

[micro@master BLACS]$ cp BMAKES/Bmake.MPI-LINUX ./Bmake.inc

Bmake.inc 파일을 편집한다. Bmake.inc 파일은 3가지 Section 으로 정의되어 있으며 각 섹션은 컴파일 과정에서 필요한 여러 매크로를 정 의 한다. 섹션 1 에서는 라이브러리와 실행파일의 위치를 지정하고 make 결과로 생성되는 파일의 이름을 지정할 때 이용하는 매크로를 정의한다. Section 2 에서는 BLACS 에서 이용하는 C Preprocessor 값 을 정의한다. Section 3 에선 컴파일러와 링커/로더 를 설정하는 매크로 를 정의한다.

[micro@master BLACS]$ vi Bmake.inc

#============ SECTION 1: PATHS AND LIBRARIES =======================
SHELL = /bin/sh			<- 사용할 쉘의 종류

BTOPdir = $(HOME)/BLACS  	<- BLACS 의 Top Level 디렉토리

COMMLIB = MPI			<- 사용할 communication 라이브러리 CMMD, 					MPI, PVM, MPL, NX 중 하나

PLAT = LINUX               	<- 플렛폼

BLACSdir    = $(BTOPdir)/LIB	<- BLACS 라이브러리의 위치
BLACSDBGLVL = 1			<- 디버깅 레벨 (0 = NO, 1 = YES)
BLACSFINIT  = $(BLACSdir)/blacsF77init_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a
BLACSCINIT  = $(BLACSdir)/blacsCinit_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a
BLACSLIB    = $(BLACSdir)/blacs_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a
			<- 라이브러리 이름들 

MPIdir = /usr/local/mpich		<- MPICH 의 위치
MPIdev = ch_p4mpd			<- MPICH Device 의 종류
MPIplat = LINUX			
MPILIBdir = $(MPIdir)/$(MPIdev)/lib	<- MPICH 라이브러리 위치
MPIINCdir = $(MPIdir)/$(MPIdev)/include	<- MPICH 헤더 파일 위치
MPILIB = $(MPILIBdir)/libmpich.a	<- MPICH 라이브러리 파일

BTLIBS = $(BLACSFINIT) $(BLACSLIB) $(BLACSFINIT) $(MPILIB)
			<- 테스팅에 필요한 라이브러리 들.
INSTdir = $(BTOPdir)/INSTALL/EXE
TESTdir = $(BTOPdir)/TESTING/EXE
FTESTexe = $(TESTdir)/xFbtest_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL)
CTESTexe = $(TESTdir)/xCbtest_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL)
#================= End SECTION 1===============================
#============== SECTION 2: BLACS INTERNALS ========================
SYSINC = -I$(MPIINCdir)
INTFACE = -Df77IsF2C	<- Fortran77 에서 C 로 인터페이스 방법 Add_, NoChange, UpCase, 또는 f77IsF2C 확실하지 않을 경우 INSTALL/EXE/xintface 프로그램을 수행해 볼것
SENDIS =           <- -DSndIsLocBlk 로 정의하면 MPI_Send 가
                   locally-blocking 루틴으로 수행된어 더 효율적이다.
		   비워둘 경우 globally-blocking으로 가정한다.
BUF = 
TRANSCOMM = -DuseMpich <- 여기에 정의	된 파라미터는 플랫폼 마다 틀리다. 주석에서는 BLACS/INSTALL/xtc_CsameF77 과 BLACS/INSTALL/xtc_UseMpich 를 실행하도록 지시하고 있다. xtc_CsameF77 과 xtc_UseMpich 를 생성하는 방법은 다음과 같다.
	$ BLACS/INSTALL/make xtc_CsameF77
$ BLACS/INSTALL/make xtc_UseMpich
이 파일을 실행시키면 설정할 값이 출력된다.
	$ BLACS/INSTALL/EXE/mpirun -np 2 xtc_CsameF77
.............
Set TRANSCOMM = -DUseMpich
$ BLACS/INSTALL/EXE/xtc_UseMpich
Set TRANSCOMM = -DuseMpich
WHATMPI =
SYSERRORS =
DEBUGLVL = -DBlacsDebugLvl=$(BLACSDBGLVL)
DEFS1 = -DSYSINC $(SYSINC) $(INTFACE) $(DEFBSTOP) $(DEFCOMBTOP) $(DEBUGLVL)
BLACSDEFS = $(DEFS1) $(SENDIS) $(BUFF) $(TRANSCOMM) $(WHATMPI) $(SYSERRORS)
#================= End SECTION 2===============================
#================= SECTION 3: COMPILERS ============================
F77            = f77            <- fortran 컴파일러
#F77NO_OPTFLAGS = -Nx400
F77FLAGS       = $(F77NO_OPTFLAGS) -O
F77LOADER      = $(F77)
F77LOADFLAGS   =
CC             = gcc            <- C 컴파일러
CCFLAGS        = -O4
CCLOADER       = $(CC)
CCLOADFLAGS    =

ARCH      = ar
ARCHFLAGS = r
RANLIB    = ranlib
#================= End SECTION 3 ===============================

컴파일 한다

[micro@master BLACS]$ make mpi

LIB/blacs_MPI-LINUX-1.a 화일이 생성되어야 한다. SRC/ 디렉토리에는 사용자가 호출할 수 있는 루틴들이 들어 있고, 모두 C와 Fortran77 인터페이스를 가지고 있다. 모든 non-communication 루틴들은 blacs_ 라는 접두어로 시작된다. BLACS 내부 루틴과 전역 변수(global variables)들은 모두 BI_ 라는 접두어를 가지고 있다.

ATLAS (Automatically Tuned Linear Algebra Software) 설치. ATLAS 홈 (http://math-atlas.sourceforge.net) 에서 관련 파일을 다운 받아 설치를 한다. 이글을 쓰는 시점에서의 ATLAS 최신버전은 3.5.2 이다.

[micro@master share]$ tar xzf atlas3.5.2.tar.gz
[micro@master share]$ cd ATLAS
[micro@master ATLAS]$ make config CC=gcc  #CC 를 지정하지 않으면 gcc 가 사용된다.
[micro@master ATLAS]$ make config
gcc -o xconfig config.c
./xconfig
ATLAS configure started.
160
159
……
001
Enter number at top left of screen [0]: 160 # 화면에 보이는 가장 큰 수를 적는다.
====================================================================
IMPORTANT
====================================================================
Before going any further, check
http://math-atlas.sourceforge.net/errata.html.
This is the ATLAS errata file, which keeps a running count of all known
ATLAS bugs and system problems, with associated workarounds or fixes.
IF YOU DO NOT CHECK THIS FILE, YOU MAY BE COMPILING A LIBRARY WITH KNOWN BUGS.

Have you scoped the errata file? [y]: y 	# errata 문서를 읽어보도록 한다.
Configure will ask a series of questions, in one of two forms. The first form of question is a menu of choices. One option in almost all menus is
'Other/UNKNOWN'. If you are unsure of the answer, always choose this option.
…………..(생략)……….
Are you ready to continue? [y]: y 
I need to know if you are using a cross-compiler (i.e., you are compiling on a different architecture than you want the library built for).

Are you using a cross-compiler? [n]: n
Probing to make operating system determination:
Operating system configured as Linux # 맞는지 확인

Probing for architecture:
Architecture is set to ATHLON # 맞는지 확인

Probing for supported ISA extensions:
make[2]: *** [atlas_run] Error 132
make[1]: *** [IRun_SSE1] Error 2
SSE2: NO.
SSE1: DETECTED!
Number of CPUs: 1
Required cache flush detected as : 524288 bytes
Looking for compilers:

F77 = /usr/bin/g77 -funroll-all-loops -O3
CC = /usr/bin/gcc -fomit-frame-pointer -O3 -funroll-all-loops
MCC = /usr/bin/gcc -fomit-frame-pointer -O

Looking for BLAS (this may take a while):
Unable to find usable BLAS, BLASlib left blank.
FINDING tar, gzip, AND gunzip
tar : /bin/tar
gzip : /bin/gzip
gunzip : /bin/gunzip


ATLAS has default parameters for OS='Linux' and system='ATHLON'.
If you want to just trust these default values, you can use express setup,
drastically reducing the amount of questions you are required to answer

use express setup? [y]: y
……………
Enter Architecture name (ARCH) [Linux_ATHLONSSE1]: Enter 

[micro@master ATLAS]$ make install arch=< arch>

arch 는 아키텍쳐 이름이다. config 과정에서 마지막에 출력된다. 위의 config 과정에서 출력된 대로 make install arch=Linux_ATHLONSSE1 를 입력한다.

[micro@master ATLAS]$ make install arch=Linux_ATHLONSSE1
.........
(생략 1시간 이상 소요)
ATLAS install complete. Examine
ATLAS/bin//INSTALL_LOG/SUMMARY.LOG for details.

다음은 SCALAPACK 을 설치한다. MPICH, BLAS, BLACS 가 설치되어 있어야 한다. http://www.netlib.org/scalapack/ 에서 최신버전의 scalapack 을 다운받은후 압축을 풀면 SCALAPACK 디렉토리가 생긴다.

[micro@master share]$ tar xzf scalapack.tgz

Slmake.inc 파일을 편집한다. 이 파일은 모든 Makefile 에서 include 되어 사용 되며 설치에 필요한 매크로를 정의한다. INSTALL 디렉토리를 참조 하여 자신에게 맞는 Slmake.inc 파일을 복사하여 편집한다.

[micro@master share]$ cd SCALAPACK
[micro@master SCALAPACK]$ cp INSTALL/SLmake.LINUX ./SLmake.inc
[micro@master SCALAPACK]$ vi SLmake.inc

대부분의 값들은 기본값을 그대로 이용하고, 앞서 ATLAS 를 이용 플랫폼 에 최적화된 BLAS 를 생성했으니 관련설정을 맞추어 주도록 한다.

############################################################################
#
#  Program:         ScaLAPACK
#
#  Module:          SLmake.inc
#
#  Purpose:         Top-level Definitions
#
#  Creation date:   February 15, 2000
#
#  Modified:
#
#  Send bug reports, comments or suggestions to scalapack@cs.utk.edu
#
############################################################################
#
SHELL         = /bin/sh
#
#  The complete path to the top level of ScaLAPACK directory, usually
#  $(HOME)/SCALAPACK
#
home          = $(HOME)/SCALAPACK
#
#  The platform identifier to suffix to the end of library names
#
PLAT          = LINUX
#
#  BLACS setup.  All version need the debug level (0 or 1),
#  and the directory where the BLACS libraries are
#
BLACSDBGLVL   = 1
BLACSdir      = $(HOME)/BLACS/LIB
#
#  MPI setup; tailor to your system if using MPIBLACS
#  Will need to comment out these 6 lines if using PVM
#
USEMPI        = -DUsingMpiBlacs
#SMPLIB        = /usr/lib/mpi/build/LINUX/ch_p4/lib/libmpich.a
SMPLIB        = /usr/local/mpich/lib/libmpich.a
BLACSFINIT    = $(BLACSdir)/blacsF77init_MPI-$(PLAT)-$(BLACSDBGLVL).a
BLACSCINIT    = $(BLACSdir)/blacsCinit_MPI-$(PLAT)-$(BLACSDBGLVL).a
BLACSLIB      = $(BLACSdir)/blacs_MPI-$(PLAT)-$(BLACSDBGLVL).a
TESTINGdir    = $(home)/TESTING

#
#  PVMBLACS setup, uncomment next 6 lines if using PVM
#
#USEMPI        =
#SMPLIB        = $(PVM_ROOT)/lib/$(PLAT)/libpvm3.a
#BLACSFINIT    =
#BLACSCINIT    =
#BLACSLIB      = $(BLACSdir)/blacs_PVM-$(PLAT)-$(BLACSDBGLVL).a
#TESTINGdir    = $(HOME)/pvm3/bin/$(PLAT)

CBLACSLIB     = $(BLACSCINIT) $(BLACSLIB) $(BLACSCINIT)
FBLACSLIB     = $(BLACSFINIT) $(BLACSLIB) $(BLACSFINIT)

#
#  The directories to find the various pieces of ScaLapack
#
PBLASdir      = $(home)/PBLAS
SRCdir        = $(home)/SRC
TESTdir       = $(home)/TESTING
PBLASTSTdir   = $(TESTINGdir)
TOOLSdir      = $(home)/TOOLS
REDISTdir     = $(home)/REDIST
REDISTTSTdir  = $(TESTINGdir)
#
#  The fortran and C compilers, loaders, and their flags
#
F77           = /usr/local/mpich/bin/mpif77
CC            = /usr/local/mpich/bin/mpicc
NOOPT        = 
F77FLAGS     =  -funroll-all-loops -O3 $(NOOPT)
DRVOPTS      = $(F77FLAGS)
CCFLAGS      = -O4
SRCFLAG       =
#F77LOADER     = $(F77)
F77LOADER     = $(F77)
CCLOADER      = $(CC)
F77LOADFLAGS  =
CCLOADFLAGS   =
#
#  C preprocessor defs for compilation 
#  (-DNoChange, -DAdd_, -DUpCase, or -Df77IsF2C)
#
CDEFS         = -Df77IsF2C -DNO_IEEE $(USEMPI)
#
#  The archiver and the flag(s) to use when building archive (library)
#  Also the ranlib routine.  If your system has no ranlib, set RANLIB = echo
#
ARCH          = ar
ARCHFLAGS     = cr
RANLIB        = ranlib
#
#  The name of the libraries to be created/linked to
#
SCALAPACKLIB  = $(home)/libscalapack.a
#BLASLIB       = $(HOME)/BLAS/blas_LINUX.a
# ATLAS 의 BLAS 라이브러리를 지정해 준다.
BLASLIB       = -L$(HOME)/ATLAS/lib/Linux_ATHLONSSE1 -lf77blas -latlas
#
PBLIBS        = $(SCALAPACKLIB) $(FBLACSLIB) $(BLASLIB) $(SMPLIB)
PRLIBS        = $(SCALAPACKLIB) $(CBLACSLIB) $(SMPLIB)
RLIBS         = $(SCALAPACKLIB) $(FBLACSLIB) $(CBLACSLIB) $(BLASLIB) $(SMPLIB)
LIBS          = $(PBLIBS)
############################################################################

컴파일 한다. 컴파일 과정에서 에러가 생기면 SLmake.inc 파일을 수정하 고 다시 컴파일 한다.

[micro@master SCALAPACK]$ make lib

SCALAPACK 디렉토리 아래에 libscalapack.a 파일이 만들어 진다. 여기 까지 이상이 없다면 간단한 테스트 프로그램을 돌려보자.

[micro@master SCALAPACK]$ cd TESTING
[micro@master TESTING]$ cd LIN
[micro@master LIN]$ make double
[micro@master LIN]$ cd ..
[micro@master TESTING]$ /usr/local/mpich/bin/mpirun -np [프로세서 개수] ./xdlu 
ScaLAPACK Ax=b by LU factorization.
'MPI Machine'

Tests of the parallel real double precision LU factorization and solve.
The following scaled residual checks will be computed:
 Solve residual         = ||Ax - b|| / (||x|| * ||A|| * eps * N)
 Factorization residual = ||A - LU|| / (||A|| * eps * N)
The matrix A is randomly generated for each test.

An explanation of the input/output parameters follows:
TIME    : Indicates whether WALL or CPU time was used.
M       : The number of rows in the matrix A.
N       : The number of columns in the matrix A.
NB      : The size of the square blocks the matrix A is split into.
NRHS    : The total number of RHS to solve for.
NBRHS   : The number of RHS to be put on a column of processes before going
          on to the next column of processes.
P       : The number of process rows.
Q       : The number of process columns.
THRESH  : If a residual value is less than THRESH, CHECK is flagged as PASSED
LU time : Time in seconds to factor the matrix
Sol Time: Time in seconds to solve the system.
MFLOPS  : Rate of execution for factor and solve.

The following parameter values will be used:
  M       :         10000
  N       :         10000
  NB      :            36
  NRHS    :             3
  NBRHS   :             3
  P       :             1
  Q       :             7

Relative machine precision (eps) is taken to be       0.111022E-15
Routines pass computational tests if scaled residual is less than   1.0000

TIME     M     N  NB NRHS NBRHS    P    Q  LU Time Sol Time  MFLOPS  CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------
WALL 10000 10000  36     3    3    1    7   100.52     0.46  6607.53 PASSED
Finished      1 tests, with the following results:
    1 tests completed and passed residual checks.
    0 tests completed and failed residual checks.
    0 tests skipped because of illegal input values.

END OF TESTS.

위와 비슷한 결과가 나와야 한다. 다음은 SCALAPACK이 설치되었다고 가정하고, 단일 노드에서 LINPACK Benchmark를 수행하는 방법을 설명한다. ATLAS 최적화 루틴을 이용하려면 SLmake.inc에서 ATLAS에서 제공하는 BLAS 루틴을 이용하도록 설정한다.

[micro@master SCALAPACK]$ vi SLmake.inc 
......(생략)
BLASLIB       = -L$(HOME)/ATLAS/lib/Linux_ATHLONSSE1 -lf77blas -latlas
......(생략)

SCALAPACK/TESTING 디렉토리에는 기본적으로 13개의 .dat 화일과 LIN, EIG 디렉토리가 설치된다. LIN, EIG 디렉토리에는 벤치마크를 수행하는 Fortran으로 작성된 소스코드와 Makefile이 들어 있다. LIN 디렉토리는 Linear Equations Testing 을 위한 프로그램들이 들어 있다. EIG 디렉토리에는 Eigenroutine Testing을 위한 프로그램이 들어 있다. TESTING 디렉토리에 포함된 .dat 화일과 용도는 다음과 같다.

BLLT.dat 'ScaLAPACK, Version 1.2, banded linear systems input file'
BLU.dat  'ScaLAPACK, Version 1.2, banded linear systems input file'
BRD.dat  'ScaLAPACK BRD input file'
HRD.dat  'ScaLAPACK HRD input file'
INV.dat  'ScaLAPACK, Version 1.0, Matrix Inversion Testing input file'
LLT.dat  'ScaLAPACK, LLt factorization input file'
LS.dat   'ScaLAPACK LS solve input file'
LU.dat   'SCALAPACK, LU factorization input file'
NEP.dat  'SCALAPACK NEP (Nonsymmetric Eigenvalue Problem) input file'
QR.dat   'ScaLAPACK, Orthogonal factorizations input file'
SEP.dat  'ScaLAPACK Symmetric Eigensolver Test File'
SVD.dat  'ScaLAPACK Singular Value Decomposition  input file'
TRD.dat  'ScaLAPACK TRD computation input file'

.dat 파일에는 각각의 벤치마크에 필요한 변수들이 저장된다. 테스트 에 필요한 LU.dat 파일을 살펴보자.

[micro@master TESTING]$ vi LU.dat 
-- LU.dat --
'SCALAPACK, LU factorization input file'
'MPI Machine'
'LU.out'                output file name (if any)
6                       device out
4                       number of problems sizes
4 10 17 13 23 31 57     values of M
4 12 13 13 23 31 50     values of N
3                       number of NB's
2 3 4 5                 values of NB
3                       number of NRHS's
1 3 9 28                values of NRHS
3                       Number of NBRHS's
1 3 5 7                 values of NBRHS
4                       number of process grids (ordered pairs of P & Q)
1 2 1 4 2 3 8           values of P
1 2 4 1 3 2 1           values of Q
1.0                     threshold
T                       (T or F) Test Cond. Est. and Iter. Ref. Routines
-- LU.dat --

대략적인 형식은 각 벤치마크에 필요한 값들과 해당 값들의 숫자로 정의 되어 있다. 예를 들어 number of problems sizes 가 4 이므로 M 과 N 의 벤치마크 테스팅 에 사용될 값은 M = 4,10,17,13 이고 N = 4,12,13,23 이다. 그리고 values of P,Q 는 프로세서의 Grid 를 뜻한다. P 는 프로세서의 row 를 뜻하며 Q 는 column 을 뜻한다. 위의 경우 P = 1 , Q = 1 이면 1 X 1 = 1 이니 1 개의 프로세서 에서 실행된다는 것을 뜻한다. 두번째 의 경우인 P = 2 , Q = 2 의 경우 2 X 2 = 4 이니 4 개의 프로세서(Node) 에서 실행된다는 것을 뜻한다. 해당 벤치마킹 프로그램을 생성하려면 LIN 이나 EIG 디렉토리 에서 make [type] 을 실행하면 된다. 해당 type 은 single, double, complex, complex16 4 가지 가 있다. 예를 들어 make single 은 single precision floating point 를 위한 벤치마킹 프로그램이 생성된다. make all 은 4가지 형에 대한 모든 벤치마킹 프로그램을 한꺼번에 컴파일 한다. 해당 type 에 대한 파일명과 개수는 다음과 같다.

[micro@master LIN]$ make single 
[micro@master LIN]$ ls ../xs* 
xsdblu*  xsdtlu*  xsgblu*  xsinv*  xsllt*  xsls*  xslu*  xspbllt*  xsptllt*  xsqr*
[micro@master LIN]$ make double
[micro@master LIN]$ ls ../xd*
xddblu*  xddtlu*  xdgblu*  xdinv*  xdllt*  xdls*  xdlu*  xdpbllt*  xdptllt*  xdqr*
[micro@master LIN]$ make complex 
[micro@master LIN]$ ls ../xc*
xcdblu*  xcdtlu*  xcgblu*  xcinv*  xcllt*  xcls*  xclu*  xcpbllt*  xcptllt*  xcqr*
[micro@master LIN]$ make complex16 
[micro@master LIN]$ ls ../xz* 
xzdblu*  xzdtlu*  xzgblu*  xzinv*  xzllt*  xzls*  xzlu*  xzpbllt*  xzptllt*  xzqr*

이와 같이 총 40개의 실행파일 들이 생성되는데 마찬가지로 EIG 디렉토리 에서도 같은 형식으로 make 를 하면 프로그램이 생성된다. 각각의 프로그 램을 실행하여 보자. MPI 를 이용하여 실행하려면 mpirun -np N program 을 실행하면 된다. LINPACK 의 벤치마크 에서는 LU.dat 파일의 파라메터 설정을 이용하여 벤치마크 할수 있다. 기본적으로 ScaLAPACK은 블럭 단위로 연산을 수행하며, 클러스터와 같은 병렬 컴퓨터에서 최대 성능을 얻기 위해서는 주어진 컴퓨터에 적절한 블록의 크기를 구하여야 한다. 이는 계산으로 대략적인 값을 구한 후에, 많은 실행을 거쳐 경험적으로 얻을 수 있다. 그리고 컴퓨터가 수행할 수 있는 최대 문제 크기(Nmax)를 얻기 위해서는, 하나의 프로세서에서 문제 크기를 점차로 늘리면서 주어진 메모리에 노드가 포용할 수 있는 최대 크기를 알아내야 한다. 마지막으로 이를 바탕으로 많은 노드를 가진 병렬 컴퓨터에서 수행할 수 있는 최대의 크기로 LU 인수분해 루팅을 수행시켜서 최대의 성능(Rmax)을 얻을 수 있다.

Fortran 으로 짜여진 LIN/pdludriver.f 파일에서 TOTMEM 의 값을 변화시키면서 Segmentation fault 가 발생하는 범위를 조사해 보자. 이것은 주 메모리 뿐만 아니라 스왑영역의 크기에 영향을 받는다. 스왑 영역의 크기보다 TOTMEM 값을 크게 하면 Segmentation fault 를 일으킬 것이다. 물론 소스를 수정하고 나면 컴파일을 다시 해야 한다. 2GB 메모리, 500MB 스왑영역의 주어진 조건에서 TOTMEM 을 500000000 로 정하였다. TESTING 디렉토리의 LU.dat 을 다음과 같이 수정하고 xdlu 를 실행시켜 보자. 계산에 필요한 메모리양이 메인 메모리의 크기보다 크면 스왑영역의 억세스를 위해서 하드디스크가 동작하는 것을 볼수 있을것이다.

-- LU.dat --
'SCALAPACK, LU factorization input file'
'MPI Machine'
'LU.out'                output file name (if any)
6                       device out
6                       number of problems sizes
1000 1200 1400 1600 1800 2000 values of M
1000 1200 1400 1600 1800 2000 values of N
1                       number of NB's
60		        values of NB
1                       number of NRHS's
1 	               values of NRHS
1                       Number of NBRHS's
1	                 values of NBRHS
1                       number of process grids (ordered pairs of P & Q)
1		       values of P
1		       values of Q
1.0                     threshold
T                       (T or F) Test Cond. Est. and Iter. Ref. Routines
-- LU.dat --

이것을 2번 실행한 결과는 다음과 같다. 첫번째 실행한 결과

TIME     M     N  NB NRHS NBRHS    P    Q  LU Time Sol Time  MFLOPS  CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------
WALL  1000  1000  60     1    1    1    1     0.64     0.01  1026.42 PASSED
WALL  1200  1200  60     1    1    1    1     1.05     0.02  1078.62 PASSED
WALL  1400  1400  60     1    1    1    1     1.67     0.02  1083.34 PASSED
WALL  1600  1600  60     1    1    1    1     2.29     0.03  1177.45 PASSED
WALL  1800  1800  60     1    1    1    1     3.13     0.04  1227.84 PASSED
WALL  2000  2000  60     1    1    1    1     4.37     0.05  1207.76 PASSED

두번째 실행한 결과

TIME     M     N  NB NRHS NBRHS    P    Q  LU Time Sol Time  MFLOPS  CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------

WALL  1000  1000  60     1    1    1    1     0.63     0.01  1032.06 PASSED
WALL  1200  1200  60     1    1    1    1     1.05     0.02  1079.69 PASSED
WALL  1400  1400  60     1    1    1    1     1.59     0.02  1134.49 PASSED
WALL  1600  1600  60     1    1    1    1     2.28     0.03  1184.27 PASSED
WALL  1800  1800  60     1    1    1    1     3.12     0.04  1231.93 PASSED
WALL  2000  2000  60     1    1    1    1     4.37     0.05  1207.30 PASSED

문제의 크기가 커질수록 MFLOPS 가 증가하다가 SWAP 을 사용할 정도가 되면 성능이 떨어지는 것을 알수 있다. 다음은 NB를 바꾸면서 수행을 해보자. M 과 N 을 각자의 시스템에 맞게 수정하면서 NB 를 28 에서 60 까지 변화 시킨다.

TIME     M     N  NB NRHS NBRHS    P    Q  LU Time Sol Time  MFLOPS  CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------

WALL  5000  5000  28     1    1    1    7    19.85     0.14  4170.73 PASSED
WALL  5000  5000  30     1    1    1    7    14.85     0.14  5562.10 PASSED
WALL  5000  5000  32     1    1    1    7    15.40     0.13  5367.77 PASSED
WALL  5000  5000  34     1    1    1    7    15.89     0.15  5198.10 PASSED
WALL  7000  7000  28     1    1    1    7    49.39     0.24  4608.81 PASSED
WALL  7000  7000  30     1    1    1    7    37.77     0.27  6013.41 PASSED
WALL  7000  7000  32     1    1    1    7    38.96     0.25  5833.21 PASSED
WALL  7000  7000  34     1    1    1    7    39.07     0.26  5816.04 PASSED
WALL 10000 10000  28     1    1    1    7   133.66     0.41  4973.60 PASSED
WALL 10000 10000  30     1    1    1    7    99.69     0.45  6659.18 PASSED
WALL 10000 10000  32     1    1    1    7   102.15     0.43  6500.25 PASSED
WALL 10000 10000  34     1    1    1    7   101.73     0.40  6529.03 PASSED

위의 실험에서 측정된 최고 성능은 M=N=10000 NB=30 일때 6659.18 MFLOPS 이다. M=N 값을 통일하고 NB 의 최적화 값을 찾아보도록 하자.

TIME     M     N  NB NRHS NBRHS    P    Q  LU Time Sol Time  MFLOPS  CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------

WALL  1000  1000  28     1    1    1    7     1.09     0.03   595.45 PASSED
WALL  1000  1000  30     1    1    1    7     0.33     0.02  1939.09 PASSED
WALL  1000  1000  32     1    1    1    7     0.33     0.02  1911.82 PASSED
WALL  1000  1000  34     1    1    1    7     0.37     0.02  1699.22 PASSED
WALL  1000  1000  36     1    1    1    7     0.38     0.02  1695.32 PASSED
WALL  1000  1000  38     1    1    1    7     0.40     0.02  1606.20 PASSED

M=N=1000 으로 통일시키고 측정했을경우 NB 가 30 일경우 가장 좋은 성능을 보임.

4.5.3.2. HPL (High-Performance Linpack Benchmark)

다음은 대용량 메모리 시스템 을 벤치마크 하는데 쓰이는 (전세계 슈퍼컴퓨팅 순위를 매기는 TOP 500 Site 에서 사용하는 프로그램) HPL 을 이용하여 벤치마킹 을 해보자. HPL 을 설치하기 전에 BLAS , MPICH CBLAS 등이 설치되어 있어야 한다. 여기서는 ATLAS 의 BLAS 루틴을 이용할 것 이기 때문에 ATLAS 도 설치되어 있어야 한다. CBLAS 설치는 한국 클러스터 기술센터 http://www.cluster.or.kr/board/read.php?table=benchmark=3 를 참고하거나 여기를 참고하도록 한다. http://www.netlib.org/blas/ Hpl 을 다운받아서 압축을 푼다. http://www.netlib.org/benchmark/hpl/

[micro@master share]$ tar xzf hpl.tgz

hpl 디렉토리 안의 setup 디렉토리 에서 해당플랫폼에 맞는 make 파일을 hpl Top 디렉토리 안으로 복사한다. 여기서는 Linux 의 Athlon 칩, 그리 고 BLAS 의 C 인터페이스인 CBLAS 를 사용할 것 이므로 파일명은 다음 과 같다.

[micro@master share]$ cd hpl 
[micro@master hpl]$ cp setup/Make.Linux_ATHLON_CBLAS .

해당파일을 수정하도록 한다.

[micro@master hpl]$ vi Make.Linux_ATHLON_CBLAS
------ Make.Linux_ATHLON_CBLAS -------
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch

ARCH         = Linux_ATHLON_CBLAS

TOPdir       = $(HOME)/hpl
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a

#CC           = gcc
CC           = /usr/local/mpich/bin/mpicc  <- MPICH 의 C 컴파일러
NOOPT        =
#CCFLAGS      = -fomit-frame-pointer -O3 -funroll-loops -W -Wall
CCFLAGS      = -fomit-frame-pointer -O3 -funroll-loops
#
#LINKER       = gcc
LINKER       = /usr/local/mpich/bin/mpicc
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo

MPdir        = /usr/local/mpich
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/libmpich.a

F2CDEFS      =
NOOPT        =
F77          = /usr/local/mpich/bin/mpif77
F77LOADER    = /usr/local/mpich/bin/mpif77
F77FLAGS     = -O $(NOOPT)

LAdir        = $(HOME)/ATLAS/lib/Linux_ATHLONSSE1
LAinc        = $(HOME)/ATLAS/include/Linux_ATHLONSSE1
LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a

#HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
		<- 윗부분을 아래와 같이 수정한다.
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) -I$(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)

HPL_OPTS     = -DHPL_CALL_CBLAS
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
------ Make.Linux_ATHLON_CBLAS -------

다음 Make.top 파일과 Makefile 의 arch 부분을 수정해 준다.

#
arch             = Linux_ATHLON_CBLAS
#

컴파일 한다. Make arch=[해당시스템] 을 입력하자. 그럼 bin 디렉토리 아 래에 [해당시스템] 디렉토리가 생성됐을것이다.

[micro@master hpl]$ make arch=Linux_ATHLON_CBLAS 
[micro@master hpl]$ cd bin/Linux_ATHLON_CBLAS

bin/Linux_ATHLON_CBLAS 디렉토리에 가보면 HPL.dat 파일과 xhpl 파 일이 보일것이다. HPL.dat 파일은 앞서 LINPACK 벤치마킹 에 환경설정 파일처럼 여러가지 벤치마킹에 필요한 파라미터 들을 설정하는 곳이고, xhpl 실행파일은 실질적으로 벤치마킹에 돌리는 프로그램이다. 그럼 HPL.dat 파일의 포맷을 살펴보자.

[micro@master hpl]$ vi HPL.dat
----- HPL.dat -----
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
10000
1           # of NBs
85           NBs
1            # of process grids (P x Q)
1            Ps
7            Qs
16.0         threshold
1            # of panel fact
1            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64          swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
----- HPL.dat -----

기본적으로 LINPACK 벤치마크의 LU.dat 과 크게 다르지 않다는 것을 알수가 있다. 몇가지 차이점은 Problem size 가 1차원 으로 바뀐것과 Swapping threshold 를 지정할수 있다는 것 등인데 자세한 내용은 해당 튜닝 페이지 http://www.netlib.org/benchmark/hpl/tuning.html 를 참고하도록 하자. xhpl 을 실행해 보도록 한다.

[micro@master Linux_ATHLON_CBLAS]$ mpirun -np 7 xhpl
====================================================================
HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27, 2000
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
====================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   10000
NB     :    85
P      :       1
Q      :       7
PFACT  :   Crout
NBMIN  :       4
NDIV   :       2
RFACT  :   Right
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11R2C4        10000    85     1     7              70.49          9.460e+00
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0646673 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0153022 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0034203 ...... PASSED
============================================================================

위의 결과는 N = 10000 , NB = 85 일때 9.46Gflops 가 나왔다. LINPACK 과 마찬가지로 여러분의 시스템 환경에 맞게 problem size 와 NB 를 적절히 수정해 가면서 시스템이 수행할수 있는 최고성능을 이끌어 내보자. HPL.dat 파일을 수정한다음 컴파일을 다시 한다.

[micro@master Linux_ATHLON_CBLAS]$ rm -f ./xhpl
[micro@master Linux_ATHLON_CBLAS]$ cd ../../
[micro@master Linux_ATHLON_CBLAS]$ make clean
[micro@master Linux_ATHLON_CBLAS]$ make all

이전	처음으로	다음
SCALAPACK (Scalable LAPACK)	위로	Reference sites

4.5. Install & Run

4.5.1. 테스트 사양

4.5.2. NetPIPE

4.5.3. SCALAPACK (LINPACK Benchmark)

4.5.3.1. LINPACK

4.5.3.2. HPL (High-Performance Linpack Benchmark)