IBM Support

Adding SAN storage to PureData System for Analytics

Question & Answer


Question

How do I add SAN storage to PureData System for Analytics?

Answer

Limitations

PlatformFileSystemConsiderations
Redhat 5ext3, gpfsMax volume size 16TB
Redhat 6ext3,ext4,xfs,gpfsminimum HPF 5.3.3
XFS not supported on RH5

Backup Streams/Threads 

The number of files in concurrent use during loading and unloading is a key factor to be considered. When designing a backup strategy, keep in mind that Multi Stream/Thread backup support was introduced in NPS 6.0 and Multi Stream restore was introduced in NPS 7.2.

On TwinFin’s the sweet spot for backup performance with multiple threads is 4.

On Stripers and Mako the number of streams is not the limiting factor but it is the total number of parallel jobs on the SPUs, which is 48. This number must be keep in mind when designing a backup strategy, as you will impact the backup and other system users.

Expectations

1.0 TB rate per hour assuming enterprise class storage and healthy fiber network

Step 1: Planning

Be sure the system actually has Fiber channel host bus adapters (FC HBA) before proceeding.

To determine if the system has the cards, and if they are plugged in, run:
#/opt/nz-hwsupport/pts/external_storage_tool.pl san -b info

Note: all commands are run as root, logged in directly as root to node IP (not via wall IP, or sudo from NZ user)

Output from a system without HBA installed

ERROR: No HBAs found.

Output from a system with HBA installed, but not used

Host Bus Adapter (HBA): 7
Info: IBM 42D0494 8Gb 2-Port PCIe FC HBA for System x on PCI bus 8b device 00 irq 58 port 0
Status: Link State appears to be down. There may be a problem with the fibre channel connection.
Link State: Linkdown
WWN: 0x10000090fa2901fa
Firmware Rev: 2.01A11 (U3D2.01A11), sli-3
Driver Ver: Emulex LightPulse Fibre Channel SCSI driver 10.6.0.20
Max Luns: 255
Model: 42D0494
Current Speed: unknown
Supported Speeds: 2 Gbit, 4 Gbit, 8 Gbit
Block Devices:
None


Output from a fully configured system

Host Bus Adapter (HBA): 5
Info: IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x on PCI bus 24 device 00 irq 209 port 0
Status: SAN storage ready for the system to use.
Link State: Link Up - Ready
   Private Loop
WWN: 0x10000000c99828da
Firmware Rev: 2.72X2 (Z3F2.72X2), sli-3
Driver Ver: Emulex LightPulse Fibre Channel SCSI driver 8.2.0.128.3p
Max Luns: 255
Model: 42C2071
Current Speed: 4 Gbit
Supported Speeds: 1 Gbit, 2 Gbit, 4 Gbit
Block Devices:
[5:0:0:0] /dev/sdb
[5:0:0:1] /dev/sdc
[5:0:0:2] /dev/sdd
[5:0:0:3] /dev/sde
[5:0:0:4] /dev/sdf

 

There are several important bits of information in the HBA section.


1. WWN: this is the address of the card, where the SAN admin will share storage against.
2. Link state: this tells if the cable is connected or not
3. Block devices: this tells what storage partitions have been shared with this adapter. This is not always accurate, and best to run cat /proc/partitions.

Do not proceed if the system does not have the HBA installed. This is handled through nzscheduling. Consult http://www-01.ibm.com/support/docview.wss?uid=swg21977823 for more details.

 

Now that the system has the necessary HBA to support SAN, the SAN admin needs to share storage.

 

run cat /proc/partitions to show the correct devices.

By default, the internal drive is /dev/sda[1-12], but this is subject to change when adding a san.

 

# cat /proc/partitions
major minor  #blocks  name

   8     0  712888320 sda
   8     1    1052226 sda1
   8     2  314576797 sda2
   8     3   37752750 sda3
   8     4          1 sda4
   8     5  290286486 sda5
   8     6   16779861 sda6
   8     7   16779861 sda7
   8     8    8385898 sda8
   8     9    8385898 sda9
   8    10    8385898 sda10
   8    11    6297448 sda11
   8    12    4192933 sda12
   8    16   52428800 sdb
   8    32    1048576 sdc
   8    48    1048576 sdd
   8    64    1048576 sde
   8    80    1048576 sdf


In above case, the HBA can already see the client storage sdb,c,d,e,f.

It is important to confirm this is the case on both hosts before proceeding.

It is a good idea to ask the following questions of the SAN admin at this time:


1. Lun count (distinct raid group)?
2. Lun size?
3. File system type?  (recall restriction above)
4. Naming convention for volume groups and logical volumes
5. How should we split storage across mount points?
6. multipathing required?

Once you have the storage visible, and the plan surrounding how they want to break up the storage, we are ready to begin.
If storage is not visible, but link is up, you must reboot the host.

Step 2: configuration of multipathing

If the client is not using multipathing, you can skip right to the section on LVM.

blocklisting

Multipathing will attempt to multipath all scsi devices, even those which are internal to the appliance. This results in many issues. It is important to configure multipathing to IGNORE the internal raid devices.

The lsscsi command will list the block devices attached to the host.

Example of system with multiple san mounts:

[0:2:0:0]    disk    IBM      ServeRAID M5015  2.13  /dev/sda
[5:0:0:0]    disk    DGC      VRAID            0533  /dev/sdb
[5:0:0:1]    disk    DGC      RAID 5           0533  /dev/sdc
[5:0:0:2]    disk    DGC      RAID 5           0533  /dev/sdd
[5:0:0:3]    disk    DGC      RAID 5           0533  /dev/sde
[5:0:0:4]    disk    DGC      RAID 5           0533  /dev/sdf
[6:0:0:0]    disk    DGC      VRAID            0533  /dev/sdg
[6:0:0:1]    disk    DGC      RAID 5           0533  /dev/sdh
[6:0:0:2]    disk    DGC      RAID 5           0533  /dev/sdi
[6:0:0:3]    disk    DGC      RAID 5           0533  /dev/sdj
[6:0:0:4]    disk    DGC      RAID 5           0533  /dev/sdk

The device which begins with IBM ServeRaid is the internal raid controller, and the device we need to blocklist.

You will need to edit /etc/multipath.conf to look like this:

blocklist {
     device {
        vendor                  "IBM"
        product                 "ServeRAID M5015"
        }
}
defaults {
        user_friendly_names yes
}
devices {
device {
        vendor                  "*"
        product                 "*"
        features                "0"
        hardware_handler        "0"
        path_grouping_policy    multibus
        no_path_retry           6
        rr_weight               uniform
        rr_min_io               1000
}
}


You will need to replace the product with the customer output listed in the lsscsi section.

This configuration will:

1. blocklist the internal raid controller
2. Assign user friendly names such as mpatha,b, etc which are easier to reference later
3. Set up round robin pathing to all devices.

service multipathd start
chkconfig multipathd on

Double check the internal raid controller is blocklisted

multipath -v3 /dev/sda

(replacing with the sd* letter from lsscsi outout)

 

Example of incorrect blocklist

sda: prio = 1
3600605b0031160d01dd5ee962b5fd144: pgfailover = -1 (internal default)
3600605b0031160d01dd5ee962b5fd144: pgpolicy = failover (internal default)
3600605b0031160d01dd5ee962b5fd144: selector = round-robin 0 (internal default)
3600605b0031160d01dd5ee962b5fd144: features = 0 (internal default)
3600605b0031160d01dd5ee962b5fd144: hwhandler = 0 (internal default)


Example of correct blocklist

 sda: (IBM:ServeRAID M5015) blocklisted by product

Next we will confirm the devices are properly multipath'ed and assigned to their round robin groups.

multipath -l is used to list the multipath devices

The lun is named mpath7 and it is 50 GB


mpath7 (360060160eca03c00ac2772679697e511) dm-0 DGC,VRAID
[size=50G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=0][active]
 \_ 6:0:0:0 sdg 8:96  [active][undef]
\_ round-robin 0 [prio=0][enabled]
 \_ 5:0:0:0 sdb 8:16  [active][undef]

360060160eca03c00daa7bcaff48ee511 is the UUID of this LUN

Compare with the same device when properly configured for round robin

mpath7 (360060160eca03c00ac2772679697e511) dm-0 DGC,VRAID
[size=50G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 5:0:0:0 sdb 8:16  [failed][undef]
 \_ 6:0:0:0 sdg 8:96  [active][undef]

In the top example, there are 2 round robin queues: 1 for each link.
In the second example, there is a single round robin queue.

We want the 2 paths to work in unison, not independently of each other

 

Copy the multipath.conf file to the peer node, and ensure multipath -l/blocklist looks correct.

Step 3 : Configuring LVM

LVM has 3 basic parts: assign disks to be used, break them up into groups, and then break the groups down into volumes.

1. Assign physical volumes

pvcreate /dev/mapper/mpathX
(/dev/sdX if not using multipathing)

run 'pvs' to confirm they show up as required

PV                                           VG   Fmt  Attr PSize  PFree
  /dev/mpath/360060160eca03c00ac2772679697e511      lvm2 a--  50.00G 50.00G

2. Assign physical volumes to volume group

vgcreate -s 512M vgX /dev/mapper/mpathY

It is optimal to have a 1-1 relation between PV and VG. This will ensure the backups are more evenly split across the back end disks. If speed doesn't matter, add all PV to a single VG. Otherwise, create as many VG as there are PV.

At the end of this step, you will have vg0,vg1,etc.

run 'vgs' to confirm they show as expected


  VG   #PV #LV #SN Attr   VSize  VFree
  vg0    1   0   0 wz--n- 49.50G 49.50G

3. Create volumes

lvcreate -L <size > -n lvol0 vg0
or
lvcreate -l 100%FREE -n lvol0 vg0

You can carve out a set amount of storage using the first example, or run '100%FREE example to use the remaining space.

lvs to confirm they show as expected:
LV    VG   Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  lvol0 vg0  -wi-a- 49.50G

4. Creating filesystems

You will issue all commands against the LV.

example:
#mkfs.ext3 /dev/mapper/vg0-lvol0

You will obtain the correct path from the lvs command above.

5. Test mounting

Do not create a mountpoint nested under /nz, instead create the backup points under /

Example

#mkdir /backup

#mount -t ext3 /dev/mapper/vg0-lvol0 /backup

This folder should be owned by nz to allow for backups to work properly

As this is created under /, it must be mirrored on the other host.

To mount on passive node, you must unmount from active node. Concurrent mounts+updates will corrupt filesystem

#umount /backup

On ha2:

#pvscan
#vgscan
#lvscan
#vgchange -ay

This command will import in all the LVM information from the steps you completed on ha1.

#lvs

Ensure the filesystem shows up

#mkdir /backup

#mount -t ext3 /dev/mapper/vg0-lvol0

6. unmount /backup

7. Make an entry in /etc/fstab (commented out) referring to the san, for future reference

 Step 4 : Adding to the cluster

SAN may be added to the cluster, NFS may not. Keep this in mind.

Adding to the cluster itself does not require an outage. Any typo, or irregularity in this action, however, will result in downtime. On production systems, better safe than sorry.

 There are 2 options when adding to the cluster: part of NPS, and separate resource group.

 

Each has pros and cons

Adding to NPS group ensures the SAN is reachable before NPS can come online.
If the san fails, so will NPS.

Adding outside NPS gives fault tolerance, but if the SAN is not reachable, the host may attempt to write to the host disks, running out of space, or causing errors in jobs.

 

Regardless of decision chosen , always download latest hw-support tools

 

Both paths begin the same

#/opt/nz-hwsupport/pts/external_storage_tool.pl

Select 1 for cluster
Select 1 to add new mount

Enter information as it is prompted:


Please Enter a mount device (ie: /dev/mapper/san_m1):

#/dev/vg0/lvol0
Please Enter a directory to mount on (ie: /media/mymount):

#/bkp
Please Enter Mount Options, separated by commas (ie: defaults,rw):

#defaults
Please Enter the HA cluster resource name:

#san_bkp

 

If a failure of this mount occurs such that it cannot be mounted or cannot stay mounted, would you like to fail NPS to the standby host?(y/n)


This is the key to whether the san is added to the NPS group, or its own.

enter y/n depending on client choice, and hit return to finish.

 

#df -h

Confirm the san is mounted.

 

Example of NPS group san

# crm_mon -1


============
Last updated: Tue Feb 16 15:26:58 2016
Current DC: nz80000-h2 (4adecb0c-2d04-4c1b-8ca6-6603c399a70e)
2 Nodes configured.
3 Resources configured.
============

Node: nz80000-h2 (4adecb0c-2d04-4c1b-8ca6-6603c399a70e): online
Node: nz80000-h1 (3e8fc051-bbf0-488d-8e55-b22f8639b23d): online

Resource Group: nps
    drbd_exphome_device (heartbeat:drbddisk):   Started nz80000-h2
    drbd_nz_device      (heartbeat:drbddisk):   Started nz80000-h2
    exphome_filesystem  (heartbeat::ocf:Filesystem):    Started nz80000-h2
    nz_filesystem       (heartbeat::ocf:Filesystem):    Started nz80000-h2
    fabric_ip   (heartbeat::ocf:IPaddr):        Started nz80000-h2
    wall_ip     (heartbeat::ocf:IPaddr):        Started nz80000-h2
    nz_dnsmasq  (lsb:nz_dnsmasq):       Started nz80000-h2
    nzinit      (lsb:nzinit):   Started nz80000-h2
    san_bkp     (heartbeat::ocf:Filesystem):    Started nz80000-h2
fencing_route_to_ha1    (stonith:apcmastersnmp):        Started nz80000-h2
fencing_route_to_ha2    (stonith:apcmastersnmp):        Started nz80000-h1


Example of distinct group san

# crm_mon -1


============
Last updated: Tue Feb 16 15:33:04 2016
Current DC: nz80000-h2 (4adecb0c-2d04-4c1b-8ca6-6603c399a70e)
2 Nodes configured.
4 Resources configured.
============

Node: nz80000-h2 (4adecb0c-2d04-4c1b-8ca6-6603c399a70e): online
Node: nz80000-h1 (3e8fc051-bbf0-488d-8e55-b22f8639b23d): online

Resource Group: nps
    drbd_exphome_device (heartbeat:drbddisk):   Started nz80000-h2
    drbd_nz_device      (heartbeat:drbddisk):   Started nz80000-h2
    exphome_filesystem  (heartbeat::ocf:Filesystem):    Started nz80000-h2
    nz_filesystem       (heartbeat::ocf:Filesystem):    Started nz80000-h2
    fabric_ip   (heartbeat::ocf:IPaddr):        Started nz80000-h2
    wall_ip     (heartbeat::ocf:IPaddr):        Started nz80000-h2
    nz_dnsmasq  (lsb:nz_dnsmasq):       Started nz80000-h2
    nzinit      (lsb:nzinit):   Started nz80000-h2
fencing_route_to_ha1    (stonith:apcmastersnmp):        Started nz80000-h2
fencing_route_to_ha2    (stonith:apcmastersnmp):        Started nz80000-h1


Resource Group: non_essential_mounts
    other_node_watcher  (heartbeat::ocf:Dummy): Started nz80000-h2
    san_bkp     (heartbeat::ocf:Filesystem):    Started nz80000-h2


How to remove a SAN from Netezza PDA systems:

1. Make sure with the customer that the SAN is not being used.

2. Remove the SAN using same tool as when adding:

#/opt/nz-hwsupport/pts/external_storage_tool.pl
Select 1.) High Availability Cluster
Select 2.) Remove a Mount From the Cluster (remove_mount)

Select the mount you want to remove

3. check /etc/fstab and remove the lines if any for that mount.

4. take note of the output of df, lvs and pvs as they will be needed in next steps:

for example:

#df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda3 36569968 3062792 31619540 9% /
/dev/sda10 8123168 2063260 5640616 27% /usr
/dev/sda12 4061540 208008 3643888 6% /usr/local
/dev/sda9 8123168 731196 6972680 10% /var
/dev/sda8 8123168 2636748 5067128 35% /opt
/dev/sda7 16253924 971480 14443452 7% /tmp
/dev/sda5 281194908 38305164 228375420 15% /nzscratch
/dev/sda1 1019208 71468 895132 8% /boot
none 79990784 182528 79808256 1% /dev/shm
/dev/drbd0 16387068 3778168 11776464 25% /export/home
/dev/drbd1 309510044 162873204 130914556 56% /nz
/dev/mapper/backup_vg-backuplv
12518413000 6960306164 4922350804 59% /backups

# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
backuplv backup_vg -wi-ao 11.84T

# pvs
PV VG Fmt Attr PSize PFree
/dev/sdb1 backup_vg lvm2 a-- 476.81G 0
/dev/sdc1 backup_vg lvm2 a-- 476.81G 0
/dev/sdf1 backup_vg lvm2 a-- 5.46T 0
/dev/sdg1 backup_vg lvm2 a-- 5.46T 640.00M

5. Double check that the SAN is unmounted:

# umount /backups


6/ Remove the Logical Volumes:

lvremove /dev/mapper/backup_vg-backuplv

7/ Remove the Volume groups assigned to the Logical Volumes removed

vgremove backup_vg

You can find which VG assigned to which LV by checking the output of lvs collected in step 4:

# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
backuplv backup_vg -wi-ao 11.84T

8/ Remove the physical volumes assigned for the volume groups removed:

pvremove /dev/sdb1
pvremove /dev/sdc1
pvremove /dev/sdf1
pvremove /dev/sdg1


You can find which PV is assigned to which VG by checking the output of pvs collected in step 4:

# pvs
PV VG Fmt Attr PSize PFree
/dev/sdb1 backup_vg lvm2 a-- 476.81G 0
/dev/sdc1 backup_vg lvm2 a-- 476.81G 0
/dev/sdf1 backup_vg lvm2 a-- 5.46T 0
/dev/sdg1 backup_vg lvm2 a-- 5.46T 640.00M


It is ONLY at this point the SAN admin can de-allocate the luns for the SAN.

If the luns are removed prior to this de-configuration, host level corruption and database hanging may occur.

Appendix:

A. Adding extra lun to existing filesystem, or expanding existing LUN (only possible with LVM)

Ensure the lun shows correct size in /proc/partitions on both nodes.

If not, reboot.

add new lun to LVM, or scan in the larger lun:
#pvcreate /dev/sdc

or

#pvscan

Add PV to VG, or scan in larger sized PV

#vgextend vg0 /dev/sdc

or

#vgscan

Confirm VG shows extra space

#vgs

Unmount filesystem

#umount /bkp

Expand LV

#lvextend +100%FREE (replace with specific size if required)

FSCK volume group

#fsck -f /dev/vgo0/lvol0

Expand filesystem

#resize2fs /dev/vgo0/lvol0

Remount

# mount -t ext3 /dev/vg0/lvol0 /bkp

B. Troubleshooting SAN performance

1. Update to latest HBA firmware

2. Check /var/log/messages for errors

3. Check dmesg command for errors

4. Perform dd test to test write speed.


Note: write a minimum of 2x system memory to prevent caching throwing off results.

Example DD command (san mounted under /backup, writing ~256GB) :

Write:


1.  dd if=/dev/zero of=/backup/dd_test bs=64K count=3800K

Read:


1.  dd if=/backup/dd_test of=/dev/null bs=64K

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Storage","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21700900