Question & Answer
Question
How do I add SAN storage to PureData System for Analytics?
Answer
Limitations
Platform | FileSystem | Considerations |
Redhat 5 | ext3, gpfs | Max volume size 16TB |
Redhat 6 | ext3,ext4,xfs,gpfs | minimum HPF 5.3.3 |
Backup Streams/Threads
The number of files in concurrent use during loading and unloading is a key factor to be considered. When designing a backup strategy, keep in mind that Multi Stream/Thread backup support was introduced in NPS 6.0 and Multi Stream restore was introduced in NPS 7.2.
On TwinFin’s the sweet spot for backup performance with multiple threads is 4.
On Stripers and Mako the number of streams is not the limiting factor but it is the total number of parallel jobs on the SPUs, which is 48. This number must be keep in mind when designing a backup strategy, as you will impact the backup and other system users.
Expectations
1.0 TB rate per hour assuming enterprise class storage and healthy fiber network
Step 1: Planning
Be sure the system actually has Fiber channel host bus adapters (FC HBA) before proceeding.
To determine if the system has the cards, and if they are plugged in, run:
#/opt/nz-hwsupport/pts/external_storage_tool.pl san -b info
Note: all commands are run as root, logged in directly as root to node IP (not via wall IP, or sudo from NZ user)
Output from a system without HBA installed
ERROR: No HBAs found.
Output from a system with HBA installed, but not used
Host Bus Adapter (HBA): 7
Info: IBM 42D0494 8Gb 2-Port PCIe FC HBA for System x on PCI bus 8b device 00 irq 58 port 0
Status: Link State appears to be down. There may be a problem with the fibre channel connection.
Link State: Linkdown
WWN: 0x10000090fa2901fa
Firmware Rev: 2.01A11 (U3D2.01A11), sli-3
Driver Ver: Emulex LightPulse Fibre Channel SCSI driver 10.6.0.20
Max Luns: 255
Model: 42D0494
Current Speed: unknown
Supported Speeds: 2 Gbit, 4 Gbit, 8 Gbit
Block Devices:
None
Output from a fully configured system
Host Bus Adapter (HBA): 5
Info: IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x on PCI bus 24 device 00 irq 209 port 0
Status: SAN storage ready for the system to use.
Link State: Link Up - Ready
Private Loop
WWN: 0x10000000c99828da
Firmware Rev: 2.72X2 (Z3F2.72X2), sli-3
Driver Ver: Emulex LightPulse Fibre Channel SCSI driver 8.2.0.128.3p
Max Luns: 255
Model: 42C2071
Current Speed: 4 Gbit
Supported Speeds: 1 Gbit, 2 Gbit, 4 Gbit
Block Devices:
[5:0:0:0] /dev/sdb
[5:0:0:1] /dev/sdc
[5:0:0:2] /dev/sdd
[5:0:0:3] /dev/sde
[5:0:0:4] /dev/sdf
There are several important bits of information in the HBA section.
1. WWN: this is the address of the card, where the SAN admin will share storage against.
2. Link state: this tells if the cable is connected or not
3. Block devices: this tells what storage partitions have been shared with this adapter. This is not always accurate, and best to run cat /proc/partitions.
Do not proceed if the system does not have the HBA installed. This is handled through nzscheduling. Consult http://www-01.ibm.com/support/docview.wss?uid=swg21977823 for more details.
Now that the system has the necessary HBA to support SAN, the SAN admin needs to share storage.
run cat /proc/partitions to show the correct devices.
By default, the internal drive is /dev/sda[1-12], but this is subject to change when adding a san.
# cat /proc/partitions
major minor #blocks name
8 0 712888320 sda
8 1 1052226 sda1
8 2 314576797 sda2
8 3 37752750 sda3
8 4 1 sda4
8 5 290286486 sda5
8 6 16779861 sda6
8 7 16779861 sda7
8 8 8385898 sda8
8 9 8385898 sda9
8 10 8385898 sda10
8 11 6297448 sda11
8 12 4192933 sda12
8 16 52428800 sdb
8 32 1048576 sdc
8 48 1048576 sdd
8 64 1048576 sde
8 80 1048576 sdf
In above case, the HBA can already see the client storage sdb,c,d,e,f.
It is important to confirm this is the case on both hosts before proceeding.
It is a good idea to ask the following questions of the SAN admin at this time:
1. Lun count (distinct raid group)?
2. Lun size?
3. File system type? (recall restriction above)
4. Naming convention for volume groups and logical volumes
5. How should we split storage across mount points?
6. multipathing required?
Once you have the storage visible, and the plan surrounding how they want to break up the storage, we are ready to begin.
If storage is not visible, but link is up, you must reboot the host.
Step 2: configuration of multipathing
If the client is not using multipathing, you can skip right to the section on LVM.
blocklisting
Multipathing will attempt to multipath all scsi devices, even those which are internal to the appliance. This results in many issues. It is important to configure multipathing to IGNORE the internal raid devices.
The lsscsi command will list the block devices attached to the host.
Example of system with multiple san mounts:
[0:2:0:0] disk IBM ServeRAID M5015 2.13 /dev/sda
[5:0:0:0] disk DGC VRAID 0533 /dev/sdb
[5:0:0:1] disk DGC RAID 5 0533 /dev/sdc
[5:0:0:2] disk DGC RAID 5 0533 /dev/sdd
[5:0:0:3] disk DGC RAID 5 0533 /dev/sde
[5:0:0:4] disk DGC RAID 5 0533 /dev/sdf
[6:0:0:0] disk DGC VRAID 0533 /dev/sdg
[6:0:0:1] disk DGC RAID 5 0533 /dev/sdh
[6:0:0:2] disk DGC RAID 5 0533 /dev/sdi
[6:0:0:3] disk DGC RAID 5 0533 /dev/sdj
[6:0:0:4] disk DGC RAID 5 0533 /dev/sdk
The device which begins with IBM ServeRaid is the internal raid controller, and the device we need to blocklist.
You will need to edit /etc/multipath.conf to look like this:
blocklist {
device {
vendor "IBM"
product "ServeRAID M5015"
}
}
defaults {
user_friendly_names yes
}
devices {
device {
vendor "*"
product "*"
features "0"
hardware_handler "0"
path_grouping_policy multibus
no_path_retry 6
rr_weight uniform
rr_min_io 1000
}
}
You will need to replace the product with the customer output listed in the lsscsi section.
This configuration will:
1. blocklist the internal raid controller
2. Assign user friendly names such as mpatha,b, etc which are easier to reference later
3. Set up round robin pathing to all devices.
service multipathd start
chkconfig multipathd on
Double check the internal raid controller is blocklisted
multipath -v3 /dev/sda
(replacing with the sd* letter from lsscsi outout)
Example of incorrect blocklist
sda: prio = 1
3600605b0031160d01dd5ee962b5fd144: pgfailover = -1 (internal default)
3600605b0031160d01dd5ee962b5fd144: pgpolicy = failover (internal default)
3600605b0031160d01dd5ee962b5fd144: selector = round-robin 0 (internal default)
3600605b0031160d01dd5ee962b5fd144: features = 0 (internal default)
3600605b0031160d01dd5ee962b5fd144: hwhandler = 0 (internal default)
Example of correct blocklist
sda: (IBM:ServeRAID M5015) blocklisted by product
Next we will confirm the devices are properly multipath'ed and assigned to their round robin groups.
multipath -l is used to list the multipath devices
The lun is named mpath7 and it is 50 GB
mpath7 (360060160eca03c00ac2772679697e511) dm-0 DGC,VRAID
[size=50G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=0][active]
\_ 6:0:0:0 sdg 8:96 [active][undef]
\_ round-robin 0 [prio=0][enabled]
\_ 5:0:0:0 sdb 8:16 [active][undef]
360060160eca03c00daa7bcaff48ee511 is the UUID of this LUN
Compare with the same device when properly configured for round robin
mpath7 (360060160eca03c00ac2772679697e511) dm-0 DGC,VRAID
[size=50G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
\_ 5:0:0:0 sdb 8:16 [failed][undef]
\_ 6:0:0:0 sdg 8:96 [active][undef]
In the top example, there are 2 round robin queues: 1 for each link.
In the second example, there is a single round robin queue.
We want the 2 paths to work in unison, not independently of each other
Copy the multipath.conf file to the peer node, and ensure multipath -l/blocklist looks correct.
Step 3 : Configuring LVM
LVM has 3 basic parts: assign disks to be used, break them up into groups, and then break the groups down into volumes.
1. Assign physical volumes
pvcreate /dev/mapper/mpathX
(/dev/sdX if not using multipathing)
run 'pvs' to confirm they show up as required
PV VG Fmt Attr PSize PFree
/dev/mpath/360060160eca03c00ac2772679697e511 lvm2 a-- 50.00G 50.00G
2. Assign physical volumes to volume group
vgcreate -s 512M vgX /dev/mapper/mpathY
It is optimal to have a 1-1 relation between PV and VG. This will ensure the backups are more evenly split across the back end disks. If speed doesn't matter, add all PV to a single VG. Otherwise, create as many VG as there are PV.
At the end of this step, you will have vg0,vg1,etc.
run 'vgs' to confirm they show as expected
VG #PV #LV #SN Attr VSize VFree
vg0 1 0 0 wz--n- 49.50G 49.50G
3. Create volumes
lvcreate -L <size > -n lvol0 vg0
or
lvcreate -l 100%FREE -n lvol0 vg0
You can carve out a set amount of storage using the first example, or run '100%FREE example to use the remaining space.
lvs to confirm they show as expected:
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
lvol0 vg0 -wi-a- 49.50G
4. Creating filesystems
You will issue all commands against the LV.
example:
#mkfs.ext3 /dev/mapper/vg0-lvol0
You will obtain the correct path from the lvs command above.
5. Test mounting
Do not create a mountpoint nested under /nz, instead create the backup points under /
Example
#mkdir /backup
#mount -t ext3 /dev/mapper/vg0-lvol0 /backup
This folder should be owned by nz to allow for backups to work properly
As this is created under /, it must be mirrored on the other host.
To mount on passive node, you must unmount from active node. Concurrent mounts+updates will corrupt filesystem
#umount /backup
On ha2:
#pvscan
#vgscan
#lvscan
#vgchange -ay
This command will import in all the LVM information from the steps you completed on ha1.
#lvs
Ensure the filesystem shows up
#mkdir /backup
#mount -t ext3 /dev/mapper/vg0-lvol0
6. unmount /backup
7. Make an entry in /etc/fstab (commented out) referring to the san, for future reference
Step 4 : Adding to the cluster
SAN may be added to the cluster, NFS may not. Keep this in mind.
Adding to the cluster itself does not require an outage. Any typo, or irregularity in this action, however, will result in downtime. On production systems, better safe than sorry.
There are 2 options when adding to the cluster: part of NPS, and separate resource group.
Each has pros and cons
Adding to NPS group ensures the SAN is reachable before NPS can come online.
If the san fails, so will NPS.
Adding outside NPS gives fault tolerance, but if the SAN is not reachable, the host may attempt to write to the host disks, running out of space, or causing errors in jobs.
Regardless of decision chosen , always download latest hw-support tools
Both paths begin the same
#/opt/nz-hwsupport/pts/external_storage_tool.pl
Select 1 for cluster
Select 1 to add new mount
Enter information as it is prompted:
Please Enter a mount device (ie: /dev/mapper/san_m1):
#/dev/vg0/lvol0
Please Enter a directory to mount on (ie: /media/mymount):
#/bkp
Please Enter Mount Options, separated by commas (ie: defaults,rw):
#defaults
Please Enter the HA cluster resource name:
#san_bkp
If a failure of this mount occurs such that it cannot be mounted or cannot stay mounted, would you like to fail NPS to the standby host?(y/n)
This is the key to whether the san is added to the NPS group, or its own.
enter y/n depending on client choice, and hit return to finish.
#df -h
Confirm the san is mounted.
Example of NPS group san
# crm_mon -1
============
Last updated: Tue Feb 16 15:26:58 2016
Current DC: nz80000-h2 (4adecb0c-2d04-4c1b-8ca6-6603c399a70e)
2 Nodes configured.
3 Resources configured.
============
Node: nz80000-h2 (4adecb0c-2d04-4c1b-8ca6-6603c399a70e): online
Node: nz80000-h1 (3e8fc051-bbf0-488d-8e55-b22f8639b23d): online
Resource Group: nps
drbd_exphome_device (heartbeat:drbddisk): Started nz80000-h2
drbd_nz_device (heartbeat:drbddisk): Started nz80000-h2
exphome_filesystem (heartbeat::ocf:Filesystem): Started nz80000-h2
nz_filesystem (heartbeat::ocf:Filesystem): Started nz80000-h2
fabric_ip (heartbeat::ocf:IPaddr): Started nz80000-h2
wall_ip (heartbeat::ocf:IPaddr): Started nz80000-h2
nz_dnsmasq (lsb:nz_dnsmasq): Started nz80000-h2
nzinit (lsb:nzinit): Started nz80000-h2
san_bkp (heartbeat::ocf:Filesystem): Started nz80000-h2
fencing_route_to_ha1 (stonith:apcmastersnmp): Started nz80000-h2
fencing_route_to_ha2 (stonith:apcmastersnmp): Started nz80000-h1
Example of distinct group san
# crm_mon -1
============
Last updated: Tue Feb 16 15:33:04 2016
Current DC: nz80000-h2 (4adecb0c-2d04-4c1b-8ca6-6603c399a70e)
2 Nodes configured.
4 Resources configured.
============
Node: nz80000-h2 (4adecb0c-2d04-4c1b-8ca6-6603c399a70e): online
Node: nz80000-h1 (3e8fc051-bbf0-488d-8e55-b22f8639b23d): online
Resource Group: nps
drbd_exphome_device (heartbeat:drbddisk): Started nz80000-h2
drbd_nz_device (heartbeat:drbddisk): Started nz80000-h2
exphome_filesystem (heartbeat::ocf:Filesystem): Started nz80000-h2
nz_filesystem (heartbeat::ocf:Filesystem): Started nz80000-h2
fabric_ip (heartbeat::ocf:IPaddr): Started nz80000-h2
wall_ip (heartbeat::ocf:IPaddr): Started nz80000-h2
nz_dnsmasq (lsb:nz_dnsmasq): Started nz80000-h2
nzinit (lsb:nzinit): Started nz80000-h2
fencing_route_to_ha1 (stonith:apcmastersnmp): Started nz80000-h2
fencing_route_to_ha2 (stonith:apcmastersnmp): Started nz80000-h1
Resource Group: non_essential_mounts
other_node_watcher (heartbeat::ocf:Dummy): Started nz80000-h2
san_bkp (heartbeat::ocf:Filesystem): Started nz80000-h2
How to remove a SAN from Netezza PDA systems:
1. Make sure with the customer that the SAN is not being used.
2. Remove the SAN using same tool as when adding:
#/opt/nz-hwsupport/pts/external_storage_tool.pl
Select 1.) High Availability Cluster
Select 2.) Remove a Mount From the Cluster (remove_mount)
Select the mount you want to remove
3. check /etc/fstab and remove the lines if any for that mount.
4. take note of the output of df, lvs and pvs as they will be needed in next steps:
for example:
#df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda3 36569968 3062792 31619540 9% /
/dev/sda10 8123168 2063260 5640616 27% /usr
/dev/sda12 4061540 208008 3643888 6% /usr/local
/dev/sda9 8123168 731196 6972680 10% /var
/dev/sda8 8123168 2636748 5067128 35% /opt
/dev/sda7 16253924 971480 14443452 7% /tmp
/dev/sda5 281194908 38305164 228375420 15% /nzscratch
/dev/sda1 1019208 71468 895132 8% /boot
none 79990784 182528 79808256 1% /dev/shm
/dev/drbd0 16387068 3778168 11776464 25% /export/home
/dev/drbd1 309510044 162873204 130914556 56% /nz
/dev/mapper/backup_vg-backuplv
12518413000 6960306164 4922350804 59% /backups
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
backuplv backup_vg -wi-ao 11.84T
# pvs
PV VG Fmt Attr PSize PFree
/dev/sdb1 backup_vg lvm2 a-- 476.81G 0
/dev/sdc1 backup_vg lvm2 a-- 476.81G 0
/dev/sdf1 backup_vg lvm2 a-- 5.46T 0
/dev/sdg1 backup_vg lvm2 a-- 5.46T 640.00M
5. Double check that the SAN is unmounted:
# umount /backups
6/ Remove the Logical Volumes:
lvremove /dev/mapper/backup_vg-backuplv
7/ Remove the Volume groups assigned to the Logical Volumes removed
vgremove backup_vg
You can find which VG assigned to which LV by checking the output of lvs collected in step 4:
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
backuplv backup_vg -wi-ao 11.84T
8/ Remove the physical volumes assigned for the volume groups removed:
pvremove /dev/sdb1
pvremove /dev/sdc1
pvremove /dev/sdf1
pvremove /dev/sdg1
You can find which PV is assigned to which VG by checking the output of pvs collected in step 4:
# pvs
PV VG Fmt Attr PSize PFree
/dev/sdb1 backup_vg lvm2 a-- 476.81G 0
/dev/sdc1 backup_vg lvm2 a-- 476.81G 0
/dev/sdf1 backup_vg lvm2 a-- 5.46T 0
/dev/sdg1 backup_vg lvm2 a-- 5.46T 640.00M
It is ONLY at this point the SAN admin can de-allocate the luns for the SAN.
If the luns are removed prior to this de-configuration, host level corruption and database hanging may occur.
Appendix:
A. Adding extra lun to existing filesystem, or expanding existing LUN (only possible with LVM)
Ensure the lun shows correct size in /proc/partitions on both nodes.
If not, reboot.
add new lun to LVM, or scan in the larger lun:
#pvcreate /dev/sdc
or
#pvscan
Add PV to VG, or scan in larger sized PV
#vgextend vg0 /dev/sdc
or
#vgscan
Confirm VG shows extra space
#vgs
Unmount filesystem
#umount /bkp
Expand LV
#lvextend +100%FREE (replace with specific size if required)
FSCK volume group
#fsck -f /dev/vgo0/lvol0
Expand filesystem
#resize2fs /dev/vgo0/lvol0
Remount
# mount -t ext3 /dev/vg0/lvol0 /bkp
B. Troubleshooting SAN performance
1. Update to latest HBA firmware
2. Check /var/log/messages for errors
3. Check dmesg command for errors
4. Perform dd test to test write speed.
Note: write a minimum of 2x system memory to prevent caching throwing off results.
Example DD command (san mounted under /backup, writing ~256GB) :
Write:
1. dd if=/dev/zero of=/backup/dd_test bs=64K count=3800K
Read:
1. dd if=/backup/dd_test of=/dev/null bs=64K
Was this topic helpful?
Document Information
Modified date:
17 October 2019
UID
swg21700900