IBM Support What's New?

Extract substrings from long string based on connecting character

Technote (FAQ)


I have a string variable named EmployeeID in an SPSS Statistics data file. This variable has several blocks of information that are separated by hyphens. I wish to extract each of these blocks of information between the hyphens and store them in new string variables. The length of a given block may vary across cases but there will always be 5 blocks that are separated by 4 hyphens. The new variable names and the string length required for each are as follows:
Division 4
Region 4
Location 4
Team 8
Employee A
Can you provide a set of commands that would extract the blocks of information to new string variables?.


The syntax commands below assume that there is an active data file with the string variable EmployeeID, which contains the 5 blocks of information as described in the Question above. For this example, we'll assume that EmplyeeID has a width of 23 characters. The commands extract the substring between the beginning of EmployeeID and the first hyphen and saves that substring as Division. The substring and hyphen are removed from the input string and remaining substrings are extracted in a similar manner in turn.

Employrem is a string variable that contains the remainder of EmployeeID as each section is extracted from the EmployeeID content. Employrem becomes the input after Division is extracted so that EmployeeID is left intact and we can always look for the first hyphen in Employrem to extract the next variable.

New string variables must be declared with their widths before referencing them in other commands. The STRING command in the following syntax allows 4 characters each for Division, Region, Location, and Employee and 8 characters for Team as requested in the Question. Allowing more characters than needed should provide no problems but defining the variable with too few characters would prevent the relevant substring from being saved in the new variable. Employrem is declared to have a width of 23 characters, so that it could contain all of the content of EmployeeID.

Dash1 is a numeric variable that identifies the location of the first hyphen in EmployeeID (and later dash variables identify the first hyphen in the ever-shrinking employrem).
CHAR.INDEX(EmployeeID,'-') finds the first '-' in the string in EmployeeID.
CHAR.SUBSTR(EmployeeID,1,dash1-1) extracts a substring from EmployeeID, starting in position 1 for a length of (dash1 - 1) characters.
CHAR.SUBSTR(EmployeeID,dash1+1) extracts a substring from EmployeeID, starting in position dash1+1 (dropping the hyphen) to the end of EmployeeID (because the third argument, length of substring, was omitted).

* Commands to extract 5 substrings from EmployeeID .

string employrem (a23) division region location (A4) team (A8) employee (A4).

compute dash1=CHAR.INDEX(EmployeeID,'-').
compute division=CHAR.SUBSTR(EmployeeID,1,dash1-1).
compute employrem=CHAR.SUBSTR(EmployeeID,dash1+1).

compute dash2=CHAR.INDEX(employrem,'-').
compute region=CHAR.SUBSTR(employrem,1,dash2-1).
compute employrem=CHAR.SUBSTR(employrem,dash2+1).

compute dash3=CHAR.INDEX(employrem,'-').
compute location=CHAR.SUBSTR(employrem,1,dash3-1).
compute employrem=CHAR.SUBSTR(employrem,dash3+1).

compute dash4=CHAR.INDEX(employrem,'-').
compute team=CHAR.SUBSTR(employrem,1,dash4-1).
compute employrem=CHAR.SUBSTR(employrem,dash4+1).

compute employee=CHAR.SUBSTR(employrem,1,4).

Document information

More support for: SPSS Statistics

Software version: Not Applicable

Operating system(s): Platform Independent

Reference #: 1653807

Modified date: 2013-11-06