Restructuring Boster-format pile sort data for scaling and cluster analysis
I have data from a pile sort study, i.e. subjects placed objects (or cards depicting objects or
concepts) into a set of piles based on the subjects' judgements of the similarity of the objects. Subjects are given the same number of objects but they are not restricted in the number of piles that they form, so the number of piles varies across subjects. I wish to perform multidimensional scaling (MDS)
and cluster analysis on these objects, based on their perceived similarity as reflected in the pile sorting data.
I need to restructure the pile sort data into a K-by-K similarity matrix to perfom this analysis.
Although the data collection process (pile sorting) and analysis goals (MDS and cluster analysis) are identical to those described in Technote 1488410, my data are provided in a different structure than the data described there. In Technote 1488410, each subject represents a row, or case, in the data file. There are 21 variables that represent the 21 objects in that example and the content of each variable is the number of the pile into which each subject placed the object. My data is presented in the Boster format, where there is a case for each pile created by each subject, with each case holding the
object numbers that were placed in the corresponding pile. If Subject 100 had formed 5 piles, there would be 5 cases representing Subject 100. If that Subject had placed Objects 4, 11, 16, and 19 into a pile together, then there would be a record with Subject 100's ID and 4 variables (perhaps OBJ1 to OBJ4) that held the values 4, 11, 16, and 19.Whereas the data structure described in Technote
1488410 identified objects that were sorted together by the pile number assigned to object-specific variables, the Boster format identifies objects that were sorted together by the appearance of objects in a data row together.
How can I restructure Boster-format pile sort data into a similarity matrix that can be analyzed by MDS or cluster procedures? The similarity matrix should be a co-occurrence matrix, i.e. the value in elements (4,7) and (7.4) of the matrix would be the number of subjects that placed objects 4 and 7 into the same pile.
The easiest way to restructure Boster-format pile sort data into a similarity matrix is to restructure it into the one-case-per-subject format that was the starting point for Technote 1488410. After that initial restructuring, the steps employed in that technote are employed to build the similarity matrix.
An example of this extended command set is provided below.
In the following example, the data is read in-line with a Data LIST command. The LIST format (i.e., DATA LIST LIST) telss SPSS that each new line of data begins a new record - a new case but not necessarily a new Subject.) You'll see warnings in the output section where your data is read as input text. These warnings reflect the fact that many of the data rows have fewer data values than the number of variables in the variable list. This situation is expected for the Boster data format and is not an indication of an error.
The format that was used here to construct the proximity matrix does not allow for the placement of an
item into more than one pile by a single subject. It uses the First observed pile for the item as the item's pile number for that subject. If allowing multiple piles for one item per subject is important, we may be able to find a work-around.
In the various GET FILE and SAVE commands in the syntax, the file path is not specified. This practice is to keep the commands generic for readers of the technote. You could use full paths within quotes, as in 'C:\ Pilesort\hits.sav'.
* Data read from text file in Boster format.
* The variables one to eight allow up to 8 objects per pile per subject but more could be allowed.
* The data for only two subjects is shown here to conserve space.
* Subject 1 has 5 piles and Subject 2 has 4 piles.
* Data for Subject 3's first pile would follow the 4th line of data for Subject 2 .
* Note that this data is randomly generated.
set undefined = nowarn.
data list list / Subject one two three four five six seven eight .
1 4 10 12 13 14 20
1 2 7
1 3 6 17 18 19 21
1 9 15 16
1 1 5 8 11
2 1 5 12 15 16 20
2 13 19
2 2 6 7 17 21
2 3 4 8 9 10 11 14 18
compute Sub_seq = $casenum.
rank Sub_seq by Subject / rank into Pile.
formats Subject one to eight Pile (F4).
/keep = Subject Pile One to Eight
* From Boster format to 1-case per Subject format with variable for each item (1-21) and values being the piles into which
* the subject sorted each item.
get file = Boster_Format.sav .
vector pile_item = one to eight.
vector item (21).
loop #j = 1 to 8.
if (not(missing(pile_item(#j)))) item(pile_item(#j)) = Pile.
aggregate outfile = Item_Pile_Format.sav
/break = Subject
/item1 to item21 = first(item1 to item21).
* This method does not allow for entry of an item into more than one pile.
* When a subject places an object in more than one pile, Use of the FIRST() function in the above AGGREGATE
* command results in the smaller pile number being used for that item.
* incorporate Technote 1488410 methods.
GET FILE= Item_Pile_Format.sav.
* First, create a file in which each case is a 21x21 matrix.
* If the respondent had sorted items j and k in a pile together,
* then hit(j) in the kth row and hit(k) in the jth row would
* equal 1;
* if j = k, hit(j,j) = 1;
* otherwise, hit(j,k)= hit(k,j) = 0.
vector it = item1 to item21.
vector hit (21, F4).
loop #j = 1 to 21.
+ compute item = #j.
+ loop #k = 1 to 21.
+ compute hit(#k) = (it(#j) = it(#k)).
+ end loop.
+ xsave outfile = hits.sav /keep = Subject item hit1 to hit21 .
* Build a file in which there is a single 21x21 matrix,
* aggregated across all cases. The values are the number
* of times the row and column stimuli appeared together
* in a file. Add ROWTYPE_ and VARNAME_ column that
* ALSCAL, CLUSTER, and PROXIMITIES can read.
get file = hits.sav .
* see the end of this note for alternate code that employs
* less generic variable names for the final matrix .
aggregate outfile = *
/break = item
/item1 to item21 = sum(hit1 to hit21).
* Variables ROWTYPE_ and VARNAME_ are set up for the cluster analysis.
string ROWTYPE_ VARNAME_ (A8).
COMPUTE ROWTYPE_ = 'PROX' .
compute VARNAME_ = concat('ITEM',ltrim(string(item,f2))) .
* Be sure that PROX and the variable name stem (ITEM in this
* example) are in capital letters.
VALUE LABELS rowtype_ 'PROX' 'SIMILARITY'.
save outfile = qsmat.sav
/ keep = ROWTYPE_ VARNAME_ item1 to item21.
get file = qsmat.sav .
* Analyze the data with ALSCAL (which performs Multidimensional Scaling, or MDS)
* and CLUSTER (hierarchical clustering), which are both in the Statistics Base module.
* LEVEL = ORDINAL (SIMILAR) in the ALSCAL command below
* helps ALSCAL recognize the data as a similarity matrix,
* rather than the default distance matrix input .
ALSCAL VARIABLES= item1 to item21
/CRITERIA=CONVERGE(.001) STRESSMIN(.005) ITER(30)
/PRINT=DATA HEADER .
* Cluster needs the /MATRIX IN subcommand to recognize data as a proximities matrix.
* The "SIMILARITY" value label for ROWTYPE_= "PROX" tells CLUSTER that the
* proximities are similarities .
/MATRIX IN (*)
/PRINT SCHEDULE CLUSTER(3,7)
/PLOTS DENDROGRAM .
* analyze the data with PROXSCAL, an MDS procedure in the Categories module.
PROXSCAL VARIABLES=item1 to item21
/CRITERIA=DIMENSIONS(2,2) MAXITER(100) DIFFSTRESS(.0001) MINSTRESS(.0001)
/PRINT=COMMON DISTANCES STRESS DECOMPOSITION