Find most frequent combinations of multiple responses (e.g., courses taken)
I have a data set of student enrollment in college course. Each record has a student ID variable (ID) and the name of a course in which the student enrolled this quarter (coursename, a 9-character string variable). Each student may have taken up to 5 courses, so there may be up to 5 records for each student in the data set. I wish to determine the most frequently occurring course combinations that were taken by students. There are several ways to report the frequency of each course name in the data set, but I do not see a way to report the frequency of combinations of courses taken by the same students.
Can this be accomplished in SPSS?
Here are the steps for one approach to the problem of tabulating the most frequent combinations. A long string variable will be created with the names of all of the courses taken by the respective student,
separated by semicolons within the string variable. A sample set of syntax commands is provided at the end of this technical note.
1. Sort the data file by Student ID and Coursename within ID.
From the Data menu, choose Sort. Paste Id into the box, followed by coursename.
2. Create a variable, here called COURSESEQ, which identifies the sequence of a course record for each student.
From the Transpose menu, choose Compute Variable.
In the Compute dialog, type "courseseq" (without the quotation marks) into the Target field and type "1" (without quotes, or click the "1" button in the calculator keypad in the dialog.
Reopen the Transform->Compute dialog. COURSESEQ should still be in the Target Variable" box.
Change the Numeric Expression to "lag(courseseq) + 1" (without quotes) and click the If button at the bottom of the dialog.
Click the "include if case satisfies condition" radio button.
Type "(id = lag(id))" (without quotes) into the box below that option and click Continue.
Click OK in the main Compute dialog. You will be asked if you want to "Change existing variable?". Click OK.
Open the Variable View window of the Data Editor by clicking the Variable View button in the lower left
corner of the Variable View. COURSESEQ will be the last variable in the Name column. In the Decimals column and COURSESEQ row, make sure that the number of decimals equals 0. You can click on the right side of this cell to open a scroll bar to change the number of decimals. (By default, SPSS Statistics creates new numeric variables with 2 decimals displayed.) Setting the decimals for courseseq to 0 will simplify the variable names that are created in the following step.
3. Restructure the data from the "long" format (multiple records per student with one course name per record) to the "wide" format (one record per student with multiple course name variables in each record. From the Data menu, choose Restructure. In the first window of the Data Restructure Wizard,
Click the radio button beside "Restructure selected cases into variables" and click Next.
From the "Variables in the current file" box, move ID into the "Identifier Variable(s)" box and COURSESEQ into the "Index Variable(s)" box by highlighting each variable and clicking on the corresponding arrow. Click Next.
In the "Sorting Data" window, you can click the radio button for "No, the data are already sorted" as you had sorted the data in Step 1 above. Click Next.
In the "Options" window, click the radio button for "Group by original variable" and click Next.
In the Finish window, you have the choice of "Restructure the data now" or "Paste the syntax...". If you want to do everything by the menu system, or GUI (Graphic User Interface), then choose the Restructure now option.
The restructured data set will now be in the Data Editor with the name of the original data file. To avoid over-writing the original file, it would be wise to rename this restructured data to a new file name from the "File->Save as" menu. The restructured data will have one record per ID. The COURSENAME variable from the original will now form the variables COURSENAME.1 to COURSENAME.5. If a student had fewer than 5 courses, than some of these variables will be blank for that student.
4. The names of courses taken by each student, which are sorted into alphabetical order within the student record (by the sort performed in the Long form in step 1), are concatenated into one long string, named STUDENT_COURSE_LIST.
Open the Transform menu and choose "Compute Variable".
Type "student_course_list" into the "Target Variable" box. Click the "Type and Label" button underneath the Target box.
Under "Type", choose String and type 50 into the "Width" box. A width of 50 will allow concatenation of the five 9-character name strings and the 4 semicolons between strings. This value will depend on the number and width of strings being concatenated. Click Continue.In the "String Expression" box, type:
Varible names (oursename.1, etc.) can be pasted into the expression by highlighting them in the variable list on the left side of the dialog and clicking the right-pointing arrow. Function names can be highlighted from the function lists on the right side. Click "String" in the "Function group" box and the string functions will appear in the "Functions and Special Variables" box, from which they can be highlighted and pasted into the string expression.
The CONCAT() function concatenates string variables and literal strings (such as ";") within the parentheses together. Items in the list must be separated by commas.
The RTRIM function removes trailing characters from the first argument (e.g., COURSENAM.1). If there is no second argument in the parentheses, then the character being trimmed is the blank character. Here RTRIM() prevents the appearance of blocks of blanks in the concatenated string.
5. The Aggregate procedure is used to produce a data set with one record for each value of
STUDENT_COURSE_LIST and a variable, NUM_students, with the number of students with that course list.
In the Data menu, choose Aggregate.
In the Aggregate dialog, paste STUDENT_COURSE_LIST into the "Break Variable(s):" box.
Click the check box beside "Number of cases" and enter a name for this variable, such as Num_students, in the "Name" box.
Under "Save", you can choose either the second option to create a data set or the third option to create a file. If you create a data set set, you'll need to activate that data set (from the Windows menu in SPSS Statistics) before proceeding in Step 6.. If you create a new file, then you'll need to open that file from File->Open->Data.
There is a section titled "Options for very large data sets" at the bottom of the Aggregate dialog.
If the data file is very large, ,especially if there is a large number of unique values for the break variable,
then you may want to take the option to "Sort the file before aggregating" or precede the Aggregate operation with a file sort by STUDENT_COURSE_LIST from the Data->Sort Cases dialog. If you took the latter option, then you would click the radio button beside "File is already sorted on break variable(s)" in the Aggregate dialog. Sorting by the break variable or telling AGGREGATE that the file is presorted by the break variable reduces the memory load of the AGGREGATE procedure.
6. Once opened or activated, the aggregated file is then sorted in descending order by NUM_students and in ascending order of STUDENT_COURSE_LIST within NUM_Students. This results in a file where the most frequent combinations are at the top of the file and easily viewed. For a given frequency, the combinations are in alphabetical order of the first course in the combination. Because courses were alphabetically ordered at the beginning stages, we avoid situations where a particular combination of courses can appear as separate combinations with distinct orders.
In the Data menu, choose "Sort Cases".
In the Sort dialog, paste Num_students into the "Sort by" box and click the radio button beside "Descending".
Then paste STUDENT_COURSE_LIST into the "Sort by" box below Num_students and click the radio button beside "Ascending".
7. You could list the course combinations for the 100 most frequent combinations by listing the first 100 cases in this sorted and aggregated file.
In the Analyze menu, choose Reports and then choose "Case Summaries" from the side menu for Reports.
In the Summarize Cases dialog, paste Num_students and STUDENT_COURSE_LIST into the variable box. Leave the "Grouping variable(s)" box empty.
From the lower part of the Summarize Cases dialog, check "Display cases" and "Limit cases to first". Enter 100 into the box beside "Limit cases to first". (You can leave this box unchecked to print the whole file to the output viewer.)
8. You can save the sorted and aggregated file from the "File->Save as" menu while that data is forward (i.e., the active window).
Here is the corresponding set of syntax commands. File paths have been included in some commands for illustration. The syntax begins with Step 1 in the description above, assuming that the long format of the data is open in the Data Editor. You would need to change the file paths to reflect your own
directory names. The only existing variable names that are assumed in the original file are ID (student ID) and COURSENAME, as specified in the Problem Description. All of the other variables are created by the command sequence and would not need to be changed to match your data. (Of course, you could change the name of computed variables as well, if you wished.)
* Find the frequency of each observed combination of courses.
* Restructure data from long format (1 record per course per student) into wide format (one record
* per student with course names (up to 5) as separate variables .
SORT CASES BY ID(A) COURSENAME(A).
compute courseseq = 1.
if (id = lag(id)) courseseq = lag(courseseq) + 1.
formats courseseq (f2).
/COUNT=course_taken "Courses taken by Student".
string student_course_list (A50).
compute student_course_list = concat(rtrim(coursename.1),";",rtrim(coursename.2),
SORT CASES BY student_course_list.
get file = 'C:\course_combinations\Course_combos.sav' .
sort cases by NUM_students (d) student_course_list (a).
list NUM_students student_course_list
/cases = from 1 to 100 .