In this contest, the dataset has been divided into three subsets, namely training set, test set and last transaction set. The training set is obtained from the original dataset by removing a specified number of users (called test users) with their accompanying data. In order to make sure the items that have viewed by test users also exist in training set, the items should occur at least 20 times in the training after removing the data of test users. For the removed test users, the last transaction of each of them formed the last transaction set and the remaining data of them formed the test set. You should use the training set to train your model and apply your model on the data in test set to predict the last transaction of test users.
You can download the datasets that have already been generated if you don't want
to do it by yourself.
Click
here to download the datasets that generated from data of site
5202 (1000 test users, minimum click = 15).
Click
here to download the datasets that generated from data of site
9426 (1000 test users, minimum click = 10).
Click
here to download the datasets that generated from data of site
3699 (1000 test users, minimum click = 10).
Click
here to download the datasets that generated from data of site
9093 (1000 test users, minimum click = 6).
Click
here to download the datasets that generated from data of site
8631 (1000 test users, minimum click = 15).
Click
here to download the datasets that generated from data of all
sites (5000 test users, minimum click = 15).
In GenData class, there are 9 member functions:
To generate the data, you could do it like this:
You can use GenData class to evaluate your results. The criterion is the Hit Rate of the transactions in the last transaction set.
To compute the hit rate, simply call TestHitRate() and it will return the rate.
Before you call this function, make sure LastTransList.txt (generated by my code) and PredictedList.txt (generated by your code) are in the same directory of the code.
The format of PredictedList.txt is as follows:
Each tuple has three parts: the product id (e.g. 15691689), user id (e.g. 00104665) and the rating (e.g. 0.5). This is an example of top-10. So for each user in the test set, the algorithm should recommend 10 products with highest ratings.
The format of LastTransList.txt is as follows:
Each tuple has two parts: the product id (e.g. 15691722), user id (e.g. 36076801). The hit rate is computed by comparing tuples in both files. Hit Rate = (# tuples in both files) / (# tuples in LastTransList.txt).
An example of how to use GenData class is shown in Contest.cpp
/*======================================================== File name: Contest.cpp Author: Xiwei Wang Created: Nov.7, 2010 Last Modified: Nov.23, 2010 Description: main file for test purpose =========================================================*/ #include "GenData.h" int main() { GenData *p_gd = new GenData(); // create an object of GenData // generate the data files from the original data set p_gd->SetDataFileName("DataForClass.txt"); // set the original data file name p_gd->SetSiteID("5202"); // set the site ID p_gd->LoadSiteData(); // load all the transactions of site 5202 p_gd->WriteSiteDataToFile(); // write all the transaction data of site 5202 to the file p_gd->GenerateTrainTestFiles(1000, 15); // generate the three subsets with 1000 users in test set and the minimum view count is 15 // test the hit rate for the results cout << "Hit Rate:" << p_gd->TestHitRate() << endl; delete p_gd; //delete the object return 0; }
The sample code has been compressed into a tar archive, namely contest.tar. There are 4 files in contest.tar:
Contest.cpp, GenData.h, GenData.cpp, makefile
You should first decompress it and go to “contest” directory, type “make”. Then “contest” will be generated. You could simply type “./contest” to run the program.
Make sure DataForClass.txt is copied to contest directory.
Click here to download contest.tar