Test data generation and results evaluationFetchback Logo

Description of Datasets (Training data & Test data)

In this contest, the dataset has been divided into three subsets, namely training set, test set and last transaction set. The training set is obtained from the original dataset by removing a specified number of users (called test users) with their accompanying data. In order to make sure the items that have viewed by test users also exist in training set, the items should occur at least 20 times in the training after removing the data of test users. For the removed test users, the last transaction of each of them formed the last transaction set and the remaining data of them formed the test set. You should use the training set to train your model and apply your model on the data in test set to predict the last transaction of test users.

You can download the datasets that have already been generated if you don't want to do it by yourself.
Click here to download the datasets that generated from data of site 5202 (1000 test users, minimum click = 15).
Click here to download the datasets that generated from data of site 9426 (1000 test users, minimum click = 10).
Click here to download the datasets that generated from data of site 3699 (1000 test users, minimum click = 10).
Click here to download the datasets that generated from data of site 9093 (1000 test users, minimum click = 6).
Click here to download the datasets that generated from data of site 8631 (1000 test users, minimum click = 15).
Click here to download the datasets that generated from data of all sites (5000 test users, minimum click = 15).

Generate the datasets with GenData class

In GenData class, there are 9 member functions:

To generate the data, you could do it like this:

  1. Use SetDataFileName (string FileName) to set the name of original data file. E.g. SetDataFileName(“DataForClass.txt ”).
  2. Use SetSiteID (string SiteID) to set the id of the site from which you would like to build the three subsets. E.g. SetSiteID(“5202”). If you would like to process all the data in the original dataset, simply use SetSiteID(“ALL”).
  3. Use LoadSiteData() to load all the data from the specified site. You could also specify the range of the data by LoadSiteData(size_t istart, size_t iend). E.g. LoadSiteData(0, 9999).
  4. Use GenerateTrainTestFiles(size_t numtestusers, size_t minclicks) to generate the training set, test set and last transaction set. You could specify the number of test users and the minimum threshold of the clicks, i.e. the users in test set should have viewed minclicks items at least. E.g. GenerateTrainTestFiles(1000, 5). NOTE: the minimum value of minclicks is 2 which means there is one transaction in both test set and last transaction set.
  5. The generated files are Train.txt, Test.txt and LastTrans.txt.

Evaluate the Results

You can use GenData class to evaluate your results. The criterion is the Hit Rate of the transactions in the last transaction set.

To compute the hit rate, simply call TestHitRate() and it will return the rate.

Before you call this function, make sure  LastTransList.txt (generated by my code) and PredictedList.txt (generated by your code) are in the same directory of the code.

The format of PredictedList.txt is as follows:


15690609,00104665 0.5
15691689,00104665 0.455069
15692766,00104665 0.437516
15691143,00104665 0.436599
15692274,00104665 0.426644
15692457,00104665 0.34569
15691275,00104665 0.321064
15693027,00104665 0.309536
15690783,00104665 0.30613
15690864,00104665 0.300629

Each tuple has three parts: the product id (e.g. 15691689), user id (e.g. 00104665) and the rating (e.g. 0.5). This is an example of top-10. So for each user in the test set, the algorithm should recommend 10 products with highest ratings.


The format of LastTransList.txt is as follows:


15691722,36076801
15691572,35782808
15690567,79834543
15690903,58631080
15692472,26595739
15693087,70355960
15691647,45011026
15692706,16548390
15691092,33494567
15690804,02625677
15690987,00154157

Each tuple has two parts: the product id (e.g. 15691722), user id (e.g. 36076801). The hit rate is computed by comparing tuples in both files. Hit Rate = (# tuples in both files) / (# tuples in LastTransList.txt).


 Example code

An example of how to use GenData class is shown in Contest.cpp

/*========================================================
File name:	Contest.cpp
Author:		Xiwei Wang
Created:	Nov.7, 2010
Last Modified:	Nov.23, 2010

Description:	main file for test purpose
=========================================================*/

#include "GenData.h"

int main()
{	
	GenData *p_gd = new GenData();			// create an object of GenData

	// generate the data files from the original data set
    	p_gd->SetDataFileName("DataForClass.txt");	// set the original data file name
	p_gd->SetSiteID("5202");			// set the site ID
	p_gd->LoadSiteData();				// load all the transactions of site 5202
	p_gd->WriteSiteDataToFile();			// write all the transaction data of site 5202 to the file
	p_gd->GenerateTrainTestFiles(1000, 15);	// generate the three subsets with 1000 users in test set and the minimum view count is 15

    	// test the hit rate for the results
	cout << "Hit Rate:" << p_gd->TestHitRate() << endl;
    	
	delete p_gd;	//delete the object
	return 0;
}

Build and run the code

The sample code has been compressed into a tar archive, namely contest.tar. There are 4 files in contest.tar:

Contest.cpp, GenData.h, GenData.cpp, makefile

You should first decompress it and go to “contest” directory, type “make”. Then “contest” will be generated. You could simply type “./contest” to run the program.

Make sure DataForClass.txt is copied to contest directory.

Click here to download contest.tar


Last updated on Jan. 2, 2011         Back to Fetchback Contest - Home page