This project is maintained by zheweishi
Team Members: Zhewei Shi (zheweis), Weijie Chen (weijiec)
Project Webpage: https://zheweishi.github.io/15618-Project/
After submitting the project proposal, we did more research on the topic that we are going to dive into. With sufficient background study, we started to work on the implementation of alternative least square (ALS) algorithm for recommendation systems. We finished the implementation of ALS algorithm in both single-process model and MPI model. We also conducted many experiments on the MovieLens dataset. It shows that our implementation is efficient in training while the number of processes is fewer than 16 and it gains decent performance in accuracy.
We downloaded a sample dataset from MovieLens website and conducted many experiments based on this dataset. Currently, we choose root mean squared error (RMSE) as the criterion for error measurement. Following are some experiment results we obtain:
(1) The training error decreases with iteration number on a single core (serial version):
(2) We use the same initialization values for all experiments. The training error on multiple cores has the identical results as those from a single core. Therefore, we can ensure the correctness of the parallel MPI version of our code.
(3) The speedup results
Experiments on Andrew Linux servers (numIterations = 30, numFeatures = 10)
Experiments on Latedays cluster (numIterations = 5, numFeatures = 16)
We can further analyze our experiment results:
(1) The smaller numIterations is, the better speedup we can get. The start of each iteration will depend on the results from the previous iteration. Therefore, this part is inherently serial and is hard to be parallelized.
(2) The larger dataset is, the better speedup we can get. Our approach will parallelize over the user/movie matrix, and each worker will be responsible for a subset of the matrix. Therefore, our algorithm can have an obvious performance improvement when the dataset is larger.
Currently, we have conducted experiments on a sample dataset from movieLens. We have not done extensive and comprehensive experiments and analysis of the ALS algorithm yet. In the following weeks, we can have more time devoted to this project and we believe that we can produce all planned deliverables in our proposal. The online learning part, which we set as a bonus part, would be dependent on the SGD algorithm progress.
According to our current experiment results, we make some modifications to our goals:
- Plan To Achieve
(1) ALS
Implementation of sequential / parallelized ALS using MPI on Xeon CPUs
Parallelized ALS achieves nearly linear speedup before reaching x10 (changed from X20 based on our current experiment results)
Experiments and analysis of ALS on public dataset (MovieLens + Netflix)
(2) SGD
Implementation of sequential / parallelized SGD using OpenMP on Xeon Phi
Parallelized SGD achieves nearly linear speedup before reaching x10
Experiments and analysis of SGD on the same public dataset
- Hope To Achieve
Further optimization of ALS / SGD and Achieve X20 speedup (newly added)
Explore parallelization on online learning for recommendation systems
During the poster session, we plan to show:
(1) speedup performance graph for both ALS and SGD algorithms
(2) train / test error graph for both algorithms
If possible, we will spend some time exploring the online learning parallelization and we may build a demo showing the online recommendation system.