This project is maintained by zheweishi
Team Members: Zhewei Shi (zheweis), Weijie Chen (weijiec)
Project Webpage: https://zheweishi.github.io/15618-Project/
We are going to implement two parallelized algorithms for matrix factorization in recommendation systems using MPI and OpenMP frameworks. The implementation and analysis will be based on Xeon CPUs and Xeon Phi platforms.
In mathematics, matrix factorization (or matrix decomposition) is to represent a matrix by a product of matrices. It is one of the most popular collaborative filtering techniques for recommendation systems. We first briefly introduce the problem that we will be working on in this project:
Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization (in this case, matrices U and I). We will discuss both algorithms and how we parallelize them, respectively.
If we apply mean-square loss and append a Tikhonov regularization term on the loss function, the optimal result of U while fixing I can be obtained through mathematical approach. Furthermore, the calculation on each row of U will be independent from each other. Then, to update U, we can update each row of U simultaneously and therefore we can parallelize ALS by parallelizing the updates of U (step 2) and of I (step 3; symmetric).
Assume that we have v valid interaction values in matrix M and we run SGD for T rounds, we can have the following algorithm:
Initialize matrices U and I
for t = 1 to T:
Draw i from {1...v} uniformly at random
Calculate the loss of i-th interaction using current U and I
Update U and I based on the loss
SGD will be harder to parallelize as the process is inherently serial. A possible way for parallelization is to allow multiple threads overwriting U and I matrices at the same time, assuming that the interaction matrix M is sufficiently sparse. Another method is to grid the interaction matrix M into many independent blocks. Each thread will only pick random pairs from the assigned block and the independence between blocks helps parallize the SGD algorithm.
ALS and SGD have quite different workload patterns, therefore the challenges exhibited in these two algorithms are also different.
Dependency between update steps. While updating one matrix, the other one will be fixed, therefore the updating process is inherently serial. Due to the dependency between update steps, we can only try to explore parallelization in a single update step.
High ratio of communication to computation. While updating the user/item matrix, a worker should keep a complete and updated copy of the other matrix. If work is distributed across threads, there will be massive communications between workers when an iteration step starts/ends.
Data compression. Fortunately, locality is not a key problem for ALS because each time the model will retrieve the complete (also continuous) feature vector of an user/item. However, we should still pay attention to how to represent and store data efficiently so that we can further utilize locality.
Dependency between steps. As we can notice, the updates of the user and the item feature vectors will be dependent between different steps. So it becomes very hard to parallelize on steps. How to keep all threads busy and fully parallize the algorithm is the main challenge for SGD.
Bad locality. Because the data pair (interaction) is randomly selected from the dataset, there will be many random memory access which can lead to a high cache-miss rate.
Before each computation iteration starts, each thread should get a complete and updated copy of the data, therefore the communication cost would be huge in ALS algorithm. The communication cost between different threads and nodes will increase with the scale of the data and the number of workers involved.
Most dataset can be fit into the memory space of Xeon Phi. So how to ensure that all threads can work cooperatively is the main challenge in our work. In addition, because of the limited cache size and the random access pattern of SGD, there will be many cache misses during computation.
We will implement this project from scratch and we use the following papers as reference:
[1] Zhou, Yunhong, et al. “Large-scale parallel collaborative filtering for the netflix prize.” International Conference on Algorithmic Applications in Management. Springer, Berlin, Heidelberg, 2008.
[2] Zhuang, Yong, et al. “A fast parallel SGD for matrix factorization in shared memory systems.” Proceedings of the 7th ACM conference on Recommender systems. ACM, 2013.
[3] Recht, Benjamin, et al. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent.” Advances in neural information processing systems. 2011.
[4] Gemulla, Rainer, et al. “Large-scale matrix factorization with distributed stochastic gradient descent.” Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
[5] Yu, Hsiang-Fu, et al. “Parallel matrix factorization for recommender systems.” Knowledge and Information Systems 41.3 (2014): 793-819.
We plan to develop the whole project using C++ on the latedays cluster. C++ is a language with good trade-off between productivity and performance and it has many useful libraries for efficient matrix computations. Because we will use Xeon CPUs and Xeon Phi, the latedays cluster will be a good platform to get access to these resources.
Date | Goal |
---|---|
10/31 | Submit Project Proposal |
11/03 | Background Study |
11/10 | Serial Implementation of ALS |
11/17 | Parallel Implementation of ALS |
11/19 | Project Intermediate Checkpoint Report Due |
11/24 | Experiment of ALS / Serial Implementation of SGD |
12/01 | Parallel Implementation of SGD + Experiments and Analysis |
12/08 | Online Learning Exploration |
12/15 | Wrap Up and Final Project Submission |