15618-Project

This project is maintained by zheweishi

Fast Matrix Factorization for Recommendation Systems

Team Members: Zhewei Shi (zheweis), Weijie Chen (weijiec)

Project Webpage: https://zheweishi.github.io/15618-Project/

Summary

We are going to implement two parallelized algorithms for matrix factorization in recommendation systems using MPI and OpenMP frameworks. The implementation and analysis will be based on Xeon CPUs and Xeon Phi platforms.

Background

In mathematics, matrix factorization (or matrix decomposition) is to represent a matrix by a product of matrices. It is one of the most popular collaborative filtering techniques for recommendation systems. We first briefly introduce the problem that we will be working on in this project:

Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization (in this case, matrices U and I). We will discuss both algorithms and how we parallelize them, respectively.

ALS

If we apply mean-square loss and append a Tikhonov regularization term on the loss function, the optimal result of U while fixing I can be obtained through mathematical approach. Furthermore, the calculation on each row of U will be independent from each other. Then, to update U, we can update each row of U simultaneously and therefore we can parallelize ALS by parallelizing the updates of U (step 2) and of I (step 3; symmetric).

SGD

Assume that we have v valid interaction values in matrix M and we run SGD for T rounds, we can have the following algorithm:

Initialize matrices U and I
for t = 1 to T:
  Draw i from {1...v} uniformly at random
  Calculate the loss of i-th interaction using current U and I
  Update U and I based on the loss

SGD will be harder to parallelize as the process is inherently serial. A possible way for parallelization is to allow multiple threads overwriting U and I matrices at the same time, assuming that the interaction matrix M is sufficiently sparse. Another method is to grid the interaction matrix M into many independent blocks. Each thread will only pick random pairs from the assigned block and the independence between blocks helps parallize the SGD algorithm.

Challenge

ALS and SGD have quite different workload patterns, therefore the challenges exhibited in these two algorithms are also different.

Workload

ALS

SGD

System Constraints

ALS (Xeon CPUs + MPI)

Before each computation iteration starts, each thread should get a complete and updated copy of the data, therefore the communication cost would be huge in ALS algorithm. The communication cost between different threads and nodes will increase with the scale of the data and the number of workers involved.

SGD (Xeon Phi + OpenMP)

Most dataset can be fit into the memory space of Xeon Phi. So how to ensure that all threads can work cooperatively is the main challenge in our work. In addition, because of the limited cache size and the random access pattern of SGD, there will be many cache misses during computation.

Resources

We will implement this project from scratch and we use the following papers as reference:

[1] Zhou, Yunhong, et al. “Large-scale parallel collaborative filtering for the netflix prize.” International Conference on Algorithmic Applications in Management. Springer, Berlin, Heidelberg, 2008.

[2] Zhuang, Yong, et al. “A fast parallel SGD for matrix factorization in shared memory systems.” Proceedings of the 7th ACM conference on Recommender systems. ACM, 2013.

[3] Recht, Benjamin, et al. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent.” Advances in neural information processing systems. 2011.

[4] Gemulla, Rainer, et al. “Large-scale matrix factorization with distributed stochastic gradient descent.” Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

[5] Yu, Hsiang-Fu, et al. “Parallel matrix factorization for recommender systems.” Knowledge and Information Systems 41.3 (2014): 793-819.

Goals and Deliverables

Plan To Achieve

ALS

SGD

Hope To Achieve

Backoff Plan

Platform Choice

We plan to develop the whole project using C++ on the latedays cluster. C++ is a language with good trade-off between productivity and performance and it has many useful libraries for efficient matrix computations. Because we will use Xeon CPUs and Xeon Phi, the latedays cluster will be a good platform to get access to these resources.

Tentative Schedule

Date Goal
10/31 Submit Project Proposal
11/03 Background Study
11/10 Serial Implementation of ALS
11/17 Parallel Implementation of ALS
11/19 Project Intermediate Checkpoint Report Due
11/24 Experiment of ALS / Serial Implementation of SGD
12/01 Parallel Implementation of SGD + Experiments and Analysis
12/08 Online Learning Exploration
12/15 Wrap Up and Final Project Submission

Back