Thesis Title: Data Distribution and Performance Optimization Models for Parallel Data Mining
We have embarked upon a multitude of approaches to improve the efficiency of selected fundamental tasks in data mining. The present thesis is concerned with improving the efficiency of parallel processing methods for large amounts of data. We have devised new parallel frequent itemset mining algorithms that work on both sparse and dense datasets, and 1-D and 2-D parallel algorithms for the all-pairs similarity problem.
Two new parallel frequent itemset mining (FIM) algorithms named NoClique and NoClique2 parallelize our sequential vertical frequent itemset mining algorithm named bitdrill, and uses a method based on graph partitioning by vertex separator (GPVS) to distribute and selectively replicate items. The method operates on a graph where vertices correspond to frequent items and edges correspond to frequent itemsets of size two. We show that partitioning this graph by a vertex separator is sufficient to decide a distribution of the items such that the sub-databases determined by the item distribution can be mined independently. This distribution entails an amount of data replication, which may be reduced by setting appropriate weights to vertices. The data distribution scheme is used in the design of two new parallel frequent itemset mining algorithms. Both algorithms replicate the items that correspond to the separator. NoClique replicates the work induced by the separator and NoClique2 computes the same work collectively. Computational load balancing and minimization of redundant or collective work may be achieved by assigning appropriate load estimates to vertices. The performance is compared to another parallelization that replicates all items, and ParDCI algorithm.
We introduce another parallel FIM method using a variation of item distribution with selective item replication. We extend the GPVS model for parallel FIM we have proposed earlier, by relaxing the condition of independent mining. Instead of finding independently mined item sets, we may minimize the amount of communication and partition the candidates in a fine-grained manner. We introduce a hypergraph partitioning model of the parallel computation where vertices correspond to candidates and hyperedges correspond to items. A load estimate is assigned to each candidate with vertex weights, and item frequencies are given as hyperedge weights. The model is shown to minimize data replication and balance load accurately. We also introduce a re-partitioning model since we can generate only so many levels of candidates at once, using fixed vertices to model previous item distribution/replication. Experiments show that we improve over the higher load imbalance of NoClique2 algorithm for the same problem instances at the cost of additional parallel overhead.
For the all-pairs similarity problem, we extend recent efficient sequential algorithms to a parallel setting, and obtain document-wise and term-wise parallelizations of a fast sequential algorithm, as well as an elegant combination of two algorithms that yield a 2-D distribution of the data. Two effective algorithmic optimizations for the term-wise case are reported that make the term-wise parallelization feasible. These optimizations exploit local pruning and block processing of a number of vectors, in order to decrease communication costs, the number of candidates, and communication/computation imbalance. The correctness of local pruning is proven. Also, a recursive term-wise parallelization is introduced. The performance of the algorithms are shown to be favorable in extensive experiments, as well as the utility of two major optimizations.