Hi Guest, 30 September 2020 Wednesday IST

About CUSAT | About Department | Alumni | Sitemap | Disclaimer  

     
 
  Home > Academic/Programmes > Programme Structure > CSE (2019)
       
       
 
19-475-0503 MINING OF MASSIVE DATA SETS
Core/Elective: Elective Semester: 5 Credits: 4
Course Description

Big Data concerns large-volume, complex, growing data sets with multiple, autonomous sources. With the
fast development of networking, data storage, and the data collection capacity, Big Data is now rapidly
expanding in all science and engineering domains. The traditional data mining algorithms also need to be
adapted for dealing with the ever-expanding datasets of tremendous volume.

Course Objectives

To understand emphasis on the algorithms to be applied on large amounts of data
To develop hands-on experience on the distributed file systems and MapReduce as a tool for creating
parallel algorithms
To explore streaming data and some of the techniques and algorithms specifically extended for
mining on stream data

Course Content

Module I
Introduction to MapReduce – the map and reduce tasks, MapReduce workflow, fault tolerance. - Algorithms for MapReduce – matrix multiplication, relational algebra operations- Complexity theory for MapReduce

Module II
Locality-Sensitive Hashing - shingling of documents, min-hashing. Distance measures, nearest neighbors, frequent itemsets- LSH families for distance measures, Applications of LSH- Challenges when sampling from massive data

Module III
Mining data streams – stream model, stream data sampling, filtering streams – bloom filters, counting distinct elements in a stream - Flajolet-Martin algorithm. Moment estimates - AlonMatias-Szegedy algorithm, counting problems for streams, decaying windows

Module IV
MapReduce and link analysis- PageRank iteration using MapReduce, topic-sensitive PageRank - On-line algorithms – Greedy algorithms, matching problem, the adwords problem – the balance algorithm

Module V
Computational model for data mining – storage, cost model, and main memory bottleneck. Hash based algorithm for mining association rule – improvements to a-priori, park-chen-yu algorithm, multistage algorithm, approximate algorithm, limited-pass algorithms – simple randomized algorithm, Savasere, Omiecinski, and Navathe algorithm, Toivonen algorithm

REFERNCES

1. Jure Leskovec, Rajaraman, A., & Ullman, J. D.; Mining of Massive Datasets, Cambridge India, 2 ed, 2016
2. Charu C. Aggarwal; Data Streams: Models and Algorithms, 1ed, Springer, 2007
3. Michael I Jordan et.al , Frontiers in Massive Data analysis, 1ed, National Academies Press, 2013
4. Nathan Marz & James Warren, Big Data: Principles and best practices of scalable realtime data systems, Manning Publications, 2015


Copyright © 2009-20 Department of Computer Science,CUSAT
Design,Hosted and Maintained by Department of Computer Science
Cochin University of Science & Technology
Cochin-682022, Kerala, India
E-mail: csdir@cusat.ac.in
Phone: +91-484-2577126
Fax: +91-484-2576368