




194750201 FOUNDATIONS OF DATA SCIENCE

Core/Elective:
Core Semester: 2 Credits:
4 
Course Description 
While
traditional areas of computer science remain highly
important, increasingly researchers of the
future will be involved with using computers to understand
and extract usable information from
massive data arising in applications, not just how to
make computers useful on specific welldefined
problem. This course introduce the statistics and computer
science concepts required to master data
science as a subject. 
Course Objectives 
To introduce
the mathematical foundations to deal with high dimensional
data
To introduce concepts like random graphs, random walks,
markov chains
To understand basic underpinnings of machine learning
algorithms

Course Content 
Module I
High dimensional space: Law of large numbers, geometry
of high dimensions, properties of the unit ball, Gaussians
in high dimension, random projection and JohnsonLindenstrauss
Lemma, seperating Gaussians – Singular Value Decomposition:
Power method to compute SVD, singular vectors and Eigen
vectors, Applications of SVD
Module II
Random Graphs: G(n,p) model, phase transitions, giant
component, branching process, cycles and full connectivity
– Growth models of Random Graphs: Growth models with
and without preferential attachment, small world graphs
Module III
Random walks and Markov chains: Stationary distribution,
MCMC, Gibbs sampling, areas and volumes, convergence
of random walks, random walks in Euclidean space, web
as a Markov chain
Module IV
Learning and VC dimention: Linear Separators, the Perceptron
Algorithm, and Margins, Nonlinear Separators, Support
Vector Machines, and Kernels, Strong and Weak Learning
– Boosting – VapnikChervonenkis dimention: Examples
of Set Systems, The Shatter Function, The VC Theorem,
Simple Learning
Module V
Algorithms for Massive Data Problems: LocalitySensitive
Hashing  shingling of documents, minhashing. Distance
measures, nearest neighbors, frequent itemsets LSH
families for distance measures, Applications of LSH
Challenges when sampling from massive data Frequency
Moments of Data Streams, Counting Frequent Elements,
Matrix Algorithms Using Sampling, Sketch of a Large
Matrix, Sketches of Documents

REFERNCES 
1. Avrim
Blum, John Hopcroft, Ravindran Kannan; Foundations of
Data Science, 2018
https://www.cs.cornell.edu/jeh/book.pdf
2. Jure Leskovec, Rajaraman, A., & Ullman, J. D.,
Mining of Massive Datasets, Cambridge University Press,
2e, 2016
3. Charu C. Aggarwal, Data Streams: Models and Algorithms,
1e, Springer, 2007
4. Michael I Jordan et.al , Frontiers in Massive Data
analysis, 1e, National Academies Press, 2013
5. Nathan Marz & James Warren, Big Data: Principles
and best practices of scalable realtime data systems,
Manning Publications, 2015











