Math 166 / CSCI 145 - Data Mining
(Spring 14)
Course Information
- Instructor : Blake Hunter
- Email : bhunter@cmc.edu
- Office : Adams 212
- Office Hours : TBA
- Lectures : MWF from 11:00 to 11:50 in TBA
- Syllabus : .pdf
- Information flyer: .pdf
- Course Website : Sakai
Text Book
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd Ed., Hastie, Tibshirani, and Friedman.
- Matlab a practical introduction to programming and problem solving 3rd Ed., Stormy Attaway.
Course Information
- Data mining is the process for discovering patterns in big data using techniques from mathematics, computer science and statistics with applications ranging from biology and neuroscience to history and economics. The goal of the course is to teach students fundamental data mining techniques that are commonly used in practice. Students will learn advanced data mining techniques (including linear classifiers, support vector machines, clustering, dimension reduction, transductive learning and topic modeling). The course is designed to be applicable and fulfilling for both strongly motivated students who have taken Linear Algebra and advanced mathematics or computer science majors. Prerequisite: MATH 60 (Linear Algebra); and a proof based course above MATH 100 or CSCI 62 (Data Structures and Advanced Programming); or consent of the instructor.
- Note : Familiarity with a computer programming language (C++, Java, Matlab, Python or R) or CSCI 51 (Intro to Computer Science) is suggested but not required.
Grading
- Homework - 20%
- Midterms (2) - 40%
- Final Project - 10%
- Final Exam - 30%
Tentative Weekly Schedule
- Overview, programming in Matlab, (variables, logic, loops, functions, vectors, matrices, class/objects...)
- Linear algebra review (vectors, matrices, eigenvalues/vectors, norms, inner products, Matlab matrix operations, ... ), mathematical proofs review (logic, direct proof, proof by contradiction, proof by contrapositive)
- Supervised Learning, regression, linear classifiers
- Support vector machines (SVM)
- kernel SVM, soft SVM, Midterm 1
- Clustering and k nearest neighbor
- k-means and spectral clustering
- Dimension reduction - singular value decomposition (SVD)
- Image processing, image features. object recognition
- Unsupervised/supervised image segmentation, Midterm 2
- Image classification, Computer vision,
- Principal component analysis (PCA)
- Topic modeling and non-negative matrix factorization
- More on topic modeling / natural language processing
- makeup, (if time allows) Transductive learning, diffusion, motion by mean curvature (MBO)
- Student presentations
- Final Exam
Final Projects
- Each student will work in a group of 3-5 students, on a final project related to the material covered in the course. The goal of the final project is to have each student demonstrate their knowledge of the material by applying and extending the algorithms presented in class to an application of their choice (in consultation with the course instructors). Each group will submit a written proposal, progress report, final report and deliver a 20 minute oral presentation of their results. Students are expected to meet regularly with the course instructor to discuss project development.
Final Project Topics
- Enron email - email text mining and time series analysis
- Yahoo finance - stock and market analysis, clustering, visualization, auto-summary
- Hyperspectral Imaging - pixel classification and object detection
- Audio - voice command classification (OK Google / Siri)
- Enron Social Network - Email network, social network analysis
- MRI brain images - image noise statistical analysis and reduction
- Video - Scene detection and object recognition
- LAPD field interview cards - community detection
- Robot survey data - object classification (xBox kinect data)
- Google election polling data - trend detection and topic model fitting
- data acquired from outside the course - from other classes, labs, internet or industry
Homework
- There will be 8 bi-weekly homework assignments. The homework will be made of three parts including
- - a math component
- - a computational component
- - an advanced theory component. You can choose between
- ---- a theoretical math option (requiring MATH 100+)
- ---- or advanced computer science option (requiring CSCI 62).