Sunday, November 15, 2015

Week 4

We are half way through with the structured part of the program and the curriculum is getting more practical each week.  Still lots of theory, but lectures include more practical guidance on tuning.  Meanwhile, our daily exercises are hands-on working on the modeling tools we will be using as practicing data scientists.


Key topics were Random Forest, Boosting, SVM and Profit Curves.  Although one can learn these topics from the sklearn docs, we are being taught how to apply each algorithm effectively.  

I started working on a old Kaggle competition, Burn CPU Burn (https://inclass.kaggle.com/c/model-t4) to get start applying what I have learned and seeing how well I can do against the competition.  The data is very wide so applying my new EDA skills was critical in narrowing the feature set to something more manageable.

An initial model with  Ridge regression got me to 15th place.  Random Forest improved the model to 8th place.  I will keep working on this as I learn to use new algorithms and tuning techniques, and will likely return to feature engineering.

Time to start thinking about the capstone project...

Tuesday, November 10, 2015

Week 3


I thought this week would be less intense.  The core topics were linear and logistic regression.  After going through the Andrew Ng Machine Learning course on Coursera, I figured this would be basic stuff.

I was clearly mistaken.  It was not simply a matter of running regressions and calling it good.  There is a huge body of knowledge around evaluating the quality of the regressions coming from classical statistics.  On top of that, we have been implementing machine learning models and techniques, such as regularization and gradient descent, from scratch.

It definitely felt like we were drinking from a firehose this week.

Day in the life of a data science fellow

Every day starts at 8:30 with a 30 minute quiz.  It is typically a programming exercise that focuses on some topic from the last week.  A morning lecture follows for about an hour and a half.  Then, we go off to work on an individual programming exercise related to the lecture.  After lunch is another lecture followed by a paired programming exercise.  We are assigned a new partner each day.  The day ends when the paired exercise is complete or you have had enough.

The programming exercises are typically well designed to reinforce the daily lectures.  The scope of the exercises are generally quite ambitious so it is not uncommon to have them spill into the lunch hour or into the evening.

This daily routine is broken up occasionally by assessments.  We had a 2 hour assessment last week which consisted of programming and math problems.  I don't think most people were able to finish, but this is not like college.  The assessments are there for you to identify areas of improvement.

The tight feedback loop of lectures and exercises combined with regular quizzes and assessments has been very effective in reinforcing learning.  

Tuesday, November 3, 2015

Week 2

Week 2 felt like being back in a college classroom.  Calculus in lecture!  It's been a long time since I had done any calculus, but I was surprised how easy it was to follow the the derivations of formulas from first principals.  Actually doing the calculus myself would take a little more ramp up, but at least it was not completely foreign.

As with programming, there are many students without a strong math background, so I don't know how lecture makes sense to them.  Certainly no spoonfeeding in this bootcamp.  You can get through it without a technical background, but you will get more out of it if you come in with strong technical chops.

Topics this week were probability, experiment design, classical statistics (e.g. frequentist hypothesis testing) and Bayesian statistics.  Coming from a physics background, probability and statistics were never formally taught.  It was just a topic you were supposed to figure out, but having a good foundation would have made statistical mechanics so much less painful.

EDX offers an MIT Probability course which is the best treatment available online:  https://courses.edx.org/courses/MITx/6.041x_1/1T2015/info.  Definitely watch the solutions videos, especially on counting.  I thought I had learned to count kindergarten but apparently I was mistaken.

In addition to Bayesian statistics, bootstrapping and multi-arm bandit were highlights.  Bootstrapping is nothing short of magic.  Bayesian statistics is much mores satisfying than frequentist statistics if for no other reason than not have to use phrases like, "do not fail to reject the null hypothesis".

Multi-arm bandit for A/B testing makes a lot of intuitive sense.  But after running some Monte Carlos on different bandit algorithms, performance better than chance is not guaranteed unless the differences in clickthrough rates between A and B are fairly large (e.g. 2X).  

Along with the stats, the hands-on exercises is making us all stronger in python, numpy, pandas and matplotlib.  Feels good to be able to start wielding the data scientist toolkit with more confidence.