I am a firm believer in learning by doing.

One of the most learned problem in ML is Titanic data for kaggle and is critical in this field. Kind of “Hello world”. I replicated many of the kernels on kaggle and this caught my attention, What if rather than predicting survived or not, I can predict employee response in a survey.

I work many times on surveys, and one of the new approaches is not to have an average score but NPS – Net Promoter Score. What essentially happens is that you ask a question which is usually-  “How likely is it that you would recommend our company/product/service to a friend or colleague?” , on a scale of 0 -10, Where 0 is not likely at all and 10 – Extremely likely.

Now 0-6 becomes a detractor and 7-8 a passive and 9-10 becomes a promoter. The NPS score becomes = % of Promoters – % of detractors with a range of -100 to +100.

Can I predict from current data if the employee will be a detractor of not?

I ran into many problems being a noob.

  1. The data for the survey was in multiple excel file, can I combine them in one?
  2. The date of joining of the employee is a date, can I convert it to days in Organization?
  3. How can I handle categorical data, e.g., male and female? What about 10 Business Units, Multiple COEs and Projects, and Managers, even locations?
  4. How can I break data into test and training?
  5.  How can I do feature extraction and Normalization?

Well, I know how to do it in excel, but Python is another matter, that too with limited knowledge.

  1. Which classifier models will work best?
  2. How can I tune them?

So went ahead and started visualizing data as done in Titanic.  I used fields from Survey response like Date of Joining, Business Unit, COE, Project and other question responses.

Low and behold I had a model ready with almost 90% accuracy. But I had made a major mistake when I tried to run on Test data.

I am not sure, but this is one mistake that I will remember all my life. If you look closely, I had taken other question responses as features? What ? My training data will never have Other survey question response.

Let me explain.

An employee is asked the “Likely to Recommend” question along with some 6 other questions like are your inputs appreciated, do you see opportunities to upskill, etc. But the mistake I made was to assume them as features. My test data will never have those.

So I went back and ran again. The accuracy dropped.

My code had 2 major inspirations.

  1. https://www.kaggle.com/sachinkulkarni/titanic/an-interactive-data-science-tutorial
  2. Data School – Scikit Learn class

I then manually compared the results with the predictions of 7 models and found 2 to be best.

RandomForestClassifier and KNeighborsClassifier.  I then used the Grid Search method to find the optimal hyperparameters for KNN.

Result – With 50% of the employee not responding in a survey, I can with some accuracy predict if they will be a detractor or not.

I had used one technique that made sense to me. Use Hold Out data. This is essentially a small part that is just like validation data. I used it to check the accuracy of the model even after validation. Next when I optimized KNN I when no holds barred. Let’s see if this approach is right or not.

Lessons learnt

  1. Just by replicating a published data set and results you cannot learn everything, You have to do on data that is yours. While the concepts are same live data is much harder to understand and manipulate. Have to get hands dirty.
  2. The prediction accuracy of live data is very low as compared to published results, and that’s ok. This is a start, Learn more and apply knowledge over and over again. “Sharpen the saw every day.”
  3. Also using more than one model is now a norm, It’s known as Ensemble Modelling. Will need to learn how to use it.

Next steps

  1. Try to use XGBoost and Deep learning to see if the accuracy can be improved.
  2.  I came across this excellent youtube series – Can I improve the modes based on his interpretation. His data preprocessing method is fantastic. Brilliant.
  3. I have used label encoding, i.e., converted BUs to Zero to 10, but that means than one BU is closer to another, e.g., 3 to 4 than 0 to 10. But logically this would be wrong.  One way is to use Pandas getDummy, will try to replicate that instead of using Label encoding.

Note: The data is private, and I cannot share that, but the code is attached if someone wants to see.   This does not has data visualization and data consolidation. That has some analysis that I cannot share. Pls leave a message if anyone wants to see, I ‘ll sanitize and share.

Final Code

PS: Don’t judge me on the code quality, I am still learning.