Shrinkage Methods in Linear Regression

  Ever have a question that, “Why is Linear Regression giving me such good accuracy on the training set but a low accuracy on the test set in spite of adding all the available dependent features to the model?” The question above seems inexplicable to many people but is answered by a concept called...

Data Science Engineer – Who, What, & Why?

  Data science is the study of where information comes from, what it represents and how it can be turned into a valuable resource in the creation of business and IT strategies. Mining large amounts of structured and unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize new...

Friend Follower Analysis using Apache Spark GraphX’s PageRank algorithm

GraphX is Apache Spark’s API for graphs and graph-parallel computation. This includes transformation, exploration, and graph computation. Data can be viewed both as graph & collections. This use case discusses friend follower analysis using Apache Spark GraphX’s PageRank operator. PageRank measures the importance of each vertex in a graph, by determining which vertexes have the...

Data Science – Let the Data Sing

  The hype is real. But let’s get past it. What exactly is Data Science? And why is it the next big thing. Massive amounts of data are being generated every sec. The total amount of data in the world is 4.4 zetabytes. And this is not just the internet data. We are talking...

Feature selection using Decision Tree

  One of the key differentiators in any data science problem is the quality of feature selection and importance. When we have a lot of data available to be used by our model, the task of feature selection becomes inevitable due to computational constraints and the elimination of noisy variables for better prediction. Also,...

Hyperparameter Optimization and Why is it important?

  A machine learning model consists of various parameters that need to be learned from the data. The crux of Machine learning is fitting a model to the data. This process of training a model with existing data to fit the model parameters, is called model training. Hyperparameters refer to another kind of parameters...

How to avoid overfitting while training?

Overfitting happens mostly because the model becomes too complex. Such a model will give poor accuracies, as it memorizes the noise in the training data. A model is usually fit by achieving the highest accuracy on the training data set. However, its efficiency is judged by its its performance on test data. Overfitting occurs...