Posts

Model Drift in Machine Learning

Image
  We expect our model to perform the same as it does with the training data. However if the distribution of production data is different from that of the training data, this may lead to Model Drift. Model drift refers to the decay of the model’s predictive power. Model Drift occurs when: Training data is poorly sampled There is a change in the underlying business context Why is important to monitor Model Drift? It is necessary to monitor performance for model drift to ensure accurate predictions and check if retraining is required. Kinds of Drift Data Drift/Feature drift   — When there is a change in the input feature. Ta rget Drift   — When there is a change in the distribution of the target variable. Concept Drift   — When there is a change between the pattern or the relationship between the predictors and the outcome. Data Drift: This is also known as feature drift, population drift or covariate drift. Data drift is observed when there is a change in the distribution of features in

Model Cards for Model Reporting

Image
  What are model cards? Today machine learning models have a lot of potentials. Knowing about the usage and limitations of a model is crucial. Model cards aim to provide that information in a holistic and comprehensive way. They contain short records of various aspects of ML models. Why do we need model cards? Lack of documentation: Due to lack of documentation users might not be aware of the valid uses. This can have a serious impact in areas like — Healthcare, law and order, employment etc. For example — If a healthcare model is trained on da t a pertaining to specific geography (assume Indians) it might not be applicable to the American population. Hence the model should not be used in the treatment of American patients. Due to lack of documentation often important information like the kind of data used in training the model seems to be missing. Lack of transparency: Often machine learning models do not state the ethical considerations clearly. This can also lead to systematic biase

Feature Selection Approach

Image
What is feature selection? In machine learning, Feature Selection refers to the process of selecting the features which contribute most to the output or the dependent variable. Feature selection basically helps to reduce the number of features from the dataset and helps in improving the performance of the model by reducing the computational efforts. Why is feature selection required? O verfit: Having too many features tend to decrease the accuracy of the prediction and lead to overfitting. Overfitting occurs when the model fits too closely in the training data but fails to perform on the new data. R edundan c y: Having too many features can lead to redundancy as some features might provide information that is already available through other features as well. C ost: Having too many features also increases the cost of training and deploying the model. I nterpretability: It is easy to interpret/explain a model with fewer features and understand its value compared to one which has more fea

Case Study on Ride-Hailing App

Case Study: This is an open-ended case study asked around the reliability of ride-hailing service having decreased in a particular area. I had to suggest the data variables I will consider analyzing to solve the problem. Interviewer : The reliability of Ola cars has decreased in a particular area. What data parameters will you consider analyzing the problem? Priyanka : There were some clarifying questions that I asked to get more insights about the problem. How do you measure reliability exactly? Interviewer : Reliability would be a measure of Rides getting requested vs requests getting accepted i.e., Out of the requests accepted how many are actually completed. Priyanka : Ok since the platform has both driver and rider sides do you want me to focus on any specific one. Interviewer : You are free to choose. Priyanka : Ok I would like to think about the data variables that driver considers while accepting a ride: Distance/Length of trip  — Drivers won’t prefer short rides as they might

Evolving ways of Data File Format

Image
  Data File Format define standard ways of storing information in a file or database. We require different file formats for different use cases. For example — If we know that only Python systems are going to read our file then we can choose Pickle format as it is highly optimized. CSV data format has been the most widely used opti o n for data storage. Using CSV, we can read from and write to most data software. However, there is no schema attached and no standard way to control characters and it is not the best way to deal with complex data. In this article, we will discuss the evolving ways of the data file format: Parquet:  Parquet is one of the most common data storage formats for Big Data as it is very fast. It also understands all the data types used by Pandas, including multi-index data frames. It is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. It is mostly used as a data warehouse or data lake storage