Unit - 4
Development of ML Model
Classification is the process of identifying a function that aids in the classification of a dataset based on several factors. A computer programme is trained on the training dataset and then categorises the data into distinct classes based on that training.
The classification algorithm's goal is to identify the mapping function that will convert the discrete input(x) to the discrete output(y) (y).
Email Spam Detection is the finest example for understanding the Classification challenge. The model is trained on millions of emails on various parameters, and it determines if an email is spam or not when it receives a new one. The email is moved to the Spam folder if it is spam.
Classification is a supervised strategy to learning a target class function that maps each attribute set to one of the predetermined class labels in machine learning. In other words, classification is a type of predictive modelling that predicts a target class from a set of input data.
There are many different forms of classification issues, including:
● Binary Classification
● Multi-class Classification
● Multi-label Classification
Binary classification
Binary classification is a supervised classification problem in which the target class label has two classes and the goal is to predict one of them. The exercise usually entails one class in a regular state and another in an aberrant state.
Fig 1: Binary classification
Example Problems:
● Spam Detection: The goal of the spam detection problem is to determine whether or not the input mail/message is spam. 'Not spam' is a normal state in this problem, while spam' is an abnormal state.
● Cancer Detection: The goal of the cancer detection problem is to determine whether or not the candidate has cancer. 'Cancer' is a normal state in this problem, while 'no cancer' is an aberrant state.
Multi-class classification
Multi-class classification, often known as Multinomial Classification, is a classification task using more than two class labels. There is no concept of normal and pathological states in this classification, unlike binary classification. The classifier predicts that the object belongs to only one of several recognised classes.
For a multi-class classification problem, two approaches have been proposed:
One-vs-Rest - For N classes, N classifier models are fitted. The final output will be determined by the class with the highest prediction probability.
One-vs-One - For each pair of classes, N*(N-1) classifier models are fitted.
Fig 2: Multi class classification
Multi-label Classification
Multi-label Classification is a classification problem in which there are more than two target class labels and more than one class can be predicted as output. Unlike binary or multi-class classification issues, where only one class is anticipated, multi-class classification problems can predict many classes.
Fig 3: Multi label classification
Key takeaway
Classification is the process of identifying a function that aids in the classification of a dataset based on several factors. A computer programme is trained on the training dataset and then categorises the data into distinct classes based on that training
Clustering is the process of grouping data. Based on some similarity measure, the resulting groups should be such that data within the same group should be identical and data within different groups should be dissimilar. A good clustering algorithm can create groups that maximize data similarity among similar groups while minimizing data similarity among different groups.
Application of clustering
Classes are conceptually important groups of objects with identical characteristics. Data that is defined in terms of classes offers a better image than raw data.
A image, for example, may be defined as "a picture of houses, vehicles, and people" or by specifying the pixel value. The first approach is more useful for finding out what the picture is about.
Clustering is also seen as a prelude to more sophisticated data processing techniques.
As an example,
● A music streaming service could divide users into groups based on their musical preferences. Instead of estimating music recommendations for each individual person, these classes can be measured. It's also simpler for a new consumer to find out who their closest neighbors are.
● Time is proportional to n2 in linear regression, where n is the number of samples. As a result, nearby data points may be grouped together as a single representative point in large datasets. On this smaller dataset, regression can be used.
● There may be millions of colors in a single picture. To make the color palette smaller, identical RGB values could be grouped together.
Types of groups of clusters
Cluster classes are classified into four categories:
- Partitional - A series of non-overlapping clusters. There is only one cluster per each data point.
- Hierarchical - Clusters have sub-clusters in a hierarchical system.
- Overlapping - A data point may belong to multiple clusters, which means the clusters overlap.
- Fuzzy - Each data point is assigned a weight and belongs to one of the clusters. If there are five clusters, each data point belongs to all five clusters and is assigned five weights. The sum of these weights is 1.
Types of clusters
Depending on the issue at hand, clusters may be of various forms. Clusters can be anything from
- Well separated clusters
Each data point in a well-separated cluster is closer to all points within the cluster than to any point outside of it. When clusters are well apart, this is more likely to occur.
- Prototype based clusters
Each cluster in prototype-based clusters is represented by a single data point. Clusters are allocated to data points. Each point in a cluster is closer to that cluster's representative than any other cluster's representative.
- Graph based clusters
A cluster is a group of linked data points that do not have a connection to any point outside the cluster if data is described as a graph with nodes as data points and edges as connections. Depending on the issue, the word "related" can mean various things.
- Density based clusters
A density-based cluster is a dense set of data points surrounded by a low-density area.
Key takeaway
Clustering is the process of grouping data. Based on some similarity measure, the resulting groups should be such that data within the same group should be identical and data within different groups should be dissimilar.
A good clustering algorithm can create groups that maximize data similarity among similar groups while minimizing data similarity among different groups.
The technique of discovering correlations between dependent and independent variables is known as regression. It aids in the prediction of continuous variables such as market trends, house values, and so forth.
The Regression algorithm's goal is to identify the mapping function that will translate the continuous input variable (x) to the discrete output variable (y) (y).
For example, let's say we want to forecast the weather, so we'll apply the Regression approach. When it comes to weather prediction, the model is trained on historical data, and after it is finished, it can accurately predict the weather for future days.
We employ many types of algorithms in Machine Learning to allow machines to learn the relationships within the data and generate predictions based on patterns or rules discovered in the dataset. As a result, regression is a machine learning technique in which the model predicts the outcome as a continuous numerical value.
Regression analysis is a technique for determining the relationship between a single dependent variable (goal variable) and numerous independent variables. It is commonly used in finance, investment, and other fields. The most typical regression problems include estimating the price of a property, the stock market, or an employee's compensation, among others.
Need for Regression techniques
The benefits of regression analysis and the regression method of forecasting can assist a small business, and indeed any business, gain a better knowledge of the variables (or factors) that can effect its success in the coming weeks, months, and years.
Data are vital figures that characterise a company's whole operation. Regression analysis aids in the examination of data quantities and assists large corporations and enterprises in making better judgments. Regression forecasting is the process of examining the correlations between data points in order to predict the future.
Various algorithm
1. Linear Regression
2. Decision Tree
3. Support Vector Regression
4. Lasso Regression
5. Random Forest
Key takeaway
The technique of discovering correlations between dependent and independent variables is known as regression. It aids in the prediction of continuous variables such as market trends, house values, and so forth.
The application of machine learning, often supervised, semi-supervised, or reinforcement learning, in the creation of ranking models for information retrieval systems is known as learning to rank or machine-learned ranking (MLR). The training data is made up of lists of items with a partial order defined between each list's contents. This order is usually established by assigning each item a numerical or ordinal score or a binary judgement (e.g., "relevant" or "not relevant"). The ranking model's goal is to rank, i.e. to produce a permutation of items in new, unseen lists in a manner similar to the training data's ranks.
Learning-to-rank is a machine learning learning framework that seeks to organise things in a specific order based on their preference, relevance, or rating. We provide a detailed survey for learning-to-rank in this study. First, we go over the various methodologies as well as machine learning methods like regression, SVM, neural network-based, evolutionary, and boosting. We explain the characteristics of each technique in order to compare different approaches.
Learning-to-rank algorithms can also be used in conjunction with other machine learning paradigms such as semi-supervised learning, active learning, reinforcement learning, and deep learning. To review computational and storage advantages, learning-to-rank models are used in conjunction with parallel or large data analytics. For preference learning, many real-time applications use learning-to-rank. In this sense, we give some representative works.
7 steps in ML modeling
Step 1: Collect Data
Given the problem you wish to tackle, you'll need to conduct research and gather data to feed your machine. The quality and quantity of information you obtain are critical since they will have a direct impact on how well or poorly your model performs. You might have the data in an existing database or you'll have to start from scratch. If the project is small, you can construct a spreadsheet that can be simply exported as a CSV file later. Web scraping is also commonly used to automatically collect data from multiple sources, such as APIs.
Step 2: Prepare the data
Now is a good time to visualise your data and see if there are any links between the various features we discovered. It will be vital to choose features because the ones you choose will have a direct impact on the execution timelines and outcomes. If necessary, PCA can also be used to minimize dimensions.
Furthermore, you must balance the quantity of data we have for each outcome -class- so that it is substantial, as learning may be biased towards a type of answer, and your model will fail when attempting to generalise knowledge.
You must also divide the data into two groups: one for training and the other for model evaluation, which can be done roughly in an 80/20 ratio, but this can vary based on the scenario and the amount of data available.
You can also pre-process your data at this point by normalising it, removing duplicates, and correcting errors.
Step 3: Choose the model
You can use methods for classification, prediction, linear regression, clustering, i.e. k-means or K-Nearest Neighbor, Deep Learning, i.e. Neural Networks, Bayesian, and so on, depending on your goal.
Depending on the data you'll be processing, such as photos, sound, text, and numerical values, you can employ a variety of models. We'll look at some models and their applications in the table below, which you can use in your projects:
Model | Applications |
Logistic Regression | Price prediction |
Fully connected networks | Classification |
Convolutional Neural Networks | Image processing |
Recurrent Neural Networks | Voice recognition |
Random Forest | Fraud Detection |
Reinforcement Learning | Learning by trial and error |
Generative Models | Image creation |
K-means | Segmentation |
k-Nearest Neighbors | Recommendation systems |
Bayesian Classifiers | Spam and noise filtering |
Step 4 Train your machine model
To get the datasets to run smoothly and see modest improvements in the prediction rate, you'll need to train them. Remember to randomly initialise the weights of your model (weights are the numbers that multiply or alter the relationships between the inputs and outputs), as the selected algorithm will adjust them as you train them.
Step 5: Evaluation
You'll need to compare the machine you built to your assessment data set, which comprises inputs the model doesn't recognise, and double-check the precision of your already trained model. If the accuracy is less than or equal to 50%, the model will be useless because making decisions will be like throwing a coin. If you achieve a score of 90% or higher, you can be confident in the model's predictions.
Step 6: Parameter Tuning
If you didn't get accurate predictions during the evaluation and your precision isn't as high as you'd want, you may have overfitting or underfitting issues, and you'll need to go back to the training step before changing your model's parameters.
You can increase the number of epochs used to iterate your training data. The "learning rate," which is usually a variable that multiplies the gradient to gradually get it closer to the global -or local- minimum to reduce the function's cost, is another significant parameter.
Increasing your values by 0.1 units from 0.001 is not the same as increasing them by 0.1 units from 0.001. This can have a major impact on the model execution duration. You can also choose the maximum inaccuracy that your model is permitted to have. Training your machine might take anywhere from a few minutes to hours, or even days. Hyper parameters are a term used to describe these parameters. This "tuning" is currently more of an art than a science, but it will get better with practise.
There are generally a lot of settings to tweak, and when they're all combined, they can activate all of your selections. Each algorithm has its own set of parameters that must be tweaked. To name a few, in Artificial Neural Networks (ANNs), you must specify the number of hidden layers in the architecture and gradually test with more or less neurons in each layer. To achieve decent outcomes, this will require a lot of effort and patience.
Step 7: Prediction or Inference
You may now use your Machine Learning model to infer outcomes in real-world settings.
The first step in the machine learning pipeline is to collect data for training the ML model. The accuracy of ML systems' predictions is only as good as the data used to train them. Some of the issues that can emerge during data collection are as follows:
● Inaccurate data - It's possible that the information gathered has nothing to do with the problem description.
● Missing data - It's possible that some sub-data is missing. For some types of predictions, this could take the shape of missing images or empty values in columns.
● Data imbalance - Some data groups or categories may have an unusually large or small number of associated samples. As a result, individuals can find themselves underrepresented in the model.
● Data bias - The algorithm could transmit inherent biases on gender, politics, age, or geography, for example, depending on how the data, individuals, and labels are chosen. It's challenging to spot and eliminate data bias.
To overcome these issues, a variety of strategies can be used:
● Pre-cleaned, freely available datasets - Take advantage of existing, open-source expertise if the issue statement (for example, image classification, object recognition) corresponds with a clean, pre-existing, correctly formed dataset.
● Web crawling and scraping - Websites can be crawled and scraped for data using automated tools, bots, and headless browsers.
● Private data - Engineers that specialise in machine learning can generate their own data. When the amount of data needed to train the model is little and the issue statement is too particular to generalise across an open-source dataset, this is useful.
● Custom data - For a price, agencies can develop or crowdsource data.
Raw data and photos from the real world are frequently incomplete, unreliable, and lacking in specific behaviours or trends. They're also likely to be riddled with errors. As a result, when they've been gathered, they're pre-processed into a format that the machine learning algorithm can utilize to build the model.
Pre-processing entails a variety of processes and actions, including:
● Data cleaning - These manual and automatic procedures remove data that has been erroneously added or categorised.
● Data imputations - Methods and APIs for balancing or filling in missing data are available in most machine learning systems. Imputing missing values with standard deviation, mean, median, and k-nearest neighbours (k-NN) of the data in the specified field is a common technique.
● Oversampling - Bias or imbalances in the dataset can be rectified by using methods like repetition, bootstrapping, or the Synthetic Minority Over-Sampling Technique (SMOTE) to generate more observations/samples, which can then be added to the under-represented classes.
● Data integration - Incompleteness in a single dataset can be overcome by combining many datasets to create a big corpus.
● Data normalization - The memory and processing required for iterations during training are affected by the size of a dataset. By lowering the order and magnitude of data, normalisation minimises the size.
The process of selecting one of several potential models for a predictive modelling challenge is known as model selection.
When it comes to model selection, there may be numerous conflicting factors beyond model performance, such as complexity, maintainability, and available resources.
Probabilistic measurements and resampling procedures are the two basic types of model selection techniques.
Model selection is the process of choosing one final machine learning model for a training dataset from a pool of candidate machine learning models.
Model selection is a procedure that can be used to compare models of different types (e.g., logistic regression, SVM, KNN, and so on) as well as models of the same type with different model hyper parameters (e.g. Different kernels in an SVM).
We may, for example, have a dataset for which we want to create a classification or regression predictive model. We have no way of knowing which model will perform better on this problem because it is unknown. As a result, we fit and assess a variety of models to the problem.
The process of selecting one of the models as the final model to solve the problem is known as model selection.
Model selection is distinct from model evaluation.
Model selection, for example, is the process of evaluating or assessing candidate models in order to select the best one. Model evaluation is the process of evaluating a model after it has been chosen in order to explain how well it is predicted to function in general.
Model Selection Techniques
The optimum model selection strategy necessitates "adequate" data, which can be almost limitless depending on the problem's complexity.
We would divide the data into training, validation, and test sets, then fit candidate models on the training set, evaluate and pick them on the validation set, then report the final model's performance on the test set in this ideal scenario.
There are two main types of techniques that can be used to approximate the ideal case of model selection:
Probabilistic Measures - Using in-sample error and complexity, select a model.
Resampling Methods - Select a model based on the estimated out-of-sample error.
Probabilistic Measures
The Information Criterion can be used to assess the quality of probabilistic model selection (IC). Its methods include a scoring approach that chooses the best among candidate models using a probability framework of log-likelihood of Maximum Likelihood Estimation (MLE).
Unlikely Resampling techniques are a very beneficial approach of selecting your model while considering both performance and complexity.
Because a model with fewer parameters is less complex, it is chosen because it is more likely to generalise on average.
The following are four regularly used probabilistic model selection measures:
● Akaike Information Criterion (AIC).
● Bayesian Information Criterion (BIC).
● Minimum Description Length (MDL).
● Structural Risk Minimization (SRM).
When utilising simpler linear models like linear regression or logistic regression, probabilistic measurements are useful since the model complexity penalty (e.g. In sample bias) is known and tractable.
Resampling Methods
Out-of-sample data is used to estimate the performance of a model (or, more correctly, the model creation process).
Splitting the training dataset into sub train and test sets, fitting a model on the sub train set, then evaluating it on the test set accomplishes this. This method can then be performed numerous times, with the average performance provided for each trial.
It's a Monte Carlo estimate of model performance on out-of-sample data, albeit each trial isn't technically independent because, depending on the resampling strategy used, the same data may appear numerous times in various training or test datasets.
Data is resampled into train/test for a number of iterations in the resampling process of model selection, followed by training on train and evaluation on test set.
The performance of the model chosen using this technique is evaluated, not its complexity.
Out-of-sample data is used to calculate performance. Out-of-sample data, also known as unseen data, is evaluated by resampling procedures to estimate the error.
The following are three typical resampling model selection methods:
● Random train/test splits.
● Cross-Validation (k-fold, LOOCV, etc.).
● Bootstrap.
The widely used k-fold cross-validation method, for example, divides the training dataset into k folds, with each sample appearing only once in each test set.
Another is the leave one out (LOOCV) method, in which the test set is made up of a single sample and each sample is given the chance to be the test set, necessitating the construction and evaluation of N (the number of samples in the training set) models.
Key takeaway
Model selection is the process of choosing one final machine learning model for a training dataset from a pool of candidate machine learning models.
Model selection is a procedure that can be used to compare models of different types as well as models of the same type with different model hyper parameters.
Random Splits are used to sample a percentage of data at random and divide it into training, testing, and, ideally, validation sets. The advantage of this strategy is that the original population is likely to be well represented in all three groupings. Random splitting, to put it another way, prevents biased data sampling.
It's crucial to remember that the validation set is used in model selection. The validation set is the second test set, and it's understandable to wonder why there are two test sets.
The test set is used to evaluate the model during the feature selection and tuning phase. This signifies that the model parameters and feature set have been chosen to produce the best results on the test set. As a result, the validation set is used for the final evaluation, which contains wholly unseen data points (not used in the tuning and feature selection modules).
The data that must be passed through the model is divided into train and test ratios. The train test split function can be used to accomplish this.
From sklearn.model_selection import train_test_split
X = df.drop(['target'],axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X , y, test_size=0.2, random_state=0)
The outcomes of re-running the train test split data code are different each time it is run. So you're not sure how your model will perform on data you haven't seen yet.
Advantages of train/test split:
● Because K-fold cross-validation repeats the train/test split K times, it is K times faster than Leave One Out cross-validation.
● Examining the detailed outcomes of the testing process is easier.
K - fold cross-validation
Cross-validation is a resampling technique for evaluating machine learning models on a small sample of data.
The process includes only one parameter, k, which specifies the number of groups into which a given data sample should be divided. As a result, the process is frequently referred to as k-fold cross-validation. When a precise value for k is specified, it can be substituted for k in the model's reference, for example, k=10 for 10-fold cross-validation.
Cross-validation is a technique used in applied machine learning to estimate a machine learning model's skill on unknown data. That is, to use a small sample to assess how the model will perform in general when used to generate predictions on data that was not utilised during the model's training.
It's a popular strategy since it's straightforward to grasp and produces a less biased or optimistic estimate of model competence than other approaches, such as a simple train/test split.
The cross-validation technique shuffles the dataset at random and then divides it into k groups. Following that, when iterating over each group, the group should be considered a test set, while the rest of the groups should be combined into a training set. The model is then tested on the test group, and the process is repeated for the remaining k groups.
As a result, at the end of the process, one will have k different test group findings. The best model can then be readily chosen by selecting the model with the highest score.
The following is the general procedure:
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
● Take the group as a hold out or test data set
● Take the remaining groups as a training data set
● Fit a model on the training set and evaluate it on the test set
● Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to a distinct group and remains there throughout the method. This means that each sample has the chance to be utilised in the hold out set once and to train the model k times.
Any data preparation before to fitting the model should take place on the CV-assigned training dataset within the loop rather than on the larger data set. This holds true for any hyper parameter tweaking. Failure to complete these procedures within the loop could lead to data leakage and an overestimation of model competence.
Stratified K-Fold
The technique for stratified K-Fold is similar to that of K-Fold cross-validation with one major difference: unlike k-fold cross-validation, stratified k-fold considers the values of the target variable.
If the target variable is a categorical variable with two classes, for example, stratified k-fold assures that each test fold has the same ratio of the two classes as the training set.
This improves the accuracy of model evaluation and reduces the bias in model training.
Configuration of k
For your data sample, the k value should be carefully chosen.
A mis-representative view of the model's skill, such as a score with a high variance (that might change a lot depending on the data used to fit the model) or a strong bias, could come from a badly chosen value for k. (such as an overestimate of the skill of the model).
The following are three approaches to determining a value for k:
● Representative - The value for k is chosen so that each train/test batch of data samples is statistically representative of the larger dataset.
● k=10 - The value for k is set to 10, which has been found through experimentation to produce a model skill estimate with low bias and low variance.
● k=n - The value of k is set to n, where n is the dataset size, such that each test sample has a chance to be used in the hold out dataset. Leave-one-out cross-validation is the name for this method.
Advantages of Cross- validation include:
● Out-of-sample accuracy can now be estimated more accurately.
● Every observation is used for both training and testing, resulting in a more "efficient" use of data.
When constructing a predictive machine learning model, model evaluation is critical. Building a predictive model without validating it is not a fit model, but a model that provides the highest level of accuracy is undoubtedly a good one. For this, you'll need to keep an eye on the metrics and make adjustments as needed until you get the desired accuracy rate.
Confusion matrix, true positive, false positive
The confusion matrix, also known as the Error matrix, is a table that depicts a classification model's performance on a set of test data in machine learning. Class 1 is shown as the positive table, whereas Class 2 is shown as the negative table in the above table. It's a two-dimensional matrix in which each row represents examples in the prediction class and each column represents instances in the actual class, or vice versa.
The confusion matrix, also known as the Error matrix, is a table that depicts a classification model's performance on a set of test data in machine learning. Class 1 is shown as the positive table, whereas Class 2 is shown as the negative table in the above table. It's a two-dimensional matrix in which each row represents examples in the prediction class and each column represents instances in the actual class, or vice versa.
● True positives occur when you anticipate that an observation belongs to a particular class and it actually does.
● When you forecast that an observation does not belong to a class and it truly does not belong to that class, you have a true negative.
● False positives arise when you incorrectly forecast that an observation belongs to a particular class when it does not.
● False negatives occur when you incorrectly forecast that an observation does not belong to a particular class when it actually does.
These four outcomes are often plotted on a confusion matrix.
Accuracy
The most basic metric is accuracy, which is defined as the number of correctly classified test instances divided by the total number of test cases.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
It can be used to solve a wide range of problems, although it isn't very useful when dealing with unbalanced datasets.
When identifying fraud in bank data, for example, the ratio of fraud to non-fraud cases might be as high as 1:99. If accuracy is employed in these circumstances, the model will be 99 percent accurate, correctly predicting all test cases as non-fraud. The model that is 99 percent accurate will be absolutely useless.
If a model is inadequately trained, it will miss out on the 10 fraud data points if it predicts all 1000 (say) data points as non-frauds. If the model's accuracy is measured, it will reveal that it accurately predicts 990 data points, giving it a score of (990/1000)*100 = 99 %!
As a result, accuracy is a misleading indicator of a model's health.
As a result, a metric that can focus on the ten fraud data points that the model completely missed is necessary in this scenario.
Precision
Precision is a statistic for determining whether or not a classification is right.
Precision = TP / (TP + FP)
This equation is intuitively the ratio of correct positive classifications to total anticipated positive classifications. The larger the proportion, the higher the precision, which indicates the model's ability to correctly categorise the positive class is improved.
Precision is important in the subject of predictive maintenance, which involves predicting when a machine will need to be fixed in advance. Because the cost of maintenance is typically substantial, inaccurate estimates can result in a loss for the organisation. In these situations, the model's capacity to correctly categorise the positive class and reduce the number of false positives is critical!
Recall
The number of accurately detected positive cases out of the total number of positive instances is known as recall.
Recall = TP / (TP + FN)
Returning to the fraud problem, a high recall value indicates that a large number of fraud cases were recognised out of the total number of frauds.
Hyperparameters are controllable parameters that allow you to fine-tune the training process for your model. With neural networks, for example, you can choose the number of hidden layers and nodes in each layer. Hyperparameters have a big impact on model performance.
Hyperparameter tuning, also known as hyperparameter optimization, is the process of determining the optimal configuration of hyperparameters. The procedure is usually both computationally and manually intensive.
The process of selecting a set of ideal hyperparameters for a learning algorithm is known as hyperparameter tuning. A hyperparameter is a model argument whose value is determined prior to the start of the learning process. Hyperparameter tuning is the key to machine learning algorithms.
A mathematical model containing a number of parameters that must be learned from data is referred to as a Machine Learning model. We can fit the model parameters by training a model using existing data.
Hyperparameters, on the other hand, are a type of parameter that cannot be learned directly from the standard training procedure. They are normally fixed prior to the start of the training procedure. These parameters describe crucial aspects of the model, such as its complexity and learning rate.
The following are some instances of model hyperparameters:
● The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
● The learning rate for training a neural network.
● The C and sigma hyperparameters for support vector machines.
● The k in k-nearest neighbors.
Fig 4: Hyper-parameter tuning vs Model training
Hyperparameters: There are no hyperparameters in standard linear regression. Regularization is a hyperparameter in linear regression variants (ridge and lasso). As hyperparameters, the decision tree has a maximum depth and a minimum number of observations in each leaf.
Optimal Hyperparameters: Hyperparameters regulate the model's overfitting and under-fitting. Different datasets have different optimal hyperparameters. The following methods are taken to obtain the best hyperparameters:
1. For each proposed hyperparameter setting the model is evaluated
2. The hyperparameters that give the best model are selected.
Hyperparameters Search: Grid search selects a grid of hyperparameter values and compares them all. The min and max values for each hyperparameter must be determined by guesswork. A random sample of points on the grid is valued using a random search. It performs better than grid search. Smart hyperparameter tuning selects a few hyperparameter settings, assesses the validation matrices, makes hyperparameter adjustments, and re-evaluates the validation matrices. Spearmint (hyperparameter optimization using Gaussian processes) and Hyperopt are two examples of smart hyper-parameters (hyperparameter optimization using Tree-based estimators).
Key takeaway
Hyperparameters are controllable parameters that allow you to fine-tune the training process for your model. With neural networks, for example, you can choose the number of hidden layers and nodes in each layer
When anticipating the likelihood of a given result, such as whether or not a customer would churn in 30 days, "prediction" refers to the output of an algorithm after it has been trained on a previous dataset and applied to new data. For each record in the new data, the algorithm will generate probable values for an unknown variable, allowing the model builder to determine what that value will most likely be.
The term "prediction" has the potential to be deceiving. In some circumstances, such as when utilising machine learning to pick the next best move in a marketing campaign, it actually does mean you're forecasting a future outcome. Other times, the "prediction" concerns, for example, whether or not a previously completed transaction was fraudulent. In that situation, the transaction has already occurred, but you're attempting to determine whether it was legitimate, allowing you to take necessary action.
Why are Predictions Important?
Machine learning model predictions allow organisations to generate very accurate guesses about the likely outcomes of a query based on historical data, which might be about anything from customer attrition to possible fraud. These supply the company with information that has a measurable business value. For example, if a model predicts that a client is likely to churn, the company can reach out to them with tailored messaging and outreach to prevent the customer from leaving.
Key takeaway
The term "prediction" has the potential to be deceiving. In some circumstances, such as when utilising machine learning to pick the next best move in a marketing campaign, it actually does mean you're forecasting a future outcome.
References:
- Stuart Russell and Peter Norvig (1995), “Artificial Intelligence: A Modern Approach,” Third edition, Pearson, 2003.
- Parag Kulkarni and Prachi Joshi, “Artificial Intelligence – Building Intelligent Systems”, PHI learning Pvt. Ltd., ISBN – 978-81-203-5046-5, 2015
- Artificial Intelligence by Elaine Rich, Kevin Knight and Nair, TMH
- Mohri, Rostamizdeh, Talwalkar, Foundations of Machine Learning, MIT Press, 2018.