Building Blocks of a Learning Algorithm

vibhor nigam
6 min readJan 18, 2020
courtesy honeypotmarketing.com

A lot of articles have been written on different algorithms present in the field of Machine Learning with most of them dividing the learning space between supervised and unsupervised learning and explaining (sometimes in detail) algorithms available in these spaces.

In this article, I have tried to break down a learning algorithm into its basic components; architecture, model, loss function, optimization and regularization which collectively can be understood as building blocks of any discriminative supervised learning algorithm.

Architecture

Architecture is a word that gained prominence with the rise of Neural Networks and Deep Learning. Once one starts researching more into deep learning, the architecture of a neural network gains as much, if not more, importance than the optimization and regularization methods being used. It does not mean architecture was not present in the earlier versions of the algorithms.

In simpler terms, architecture refers to the framework within which the mathematical functions of a learning algorithm operate during the training phase to learn adequate weights for the model which can then be used in case of predictions.

If the definition seems too technical, let’s understand it with a simple example. Take logistic regression for example. The likelihood of logistic regression is represented by the equation. Details of how this equation is derived can be seen at[1]

where

If you look closely, this equation has the architecture of linear regression with p being the mathematical/logistic function which provides the non-linearity. Such models in fact belong to a category called generalized linear models which try to fit an equation of the form y = f(a + b*x1 + c*x2) wherein case of logistic regressions f(z) = 1/(1+e^{-z}) [2]

Similarly, in the case of a neural network, architecture refers to the number of layers, number of neurons in each layer and interconnectivity between different layers. Based on these different kinds of layouts neural networks are categorized into a feed-forward neural network, a recurrent neural network (RNN) or a convolutional neural network (CNN), an LSTM and other different architectures.

Image of simple neural network architecture.

For instance, a recurrent neural network takes its output at time t-1 as input at time t, a CNN takes in convolutional layers in the network which perform a different function than a normal Neural Network Layer, and LSTM has a rather complex architecture, explaining which is beyond the scope of this article.
If you want to learn about Neural Networks in detail please feel free to go through my previous article

Model

A model is a generic term that is used in different domains to represent different concepts. In the world of machine learning, one can mostly consider a model to be a statistical model.

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process. [3]

Each model assumes a mathematical relationship between the dependent and independent variables. The process of training a model generally consists of training the weights for a model so as to capture the relationships between the dependent and independent variables, which then can be used for prediction, classification or forecasting tasks on a new set of data.

In some cases, such as Bayesian models instead of training a weight the model relationship itself is used to generate the required output.

Loss Function

At its core, a loss function is incredibly simple: it’s a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll output a lower number. [4]

A Plot of different loss functions ~ courtesy researchgate.net

The loss functions are chosen in a way so that they are differentiable and convex, thus enabling to reach a minimum value during optimization. Reaching a value of the minimum loss is indicative of the best parameters/weights a model can learn within its constraints and with the data provided. Training is stopped once a stage of minimum loss is reached.

Log loss in logistic regression, hinge loss in SVM, mean square error or quadratic loss in linear regression are few examples of the most common loss functions.

Optimization

Optimization is a technique used in machine learning to tune the parameters of a model so as to drive its loss to the minimum.

Gradient Descent Visualization ~ courtesy datasciencecentral.com

One of the most common optimization algorithms is gradient descent which is used extensively in the field of machine learning to find a local minimum.

Stochastic Gradient descent, mini-batch gradient descent (mostly used in neural networks) and adaptive movement estimation (Adam) are some other popular optimization algorithms

Regularization

In mathematics, statistics, and computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting. [5]

Plots of contours of L2 regularization(left) and L1 regularization(right) ~ courtesy datascience.stackexchange.com

In the field of machine learning, regularization generally refers to adding a penalty term to the objective function and control the model complexity using that penalty term. It is mainly used to avoid overfitting by penalizing the weights in case of wrong predictions on training data. Depending upon the kind of penalty term added regularization can be called L1 or L2 regularization.

L1 regularization imposes heavy penalty driving weights of non-contributing features to zero and increasing weights of contributing features. Because of this reason, it can be used for feature selection as well.

Conclusion

What I have tried to highlight through this article are the basic components that are shared across all types of discriminative algorithms and models. On a closer look, each of these algorithms can be divided into the basic modules of architecture, model, loss function, an optimization technique, and a regularization technique. There are multiple versions of each of these modules and different variations of them are tried while training an ML model to get the best results.

With time some of the things have come into standard practice like using gradient descent for optimization or L2 regularization but it does not mean that other techniques are inefficient or can not be used. In fact, in the field of neural networks, Adam optimizer is steadily replacing gradient descent as an optimization technique.

Understanding these basic components are necessary for a data scientist, as training an effective model does not only involve collecting a large amount of data and choosing particular algorithms. Tweaking other components such as optimization or regularizing techniques can give a significant boost as well.

References

1. https://www.saedsayad.com/logistic_regression.htm

2. http://www.win-vector.com/blog/2012/08/what-does-a-generalized-linear-model-do/

3. https://en.wikipedia.org/wiki/Statistical_model

4. https://algorithmia.com/blog/introduction-to-loss-functions

5. https://en.wikipedia.org/wiki/Regularization_(mathematics)

--

--