Machine Learning’s Fidden Façade — Math Learning

Basic math behind ML algorithms

Yelim Kim
students x students

--

One of the largest things I learned about Machine Learning over the course of my research is that ML is really a game of numbers.

Source.

Seriously. In simple terms, all an ML algorithm does is take in a set of numerical inputs and runs it through a series of math operations (multiplication, addition, exponentiation (putting it to the power of another number), log, etc) and outputs a new array of numbers that tells us what the input data means in a form that we can understand.

A simplified representation of how machine learning works

So, if you’re planning to learn and/or build any type of Machine Learning algorithm, it’s crucial that you understand the mathematical operations on it, because really most of what the code does is to translate the mathematical processes into the language of a computer.

Alright, so how do you learn it? Well, you’ve got two options: (1) invest 5 hours into going through and understanding the “Introduction to Neural Networks” module at this Udacity course, or (2) read through this 28-minute article overviewing the basic mathematical and logical operation behind Neural Networks. If you’re ready, I’m ready. Let’s get right into it.

Table of Contents

  1. What is a Neural Network (NN)?
  2. Basic math concepts
  3. How perceptrons work

3a) Feedforward

3b) Backpropagation

4. Neural networks — a network of perceptrons

4a) Feedforward in NNs

4b) Backpropagation in NN

5. Conclusion

What is a Neural Network (NN)?

Machine learning (ML) is a computer algorithm that can take in a set of inputs, process the inputs to extract patterns from them, and make a prediction of the corresponding outcome based on those patterns. For example, a ML model can look at a set of images of animals, learn the patterns that distinguish cats from dogs, and output a prediction of whether the image is a cat or a dog when given a new image it’s never seen before.

A subfield of machine learning is deep learning (DL). Deep learning models are ML that don’t require the user to tell it which specific patterns to look for. Back to the dog and cat example, it doesn’t need to be given that the ears are a crucial distinguishing factor, and don’t need data of images with labels indicating if it’s a dog or a cat. They figure out which features are the most important by themselves and, although they can also learn well from data with written answers on them, can group different data points into categories of dogs vs cats by themselves without being told the answer. If you think about it, this is SICK. DLs are like babies that all at sudden learn to read just by staring at the pages of a book.

How does it do this? Like in soft robotics, where the composition of the robots allowed them to be sturdy and flexible at the same time — the magic of DLs originate from the composition of the DL algorithm. We call these components Neural Networks (NN), or Artificial Neural Networks (ANN). In this article, I’ll be simply calling them NNs (although they technically refer to the physical networks of neurons in your brain, not a computer algorithm).

As a primer, NNs consist of layers, each made from nodes. The model consists of an input layer that stores the input data from the user, a set of hidden layers that apply math operations to the input data, and an output layer that returns the prediction of the model back to the user.

The property and type of the NN (ex. CNN, RNN, GAN) depends on the type of hidden layer you add to your network, because the hidden layers handle all the pattern extraction and data processing. You can think of NNs as Subway sandwiches; you can pick what you want to add in between the two sandwich buns (the pieces of bread representing the input layer and output layer), and the unique set of toppings (hidden layers) added make sandwiches taste different.

Source.

Basic math concepts

I know you’re excited to learn the actual ML mechanism, but before we jump into the meat, let’s review some math concepts that you should know to fully understand the math. Consider it the extra stretch before you dive into a pool 🌊 (and you probably know what happens if you don’t warm up before going in; you drown 💀).

Derivatives

You probably know what a slope is; it’s the rate of change of y as x changes in a straight line. In other words, a slope tells you how quickly y changes as x constantly increases and whether y increases or decreases.

Well, derivatives measure this rate of change at a single point on the curve. With derivatives, you can find out how quickly y is changing as x increases, even when the function of y is not a straight line.

There are numerous applications for derivatives (ex. calculating the speed of an object, minimum/maximum values in a curve), but generally, we use derivatives to find in what direction (positively or negatively) a value (y) is changing as another value (x) changes, and therefore what we can do to x to increase or decrease y (the latter will be super pertinent to ML).

There are two commonly used notations for derivatives — d/dx(f(x)) (f = the value being changed, x = the independent value that the change of f depends upon) and f’(x).

Here are the steps to finding the derivative of a value:

(1) Find a formula for the value in terms of another value, like x.

(2) Apply a set of rules that tell you how to find the derivative depending on what type of numerical expression (e^x, x², cos(x), f(g(x)), 1/x, etc) are present in that function. A really important rule that you have to know is the chain rule, which states that the derivative of a nested function f(g(h((t))) is the product of the derivatives between each function and the function it’s nested in, all the way from t to f (see image below).

If you really want to understand the underlying math for deriving certain formulas used in the ML algorithm, you should also know:

Power rule

The derivative of a constant (a single number) is 0

When you’re finding a derivative in terms of a variable, every other variable is dealt like a constant

Division rule

Special derivatives — derivatives of special functions (you’ll only need to know the derivative of ln(x) — d/dx(ln(x)) = 1/x)

Special derivatives for different functions. Source.

Summation notation

If you’re familiar with Python, summation is essentially a for loop where you’re iterating through the function inside the bracket for (m- i + 1) many times:

Example of summation in Python. The “function_terms” list is a list of the result of each time a value i is plugged into the function (1², 2², 3² and so on)

More technically, it says that starting from i = 1 and counting up (in integers) to m, you’re plugging in the i value into the formula on the right side of the symbol and adding up the result of all of these substitutions.

Matrices

A matrix is a 2-dimensional array of numbers into a rectangular conformation. Simply, it’s a rectangle filled with numbers. This rectangle consists of smaller units that each contain a single number. The smaller units are used to evaluate the rectangle’s dimensions: the number of rows and columns.

Matrices with a dimension of 1 x m or m x 1 is called a vector.

Organizing the numbers into a matrix allows us to perform arithmetic operations like addition, subtraction, multiplication, division, cos(), square root, and exponentiation, at the same time on all the numbers in the matrix.

To show you what I mean by this, here’re a list of common arithmetic operations done with matrices in ML:

  1. Matrix addition.

The numbers on the corresponding spot within the matrix are added to each other for all the numbers in the matrix.

In this way, addition is done “simultaneously” for all the numbers on the two matrices. Caveat: the two matrices must have the same dimensions for addition to work.

2. Matrix scalar multiplication.

Multiplying the matrix by a single constant. Each element on the matrix is multiplied to that constant — this is when multiplication is done at the same time for all the numbers in the matrix. A scalar means a single number.

3. Matrix dot product.

The numbers on each column of the matrix on the right are multiplied element-to-element to numbers on each row of the matrix on the left.

Think of each column of matrix on the right as a person and each row of the matrix on the left as their bed. For every multiplication, the person jumps onto their bed with a matching length as them, so that the numbers from the column (person) can be multiplied to the corresponding numbers on the row.

The results of this multiplication for numbers in the same “person” making up the matrix on the right are added up to produce one element on the output matrix.

The resulting matrix has the # of columns as the bed matrix and the # of rows as the person matrix (shortcut: the dimensions of the output matrix is as if you took the outer two dimension parameters from the parent matrices).

Here, multiplication AND addition is done simultaneously for multiple numbers. Caveat: the number of rows on the matrix on the right should be the same as the number of columnes matrix on the left

Properties of logs and exponentials

Hey, you’re almost there! The two last concepts I’ll be covering are more supplementary and will help you understand why we’re using certain functions, like the sigmoid function as we’ll see later.

  1. Exponents occur when a number is multiplied to itself a certain number of times. We call the number being multiplied the base and the number of times it’s multiplied for the exponent (yeah, the same name as the function itself).

Property to remember: exponents with a positive base ALWAYS result in a positive number.

2. Logarithms are the exact opposites of exponents. They take in the end result of an exponent and the base of the exponent, and outputs the number of times for which the base would have been multiplied.

Property to remember: the sum of multiple logarithms with the same base is equal to the logarithm of the product of all the exponents in each of these added logarithms.

How perceptrons work

Perceptrons (also called nodes or layer units) are the basic building blocks of a neural network, and yep, they’re like neurons are to a brain. In fact, the mechanism of perceptrons were designed based on the structure of a neuron. Each perceptron takes in a set of inputs, mathematically modifies the inputs, runs the result through an activation function to squash it into an appropriate size, then outputs a set of one or more numbers.

An ML algorithm is created when the individual perceptrons are linked to one another, so that the output of a set of perceptrons becomes the input for another perceptron (I’ll talk about this in the later sections). For now, let’s zoom into one perceptron in an algorithm to see what’s mathematically happening inside a perceptron. Then, in the second part, I’ll show you what’s happening at the scale of a whole NN and how a Neural Network model trains.

Feedforward in perceptrons

A perceptron consists of 4 parts: the input, weights, activation function, and output. Feedforward is when a value enters a perceptron as an input, is modified by the weights and the activation function, and is returned as an output. Here’s what happens in each part of the perceptron.

  1. Input: receiving inputs.

First, a set of inputs are collected from the other perceptrons from the previous layer into a matrix.

The input of a perceptron is sometimes confusingly also called the input layer (the same name as the layer inside a neural network where the user gives s dataset to the NN).

To clarify, the input layer of an NN is a group of perceptrons that receive their inputs directly from the user, instead of from other perceptron.

2. Weights : adding significance to the inputs.

A perceptron has a set of numbers, called weights. Each weight is designed for one corresponding input, and the weights determines how important each input is to the perceptron’s output.

During feedforward, each weight is multiplied to its matching input and all the products are summed up. If you think of a perceptron as a function, the inputs would be the variables (x, y, z, etc) in the function and the weights the coefficients of those variables.

Clearly, it’s pretty accurate to say that the weights determine the outcome of the perceptron.

However, it would take a looong time and computer memory to multiply the weights one-by-one to the inputs. Luckily, we now know a way to do all these operations simultaneously… matrix operation! The weights are organized into a matrix and multiplied to the input matrix by matrix dot product (see above).

This results in the product of each input and the corresponding weight and the subsequent summation of these products for all the received inputs.

On top of all the products between the weights and the inputs, the total sum is added to a single number, called the bias, that regulates the significance of the perceptron itself to the whole NN algorithm. The bias can be either a positive or negative number, meaning that it can increase or decrease the perceptron's influence on the model’s output.

To recap: the weights and biases are the agents that manipulate the input values to control how much each input and the perceptron itself matters to the overall outcome of the NN. They determine the accuracy of the perceptron and the whole NN model.

3. Activation function: condensing the weights.

In ML, we want the outputs of all the nodes to be in the same range of numbers, particularly between 0 and 1, so that the outcome of each node equally influences the final prediction of the model. To make sure that each perceptron is within this range, the result from the W (weight) * X (input) + b is run through an activation function. There are several activation functions used, the most common being the sigmoid function (σ), tanh(x), and rectified linear unit (ReLU).

In this example, we’ll be sticking to the sigmoid function. Sigmoid, however, is actually not used anymore because it leads to slow feedforward and makes it difficult to precisely optimize the weights (backpropagation, in the next section). We’re talking about sigmoid here only because it was historically first used in ML algorithm (and most ML courses start with the sigmoid function).

4. Output: sending out the result.

This final result after applying the weights and squishing to fit between 0 and 1 is passed along onto another set of perceptrons in the next layer to be used as inputs.

The goal of a perceptron is to find a linear line that can separate the datasets into two groups (; categories). They do this by linearly multiplying weights and adding the bias to the inputs, so that when the resulting number is greater than 0, the input is predicted to be in one category, and if it’s less than 0, it’d put into another category.

But hold up… aren’t neural networks super complex and can solve sophisticated problems? How do these simple perceptrons make up such powerful algorithms? Well, here comes our next section.

4. Neural networks — a network of perceptrons

As I mentioned earlier, perceptrons in a neural network are connected to other perceptrons, where the output of perceptrons from one layer becomes the input for the perceptrons in the other layer.

Each layer multiplies the output of the previous perceptrons with weight, adds a bias to it, and squishes it with an activation function to output a more complex, non-linear prediction curve. In other words, we’re making non-linear curves by mixing straight lines.

Neural networks consist of layers of perceptrons, where each layer is an array of perceptrons laid side-by-side each other.

The lines connecting the nodes between layers represent a weight, and each node has one bias along with an activation function.

There are two components of an ML algorithm — feedforward and back propagation.

Feedforward in NNs

Organizing the weights, biases, and inputs

Since a neural network consists of connected layers of individual perceptrons, there are much more weights and biases, as well as inputs (since we have a whole input layer instead of a single perceptron for storing all the inputs, and the model is usually fed with a whole dataset — a group of hundreds of images — instead of a single image).

To keep track of the many inputs, all input values in the dataset fed into the model are organized into one matrix, similar to the perceptron input matrix, except that it’s a stack of multiple perceptron input matrices:

The input matrix for the perceptron was drawn here as a vertical vector (instead of as the horizontal vector we’ve been using) for convenience.

Each datapoint in the dataset isn’t just a number and is usually an array of multiple numbers (x1, x2, x3, …), with each number representing the value of a certain parameter/characteristic about that datapoint. For example, a single grayscale image consists of a 2D array of numbers, which each number corresponding to the darkness (0 (black) to 1 (white)) of a pixel in the image.

The convention is that the inputs like this are organized into a matrix, so that each row represents a new datapoint and each column represents a parameter of the datapoint(ex. the color value for a pixel at a certain location inside the image)

Weights are also organized into a matrix. Each layer in the NN model has a matrix containing all the weights in that layer. Similar to in the perceptrons, the inputs derived from the previous layer are multiplied in a dot product to the weights in the current layer. To meet the dimension requirement for matrix dot products, the weight matrix has the same number of rows as the number of parameters/characteristics in the (= the number of columns) in the input matrix. Each column of the matrix contains the weights for each node in the current layer, so the number of columns in the weight matrix would be the number of nodes in that layer.

Feedforward process

Starting off from the input layer, the inputs, organized into a matrix (# of rows: # of datapoints; # of columns: # of parameters), would be multiplied to the weight matrix from the next layer, added to a bias of the next layer, and run through the activation function (where each element in the layer is applied to the sigmoid function). This result would then be fed into the weights in the layer after, and so on up to the output layer.

Mistake in the image: h1, h2, and h should be the sigmoid function of the Weight-Input matrix dot product

Converting normal numbers to probabilities (output layer)

Previously, we were just looking at individual perceptrons, which output a single number. When we’re looking at networks of perceptrons, the output layer may consist of multiple nodes. If this is the case, each output node would store a probability that the given datapoint belongs to a certain category (assuming that the input can only belong to one of the given categories and each node stores the probability for one category). This means that the probabilities of all the output nodes must add up to 1.

Because of this, we can’t apply the regular activation functions, like ReLU and sigmoid, to the output layer, since they would squeeze the W* X + b values into numbers that don’t add up to 1 for each datapoint. So, we need a special activation function that takes in the resulting matrix from W * X + b (where each column represents a datapoint and each row represents the prediction value —a value directly related to how likely the datapoint belongs to a particular category), and turn it into the probability for each category. Note: Probabilities are all numbers between 0 and 1 while the predictions values directly from the W, X dot product can be a number of any range (ex. it can be 100, -5, 25, 0.5).

Well, what do we use then? Softmax.

Mathematically, the softmax function divides e (Euler’s number) to the power of the prediction value for one category by the sum of this exponent for all the categories.

Summing up the softmax probability for all categories returns 1, showing us that the softmax function gives us a probability corresponding to each prediction value.

The softmax function in a neural network would look like this:

To recap, during the feedforward of a neural network:

  1. The dataset is organized into a matrix, where each row represents a single datapoint
  2. The dataset matrix undergoes a dot product with a weight matrix of the next layer, where each column contains all the weights for one node in the layer.
  3. The dot product is fed into an activation function, and the result is used as the input for the nodes in the layer after.
  4. Steps 2 and 3 are repeated until you reach the output layer. The output layer uses a special activation function called softmax that takes all the dot products to the power of e and divides it by the total sum of the power of e for the dot products for all the categories.

Backpropagation in NNs

What is backpropagation?

A neural network algorithm doesn’t just end with running a bunch of inputs into a model. We want to train the perceptron so that it makes more accurate predictions every time another dataset is run on it.

We learned that the weights and biases determine the accuracy of a perceptron’s predictions, so we’ll be tinkering with the weights and biases during backpropagation. After every round of feedforward, a round of backpropagation would take place where the computer looks at (1) how inaccurate the model was and (2) how each weight/bias contributed to the inaccuracy, and tailors each of the weights and biases accordingly to improve the accuracy. We call this training process backpropagation.

Gradient descent

The mathematical operation behind backpropagation is called gradient descent. During gradient descent, we tell the computer to find the correlation between the error and every single weight (and bias) of the model and therefore whether the weight should be increased or decreased, and by how much. Using this, we update each weight in small steps towards the direction of lowering its mistakes based on how the error is changing with the change in the weights.

Every time all the weights and biases are updated once, we call this “step” an epoch. There are 2 major parts in gradient descent: (1) finding the “error” in the model and (2) minimizing that error.

From now on, I’ll be talking about how gradient descent updates each weight of the model. The process for updating the bias is identical to this process, so when I say “weight”, just assume I mean it for both weights and biases.

Let’s break it down ⛏.

Step 1) Finding the error

To know how we can lower the error, we first need a way to measure the error, or find the error function.

One of the most widely used error functions is called class entropy. A entropy function essentially adds up the log() of the probability corresponding to the right answer that the model predicted across all the data points in the dataset.

How did this class entropy formula come to be?

One way to assess how good the model is at predicting is to find the probability it assigned to the category that turned out to be the right answer; in other words, how sure the model was of the right answer.

To find the total accuracy of the model, we measure this for all the data points the model was trained on, and multiply them together, we get the maximum likelihood.

The only problem is that, since each probability would be less than 1, this multiplication of the probability for the whole dataset would result in a super tiny number, which is inconvenient to deal with.

Can you think of how we can turn this multiplication operation into an addition operation?

Answer: log(A x B x C) = log(A) + log(B) + log(C).

Well, we could put a log() on the whole product, so that it could equal the sum of the log() of each probability. For example, it would look like this for 3 data points:

But here’s the challenge: how do we make sure that only the probabilities pertaining to the correct category gets added up? Is there a way to apply the above log() to a matrix containing the model’s predicted probabilities for all the categories and eliminate the log() of the probabilities for the categories that weren’t the right answer?

Here’s where one-hot encoding comes in. Basically, the correct answer (y) for what category the image is in can be expressed in a table of 0s and 1s, so that each row represents each datapoint and each column asks, “Does the image of this datapoint belong to category_n?” If yes, the table has the value 1 in that place, and if no, it has 0:

If we just look at just one datapoint for simplicity, we can use one-hot encoding by multiplying the y value (0 or 1) for each category in the one-hot encoding table with the log of the model’s predicted probability.

y_img1, dog = the value in the one-hot encoding table for row “img1” and column “is the image a dog?”

When the probability belong to the category of the correct answer, then the log() of that probability would be multiplied by 1 (no change to the original log()), and if it belongs to the category that wasn’t the answer, it would be multiplied by 0 (eliminated). Great, problem solved. Well, we have one last thing to tackle.

You might have noticed that, since we’re summing up logs of numbers less than 1, each log() value would be a negative number. But who likes negative numbers? …

Well, I don’t, so let’s multiply this sum of logs by -1 to change it back to positive. Multiplying by -1 means that we’re changing from using the sum of the log()s to measure the accuracy of the model to measuring the inaccuracy of the model.

Lastly, let’s turn this total sum of logs() into the average inaccuracy of the model for every data point we run on by dividing the sum by the number of datapoints used.

To put everything that we just talked about into one equation, we get the following:

Step 2) Minimizing the error

Once we have the average inaccuracy of the model on ALL the datapoints in the dataset, we can use this error to find how we can lower this error.

Remember how the weights and biases are the key determinants of a model’s accuracy? We would be changing the model’s accuracy by directly controlling the weights and biases of all the perceptrons in the model. To decide if we should increase or decrease each weight and bias and by how much, we take the derivative of the cross entropy with respect to each weight and bias.

Essentially, you want to update each weight by subtracting the original weight by the product of the learning rate and the derivative of E, as in the equation below:

The weird equal sign with two dots on the side is a symbol for assignment, meaning that the value on the right side of the sign is assigned to the variable on the left side

Again, I’ll break it down for you.

  1. Uhh why is there a minus sign?

As I said before, derivatives measure the rate of change of a function — in which direction the values are changing (positive- up, negative — down) and at what pace. No matter what direction, we always want to go DOWN in the model’s error, and the minus sign in front of the derivative term guarantees that the change in the weights would lead to a lower error.

Let’s take an example. If the derivative is positive, we’d want to go left (decrease the weight) to lower the error.

If it’s negative, we’d want to go to the right (increase the weight) to lower the error.

Realize how the direction in which we’re increasing or decreasing our weight is always opposite the direction in which the error is increasing or decreasing as the weight increases. In other words, you want to change the weight in the opposite sign as the rate of change (derivative) of the error.

2. How to find the derivative

Here’s a list of all the derivatives of E you have to find:

Well, in case you’re also wondering why I mentioned the chain rule at the beginning, here’s why. As a review, the chain rule says that the derivative of a value f with respect to another value t is a product of the derivative between the consecutive values between f and t in the definition of f with respect to t.

A really convenient thing about the chain rule is that it allows us to represent a seemingly complex derivative as the product of simpler, more straight-forward derivatives.

The chain rule’s useful in our situation because we can think of the error of the model as a nested function in terms of the model’s output (h) in terms of the output layer’s weights (w²_ij), which are in turn defined by h1, then down the line to the weights of the hidden layer (W¹_ij). This allows us to apply the chain rule to the error function (E) like below:

So, we can now find the expression for each of these individual derivatives (see image below), multiply them together, and plug them into our formula for updating each weight (and bias):

The weird equal sign with two dots on the side is a symbol for assignment, meaning that the value on the right side of the sign is assigned to the variable on the left side

Here’s an example of solving for the formula to one of the multiplied derivatives (dh/dh1):

3. Learning rate (α)

At the beginning of this section, I mentioned that gradient descent happens in steps, as we take one step for each epoch down the graph of E towards a minimal E value. Each “step” here refers to the process of updating all the weights and biases in the model by subtracting the derivative of E from the original weight (or bias).

But, it turns out that we can’t just change the weights by the pure value of the derivative — the change in the weights would then be too drastic and you’d be overshooting.

Notice how the step size (the size of the orange vector representing each step taken) decreases or increases by the magnitude of derivative of the E function

So, we need to multiply this by a number below 1 (but larger than 0) to water down the effects of the derivative and take smaller steps to adjust the weights and biases. We call this constant a learning rate (α). A model has ONE learning rate, so one learning rate is used for updating all the weights and biases for errors from all the datapoints.

Woah, that was a long section… here’s a recap:

  1. Gradient descent is the logical and mathematical operation behind backpropagation, where the weights and biases (the controllers of the model’s outcome) are tailored every time feedforward occurs to lower the model’s inaccuracy.
  2. Applying gradient descent involves (1) finding the error of the model, (2) finding the derivative of that error to find how the error is changing with respect to the weights, and thus if each weight should be increased or decreased and by how much, and (3) using the derivative to update the weights to reduce the error.
  3. A common error function used is cross entropy, defined by

4. To find the derivative of E about a weight w, we use a chain rule to multiply the derivatives of the values between E and w (y^, h, h1):

5. When updating the weights, derivative of the error is subtracted from the original weight, because you’re trying to lower the error. Before the subtraction, derivative is multiplied to a learning rate (how dramatically the weights are modified at each epoch, or a round of weight updating for the whole dataset).

TL;DR

  1. Machine learning is an algorithm that extracts patterns from a set of inputs and use them to predict something about a new input.
  2. Deep learning (DL) is a subset of machine learning for neural networks with multiple hidden layers. A neural network is the code behind a machine learning model, and consists of multiple layers — the input layer, hidden layers, and the output layer. DL algorithms can handle unlabeled data and can determine which patterns to look for by themselves with little human intervention.
  3. The basic building block of a neural network (NN) is a perceptron, which is a linear predictor that takes in several inputs, multiplies them by a set of weights, adds a bias to the product, and squeezes the result into a number between 0 and 1 by an activation function. Perceptrons make prediction lines that attempt to separate datapoints belonging to two or more different categories.
  4. A more complex and powerful algorithm can be made by stacking layers of perceptrons on top of each other to create a neural network. In an NN, the output of one perceptron is used as the input of the perceptron on the next layer, and therefore the linear predictions from the previous set of perceptrons are plugged into another linear relationship in the next layer, combining the linear predictions of the perceptrons into a non-linear prediction model.
  5. A NN algorithm consists of two parts — feedforward (running a set of inputs through all the perceptrons in the model to produce a corresponding output) and backpropagation (taking the output from the model to find the error of the model, how each weight and bias in the model contributed to that error, and updating the weights and biases to lower the error accordingly)
  6. A common way to measure the error of a model is called Class Entropy, where the log() of the probability predicted by the model of the correct category is summed up for all the datapoints, multiplied by -1 (to turn the measurement into the inaccuracy of the model instead of its accuracy), and divided by the total number of datapoints used to find the average inaccuracy per datapoint.
  7. To update each weight and bias, the derivative of the class entropy is found by the chain rule, multiplied to a constant called the learning rate, and subtracted from the original weight value.

Personal Remarks

Machine learning is becoming the next programming; it’s affecting our lives more than ever, and increasingly more industries are and will implement machine learning in their system, making it extremely beneficial for people to have a level of knowledge and experience with the concept and algorithms.

As a calculus lover (taking calculus III right now), I’ve been immersed in the world of Machine Learning for the last couple of weeks, learning the mechanism behind neural networks (which is super cool with all the applications of calculus and linear algebra in real action), how to build basic neural networks from scratch, and using frameworks like Tensorflow, Pytorch, and Torchvision (sub-library of Pytorch). I’ve primarily been learning from the “Intro to Deep Learning with PyTorch” Udacity course (linked below), which I’m finding very well-developed and easy to follow.

For those of you trying to start getting into machine learning and deep neural networks (neural networks with 2 or more hidden layers), if you haven’t started building algorithms in machine learning, make sure that you have a sufficient level of experience with Python. Look at this codecademy course and follow along this hangman tutorial to get to that level (it should probably take you 1–2 weeks to complete if you invest around 7 hours a week).

If you want to go further…

Hi! My name is Yelim, an innovator and builder passionate in AI, tissue engineering, and soft robotics. If you enjoyed this article, check out my socials below or connect with me through email: yelim.kim0229@gmail.com.

I’ll be writing about a mini AI project that I made recently to grasp the basics of ML, so make sure to come back for that! I look forward to seeing you again, and please leave a clap or comment on this article (bonus point if you do both)!

LinkedIn | YouTube | Personal Website | Instagram | Twitter

We’re providing opportunities for the next generation of student thinkers, inventors, and learners, to publish their thoughts, ideas, and innovation through writing.
Our writers span from all areas of topics — from Growth to Tech, all the way to Future and World.
So if you feel like you’re about to jump into a rabbit hole of reading these incredible articles, don’t worry, we feel the same way. ;)
That’s why studentsxstudents is the place for getting your voice heard!
Sounds interesting? Why not join us on this epic journey?

--

--