What is the derivative function used in backpropagration?

I'm learning AI, but this confuses me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?

These terms are confusing: derivative of act. function, partial derivative wrt. loss function??

I'm still not getting it correct.

asked Dec 18 '18 at 7:59

datdinhquoc

1154

add a comment |

I'm learning AI, but this confuses me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?

These terms are confusing: derivative of act. function, partial derivative wrt. loss function??

I'm still not getting it correct.

asked Dec 18 '18 at 7:59

datdinhquoc

1154

add a comment |

I'm learning AI, but this confuses me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?

These terms are confusing: derivative of act. function, partial derivative wrt. loss function??

I'm still not getting it correct.

asked Dec 18 '18 at 7:59

datdinhquoc

1154

I'm learning AI, but this confuses me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?

These terms are confusing: derivative of act. function, partial derivative wrt. loss function??

I'm still not getting it correct.

backpropagation activation-function loss-functions

asked Dec 18 '18 at 7:59

datdinhquoc

1154

asked Dec 18 '18 at 7:59

datdinhquoc

1154

asked Dec 18 '18 at 7:59

datdinhquoc

1154

asked Dec 18 '18 at 7:59

datdinhquoc

1154

asked Dec 18 '18 at 7:59

datdinhquoc

1154

add a comment |

2 Answers
2

active

oldest

votes

In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.

The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.

The derivative of the activation function is used to compute the gradients of all layers except the output layer.

The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.

The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.

edited Dec 18 '18 at 18:26

jazib jamil

384

answered Dec 18 '18 at 8:36

Shubham Panchal

375110

2

$begingroup$
Please learn the math and correct the answer. It is too important a question to leave incorrect.
$endgroup$
– FauChristian
Dec 18 '18 at 10:53

2

$begingroup$
@FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
$endgroup$
– DuttaA
Dec 18 '18 at 11:14

1

$begingroup$
@datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
$endgroup$
– Shubham Panchal
Dec 18 '18 at 13:45

add a comment |

Overview

The derivatives of functions are used to determine what changes to input parameters correspond to what desired change in output for any given point in the forward propagation and cost, loss, or error evaluation &mdash whatever it is conceptually the learning process is attempting to minimize. This is the conceptual and algebraic inverse of maximizing valuation, yield, or accuracy.

Back-propagation estimates the next best step toward the objective quantified in the cost function in a search. The result of the search is a set of parameter matrices, each element of which represents what is sometimes called a connection weight. The improvement of the values of the elements in the pursuit of minimal cost is artificial networking's basic approach to learning.

Each step is an estimation because the cost function is a finite difference, where as the partial derivatives express the slope of a hyper-plane normal to surfaces that represent functions that comprise forward propagation. The goal is to set up circumstances so that successive approximations approach the ideal represented by minimization of the cost function.

Back-propagation Theory

Back-propagation is a scheme for distribution of a correction signal arising from cost evaluation after each sample or mini-batch of them. With a form of Einsteinian notation, the current convention for distributive, incremental parameter improvement can be expressed concisely.

$$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$

The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.

In sentence form, $Delta P$ at any layer shall be the quotient of cost function $c$ (given label vector $vec{ell}$ and network output signal $vec{o}$), attenuated by learning rate $alpha$, over the product of all the derivatives leading up to the cost evaluation. The multiplication of these derivatives arise through the recursive application of the chain rule.

It is because the chain rule is a core method for feedback signal evaluation that partial derivatives must be used. All variables must be bound except for one dependent and one independent variable for the chain rule to apply.

The derivatives include three types.

All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$

Answer to the Question

Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.

Redundant Operation Removal for an Efficient Algorithm Design

Actual back propagation algorithms save computing resources and time using three techniques.

Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

Use of reciprocals because division is more costly than multiplication at a hardware level

In addition to these practical principles of algorithm design, other algorithm features arise from extensions of basic back-propagation. Mini-batch SGD (stochastic gradient descent) applies averaging to improve convergence reliability and accuracy in most cases, provided hyper-parameters and initial parameter states are well chosen. Gradual reduction of learning rates, momentum, and various other techniques are often used to further improve outcomes in deeper artificial networks.

edited Dec 22 '18 at 14:45

answered Dec 22 '18 at 13:00

Douglas Daseeco

4,738938

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "658"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f9578%2fwhat-is-the-derivative-function-used-in-backpropagration%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.

The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.

The derivative of the activation function is used to compute the gradients of all layers except the output layer.

The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.

The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.

edited Dec 18 '18 at 18:26

jazib jamil

384

answered Dec 18 '18 at 8:36

Shubham Panchal

375110

2

$begingroup$
Please learn the math and correct the answer. It is too important a question to leave incorrect.
$endgroup$
– FauChristian
Dec 18 '18 at 10:53

2

$begingroup$
@FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
$endgroup$
– DuttaA
Dec 18 '18 at 11:14

1

$begingroup$
@datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
$endgroup$
– Shubham Panchal
Dec 18 '18 at 13:45

add a comment |

In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.

The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.

The derivative of the activation function is used to compute the gradients of all layers except the output layer.

The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.

The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.

edited Dec 18 '18 at 18:26

jazib jamil

384

answered Dec 18 '18 at 8:36

Shubham Panchal

375110

2

$begingroup$
Please learn the math and correct the answer. It is too important a question to leave incorrect.
$endgroup$
– FauChristian
Dec 18 '18 at 10:53

2

$begingroup$
@FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
$endgroup$
– DuttaA
Dec 18 '18 at 11:14

1

$begingroup$
@datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
$endgroup$
– Shubham Panchal
Dec 18 '18 at 13:45

add a comment |

In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.

The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.

The derivative of the activation function is used to compute the gradients of all layers except the output layer.

The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.

The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.

edited Dec 18 '18 at 18:26

jazib jamil

384

answered Dec 18 '18 at 8:36

Shubham Panchal

375110

In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.

The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.

The derivative of the activation function is used to compute the gradients of all layers except the output layer.

The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.

The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.

edited Dec 18 '18 at 18:26

jazib jamil

384

answered Dec 18 '18 at 8:36

Shubham Panchal

375110

edited Dec 18 '18 at 18:26

jazib jamil

384

edited Dec 18 '18 at 18:26

jazib jamil

384

edited Dec 18 '18 at 18:26

jazib jamil

384

answered Dec 18 '18 at 8:36

Shubham Panchal

375110

answered Dec 18 '18 at 8:36

Shubham Panchal

375110

answered Dec 18 '18 at 8:36

Shubham Panchal

375110

2

$begingroup$
Please learn the math and correct the answer. It is too important a question to leave incorrect.
$endgroup$
– FauChristian
Dec 18 '18 at 10:53

2

$begingroup$
@FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
$endgroup$
– DuttaA
Dec 18 '18 at 11:14

1

$begingroup$
@datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
$endgroup$
– Shubham Panchal
Dec 18 '18 at 13:45

add a comment |

2

$begingroup$
Please learn the math and correct the answer. It is too important a question to leave incorrect.
$endgroup$
– FauChristian
Dec 18 '18 at 10:53

2

$begingroup$
@FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
$endgroup$
– DuttaA
Dec 18 '18 at 11:14

1

$begingroup$
@datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
$endgroup$
– Shubham Panchal
Dec 18 '18 at 13:45

Please learn the math and correct the answer. It is too important a question to leave incorrect.

– FauChristian
Dec 18 '18 at 10:53

@FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??

– DuttaA
Dec 18 '18 at 11:14

@datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.

– Shubham Panchal
Dec 18 '18 at 13:45

add a comment |

Overview

Back-propagation Theory

$$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$

The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.

The derivatives include three types.

All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$

Answer to the Question

Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.

Redundant Operation Removal for an Efficient Algorithm Design

Actual back propagation algorithms save computing resources and time using three techniques.

Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

Use of reciprocals because division is more costly than multiplication at a hardware level

edited Dec 22 '18 at 14:45

answered Dec 22 '18 at 13:00

Douglas Daseeco

4,738938

add a comment |

Overview

Back-propagation Theory

$$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$

The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.

The derivatives include three types.

All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$

Answer to the Question

Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.

Redundant Operation Removal for an Efficient Algorithm Design

Actual back propagation algorithms save computing resources and time using three techniques.

Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

Use of reciprocals because division is more costly than multiplication at a hardware level

edited Dec 22 '18 at 14:45

answered Dec 22 '18 at 13:00

Douglas Daseeco

4,738938

add a comment |

Overview

Back-propagation Theory

$$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$

The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.

The derivatives include three types.

All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$

Answer to the Question

Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.

Redundant Operation Removal for an Efficient Algorithm Design

Actual back propagation algorithms save computing resources and time using three techniques.

Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

Use of reciprocals because division is more costly than multiplication at a hardware level

edited Dec 22 '18 at 14:45

answered Dec 22 '18 at 13:00

Douglas Daseeco

4,738938

Overview

Back-propagation Theory

$$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$

The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.

The derivatives include three types.

All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$

Answer to the Question

Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.

Redundant Operation Removal for an Efficient Algorithm Design

Actual back propagation algorithms save computing resources and time using three techniques.

Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

Use of reciprocals because division is more costly than multiplication at a hardware level

edited Dec 22 '18 at 14:45

answered Dec 22 '18 at 13:00

Douglas Daseeco

4,738938

edited Dec 22 '18 at 14:45

answered Dec 22 '18 at 13:00

Douglas Daseeco

4,738938

answered Dec 22 '18 at 13:00

Douglas Daseeco

4,738938

answered Dec 22 '18 at 13:00

Douglas Daseeco

4,738938

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Artificial Intelligence Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfrktyl