What is the derivative function used in backpropagration?












1












$begingroup$


I'm learning AI, but this confuses me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?



These terms are confusing: derivative of act. function, partial derivative wrt. loss function??



I'm still not getting it correct.










share|improve this question









$endgroup$

















    1












    $begingroup$


    I'm learning AI, but this confuses me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?



    These terms are confusing: derivative of act. function, partial derivative wrt. loss function??



    I'm still not getting it correct.










    share|improve this question









    $endgroup$















      1












      1








      1


      1



      $begingroup$


      I'm learning AI, but this confuses me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?



      These terms are confusing: derivative of act. function, partial derivative wrt. loss function??



      I'm still not getting it correct.










      share|improve this question









      $endgroup$




      I'm learning AI, but this confuses me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?



      These terms are confusing: derivative of act. function, partial derivative wrt. loss function??



      I'm still not getting it correct.







      backpropagation activation-function loss-functions






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 18 '18 at 7:59









      datdinhquocdatdinhquoc

      1154




      1154






















          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.




          • The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.


          • The derivative of the activation function is used to compute the gradients of all layers except the output layer.



          The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.



          The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.






          share|improve this answer











          $endgroup$









          • 2




            $begingroup$
            Please learn the math and correct the answer. It is too important a question to leave incorrect.
            $endgroup$
            – FauChristian
            Dec 18 '18 at 10:53






          • 2




            $begingroup$
            @FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
            $endgroup$
            – DuttaA
            Dec 18 '18 at 11:14






          • 1




            $begingroup$
            @datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
            $endgroup$
            – Shubham Panchal
            Dec 18 '18 at 13:45



















          1












          $begingroup$

          Overview



          The derivatives of functions are used to determine what changes to input parameters correspond to what desired change in output for any given point in the forward propagation and cost, loss, or error evaluation &mdash whatever it is conceptually the learning process is attempting to minimize. This is the conceptual and algebraic inverse of maximizing valuation, yield, or accuracy.



          Back-propagation estimates the next best step toward the objective quantified in the cost function in a search. The result of the search is a set of parameter matrices, each element of which represents what is sometimes called a connection weight. The improvement of the values of the elements in the pursuit of minimal cost is artificial networking's basic approach to learning.



          Each step is an estimation because the cost function is a finite difference, where as the partial derivatives express the slope of a hyper-plane normal to surfaces that represent functions that comprise forward propagation. The goal is to set up circumstances so that successive approximations approach the ideal represented by minimization of the cost function.



          Back-propagation Theory



          Back-propagation is a scheme for distribution of a correction signal arising from cost evaluation after each sample or mini-batch of them. With a form of Einsteinian notation, the current convention for distributive, incremental parameter improvement can be expressed concisely.



          $$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$



          The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.



          In sentence form, $Delta P$ at any layer shall be the quotient of cost function $c$ (given label vector $vec{ell}$ and network output signal $vec{o}$), attenuated by learning rate $alpha$, over the product of all the derivatives leading up to the cost evaluation. The multiplication of these derivatives arise through the recursive application of the chain rule.



          It is because the chain rule is a core method for feedback signal evaluation that partial derivatives must be used. All variables must be bound except for one dependent and one independent variable for the chain rule to apply.



          The derivatives include three types.




          • All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

          • All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

          • The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$


          Answer to the Question



          Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.



          Redundant Operation Removal for an Efficient Algorithm Design



          Actual back propagation algorithms save computing resources and time using three techniques.




          • Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

          • Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

          • Use of reciprocals because division is more costly than multiplication at a hardware level


          In addition to these practical principles of algorithm design, other algorithm features arise from extensions of basic back-propagation. Mini-batch SGD (stochastic gradient descent) applies averaging to improve convergence reliability and accuracy in most cases, provided hyper-parameters and initial parameter states are well chosen. Gradual reduction of learning rates, momentum, and various other techniques are often used to further improve outcomes in deeper artificial networks.






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "658"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f9578%2fwhat-is-the-derivative-function-used-in-backpropagration%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.




            • The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.


            • The derivative of the activation function is used to compute the gradients of all layers except the output layer.



            The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.



            The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.






            share|improve this answer











            $endgroup$









            • 2




              $begingroup$
              Please learn the math and correct the answer. It is too important a question to leave incorrect.
              $endgroup$
              – FauChristian
              Dec 18 '18 at 10:53






            • 2




              $begingroup$
              @FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
              $endgroup$
              – DuttaA
              Dec 18 '18 at 11:14






            • 1




              $begingroup$
              @datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
              $endgroup$
              – Shubham Panchal
              Dec 18 '18 at 13:45
















            0












            $begingroup$

            In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.




            • The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.


            • The derivative of the activation function is used to compute the gradients of all layers except the output layer.



            The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.



            The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.






            share|improve this answer











            $endgroup$









            • 2




              $begingroup$
              Please learn the math and correct the answer. It is too important a question to leave incorrect.
              $endgroup$
              – FauChristian
              Dec 18 '18 at 10:53






            • 2




              $begingroup$
              @FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
              $endgroup$
              – DuttaA
              Dec 18 '18 at 11:14






            • 1




              $begingroup$
              @datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
              $endgroup$
              – Shubham Panchal
              Dec 18 '18 at 13:45














            0












            0








            0





            $begingroup$

            In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.




            • The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.


            • The derivative of the activation function is used to compute the gradients of all layers except the output layer.



            The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.



            The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.






            share|improve this answer











            $endgroup$



            In back propagation, both the derivative of the loss function as well as the derivative of the activation function are used for error minimization.




            • The derivative of the loss function is used to compute the compute the gradients between the last hidden layer and output layer.


            • The derivative of the activation function is used to compute the gradients of all layers except the output layer.



            The weights from a layer get activated in the next layer. Hence in this scenario, the derivative of the activation function will be used.



            The weights from the last hidden layer get activated in the output layer. Hence, here the derivative of the loss function is used since the output layer utilizes the loss function.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 18 '18 at 18:26









            jazib jamil

            384




            384










            answered Dec 18 '18 at 8:36









            Shubham PanchalShubham Panchal

            375110




            375110








            • 2




              $begingroup$
              Please learn the math and correct the answer. It is too important a question to leave incorrect.
              $endgroup$
              – FauChristian
              Dec 18 '18 at 10:53






            • 2




              $begingroup$
              @FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
              $endgroup$
              – DuttaA
              Dec 18 '18 at 11:14






            • 1




              $begingroup$
              @datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
              $endgroup$
              – Shubham Panchal
              Dec 18 '18 at 13:45














            • 2




              $begingroup$
              Please learn the math and correct the answer. It is too important a question to leave incorrect.
              $endgroup$
              – FauChristian
              Dec 18 '18 at 10:53






            • 2




              $begingroup$
              @FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
              $endgroup$
              – DuttaA
              Dec 18 '18 at 11:14






            • 1




              $begingroup$
              @datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
              $endgroup$
              – Shubham Panchal
              Dec 18 '18 at 13:45








            2




            2




            $begingroup$
            Please learn the math and correct the answer. It is too important a question to leave incorrect.
            $endgroup$
            – FauChristian
            Dec 18 '18 at 10:53




            $begingroup$
            Please learn the math and correct the answer. It is too important a question to leave incorrect.
            $endgroup$
            – FauChristian
            Dec 18 '18 at 10:53




            2




            2




            $begingroup$
            @FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
            $endgroup$
            – DuttaA
            Dec 18 '18 at 11:14




            $begingroup$
            @FauChristian I can't even understand the question and the answer is also incomprehensible...What exactly is the OP trying to mean or know??
            $endgroup$
            – DuttaA
            Dec 18 '18 at 11:14




            1




            1




            $begingroup$
            @datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
            $endgroup$
            – Shubham Panchal
            Dec 18 '18 at 13:45




            $begingroup$
            @datdinhquoc wants to know the difference between the partial derivative of the loss function and the activation function which are used in back propogation.
            $endgroup$
            – Shubham Panchal
            Dec 18 '18 at 13:45













            1












            $begingroup$

            Overview



            The derivatives of functions are used to determine what changes to input parameters correspond to what desired change in output for any given point in the forward propagation and cost, loss, or error evaluation &mdash whatever it is conceptually the learning process is attempting to minimize. This is the conceptual and algebraic inverse of maximizing valuation, yield, or accuracy.



            Back-propagation estimates the next best step toward the objective quantified in the cost function in a search. The result of the search is a set of parameter matrices, each element of which represents what is sometimes called a connection weight. The improvement of the values of the elements in the pursuit of minimal cost is artificial networking's basic approach to learning.



            Each step is an estimation because the cost function is a finite difference, where as the partial derivatives express the slope of a hyper-plane normal to surfaces that represent functions that comprise forward propagation. The goal is to set up circumstances so that successive approximations approach the ideal represented by minimization of the cost function.



            Back-propagation Theory



            Back-propagation is a scheme for distribution of a correction signal arising from cost evaluation after each sample or mini-batch of them. With a form of Einsteinian notation, the current convention for distributive, incremental parameter improvement can be expressed concisely.



            $$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$



            The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.



            In sentence form, $Delta P$ at any layer shall be the quotient of cost function $c$ (given label vector $vec{ell}$ and network output signal $vec{o}$), attenuated by learning rate $alpha$, over the product of all the derivatives leading up to the cost evaluation. The multiplication of these derivatives arise through the recursive application of the chain rule.



            It is because the chain rule is a core method for feedback signal evaluation that partial derivatives must be used. All variables must be bound except for one dependent and one independent variable for the chain rule to apply.



            The derivatives include three types.




            • All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

            • All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

            • The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$


            Answer to the Question



            Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.



            Redundant Operation Removal for an Efficient Algorithm Design



            Actual back propagation algorithms save computing resources and time using three techniques.




            • Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

            • Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

            • Use of reciprocals because division is more costly than multiplication at a hardware level


            In addition to these practical principles of algorithm design, other algorithm features arise from extensions of basic back-propagation. Mini-batch SGD (stochastic gradient descent) applies averaging to improve convergence reliability and accuracy in most cases, provided hyper-parameters and initial parameter states are well chosen. Gradual reduction of learning rates, momentum, and various other techniques are often used to further improve outcomes in deeper artificial networks.






            share|improve this answer











            $endgroup$


















              1












              $begingroup$

              Overview



              The derivatives of functions are used to determine what changes to input parameters correspond to what desired change in output for any given point in the forward propagation and cost, loss, or error evaluation &mdash whatever it is conceptually the learning process is attempting to minimize. This is the conceptual and algebraic inverse of maximizing valuation, yield, or accuracy.



              Back-propagation estimates the next best step toward the objective quantified in the cost function in a search. The result of the search is a set of parameter matrices, each element of which represents what is sometimes called a connection weight. The improvement of the values of the elements in the pursuit of minimal cost is artificial networking's basic approach to learning.



              Each step is an estimation because the cost function is a finite difference, where as the partial derivatives express the slope of a hyper-plane normal to surfaces that represent functions that comprise forward propagation. The goal is to set up circumstances so that successive approximations approach the ideal represented by minimization of the cost function.



              Back-propagation Theory



              Back-propagation is a scheme for distribution of a correction signal arising from cost evaluation after each sample or mini-batch of them. With a form of Einsteinian notation, the current convention for distributive, incremental parameter improvement can be expressed concisely.



              $$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$



              The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.



              In sentence form, $Delta P$ at any layer shall be the quotient of cost function $c$ (given label vector $vec{ell}$ and network output signal $vec{o}$), attenuated by learning rate $alpha$, over the product of all the derivatives leading up to the cost evaluation. The multiplication of these derivatives arise through the recursive application of the chain rule.



              It is because the chain rule is a core method for feedback signal evaluation that partial derivatives must be used. All variables must be bound except for one dependent and one independent variable for the chain rule to apply.



              The derivatives include three types.




              • All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

              • All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

              • The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$


              Answer to the Question



              Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.



              Redundant Operation Removal for an Efficient Algorithm Design



              Actual back propagation algorithms save computing resources and time using three techniques.




              • Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

              • Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

              • Use of reciprocals because division is more costly than multiplication at a hardware level


              In addition to these practical principles of algorithm design, other algorithm features arise from extensions of basic back-propagation. Mini-batch SGD (stochastic gradient descent) applies averaging to improve convergence reliability and accuracy in most cases, provided hyper-parameters and initial parameter states are well chosen. Gradual reduction of learning rates, momentum, and various other techniques are often used to further improve outcomes in deeper artificial networks.






              share|improve this answer











              $endgroup$
















                1












                1








                1





                $begingroup$

                Overview



                The derivatives of functions are used to determine what changes to input parameters correspond to what desired change in output for any given point in the forward propagation and cost, loss, or error evaluation &mdash whatever it is conceptually the learning process is attempting to minimize. This is the conceptual and algebraic inverse of maximizing valuation, yield, or accuracy.



                Back-propagation estimates the next best step toward the objective quantified in the cost function in a search. The result of the search is a set of parameter matrices, each element of which represents what is sometimes called a connection weight. The improvement of the values of the elements in the pursuit of minimal cost is artificial networking's basic approach to learning.



                Each step is an estimation because the cost function is a finite difference, where as the partial derivatives express the slope of a hyper-plane normal to surfaces that represent functions that comprise forward propagation. The goal is to set up circumstances so that successive approximations approach the ideal represented by minimization of the cost function.



                Back-propagation Theory



                Back-propagation is a scheme for distribution of a correction signal arising from cost evaluation after each sample or mini-batch of them. With a form of Einsteinian notation, the current convention for distributive, incremental parameter improvement can be expressed concisely.



                $$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$



                The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.



                In sentence form, $Delta P$ at any layer shall be the quotient of cost function $c$ (given label vector $vec{ell}$ and network output signal $vec{o}$), attenuated by learning rate $alpha$, over the product of all the derivatives leading up to the cost evaluation. The multiplication of these derivatives arise through the recursive application of the chain rule.



                It is because the chain rule is a core method for feedback signal evaluation that partial derivatives must be used. All variables must be bound except for one dependent and one independent variable for the chain rule to apply.



                The derivatives include three types.




                • All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

                • All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

                • The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$


                Answer to the Question



                Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.



                Redundant Operation Removal for an Efficient Algorithm Design



                Actual back propagation algorithms save computing resources and time using three techniques.




                • Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

                • Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

                • Use of reciprocals because division is more costly than multiplication at a hardware level


                In addition to these practical principles of algorithm design, other algorithm features arise from extensions of basic back-propagation. Mini-batch SGD (stochastic gradient descent) applies averaging to improve convergence reliability and accuracy in most cases, provided hyper-parameters and initial parameter states are well chosen. Gradual reduction of learning rates, momentum, and various other techniques are often used to further improve outcomes in deeper artificial networks.






                share|improve this answer











                $endgroup$



                Overview



                The derivatives of functions are used to determine what changes to input parameters correspond to what desired change in output for any given point in the forward propagation and cost, loss, or error evaluation &mdash whatever it is conceptually the learning process is attempting to minimize. This is the conceptual and algebraic inverse of maximizing valuation, yield, or accuracy.



                Back-propagation estimates the next best step toward the objective quantified in the cost function in a search. The result of the search is a set of parameter matrices, each element of which represents what is sometimes called a connection weight. The improvement of the values of the elements in the pursuit of minimal cost is artificial networking's basic approach to learning.



                Each step is an estimation because the cost function is a finite difference, where as the partial derivatives express the slope of a hyper-plane normal to surfaces that represent functions that comprise forward propagation. The goal is to set up circumstances so that successive approximations approach the ideal represented by minimization of the cost function.



                Back-propagation Theory



                Back-propagation is a scheme for distribution of a correction signal arising from cost evaluation after each sample or mini-batch of them. With a form of Einsteinian notation, the current convention for distributive, incremental parameter improvement can be expressed concisely.



                $$ Delta P = dfrac {c(vec{o}, vec{ell}) ; alpha} {big[ prod^+ ! P big] ; big[ prod^+ !a'(vec{s} , P + vec{z}) big] ; big[ c'(vec{o}, vec{ell}) big]} $$



                The plus sign in $prod^+!$ designates that the factors multiplied must be downstream in the forward signal flow from the parameter matrix being updated.



                In sentence form, $Delta P$ at any layer shall be the quotient of cost function $c$ (given label vector $vec{ell}$ and network output signal $vec{o}$), attenuated by learning rate $alpha$, over the product of all the derivatives leading up to the cost evaluation. The multiplication of these derivatives arise through the recursive application of the chain rule.



                It is because the chain rule is a core method for feedback signal evaluation that partial derivatives must be used. All variables must be bound except for one dependent and one independent variable for the chain rule to apply.



                The derivatives include three types.




                • All layer input factors, the weights in the parameter matrix used to attenuate the signal during forward propagation, which are equal to the derivatives of those signal paths

                • All the derivatives of activation functions $a$ evaluated at the sum of the matrix vector product of the parameters and signal at that layer plus the bias vector

                • The derivative of the cost function $c$ evaluated at the current output value $vec{o}$ with the label $vec{ell}$


                Answer to the Question



                Note that, as a consequence of the above, the derivatives of both the cost (or loss or error) function and any activation functions are necessary.



                Redundant Operation Removal for an Efficient Algorithm Design



                Actual back propagation algorithms save computing resources and time using three techniques.




                • Temporary storage of the value used for evaluation of the derivative (since it was already calculated during forward propagation)

                • Temporary storage of products to avoid redundant multiplication operations (a form of reverse mode automatic differentiation)

                • Use of reciprocals because division is more costly than multiplication at a hardware level


                In addition to these practical principles of algorithm design, other algorithm features arise from extensions of basic back-propagation. Mini-batch SGD (stochastic gradient descent) applies averaging to improve convergence reliability and accuracy in most cases, provided hyper-parameters and initial parameter states are well chosen. Gradual reduction of learning rates, momentum, and various other techniques are often used to further improve outcomes in deeper artificial networks.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 22 '18 at 14:45

























                answered Dec 22 '18 at 13:00









                Douglas DaseecoDouglas Daseeco

                4,738938




                4,738938






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Artificial Intelligence Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f9578%2fwhat-is-the-derivative-function-used-in-backpropagration%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Сан-Квентин

                    8-я гвардейская общевойсковая армия

                    Алькесар