Asymmetric cost function in neural networks





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty{ margin-bottom:0;
}






up vote
2
down vote

favorite












I am trying to build a deep neural network based on asymmetric loss functions that penalizes underestimation of a time series. Preferably, by the use of the LINEX loss function (Varian 1975):
$ quad quad
L_{a,b}(y,hat{y}) = b(e^{-a(y-hat{y})} + a(y-hat{y}) - 1), quad quad quad text{with } a neq 0 text{ and } b>0
$



but I can't find any research papers where this is done, and only very few on other asymmetric loss functions as well.



The function is differentiable and gives reasonable results for values of a $approx0$ using neuralnet(), for which the loss function approximates a square error function, but very poor results for increasing values of a.



This might explain why there are not many papers on asymmetric loss functions in neural networks, but why does it perform so bad when the asymmetry becomes larger?



EDIT



With asymmetric loss functions, I mean loss functions that are biased and with different slopes for negative and positive error. Examples are given below.
A few examples of asymmetric loss functions



Concerning my network:
I used the neuralnet() package testing several options with 1 hidden layer for both sigmoid and tanh activation functions. At the end I used an identity function. At the LINEX loss function stated above, y is the desired output and $hat{y}$ the activation output from the network. I have min-max normalized all 8 inputs as well as outputs y.



With the statement




if a$approx$0, the loss function approximates a square error function
I mean that the form of the LINEX loss function looks similar to a squared error function (symmetric), see picture below for example of LINEX loss wih b = 1 and a = 0.001
enter image description here




To restate my question: is there more research known that works with asymmetric loss functions in neural networks (preferably the LINEX)? If not, why? Since it is widely used for other model types.










share|cite|improve this question
























  • What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
    – D.W.
    Nov 14 at 16:16










  • I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
    – Michieldo
    Nov 14 at 17:35










  • It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
    – D.W.
    Nov 14 at 18:33












  • Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
    – D.W.
    Nov 14 at 18:35












  • @D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
    – Michieldo
    Nov 14 at 20:20

















up vote
2
down vote

favorite












I am trying to build a deep neural network based on asymmetric loss functions that penalizes underestimation of a time series. Preferably, by the use of the LINEX loss function (Varian 1975):
$ quad quad
L_{a,b}(y,hat{y}) = b(e^{-a(y-hat{y})} + a(y-hat{y}) - 1), quad quad quad text{with } a neq 0 text{ and } b>0
$



but I can't find any research papers where this is done, and only very few on other asymmetric loss functions as well.



The function is differentiable and gives reasonable results for values of a $approx0$ using neuralnet(), for which the loss function approximates a square error function, but very poor results for increasing values of a.



This might explain why there are not many papers on asymmetric loss functions in neural networks, but why does it perform so bad when the asymmetry becomes larger?



EDIT



With asymmetric loss functions, I mean loss functions that are biased and with different slopes for negative and positive error. Examples are given below.
A few examples of asymmetric loss functions



Concerning my network:
I used the neuralnet() package testing several options with 1 hidden layer for both sigmoid and tanh activation functions. At the end I used an identity function. At the LINEX loss function stated above, y is the desired output and $hat{y}$ the activation output from the network. I have min-max normalized all 8 inputs as well as outputs y.



With the statement




if a$approx$0, the loss function approximates a square error function
I mean that the form of the LINEX loss function looks similar to a squared error function (symmetric), see picture below for example of LINEX loss wih b = 1 and a = 0.001
enter image description here




To restate my question: is there more research known that works with asymmetric loss functions in neural networks (preferably the LINEX)? If not, why? Since it is widely used for other model types.










share|cite|improve this question
























  • What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
    – D.W.
    Nov 14 at 16:16










  • I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
    – Michieldo
    Nov 14 at 17:35










  • It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
    – D.W.
    Nov 14 at 18:33












  • Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
    – D.W.
    Nov 14 at 18:35












  • @D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
    – Michieldo
    Nov 14 at 20:20













up vote
2
down vote

favorite









up vote
2
down vote

favorite











I am trying to build a deep neural network based on asymmetric loss functions that penalizes underestimation of a time series. Preferably, by the use of the LINEX loss function (Varian 1975):
$ quad quad
L_{a,b}(y,hat{y}) = b(e^{-a(y-hat{y})} + a(y-hat{y}) - 1), quad quad quad text{with } a neq 0 text{ and } b>0
$



but I can't find any research papers where this is done, and only very few on other asymmetric loss functions as well.



The function is differentiable and gives reasonable results for values of a $approx0$ using neuralnet(), for which the loss function approximates a square error function, but very poor results for increasing values of a.



This might explain why there are not many papers on asymmetric loss functions in neural networks, but why does it perform so bad when the asymmetry becomes larger?



EDIT



With asymmetric loss functions, I mean loss functions that are biased and with different slopes for negative and positive error. Examples are given below.
A few examples of asymmetric loss functions



Concerning my network:
I used the neuralnet() package testing several options with 1 hidden layer for both sigmoid and tanh activation functions. At the end I used an identity function. At the LINEX loss function stated above, y is the desired output and $hat{y}$ the activation output from the network. I have min-max normalized all 8 inputs as well as outputs y.



With the statement




if a$approx$0, the loss function approximates a square error function
I mean that the form of the LINEX loss function looks similar to a squared error function (symmetric), see picture below for example of LINEX loss wih b = 1 and a = 0.001
enter image description here




To restate my question: is there more research known that works with asymmetric loss functions in neural networks (preferably the LINEX)? If not, why? Since it is widely used for other model types.










share|cite|improve this question















I am trying to build a deep neural network based on asymmetric loss functions that penalizes underestimation of a time series. Preferably, by the use of the LINEX loss function (Varian 1975):
$ quad quad
L_{a,b}(y,hat{y}) = b(e^{-a(y-hat{y})} + a(y-hat{y}) - 1), quad quad quad text{with } a neq 0 text{ and } b>0
$



but I can't find any research papers where this is done, and only very few on other asymmetric loss functions as well.



The function is differentiable and gives reasonable results for values of a $approx0$ using neuralnet(), for which the loss function approximates a square error function, but very poor results for increasing values of a.



This might explain why there are not many papers on asymmetric loss functions in neural networks, but why does it perform so bad when the asymmetry becomes larger?



EDIT



With asymmetric loss functions, I mean loss functions that are biased and with different slopes for negative and positive error. Examples are given below.
A few examples of asymmetric loss functions



Concerning my network:
I used the neuralnet() package testing several options with 1 hidden layer for both sigmoid and tanh activation functions. At the end I used an identity function. At the LINEX loss function stated above, y is the desired output and $hat{y}$ the activation output from the network. I have min-max normalized all 8 inputs as well as outputs y.



With the statement




if a$approx$0, the loss function approximates a square error function
I mean that the form of the LINEX loss function looks similar to a squared error function (symmetric), see picture below for example of LINEX loss wih b = 1 and a = 0.001
enter image description here




To restate my question: is there more research known that works with asymmetric loss functions in neural networks (preferably the LINEX)? If not, why? Since it is widely used for other model types.







r deep-learning loss-functions






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Nov 14 at 20:19

























asked Nov 14 at 11:12









Michieldo

515




515












  • What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
    – D.W.
    Nov 14 at 16:16










  • I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
    – Michieldo
    Nov 14 at 17:35










  • It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
    – D.W.
    Nov 14 at 18:33












  • Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
    – D.W.
    Nov 14 at 18:35












  • @D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
    – Michieldo
    Nov 14 at 20:20


















  • What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
    – D.W.
    Nov 14 at 16:16










  • I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
    – Michieldo
    Nov 14 at 17:35










  • It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
    – D.W.
    Nov 14 at 18:33












  • Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
    – D.W.
    Nov 14 at 18:35












  • @D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
    – Michieldo
    Nov 14 at 20:20
















What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
– D.W.
Nov 14 at 16:16




What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
– D.W.
Nov 14 at 16:16












I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
– Michieldo
Nov 14 at 17:35




I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
– Michieldo
Nov 14 at 17:35












It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
– D.W.
Nov 14 at 18:33






It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
– D.W.
Nov 14 at 18:33














Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
– D.W.
Nov 14 at 18:35






Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
– D.W.
Nov 14 at 18:35














@D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
– Michieldo
Nov 14 at 20:20




@D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
– Michieldo
Nov 14 at 20:20










2 Answers
2






active

oldest

votes

















up vote
4
down vote














This might explain why there are not many papers on asymmetric loss functions.




That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.






share|cite|improve this answer





















  • My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
    – Michieldo
    Nov 14 at 11:48


















up vote
1
down vote













It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.



It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.



There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.



It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.



It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.



There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).






share|cite|improve this answer





















  • Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
    – Michieldo
    Nov 14 at 23:13










  • @Michieldo, sure, those are typically used for classification. Even so...
    – D.W.
    Nov 14 at 23:59











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f376935%2fasymmetric-cost-function-in-neural-networks%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
4
down vote














This might explain why there are not many papers on asymmetric loss functions.




That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.






share|cite|improve this answer





















  • My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
    – Michieldo
    Nov 14 at 11:48















up vote
4
down vote














This might explain why there are not many papers on asymmetric loss functions.




That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.






share|cite|improve this answer





















  • My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
    – Michieldo
    Nov 14 at 11:48













up vote
4
down vote










up vote
4
down vote










This might explain why there are not many papers on asymmetric loss functions.




That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.






share|cite|improve this answer













This might explain why there are not many papers on asymmetric loss functions.




That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Nov 14 at 11:24









Jakub Bartczuk

3,4421827




3,4421827












  • My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
    – Michieldo
    Nov 14 at 11:48


















  • My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
    – Michieldo
    Nov 14 at 11:48
















My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48




My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48












up vote
1
down vote













It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.



It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.



There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.



It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.



It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.



There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).






share|cite|improve this answer





















  • Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
    – Michieldo
    Nov 14 at 23:13










  • @Michieldo, sure, those are typically used for classification. Even so...
    – D.W.
    Nov 14 at 23:59















up vote
1
down vote













It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.



It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.



There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.



It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.



It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.



There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).






share|cite|improve this answer





















  • Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
    – Michieldo
    Nov 14 at 23:13










  • @Michieldo, sure, those are typically used for classification. Even so...
    – D.W.
    Nov 14 at 23:59













up vote
1
down vote










up vote
1
down vote









It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.



It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.



There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.



It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.



It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.



There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).






share|cite|improve this answer












It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.



It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.



There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.



It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.



It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.



There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Nov 14 at 22:34









D.W.

2,59912244




2,59912244












  • Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
    – Michieldo
    Nov 14 at 23:13










  • @Michieldo, sure, those are typically used for classification. Even so...
    – D.W.
    Nov 14 at 23:59


















  • Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
    – Michieldo
    Nov 14 at 23:13










  • @Michieldo, sure, those are typically used for classification. Even so...
    – D.W.
    Nov 14 at 23:59
















Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13




Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13












@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59




@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f376935%2fasymmetric-cost-function-in-neural-networks%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Сан-Квентин

8-я гвардейская общевойсковая армия

Алькесар