Asymmetric cost function in neural networks
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty{ margin-bottom:0;
}
up vote
2
down vote
favorite
I am trying to build a deep neural network based on asymmetric loss functions that penalizes underestimation of a time series. Preferably, by the use of the LINEX loss function (Varian 1975):
$ quad quad
L_{a,b}(y,hat{y}) = b(e^{-a(y-hat{y})} + a(y-hat{y}) - 1), quad quad quad text{with } a neq 0 text{ and } b>0
$
but I can't find any research papers where this is done, and only very few on other asymmetric loss functions as well.
The function is differentiable and gives reasonable results for values of a $approx0$ using neuralnet()
, for which the loss function approximates a square error function, but very poor results for increasing values of a.
This might explain why there are not many papers on asymmetric loss functions in neural networks, but why does it perform so bad when the asymmetry becomes larger?
EDIT
With asymmetric loss functions, I mean loss functions that are biased and with different slopes for negative and positive error. Examples are given below.
Concerning my network:
I used the neuralnet()
package testing several options with 1 hidden layer for both sigmoid and tanh activation functions. At the end I used an identity function. At the LINEX loss function stated above, y is the desired output and $hat{y}$ the activation output from the network. I have min-max normalized all 8 inputs as well as outputs y.
With the statement
if a$approx$0, the loss function approximates a square error function
I mean that the form of the LINEX loss function looks similar to a squared error function (symmetric), see picture below for example of LINEX loss wih b = 1 and a = 0.001
To restate my question: is there more research known that works with asymmetric loss functions in neural networks (preferably the LINEX)? If not, why? Since it is widely used for other model types.
r deep-learning loss-functions
add a comment |
up vote
2
down vote
favorite
I am trying to build a deep neural network based on asymmetric loss functions that penalizes underestimation of a time series. Preferably, by the use of the LINEX loss function (Varian 1975):
$ quad quad
L_{a,b}(y,hat{y}) = b(e^{-a(y-hat{y})} + a(y-hat{y}) - 1), quad quad quad text{with } a neq 0 text{ and } b>0
$
but I can't find any research papers where this is done, and only very few on other asymmetric loss functions as well.
The function is differentiable and gives reasonable results for values of a $approx0$ using neuralnet()
, for which the loss function approximates a square error function, but very poor results for increasing values of a.
This might explain why there are not many papers on asymmetric loss functions in neural networks, but why does it perform so bad when the asymmetry becomes larger?
EDIT
With asymmetric loss functions, I mean loss functions that are biased and with different slopes for negative and positive error. Examples are given below.
Concerning my network:
I used the neuralnet()
package testing several options with 1 hidden layer for both sigmoid and tanh activation functions. At the end I used an identity function. At the LINEX loss function stated above, y is the desired output and $hat{y}$ the activation output from the network. I have min-max normalized all 8 inputs as well as outputs y.
With the statement
if a$approx$0, the loss function approximates a square error function
I mean that the form of the LINEX loss function looks similar to a squared error function (symmetric), see picture below for example of LINEX loss wih b = 1 and a = 0.001
To restate my question: is there more research known that works with asymmetric loss functions in neural networks (preferably the LINEX)? If not, why? Since it is widely used for other model types.
r deep-learning loss-functions
What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
– D.W.
Nov 14 at 16:16
I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
– Michieldo
Nov 14 at 17:35
It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
– D.W.
Nov 14 at 18:33
Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
– D.W.
Nov 14 at 18:35
@D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
– Michieldo
Nov 14 at 20:20
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am trying to build a deep neural network based on asymmetric loss functions that penalizes underestimation of a time series. Preferably, by the use of the LINEX loss function (Varian 1975):
$ quad quad
L_{a,b}(y,hat{y}) = b(e^{-a(y-hat{y})} + a(y-hat{y}) - 1), quad quad quad text{with } a neq 0 text{ and } b>0
$
but I can't find any research papers where this is done, and only very few on other asymmetric loss functions as well.
The function is differentiable and gives reasonable results for values of a $approx0$ using neuralnet()
, for which the loss function approximates a square error function, but very poor results for increasing values of a.
This might explain why there are not many papers on asymmetric loss functions in neural networks, but why does it perform so bad when the asymmetry becomes larger?
EDIT
With asymmetric loss functions, I mean loss functions that are biased and with different slopes for negative and positive error. Examples are given below.
Concerning my network:
I used the neuralnet()
package testing several options with 1 hidden layer for both sigmoid and tanh activation functions. At the end I used an identity function. At the LINEX loss function stated above, y is the desired output and $hat{y}$ the activation output from the network. I have min-max normalized all 8 inputs as well as outputs y.
With the statement
if a$approx$0, the loss function approximates a square error function
I mean that the form of the LINEX loss function looks similar to a squared error function (symmetric), see picture below for example of LINEX loss wih b = 1 and a = 0.001
To restate my question: is there more research known that works with asymmetric loss functions in neural networks (preferably the LINEX)? If not, why? Since it is widely used for other model types.
r deep-learning loss-functions
I am trying to build a deep neural network based on asymmetric loss functions that penalizes underestimation of a time series. Preferably, by the use of the LINEX loss function (Varian 1975):
$ quad quad
L_{a,b}(y,hat{y}) = b(e^{-a(y-hat{y})} + a(y-hat{y}) - 1), quad quad quad text{with } a neq 0 text{ and } b>0
$
but I can't find any research papers where this is done, and only very few on other asymmetric loss functions as well.
The function is differentiable and gives reasonable results for values of a $approx0$ using neuralnet()
, for which the loss function approximates a square error function, but very poor results for increasing values of a.
This might explain why there are not many papers on asymmetric loss functions in neural networks, but why does it perform so bad when the asymmetry becomes larger?
EDIT
With asymmetric loss functions, I mean loss functions that are biased and with different slopes for negative and positive error. Examples are given below.
Concerning my network:
I used the neuralnet()
package testing several options with 1 hidden layer for both sigmoid and tanh activation functions. At the end I used an identity function. At the LINEX loss function stated above, y is the desired output and $hat{y}$ the activation output from the network. I have min-max normalized all 8 inputs as well as outputs y.
With the statement
if a$approx$0, the loss function approximates a square error function
I mean that the form of the LINEX loss function looks similar to a squared error function (symmetric), see picture below for example of LINEX loss wih b = 1 and a = 0.001
To restate my question: is there more research known that works with asymmetric loss functions in neural networks (preferably the LINEX)? If not, why? Since it is widely used for other model types.
r deep-learning loss-functions
r deep-learning loss-functions
edited Nov 14 at 20:19
asked Nov 14 at 11:12
Michieldo
515
515
What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
– D.W.
Nov 14 at 16:16
I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
– Michieldo
Nov 14 at 17:35
It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
– D.W.
Nov 14 at 18:33
Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
– D.W.
Nov 14 at 18:35
@D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
– Michieldo
Nov 14 at 20:20
add a comment |
What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
– D.W.
Nov 14 at 16:16
I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
– Michieldo
Nov 14 at 17:35
It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
– D.W.
Nov 14 at 18:33
Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
– D.W.
Nov 14 at 18:35
@D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
– Michieldo
Nov 14 at 20:20
What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
– D.W.
Nov 14 at 16:16
What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
– D.W.
Nov 14 at 16:16
I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
– Michieldo
Nov 14 at 17:35
I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
– Michieldo
Nov 14 at 17:35
It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
– D.W.
Nov 14 at 18:33
It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
– D.W.
Nov 14 at 18:33
Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
– D.W.
Nov 14 at 18:35
Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
– D.W.
Nov 14 at 18:35
@D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
– Michieldo
Nov 14 at 20:20
@D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
– Michieldo
Nov 14 at 20:20
add a comment |
2 Answers
2
active
oldest
votes
up vote
4
down vote
This might explain why there are not many papers on asymmetric loss functions.
That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.
My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48
add a comment |
up vote
1
down vote
It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.
It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.
There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.
It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.
It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.
There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).
Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13
@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
This might explain why there are not many papers on asymmetric loss functions.
That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.
My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48
add a comment |
up vote
4
down vote
This might explain why there are not many papers on asymmetric loss functions.
That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.
My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48
add a comment |
up vote
4
down vote
up vote
4
down vote
This might explain why there are not many papers on asymmetric loss functions.
That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.
This might explain why there are not many papers on asymmetric loss functions.
That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.
answered Nov 14 at 11:24
Jakub Bartczuk
3,4421827
3,4421827
My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48
add a comment |
My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48
My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48
My apologies, I forgot to mention that I'm interested in a loss function where the level of asymmetry can be chosen. In my case, to avoid underestimation of demand.
– Michieldo
Nov 14 at 11:48
add a comment |
up vote
1
down vote
It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.
It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.
There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.
It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.
It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.
There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).
Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13
@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59
add a comment |
up vote
1
down vote
It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.
It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.
There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.
It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.
It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.
There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).
Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13
@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59
add a comment |
up vote
1
down vote
up vote
1
down vote
It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.
It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.
There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.
It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.
It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.
There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).
It's not correct that there are few papers that use an asymmetric loss function. For instance, the cross-entropy loss is asymmetric, and there are gazillions of papers that use neural networks with a cross-entropy loss. Same for the hinge loss.
It's not correct that neural networks necessarily perform badly if you use an asymmetric loss function.
There are many possible reasons why a neural network might perform badly. If you wanted to test whether your loss is responsible for the problem, you could replace your asymmetric loss with a symmetric loss that is approximately equal for the regime of interest. For instance, the Taylor series approximation of the function $f(x) = b(e^{ax} + ax - 1)$ is $f(x) = -b + 2abx + frac12 a^2 b x^2 + O(x^3)$, so you could try training your network using the symmetric loss function $g(y,hat{y}) = -b + frac12 a^2 b (y-hat{y})^2$ and see how well it works. I conjecture it will behave about the same, but that's something you could test empirically.
It is unusual to min-max normalize outputs of the network. I'm not even sure what that would involve. Also if you are using the sigmoid activation function, then your outputs should already be normalized to be within -1..1, so it is not clear why you are normalizing them.
It is known that sigmoid and tanh activation functions often don't work that well; training can be very slow, or you can have problems with dead neurons. Modern networks usually use a different activation function, e.g., ReLu.
There are many details to make a neural network train effectively, based on initialization, the optimization algorithm, learning rates, network architecture, and more. I don't think you have any justification for concluding that the poor performance you are observing necessarily has anything to do with the asymmetry in your loss function. And a question here might not be the best way to debug your network (certainly, the information provided here isn't enough to do so, and such a question is unlikely to be of interest to others in the future).
answered Nov 14 at 22:34
D.W.
2,59912244
2,59912244
Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13
@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59
add a comment |
Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13
@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59
Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13
Thank you, this helps a lot and I will work on all the points! For the part on the papers, correct me if I'm wrong since I'm still quite new to this subject, but aren't cross-entropy and hinge loss namely for classification problems (I'm doing regression analysis)? I need to penalize underestimation heavily and would therefore prefer a parameter for which I can put high weights on underestimation. Therefore, a Linex kind of loss function would be preferred.
– Michieldo
Nov 14 at 23:13
@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59
@Michieldo, sure, those are typically used for classification. Even so...
– D.W.
Nov 14 at 23:59
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f376935%2fasymmetric-cost-function-in-neural-networks%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What are $y,hat{y}$? Can you clarify your notation? Is $hat{y}$ the output of the neural network? What is the last layer of your network? Does your network end with a ReLu; with a softmax; something else? What are the range of values of $y$? If you want us to help form guesses about why it isn't performing well, you'll need to edit the question to provide a lot more context. Also, when $aapprox 0$, this function does not approximate squared error ($(y-hat{y})^2$ is very different from $e^{-a(y-hat{y})}$).
– D.W.
Nov 14 at 16:16
I assumed there would be a general reason for the lack of research papers. But y is the desired output and $hat{y}$ the activation output from the network. I tried several activation functions with an identity activation at the end. y lies between 0 and 1. For a≈0, the function actually takes the form of a squared error if you would plot it.
– Michieldo
Nov 14 at 17:35
It's not clear what you mean by asymmetric, but I doubt your premise that there few or no papers that use asymmetric loss functions. You seem to start from an assumption (there are few papers that use asymmetric loss functions), then make a dubious inference (maybe my network is performing badly because its loss function is asymmetric), and both of those seem pretty sketchy to me. I would encourage you to question your assumptions.
– D.W.
Nov 14 at 18:33
Please edit your question to incorporate this information (e.g., definition of your notation) into the question, so people don't have to read the comments to understand what you are asking. As far as similarity to squared loss, it just isn't true. If you showed your work and showed the plot you got perhaps we could help you understand why it isn't. The function $e^{-ax}$ behaves very differently from the function $x^2$.
– D.W.
Nov 14 at 18:35
@D.W., thank you for the extensive comments. I added more information, hope my question is more clear now.
– Michieldo
Nov 14 at 20:20