Maximum Likelihood Estimation (MLE)
Many cost functions are the result of applying Maximum Likelihood. For instance, the Least Squares cost function can be obtained via Maximum Likelihood. Cross-Entropy is another example.
The likelihood of a parameter value (or vector of parameter values), θ, given outcomes x, is equal to the probability (density) assumed for those observed outcomes given those parameter values, that is
The natural logarithm of the likelihood function, called the log-likelihood, is more convenient to work with. Because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same points as the function itself, and hence the log-likelihood can be used in place of the likelihood in maximum likelihood estimation and related techniques.
In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the “agreement” of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution. Maximum-likelihood estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution and many other problems.
Cross-Entropy
Cross entropy can be used to define the loss function in machine learning and optimization. The true probability pi is the true label, and the given distribution qi is the predicted value of the current model.
Cross-entropy error function and logistic regression.
Logistic
The logistic loss function is defined as:
Quadratic
The use of a quadratic loss function is common, for example when using least squares techniques. It is often more mathematically tractable than other loss function because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is , then a quadratic loss function is:
0-1 Loss
In statistics and decision theory, a frequently used loss function is 0-1 loss function:
Hinge Loss
The hinge loss is a loss function used for training classifiers. For an intended output and a classifier score y, the hinge loss of the prediction y is defined as:
Exponential
Hellinger Distance
It is used to quantify the similarity between two probability distributions. It is a type of f-divergence.
To define the Hellinger distance in terms of measure theory, let P and Q denote two probability measures that are absolutely continuous with respect to a third probability measure λ. The square of the Hellinger distance between P and Q is defined as the quantity.
Kullback-Leibler Divengence
Is a measure of how one probability distribution diverges from a second expected probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference.
Discrete:
Continous:
Itakura–Saito distance
is a measure of the difference between an original spectrum P(ω) and an approximation P^(ω) of that spectrum. Although it is not a perceptual measure, it is intended to reflect perceptual (dis)similarity.
[1] https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications [2] https://en.wikipedia.org/wiki/Loss_functions_for_classification