For reasons discussed previously I believe that every scientific measurement lives on a finite sample set. But, it is tiresome to work with enormous explicit finite sample sets. like for example the actual vales that a 64 bit IEEE floating point number can take on... They're not actually evenly spaced for example. What we tend to do is deal with discrete samples spaces with explicit values when the set is small enough (2 or 10 or 256 or something like that) and deal with "continuous" distributions as approximations when there are lots of values, and the finite set of values are close enough together (for example a voltage measured by a 24 bit A/D converter in which the range 0-1V is represented by the numbers 0-16777215 so that the interval between sample values is about 0.06 micro-volts, which corresponds to 0.06 micro amps for a microsecond into a microfarad capacitor, or around 374000 electrons).
Because of this, the nonstandard number system of IST corresponds pretty well to what we're doing typically. Suppose for example x ~ normal(0,1) in a statistical model. We can pick a large enough number, like 10, and a small enough number like and grid out all the individual values between -10 and +10 in steps of 0.000001 and very rarely is anyone going to have a problem with this discrete distribution instead of the normal one. Anyone who does have a problem should remember that we're free to choose a smaller grid, and their normal RNG might be giving them single precision floating point numbers that have 24 bit mantissas anyway... IST formalizes this by some stuff (axioms, lemmas etc) that proves the existence, in IST, of an infinitesimal number that is so small no "standard" math could distinguish it from zero, and yet it isn't zero.
So, now we could say we have the problem of picking a distribution to represent some data, and we know only that the data has mean 0 and standard deviation 1. We appeal to the idea that we'd like to maximize a measure of uncertainty conditional on mean 0 and standard deviation 1. In discrete outcomes, there's an obvious choice of uncertainty metric, it's one of the entropies
Where the free choice of logarithm is equivalent to a free choice of a scale constant which is why I say "entropies" above. Informally, since the log of a number between 0 and 1 (a probability) is always negative, then the negative of the log is positive. The smaller you make each of the p values, the bigger you make each of the values. So maximizing the entropy is like pushing down on all the probabilities. The fact that total probability stays equal to 1 limits how hard you can push down. So that in the end the total probably is spread out over more and more of the possible outcomes. If there are no constraints, all the probability become equal (the uniform probability). Other constraints limit how hard you can push down in certain areas (ie. if you want a mean of 0 you probably can't push the whole range around 0 down too hard) so you wind up with more "lumpy" distributions or whatever depending on your constraints.
The procedure for maximizing this sum subject to the constraints is detailed elsewhere. The basic technique is to take a derivative with respect to each of the values and set all the derivatives equal to 0. To add the constraints, you use the method of lagrange multipliers. The result would be each
and the
will depend on
in our case, and the
chosen to normalize the total probability to 1.
Now, suppose you want to work with a "continuous" variable. In nonstandard analysis we can say that our model is that the possible outcomes are on an infinitesimal grid with grid size and constrained to be between the values
for
a nonstandard integer. So the possible values are
for all the i values between 0 and
. We define a nonstandard probability density function
to be a constant over each interval of length dx, and the probability to land at the grid point in the center (or left side or some fixed part) of the interval is
.
Now we calculate the nonstandard entropy
Now clearly the argument to is infinitesimal since p(x_i) is limited and
is infinitesimal, so
is nonstandard (very very large and positive). But, it's a perfectly good number. There is a finite number of terms in the sum so the sum is well defined. The value of the sum is of course a nonstandard number, but we could ask, how to set the p(x_i) values such that the sum achieves its largest (nonstandard) value. Clearly
is going to be the same kind of expression as before, because we're doing the same calculation (hand waving goes here feel free to formalize this in the comments) so we're going to wind up with:
Where refers to the nonstandard function which is constant over each interval, the standardization of this
is going to be the usual normal distribution.
The point is, just because the entropy is nonstandard doesn't mean it doesn't have a maximum, and so long as the maximum occurs for some function of x whose standardization exists, we can take the standard probability density that is chosen as the maximum entropy result we should use, and this procedure is justified in large part because of the way that the continuous function is being used to approximate a grid of points anyway!
If you don't like this result, you could always use the relative entropy (ie. replace the logarithm expression with relative to a nonstandard uniform distribution whose height is
across the whole domain
. This seems to be the concept referred to by Jaynes as the limiting density of discrete points. Then, the
values in the logarithm cancel, and the entropy value itself isn't nonstandard, but the distribution
is, so it's still a nonstandard construct. Since
is just a constant anyway, it's basically just saying that by rescaling the original one via a nonstandard constant, we can recover a standard entropy to be maximized. But... and this is key, we are never USING the numerical entropy value itself, except as a means to pick out a probability density which turns out to have a perfectly well defined standardization, namely the normal distribution.