Subjectivism in Probability

 

Subjectivism in probability is the belief that probabilities correspond to degrees of belief. This more or less reduces down to each subject (e.g. you) assigning a weight representing something like "believability" or, "how likely I think it is to be true," to each proposition or judgment about the world, e.g. "it will rain tomorrow." What is a 'degree of belief?' Well, we can figure that the primary difference between classical logic, where a statement is either true or false, and this kind of thinking we're applying now, is that it makes no sense to say one true judgment is more true than another true judgment, but it does make sense to say that one finds one possibly true judgment more believable than another possibly true judgment. To my mind, the real starting point is that of a partial order on propositions in terms of believability.

What is a partial order? Many readers will already know, but it is just a way to say that we have a collection of items and we can sometimes say one is greater or smaller than another. So if A is greater than B and B is greater than C, then A is greater than C, that kind of thing. However, we may not be able to compare all items. For example, if we have a two dimensional vector \((x, y)\), we could say \((x, y) < (a, b)\) if \(x < a\) and \(y < b\). If the first component of one vector is greater than the first of another vector, but the inequality is flipped for the second components, then the two are not comparable. Such situations are common in optimization when we have multiple objectives we are interested in achieving: if all objectives improve, we can definitely say that the change was desirable, but otherwise it's delicate to establish a preference and involves some weighting of different priorities. A linear order is contrarily a partial order where every pair of elements is comparable.

The relevance of the distinction is in this simple question: given any two possibly true judgments, can you say that one is as likely than the other? I would contend that the answer is no. There are many pairs of propositions where your evaluation would be plain indifference, not assertion of equality or superiority of either side. The main relevance of the distinction is that assigning a single well defined numerical value to plausibilities is a way of representing their order relation, and it only works or faithfully exist if the propositions or judgments are linearly ordered by the plausibility relation.

Suppose you have two archers and you know a priori from an extremely believable source that one is much the superior of the other, but you don't know which is which. We label the archers one and two, and you are given no other information about them. I ask you if it is more likely that archer one will score more highly than archer two in a scheduled test of their abilities as archers. From the subjective point of view, you have no reason to say that it is more or less plausible that the first archer will outshoot the second, but from the 'objective' point of view, you know that it is very unlikely (due to your believable source) that it is equally plausible for either to win. There are a whole category of situations of this sort where you know some pairs of judgments greatly vary in likelihood but you do not know which way those likelihoods swing. The difficulty is that you are forced to say that all of the outcomes are equally plausible while simultaneously knowing that this is highly unlikely.

When considering the way probabilities interact with action and betting, I like the example of a weighted dice. You are told beforehand that a dice is weighted, but you do not know what the weighting is like. Assuming that you are betting solely on outcomes of the dice rather than the way it fits into some game, suppose that you are forced by the indifference principle to consider no individual face to receive preference in weighting. Hence, if you are forced to think probabilistically, you must more or less treat the dice, initially at least, as though it were a fair one. You can't say any one weighting is more plausible than another. You assign a uniform plausibility to all of the potential unfair weightings. But you can't act on that information whatsoever. You actually are essentially forced to use the one model which you know is least plausible as your estimate for what is actually functionally going to occur.

There is of course not necessarily a strict contradiction present here. The idea is that "it is equally plausible, from the information that I have access to, that archer one or archer two could win the contest," or that "all outcomes on the dice roll are equally plausible with respect to the information I have access to," do not strictly contradict the statement "it is highly implausible that it is equally plausible with respect to all of the information which I know must exist that archer one one winning is equally plausible as archer two winning." The simplest way to describe this situation is that plausibilities are conditioned on information. We know in these scenarios that information exists which would significantly alter the plausibilities we see, but we don't have access to this information directly.

In mathematical language, all of these examples exist to describe situations where refining the σ-algebra representing our current knowledge (considering a strictly larger σ-algebra) considerably changes the kind of measurements we make. Conditioned on the parameters of the latent distribution for the weighted dice roll, we will find the entropy to be considerably different from what we'd expect from a uniform and fair dice. However our primary tool for creating plausibilities without information is the indifference principle, forcing us more or less to choose a distribution satisfying some kind of symmetry (uniform, Bernoulli \(p=\frac{1}{2}\), and so on). Then you find that some information you have about the world contradicts the conclusions of your current model of plausibilities. The tension comes from the fact that we are conditioning on two different σ-algebras, each of which produces different expectations (incl. distributions) for e.g. the dice rolls. The 'contradiction' is resolved asserting that the e.g. lower entropy we know the outcome should have is conditioned on information we don't have and that the entropy we see is conditioned only on what we know. But can we talk about a canonical σ-algebra? The idea is that we have a bunch of R.V.s we have measured or seen and we can condition on the σ-algebra generated by them to see how each outcome affects our probabilistic estimates. Then can we justify our use of the indifference principle to assign plausibilities to otherwise incomparable propositions?

Suppose that I ask you whether or not the first archer's chance of winning is greater than the chance of a fair coin coming up heads. Answering this question either way involves more or less asserting that you know the hidden information determining which archer is superior. In my opinion, the most sensible answer is: I don't know. But then have we not refused to assign a plausibility relation to two distinct propositions, hence contradicting the linear character of the plausibility order we would need to represent probabilities as numbers?

I think one thing which is rather interesting to consider however is the way our model of knowledge interacts with this situation. Our recall is stochastic, our collection of prior events is often lossy. We are always conditioning on fuzzy or abstract forms of the information we have seen, which introduces uncertainty regarding plausibilities. If the conditional expected uncertainty (over all of our ambiguities and unsureties) of one thing being more likely or plausible than another is very high, does it not make sense to consider them incomparable? To me, that seems reasonable or fair. If you think like I do, then the idea of single-number probabilities being a suitable model for plausibility relations is not reasonable. Only sufficiently conditioned plausibilities become coherent enough to force a linear ordering, where the reduction to single numbers. To set all of the pedantry aside, I just do not find it reasonable to reduce plausibility relations down to a linear order subjectively.

I would also like to add that in my opinion the entire model of human knowledge as conditioning on subalgebras is trickier than it seems. Realistically, we are always choosing our measurements, for whether we go to place A or place B gives us different information about each at that time, and we can never get the information from the other side. Since we aren't really taking some ascending well defined filtration of a well defined "universal algebra" that converges to that universal algebra, we have to take on faith the convergence of our conditioned R.V.s as we ascend the chain of σ-algebras on somewhat precarious grounds. Conditioning R.V.s on an increasingly large σ-algebra which converges to a full, given algebra to start with which the variables were originally defined on will lead to those variables converging back toward their original forms (this is called Levy's theorem when the convergence is strong, but it is not easy to find online. I am familiar with it from Schervish, Theory of Statistics, in the first appendix; it is a pretty straightforward application of martingale theory to the process defined by \(Y_n = E[X | S_n]\) where \(\cup S_n = S\) and \(X\) is \(S\) measurable. There are additional hypotheses required to apply martingale convergence theorems, on the \(L^1\) integrability of \(X\)). But if we don't have a single ascending chain but an ascending tree based on choices, then do we have any guarantees of unique convergence? Basically, if there is a lot of information in the expectations against the full σ-algebra which we'll never be able to access, then do we have any guarantee that the final algebras of the measurements we do take will converge towards the same ultimate limits, regardless of our choices?

If you cannot convince yourself that the answer to that question is "yes," then the final R.V. we're converging to is somewhat a creature of our own creation based on what we choose to measure, rather than an a priori well determined reality we can continually approximate. My personal resolution to this is that there are subalgebras shared across many other algebras which we can hope to recover in the limit across many different chains of choices. In other words, there do exist many important things, things susceptible to empirical investigation, which are not too sensitive to the kinds of choices we make in measuring as to be impossible to recover without ambiguity. However, proving something is of this kind is basically impossible and is somewhat the sort of thing we take by faith. I think the comparison with noncommuting observables is interesting as well but I will leave it as just a passing mention for now.

The point of all of this discussion regarding models for probability and plausibilities is that I have rather serious reservations about the numerical representation of subjective probabilities philosophically. Of course, in specific situations of technical application many of these concerns disappear. In those cases however the frequentist and subjective points of view usually are not important to distinguish between. To me, probability is more or less a way to talk about areas and volumes. It is a technical tool which can be applied under specific conditions, like any other technical tool. I think people are far too prone to application of probabilistic reasoning to situations which are of a bad kind.

Beyond that, I find the notion of subjective plausibilities difficult in and of itself. I am not sure to what they may correspond, what it would mean to say that A is more plausible than B, outside of the context of decision theory. If I think A is more plausible than B and if a given action is more profitable under A, but less than another under B, I would prioritize the first action over the second in accordance to the difference in plausibility between A and B. Of course, in reality there are many competing judgments and plausibilities, so a single plausibility relation dominates our decision relatively rarely. If I try to imagine how plausibility relations emerge under this interpretation, I can think of regret as the only beginning. One regrets that in a circumstance that they did not take another action, and under abstraction of the original circumstance plausibilities begin to emerge. First the negative sentiment toward a given outcome corresponding to an action begins, then an abstracted representation of the original circumstance, insofar as it relates to the decision, is formed. When the assumptions that the action was predicated upon held true, then the fault is in the action and we reconsider the plan of action. When the assumptions did not hold, we reconsider the probability or plausibility of each (abstracted) temporal relation. By abstracted I mean that the two halves of the causal relation can be quite general or far-removed from the specific circumstance, just separated by time.

If one takes a view like this, then it seems fair to conclude a few things about subjective plausibilities. On the first count, they do not stem from knowledge or beliefs but instead stem from outcomes and sentiments regarding outcomes (psychologically). On the second count, they have less need for global consistency than we ask of them. If we consider plausibilities as essentially records of outcome-decision-hypothesis relations incurred by sentiments regarding outcomes, then it seems fair that highly distinct sets of 'knowledge' could condition the plausibilities we produce in nature. Additionally, the sentiment and many other intervening factors can condition the plausibilities produced. To me it seems that plausibilities in this way are less like relations between propositions or knowledge based on belief or information and moreso active products with distinct contexts of validity.

There are many issues one could raise regarding the ideas I presented immediately prior. Is it not the case, for example, that many of the things I described should be seen as psychological or circumstancial failures (or accidental characteristics) of reason and inference rather than anything essential to the character of plausibilities? Well, when you are appealing to the subjective and intrinsic sensibility of an idea to give it meaning, I do not think it is reasonable to essentially "rationalize" the subjective contents and castrate them of the many features which make simple treatment in abstract reason difficult, and then call this castrated product the "true" notion of probability or plausibility. Personally, I am not sure if I really find the need to believe in probabilities, but I definitely do advocate their becoming something of a lifestyle thing for a lot of people. It seems to me to moreso be a way of making one feel that their intuitive judgments and beliefs are more rational rather than anything quantitatively useful or meaningful oftentimes, and the models which are useful are so simple as to be doubtful characteristic of reality, but it's somewhat separate a topic.

Mathematical Appendix

I figured it is maybe worth providing some context mathematically. The model I am roughly gesturing toward with regards to longterm inference and convergence goes like this. We have a stochastic process \(X_n\) and we have a second set of variables we are interested in deducing information about (e.g. probabilities of propositions, represented in this framework by indicator functions for events). We know that we are on a particular sample \(\omega\) giving \(X_n(\omega)\) for our string of observations, so in theory we could calculate our secondary variable \(Z\) by just evaluating it at \(\omega\), or for our given sample. But the issue is that we don't actually have direct access to \(\omega\). So what we do is consider the sequence \(Z_n = E[Z | X_n, X_{n-1}, ..., X_1]\). The idea here is that eventually this sequence will, under appropriate conditions on \(Z\) and \(X_n\), converge toward the well defined \(Z\) again strongly in a pointwise almost-sure manner, in other words: \(E[|Z - Z_n|] \rightarrow 0\). A series of events are defined by the values we see as observations: \(S_n = \cap X_n^{-1}(X_n(\omega))\), with \(S_\infty\) defined as is obvious to do so. The value of \(Z_n\) evaluated over a given set of observations is the average value over \(S_n\) of \(Z\), so the limit of the \(Z_n\) is the average value of \(Z\) as we approach \(S_\infty\) if the convergence holds. We could define the measurable closure of \(E\) as the set of points \(x\) for which any neighborhood (set) of nonzero measure around \(x\) has intersection of nonzero measure with \(E\). We can define the boundary of \(E\) to be the set of points which are inside both the closure of \(E\) and its complement. If the boundary is of measure zero, then with \(Z = \chi_E\) we have convergence towards \(1\) or \(0\) for \(Z_n\) when \(S_\infty\) is a one point set not on the boundary of \(E\). Ideally, this case is rather unlikely (almost never occurring), since the boundary is of measure zero and \(S_\infty\) would have to often be contained in this set in order for the outcome to be probable: the law of the \(X_n\) would need to be non absolutely continuous with respect to the base measure. Obviously if these conditions are met, we can ensure we are getting closer and closer to the true estimate of \(Z\) at \(\omega\) which is the idea. I did not detail how to calculate \(E[Z | ...]\) since it's an entirely separate topic and depends on the circumstance at hand. It could be e.g. the success or failure of a trial for each trial \(X_n\), where the conditional expectation would be an obvious average. One can often prove the convergence to a value in the essential range of \(Z\) in practice relatively easily for technical applications. In theory the properties of the underlying probability space are completely opaque for us, but the events we are interested in ought come from our observables in a manner which renders this unimportant (and the discussion about events somewhat roundabout, considering they really should be coming from more familiar R.V.s). The issue raised in the article is we get access to different \(X_n\) based on our choices, in the sense of some \(X_n\) not taking on definite values but instead taking values in a given set. Specifically, \(S_n\) becomes \(\cap X_n^{-1}(V(u_{n-1}))\) where \(V_n = V(u_n)\) is an arbitrary set determined by the choices upto index \(n\), \(u_{n-1}\). There are more elegant ways to write this, but that is the idea. Integrating over an entire set in this way effectively conditions \(Z\) on coarser information. Specifically, now \(Z_n = E[Z | X_n \in V_n, ..., X_1 \in V_1]\). The issue is essentially about determining whether or not our convergent \(Z_n\) exists and is unique over an admissible class of the \(V_n\) and that more or less comes down to the properties of the limiting σ-algebra given by the events \(X_n \in V_n\).