How the impossibility of a universal learner can be seen as the non-existence of global sections of a 'learner' presheaf
The No-Free-Lunch theorem is a fundamental limitative result in the theory of machine learning. It says that there cannot be a universal learner. (We discuss the precise statement below.) I recently covered it in my course, and it made me wonder whether there are any links to another topic known as contextuality, i.e., situations where we have a family of data which is locally consistent, but globally inconsistent
In this post, I explore whether the impossibility of a universal learner can be seen as an instance of contextuality. The idea is:
We make this precise by defining what we will call the learner presheaf and by providing a locally consistent family of learners (i.e., compatible sections of the presheaf) that cannot be globally consistent (i.e., the presheaf does not have global sections).
I’m not aware of work in this specific direction
Statistical learning theory was developed as the theory of machine learning
Just a quick comment on the broader context. There is some disconnect between theory and practice: this theory does not fully account for the practices of modern deep learning. For further reading on this, see, e.g.,
Let’s illustrate such a learner with an example. The learner gets a dataset $S$ consisting of, say, 100,000 pairs $(x,y)$ with a pixel image $x$ of an animal and the label $y$ which is $1$ or $0$ depending on whether the depicted animal is a dog or not. Based on that, the learner has to come up with a general prediction rule $h$ that takes as input an image $x$ (whether already seen or new) and produces as output a label $h(x) = y$ describing whether the learner takes this image to depict a dog. If our learner is a neural network, it updates its weights according to the backpropagation algorithm as it goes through batches of the dataset. Whatever the setting of weights it arrives at after this training process, it will compute a function $h$ mapping inputs $x$ to outputs $h(x)$.
Typically, the learner is biased—or, positively formulated, has prior knowledge—and considers only certain functions from inputs to outputs to be potential prediction rules. (This helps to avoid overfitting to the training data, since with fewer prediction rules there are fewer options that fit the data too well.) Our neural network, for example, will only consider those functions that can be realized with some set of weights.
This already motivates the formal framework of statistical learning theory. (We use the following common notation: If $M$ is a set, $M^*$ is the set of all finite sequences of elements from $M$. If $X$ and $Y$ are sets, then $Y^X$ is the set of all functions from $X$ to $Y$.)
In our example, $X = \mathbb{R}^n$, i.e., we describe our pixel images as vectors $(x_0, x_1 , \ldots , x_{n-1})$ where $x_i$ is the value of the $i$-th pixel (with $n$ pixels in total).
In our example, we also have $Y = 2$.
In our example, this was the set of functions $h : X \to Y$ that the neural network can realize with some setting of its weights.
In our example, $A(S)$ is the function that the neural network realizes with the weights obtained after initialization and then doing, say, 10 repetitions (aka epochs) of going through batches of $S$ and updating the weights with backpropagation (we ignore that this is not a fully deterministic process).
The last definition we need is what it means for a learner to be ‘good’, i.e., to produce correct prediction rules. Statistical learning theory accounts for the fact that the learners operate in a stochastic and noisy environment. So rather than saying that a learner will surely be completely correct, we focus on when a learner will probably be approximately correct—aka a PAC learner. For such a learner, there should only be a small probability of $\delta$ that the learner comes up with a prediction rule with an error of more than $\epsilon$.
More precisely, for any confidence level $\delta$ and accuracy $\epsilon$, there should be a number of samples $m$, such that if the learner takes a set $S$ of $m$-many input-output samples of the world (which is governed by a probability distribution $P$ that the learner does not know), then, with probability at least $1 - \delta$ over the choice of samples, the learner’s prediction $A(S)$ has an error of less than $\epsilon$, provided there is a correct hypothesis in the first place. Here the error of a function $h : X \to Y$ is measured as the probability that the function does not produce the right label:
\[L_P (h) := P \big( \{ (x,y) \in X \times Y : h(x) \neq y \} \big).\]This is also called the true error (which is not accessible to the learner) to contrast it with the empirical error (which is accessible to the learner). The empirical error, which we will use later, is defined as the percentage of misprediction on a given dataset $S$:
\[L_S (h) := \frac{1}{|S|} \big\lbrace (x,y) \in S : h(x) \neq y \big\rbrace.\]Now, the formal definition of PAC-learnability reads as follows. (We write $\mathsf{Prob}(X)$ for the set of probability distributions on the set $X$; and for simplicity we ignore questions of measurability, i.e., we don’t specify the $\sigma$-algebras on the sets that we work with.)
We say $A$ PAC-learns $(X,Y,H)$ if $A$ is a function from $(X \times Y)^*$ to $H$ such that, for all $\epsilon, \delta \in (0,1)$, there is $m \in \mathbb{N}$ such that, for all $P \in \mathsf{Prob}(X \times Y)$, if there is $h \in H$ such that $L_P (h) = 0$, then
\[P^m \big( \{ (x,y) \in (X \times Y)^m : L_P( A(S) ) \geq \epsilon \} \big) < \delta.\]With this framework of statistical learning theory, we can move to one of its fundamental results.
For a given domain set $X$ and label set $Y$, there are many learners: they can have different hypothesis classes (i.e., their inductive bias differes), and even if they agree on that, they can still generalize to different prediction rules from the very same dataset that they saw. So wouldn’t it be nice if there was a universal learner: one that, on any task, performs as well as any other learner? If so, we would get a ‘free lunch’ since we wouldn’t need to design, for each task, a learner specific to this task, but could always just blindly use the universal learner instead.
At first sight we might hope for a positive answer: After all, there are universal Turing machines that can simulate any other Turing machine. So maybe we can build a universal learner that simulates other learners, collects their prediction rules, and aggregates those into a combined prediction rule, hoping for a wisdom of the crowd effect. But—as spoiled by the section title (sorry!)—the No-Free-Lunch theorem says that such a universal learner cannot exist. In other words, for every learner, there is a task on which it fails, even though another learner succeeds.
Formally, the No-Free-Lunch theorem is stated as follows (in the version of
Let $X$ be an infinite set, let $Y = 2$, and $H = 2^X$. Then no learner PAC-learns $(X,Y,H)$. In fact, slightly stronger, for any $A : (X \times 2)^* \to H$, if we choose $\epsilon = \frac{1}{8}$ and $\delta = \frac{1}{7}$, then, for any $m \in \mathbb{N}$, there is $P \in \mathsf{Prob}(X \times Y)$ such that there is $h \in H$ with $L_P (h) = 0$ but
\[P^m \big( \{ S \in (X \times Y)^m : L_P( A(S) ) \geq \epsilon \} \big) \geq \delta.\](Note that we can hence think of $P$ as a probability distribution $D$ on $X$ together with a deterministic labeling function $h : X \to Y$.)
So learner $A$ fails on this task while the highly biased learner with hypothesis class $\lbrace h \rbrace$ succeeds.
Now let’s move to what initially seems like a completely different topic: contextuality. As already mentioned, this concerns situations where we have locally consistent data that is globally inconsistent
Let’s illustrate this with the standard example, depicted on the right. We have two agents, and of course we call them Alice and Bob. Both sit at opposite sides of a box, inside of which is a mystery system. Both have two buttons in front of them and a screen. Now, at each round of the experiment, they each can press one of their two buttons and their screen will read out either 0 or 1, indicating whether the underlying system currently has the property that is measured by the respective button. They do this many rounds and record the following statistics:
The Bell table
We now wonder: what do these observations say about our mystery system? (We cannot dismiss the example as imaginary, because it is physically realizable according to quantum mechanics
Naturally, we would say that, at any given time, the system is in some state, $s$, and this state determines the outcomes of each of the measurements $a_1, a_2, b_1, b_2$ if they are performed on the system in this state. There is some probability distribution on the set of all states of the system, describing how likely it is to find the system in a given state (and of course this distribution is independent of whatever Alice and Bob do). So the measurements probe objective properties of the system, which obtain independently of which measurements we perform; and the distributions over the measurement outcomes come from the distribution over system states.
But from our observations, we must conclude that our system cannot be of such a form! No probability measure on the state space of our system can give rise to the observed distributions on the measurements.
We can say this more precisely in terms of ‘locally consistent but globally inconsistent’. Above, we considered four choices of measurements: $C_1 = \lbrace a_1, b_1 \rbrace$, $C_2 = \lbrace a_2, b_1 \rbrace$, $C_3 = \lbrace a_2, b_2 \rbrace$, and $C_4 = \lbrace a_1, b_2 \rbrace$. For each $C_i$, we obtained a probability distribution $D_i$ on the measurements in $C_i$ (so $D_i$ is a probability measure on the set of the potential outcomes $\lbrace (0,0), (0,1), (1,0), (1,1) \rbrace$). One can check that these distributions are compatible in the sense that two distributions $D_i$ and $D_j$ have the same marginals on their overlap $C_i \cap C_j$. Thus, we have a locally consistent family $\lbrace D_i \rbrace$ of observation data. However, this family of data is not globally consistent, because there cannot be a distribution $D$ on all measurements $\lbrace a_1 , a_2 , b_1 , b_2 \rbrace$ such that each $D_i$ is a marginalization of $D$ (i.e., $D$ is the joint distribution).
Consider the state space of our system with its probability measure $P$ giving rise to the observed distributions on the measurements. Consider the following events and their associated logical formula:
$E_1$: $a_1$ and $b_1$ yield distinct values | $\varphi_1 = a_1 \leftrightarrow \neg b_1$ |
$E_2$: $b_1$ and $a_2$ yield distinct values | $\varphi_2 = b_1 \leftrightarrow \neg a_2$ |
$E_3$: $a_2$ and $b_2$ yield the same values | $\varphi_3 = a_2 \leftrightarrow b_2$ |
$E_4$: $b_2$ and $a_1$ yield distinct values | $\varphi_4 = b_2 \leftrightarrow \neg a_1$ |
Looking at the observation statistics, we have $P(E_1) = \frac{1}{2} + \frac{1}{2} = 1$ and, for $i = 2,3,4$, we have $P(E_i) = \frac{3}{8} + \frac{3}{8} = \frac{3}{4}$, so $\sum P (E_i) = 1 + 3 \frac{3}{4} = 3.25$. On the other hand, basic logical reasoning shows that the events cannot occur simultaneously: otherwise we get the contradiction
\[a_1 \overset{\varphi_1}{\leftrightarrow} \neg b_1 \overset{\varphi_2}{\leftrightarrow} a_2 \overset{\varphi_3}{\leftrightarrow} b_2 \overset{\varphi_4}{\leftrightarrow} \neg a_1.\]So if one of the four events obtains, then one of the remaining three cannot obtain, so
\[P(E_4) \leq P( \bigcup_{i = 1}^3 E_i^c ) \leq \sum_{i = 1}^3 P( E_i^c) = \sum_{i = 1}^3 1 - P( E_1).\]By bringing all $P$’s to the left side, we get
\[\sum_{i = 1}^4 P (E_i) \leq 3\]contradicting our observation that $\sum_{i = 1}^4 P (E_i) = 3.25$.
Abramsky
quantum mechanics predicts correlations which exceed those which can be achieved by any classical mechanism. This is the content of Bell’s theorem […] and in many ways the starting point for the whole field of quantum information.
So our system cannot be of the form we initially thought it was. Hence somehow the properties of our system depend on the choice of measurements that we perform, i.e., the measurement context. How exactly is formalized with the sheaf-theoretic approach to contextuality. This goes as follows (e.g.
In our example, $X = \lbrace a_1 , a_2 , b_1 , b_2 \rbrace$.
In our example, $\mathcal{C} = \big\lbrace \lbrace a_1 , b_1 \rbrace, \lbrace a_1 , b_2 \rbrace, \lbrace a_2 , b_1 \rbrace, \lbrace a_2 , b_2 \rbrace \big\rbrace$.
In our example, $P(C) = \mathsf{Prob} (2^C)$, where we think of $2^C$ as the set of possible outcomes when measuring the variables in $C$. So for compatible contexts $C$, the set $P(C)$ includes the ‘local’ distributions that we have observed. For contexts $C$ outside $\mathcal{C}$, the set $P(C)$ contains distributions that we might not directly observe, but which may be constrained by the observed distributions. In particular, we saw that $P(X)$ contains any potential ‘global’ distribution. To talk about this transition between the local and the global, we also need the following.
For each $C \subseteq D$ in $\mathcal{P}(X)$, we have a function
\[\rho^D_C : P(D) \to P(C),\]called the restriction map. If $d \in P(D)$, we also write $d \restriction C := \rho^D_C(d)$. We require that $\rho^C_C$ is the identity function on $P(C)$, and if $C \subseteq D \subseteq E$, then $\rho^D_C \circ \rho^E_D = \rho^E_C$.
In our example, $\rho^D_C : P(D) \to P(C)$ takes a distribution $d$ on $2^D$ to the distribution $d \restriction C$ on $2^C$ obtained by marginalizing on the variables in $C$.
An element $d \in P(C)$ is called a (local) section and an element $d \in P(X)$ is called a global section. A family $\lbrace d_C \in P(C) : C \in \mathcal{C} \rbrace$ is called locally consistent (or compatible) if, for all $C,D \in \mathcal{C}$, we have
\[d_C \restriction C \cap D = d_D \restriction C \cap D.\]The family is called globally consistent if there is a global section $d \in P(X)$ such that, for all $C \in \mathcal{C}$, we have $d \restriction C = d_C$. A locally consistent family is called contextual if it is not globally consistent (i.e., doesn’t have a global section).
In the language of category-theory, this means that $P : \mathcal{P}(X)^\mathsf{op} \to \mathsf{Set}$ is a presheaf, where $\mathcal{P}(X)^\mathsf{op}$ is the poset category of the powerset with reverse inclusion.
It is high time to bring together the two strands: how we can view statistical learning theory through the lens of contextuality. So assume we consider a machine learning setting, i.e., a typically infinite domain set $X$ (e.g., pixel images), the label set $Y = 2$ (i.e., binary classification), and a learning architecture $H \subseteq 2^X$. For $C \subseteq X$, we define the notation
\[H_C := \{ h \restriction C : h \in H \}.\](As usual, $h \restriction C$ is the function from $C$ to $Y$ mapping $x \in C$ to $h(x) \in Y$.)
We will consider the domain $X$ as our set of variables, and our compatible contexts are all finite subsets of $X$, i.e., $\mathcal{C} := \mathcal{P}_\mathsf{fin} (X)$. After all, any actual learning context deals with finitely many inputs $x \in X$ that are, e.g., used in training data or also in test data to check a suggested prediction rule of a learner.
Now, the trick to get a presheaf is that we don't just view $x \in X$ as an object that should be labeled. We also view $x \in X$ as a variable that we measure in terms of how it figures in learners that try to label the object $x$.
Concretely, we define the learner presheaf $P_H$ as follows:
For $C \subseteq X$, define
\[P_H(C) := \big\{ A : A \text{ PAC-learns } (C,Y,H_C) \big\},\]so each $A$ in $P(C)$ is a function $(C \times Y)^* \to H_C$.
If $D \supseteq C$, define the restriction map $\rho^D_C : P_H(D) \to P_H(C)$ by
\[A \mapsto [ S \mapsto A(S) \restriction C ],\]so the learner $A$ in $P_H(D)$ is mapped to the learner $\rho^D_C (A)$ in $P_H(C)$, which is defined by mapping $S$ to $A(S) \restriction C$.
For well-definedness, we need to check that, for a learner $A$ in $P(D)$, the function $B := \rho^D_C (A)$ is in $P_H(C)$, i.e., PAC-learns $(C,Y,H_C)$.
First, $B$ is of the right type: it maps an element
\[S \in (C \times Y)^* \subseteq (D \times Y)^*\]to $h_S := A(S) \restriction C$, which is in $H_C$ because $A(S)$ is in $H_D$, so there is a function $h \in H$ with $A(S) = h \restriction D$, so, since $C \subseteq D$, also $A(S) \restriction C = h \restriction C$.
Second, to show that $B$ PAC-learns $(C,Y,H_C)$, let $\epsilon, \delta \in (0,1)$. Since $A$ PAC-learns $(D,Y,H_D)$, there is $m \in \mathbb{N}$ such that, for all $P \in \mathsf{Prob}(D \times Y)$, if there is $h \in H_D$ such that $L_P (h) = 0$, then
\[P^m \big( \{ S \in (D \times Y)^m : L_P( A(S) ) \geq \epsilon \} \big) < \delta.\]So, choosing the very same $m$, it suffices to show that, for $Q \in \mathsf{Prob}(C \times Y)$ such that there is $g \in H_C$ with $L_Q (g) = 0$, we have
\[Q^m \big( \{ S \in (C \times Y)^m : L_Q( B(S) ) \geq \epsilon \} \big) < \delta.\]Define the measure $P$ on $D \times Y$ by taking the distribution of $Q$ on $C \times Y$ and putting 0 weight on the rest of $D \times Y$ (i.e., $P(Z) := Q (Z \cap (C \times Y))$). Note that, for $f : D \to Y$, we have
\[\begin{aligned} L_P(f) &= P \big( (x,y) \in D \times Y : f(x) \neq y \big) \\ &= Q \big( (x,y) \in C \times Y : f(x) \neq y \big) \\ &= L_Q(f \restriction C). \end{aligned}\]Now, since $g \in H_C$, there is $\overline{h} \in H$ with $g = \overline{h} \restriction C$. Define $h := \overline{h} \restriction D \in H_D$. Since $g = h \restriction C$, we have
\[L_P(h) = L_Q(h \restriction C) = L_Q (g) = 0.\]So, by assumption, we have
\[P^m \big( \{ S \in (D \times Y)^m : L_P( A(S) ) \geq \epsilon \} \big) < \delta.\]Hence, since $L_Q( B(S) ) = L_Q( A(S) \restriction C ) = L_P( A(S) )$, we have
\[\begin{aligned} & Q^m \big( \{ S \in (C \times Y)^m : L_Q( B(S) ) \geq \epsilon \} \big) \\ =& Q^m \big( \{ S \in (C \times Y)^m : L_P( A(S) ) \geq \epsilon \} \big) \\ =& P^m \big( \{ S \in (D \times Y)^m : L_P( A(S) ) \geq \epsilon \} \big) < \delta, \end{aligned}\]as needed.
For compositionality, we have that $\rho^C_C$ is the identity function on $P_H(C)$, since it maps $A$ to the mapping $S \mapsto A(S) \restriction C = A(S)$. And if $C \subseteq D \subseteq E$, then $\rho^D_C \circ \rho^E_D = \rho^E_C$ because, for $A \in P_H(E)$ and $S \in (C \times Y)^*$, we have:
\[\begin{aligned} \rho^D_C \circ \rho^E_D (A) (S) &= \rho^D_C \big( [ S' \mapsto A(S') \restriction D ] \big) (S) \\ & = [ S' \mapsto A(S') \restriction D ] (S) \restriction C \\ &= ( A(S) \restriction D ) \restriction C \\ &= A(S) \restriction C \\ &= \rho^E_C (A) (S), \end{aligned}\]as needed.
Given the learner presheaf, do we get contextuality, i.e., local consistency but global inconsistency? Yes, as we’ll see now.
The No-Free-Lunch theorem immediately implies that, for $X$ infinite and $H = 2^X$, we do not have any global sections, i.e., $P_H(X) = \emptyset$. So we cannot have globally consistent data.
In fact, we can be more precise. The fundamental theorem of PAC learning says that there is a PAC-learner for $(X,Y,H)$ iff $H$ has finite VC-dimension (see
For a hypothesis class $H \subseteq Y^X$, the learner presheaf $P_H$ has a global section iff
\[\lbrace \vert C \vert : C \in \mathcal{C} \text{ with } H_C = Y^C \rbrace\]is bounded.
Thus, we also get a ‘simple combinatorial condition on the cover’ characterizing the existence of global sections, like Abramsky and Brandenburger get in the quantum mechanical case (
How would locally consistent data look like? It is a family of PAC learners, one for each finite subset $C$ of $X$,
\[\lbrace A_C \in P_H (C) : C \in \mathcal{C} \rbrace\]such that they agree on their overlaps, i.e., for all finite subsets $C$ and $D$ of $X$, we have that, for all training sets $S \in ( (C \cap D) \times 2)^*$, the prediction rules outputted by the learner $A_C$ and $A_D$ coincide on $C \cap D$, i.e.,
\[A_C (S) \restriction C \cap D = A_D (S) \restriction C \cap D\]So these PAC learners work on different finite areas of the inputs space, and they are all consistent with each other in the sense that they agree in their prediction on overlapping areas.
But can such a family of learner exist? After all, if it does, shouldn’t we be able to glue all those local learners together into a global learner, which, however, is impossible once $H$ has infinite VC-dimension?
Nonetheless, here is the plan to construct such a locally consistent family of learners:
First, fix an enumeration $C_0 , C_1, C_2, \ldots$ of $\mathcal{C}$ (without any repeats). This of course only works if $X$ is countable, which we assume for now. (If $X$ is uncountable, issues of measurability enter; then we naturally work with a standard Borel space $X$, and choose a countable generating basis $\mathcal{C}$ instead of all finite subsets.) Define $D_n := \bigcup_{i = 0}^n C_i$.
As an intermediate step, we recursively define, for each $n$, a learner $B_n : (D_n \times 2)^* \to H_{D_n}$ such that
Finally, we define, for each $C \in \mathcal{C}$, the learner $A_C := B_n \restriction C_n$, where $n$ is the unique index such that $C = C_n$. It will follow that the $A_C$’s are PAC learners and satisfy the local consistency condition.
Let’s first check that, once we do step 2, we really get step 3. First, to check that each $A_C$ is indeed a PAC-learner, we use the basic fact from statistical learning theory that empirical risk minimizers with a finite hypothesis class are PAC learners for this hypothesis class (see
So it remains to do step 2. We construct the desired $B_n$ by recursion. As useful notation, for a function $f : X \to \mathbb{R}$, define
\[\mathrm{argmin}_{x \in X} f(x) := \lbrace x \in X : \forall x' \in X . f(x) \leq f(x') \rbrace.\]For $n = 0$: Given $S \in (D_0 \times 2)^{*}$, choose as $B_0 (S)$ any element of
\[\mathrm{argmin}_{h \in H_{D_0}} L_S (h),\]which is nonempty since $H_{D_0}$ is finite.
For $n + 1$: Given $S \in ( D_{n+1} \times 2)^*$, define the prediction rule $h_S = B_{n+1} (S) \in H_{D_{n+1}}$ as follows:
Case 1: $S \not\in ( D_n \times 2)^*$. Then pick any
\[h_S \in \mathrm{argmin}_{h \in H_{D_{n+1}}} L_S (h).\]Case 2: $S \in ( D_n \times 2)^*$. Then $B_n (S) \in H_{D_n}$, so there is $\overline{h} \in H$ with $\overline{h} \restriction D_n = B_n (S)$, and we pick $h_S := \overline{h} \restriction D_{n+1}$.
Let’s verify that the $B_n$’s have the desired properties, i.e., empirical risk minimization and backwards compatibility. For $n = 0$ this holds by construction. So assume it holds for $B_n$, and we show it for $B_{n+1}$.
We first show backwards compatibility. Let $m \leq n + 1$ and show $ B_m = B_{n+1} \restriction D_m $. Let’s first consider the main case where $m = n$. So let $S \in ( D_n \times 2)^{*}$ and show $B_n (S) = B_{n+1} (S) \restriction D_n$. Hence we are in case 2 of the definition of $B_{n+1} (S)$ and, with the notation of this case, we have
\[B_n (S) = \overline{h} \restriction D_n = (\overline{h} \restriction D_{n+1} ) \restriction D_n = B_{n+1} (S) \restriction D_n.\]For the other cases, if $m = n+1$, the claim is trivial, and if $m < n$, then, using the induction hypothesis,
\[B_{n+1} \restriction D_m = ( B_{n+1} \restriction D_n ) \restriction D_m = B_n \restriction D_m = B_m.\]Finally, we check empirical risk minimization. Given $S \in ( D_{n+1} \times 2)^{*}$, we show that $h_S = B_{n+1} (S)$ is in
\[\mathrm{argmin}_{h \in H_{D_{n+1}}} L_S (h).\]This holds by construction in case 1, so let’s check it in case 2 where $S \in ( D_n \times 2)^*$. To do so, note that for any $C \supseteq \lbrace x : \exists y . (x,y) \in S \rbrace$ we have
\[L_S (h) = \frac{1}{|S|} \big\lbrace (x,y) \in S : h(x) \neq y \big\rbrace = L_S (h \restriction C).\]Hence, for any $h’ \in H_{D_{n+1}}$, we have, since $B_n$ is an empirical risk minimizer by induction hypothesis,
\[\begin{aligned} L_S (B_{n+1}) &= L_S (B_{n+1} \restriction D_n) = L_S ( B_n ) \\ &\leq L_S ( h' \restriction D_n ) = L_S ( h' ). \end{aligned}\]So we see that PAC-learnability really exhibits contextuality: We can have locally consistent learners covering the whole input space, but they cannot be glued together to one globally consistent learner.
The obvious next step is to apply the rich and general theory of sheaf-theoretic contextuality to the theory of machine learning! Specifically, the next step, as in
A central theorem of logic is the compactness theorem: if every finite subset of a theory has a model, then the whole theory has a model. The contextuality that we have found here describes a sense in which learning is not compact: every finite subset of the input space has a PAC learner, but the whole input space does not have a PAC learner. This should be related to discussions about the (non-) compactness of learning (
More generally, it would be interesting to see how our discussion connects to other category-theoretic approaches to learners (
So there are many directions to explore next, but, for now, let’s quickly recap what we’ve seen.
We introduced statistical learning theory as the classic theory of machine learning, and we stated the No-Free-Lunch theorem saying that a universal learner cannot exist. Then we introduced sheaf-theoretic contextuality, aiming to apply it to statistical learning theory.
We defined the learner presheaf and provided a locally consistent family of sections/learners that cannot be globally consistent. This makes precise the starting idea of rethinking the No-Free-Lunch theorem: That we can have learners on the different parts of the input space that cannot be glued to a learner on the whole input space.
If you found this helpful, you can cite this blogpost as
Hornischer, Levin (May 2024). The No-Free-Lunch Theorem and Sheaf-Contextuality. https://levinhornischer.github.io.
or in BibTeX format
@misc{hornischer2024, author = {Hornischer, Levin}, title = {The No-Free-Lunch Theorem and Sheaf-Contextuality}, year = {2024}, month = {May}, url = {https://levinhornischer.github.io/blog/2024/no_free_lunch/}, }