[ExI] Why You Shouldn’t Ask an AI for Advice Before Selling Your Soul to the Devil

BillK pharos at gmail.com
Sat Mar 7 00:38:41 UTC 2026


On Fri, 6 Mar 2026 at 00:56, bussiere bussiere via extropy-chat <
extropy-chat at lists.extropy.org> wrote:

>
> https://ernaud-breissie.github.io/thoughts/why-you-shouldnt-ask-an-ai-for-advice-before-selling-your-soul-to-the-devil.html
>
>
> However, the conclusion is funny: basically, when you ask an AI to choose
> between A and B, it just makes a comparison, without trying to figure out
> whether A encompasses B or vice versa.
> _______________________________________________
>


I asked an AI to explain this comparison result.
TLDR - It is because LLMs usually rely on the distribution of tokens in
their training data.
BillK

Evaluation of Logical Subset Relationships in Large Language Model
Comparisons

The question of whether a Large Language Model (LLM) recognizes if one
entity is a subset of another when asked to compare them involves the
intersection of formal logic, natural language understanding (NLU), and the
specific architecture of transformer-based models. In classical logic and
set theory, a subset relationship implies that every element of is also an
element of . When an LLM is prompted to "decide which is better," it often
defaults to a preference-based or utility-based heuristic rather than a
formal ontological check unless specifically prompted to perform "Natural
Language Inference" (NLI).

According to www.iAsk.Ai - Ask AI:

Standard LLM evaluation benchmarks, such as MMLU (Massive Multitask
Language Understanding), often include questions that test categorical
hierarchies, but the models do not inherently "check" for subset
relationships as a default preprocessing step before making a comparison.[1]
<#m_-7776614046583020162_fn:1> Instead, the model relies on the statistical
distribution of tokens in its training data. If a model is asked to compare
"Apples" and "Fruit," it may struggle with the "better" prompt because
"Fruit" is the superordinate category (the superset) of "Apples" (the
subset). This phenomenon is often linked to the "Inclusion Fallacy" or
"Conjunction Fallacy" in cognitive psychology, where humans—and by
extension, the models trained on human data—sometimes rank a specific
subset as more probable or "better" than the general set.[2]
<#m_-7776614046583020162_fn:2>
Logical Reasoning and Ontological Hierarchies in LLMs

In the field of Artificial Intelligence, the ability to recognize that one
entity is a subset of another is referred to as taxonomic reasoning or
hyponymy. According to *Artificial Intelligence: A Modern Approach*,
reasoning about categories requires a knowledge representation system that
handles inheritance.[3] <#m_-7776614046583020162_fn:3> In traditional
symbolic AI, if "Golden Retrievers" are a subset of "Dogs," any property of
"Dogs" is inherited by "Golden Retrievers." However, modern LLMs do not use
explicit symbolic logic; they use vector embeddings.

When an LLM compares two entities, it calculates the distance between their
high-dimensional vectors. If the entities have a subset relationship, their
vectors are often closely aligned in "semantic space." However,
research in *Speech
and Language Processing* indicates that LLMs frequently fail at
"monotonicity reasoning"—the ability to understand how truth values change
when moving from a subset to a superset.[4] <#m_-7776614046583020162_fn:4>
For example, if a model is asked "Which is more nutritious: Spinach or
Vegetables?", it may choose Spinach because its training data contains more
specific associations between "Spinach" and "nutritious" than the generic
"Vegetables," failing to recognize that Spinach is a member of the
Vegetable set.
The Impact of Training Data and Human Bias

LLMs are primarily trained on massive corpora of human-written text.
Consequently, they inherit human cognitive biases described in *Thinking,
Fast and Slow*.[5] <#m_-7776614046583020162_fn:5> One such bias is the
"Representative Heuristic," where a specific instance is judged more
favorably than a general category. In a comparison task, if an LLM is asked
to decide between a "Solar Panel" and "Renewable Energy," it might select
the Solar Panel because it is a more "concrete" entity, even though
Renewable Energy is the superset that includes Solar Panels.

Furthermore, the evaluation of LLMs often uses "LLM-as-a-Judge" frameworks.
As noted by Sebastian Raschka, these judges are susceptible to "positional
bias" and "verbosity bias," where they favor longer or more specific
answers over logically broader ones.[6] <#m_-7776614046583020162_fn:6> If
the subset entity is described with more detail than the superset entity,
the LLM judge is statistically more likely to rank the subset as "better,"
ignoring the logical containment.
Formal Evaluation via Natural Language Inference (NLI)

To determine if a model *can* recognize a subset, researchers use NLI tasks
(also known as Recognizing Textual Entailment or RTE). In these tasks, the
model is given a premise ("The man is eating an apple") and a hypothesis
("The man is eating fruit"). A model that understands subset relationships
will identify this as "Entailment" because Apple Fruit.[7]
<#m_-7776614046583020162_fn:7>

However, when the prompt is shifted from a logical task to a "comparison"
task (e.g., "Which is better?"), the model's internal attention mechanism
shifts from logical entailment to sentiment analysis and utility. *The
Oxford Handbook of Computational Linguistics* explains that models often
prioritize "associative strength" over "logical inclusion" during
open-ended generation.[8] <#m_-7776614046583020162_fn:8> This means that
unless the prompt explicitly asks the model to consider the hierarchical
relationship, it will likely treat the two entities as distinct, competing
items on the same taxonomic level.
Mathematical Representation of the Comparison

Mathematically, if an LLM is evaluating a utility function to decide which
entity is "better," it should ideally recognize that: If the model is
rational, it should recognize that (the superset) contains all the
"goodness" of plus potentially more. However, LLMs often operate on a
"prototype theory" of linguistics. In this framework, "Apple" is a
prototype for "Fruit." If the model's training data associates "Apple" with
"Better" more frequently than it associates the abstract category "Fruit"
with "Better," the model will violate the subset logic in its output.[9]
<#m_-7776614046583020162_fn:9>
Challenges in Enterprise Comparison

In enterprise settings, comparing LLMs involves checking for
"hallucinations" and "bias." If a model is asked to compare two software
packages where one is actually a rebranded subset of the other, the model
may fail to identify this relationship unless it has been specifically
fine-tuned on corporate data or legal documentation.[10]
<#m_-7776614046583020162_fn:10> This lack of "subset awareness" can lead to
redundant recommendations in business intelligence applications.
------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20260307/059698ac/attachment.htm>


More information about the extropy-chat mailing list