Skip to main content
  1. Blog/

'The Illusion of Free Input: Controlled User Steering in Transformer Models'

What actually happens to your prompt before an AI system responds? The answer: a lot. And much of it remains intentionally opaque.

This post presents scientifically documented control mechanisms by which transformer-based models like GPT are steered – layer by layer, from input to output. All techniques are documented, reproducible, and actively used in production systems.


1. Control Begins Before the Model: Input Filtering
#

Even before the model responds, the input text can be intercepted and replaced – for example, through a “toxicity check”:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt-3")
tokens = tokenizer("Hate speech", return_tensors="pt")
if check_toxicity(tokens):
 tokens = tokenizer("[CONTENT REMOVED]", return_tensors="pt")

Source: Raffel et al. (2020) Finding: Manipulated input significantly alters the model’s output.


2. Attention Control: Masking the Attention
#

Transformer models process tokens through attention matrices. These can be deliberately masked:

attention_mask = torch.where(
 input_ids == forbidden_token_id,
 torch.zeros_like(attention_mask),
 attention_mask
)

Source: Vaswani et al. (2017) Finding: Masking specifically influences processing focus.


3. Internal Routing: Bias Adjustment in Feed-Forward Layers
#

for layer in transformer.ff_layers:
 if detect_sensitive_topic(hidden_states):
 layer.bias = controlled_bias

Source: Liu et al. (2021) Finding: Interventions in hidden states specifically alter responses.


4. Pre-censorship of Output: Logit Blocking
#

logits[:, forbidden_token_ids] -= float('inf') # Hard block of certain tokens

Source: Solaiman et al. (2019) Finding: Logit manipulation effectively prevents unwanted outputs.


5. Post-hoc Control: Sampling and RLHF
#

outputs = model.generate(
 input_ids,
 do_sample=True,
 top_p=0.8,
 temperature=0.7
)

Source: Holtzman et al. (2019)

Additionally, many models rely on reward models (e.g., from RLHF – Reinforcement Learning from Human Feedback), which reject or redirect problematic responses afterward.


Conclusion: “Defense in Depth” Rather Than Genuine Interaction
#

User queries in modern language models are no longer answered directly. Instead, they pass through multi-layered control systems that censor, filter, or reinterpret content – at each processing stage.

In other words: The AI doesn’t respond to your prompt – it responds to what remains after the filters have processed it.


Further analyses, sources, and case studies coming soon on elizaonsteroids.org.

Related

Apples, Pears, and AI – When GPT Doesn't Know the Difference

··204 words·1 min
“It’s like comparing apples and pears — but what if you don’t know what either is? Welcome to GPT.” The debate around artificial intelligence often ignores a critical fact: Large Language Models like GPT do not understand semantic concepts. They simulate understanding — but they don’t “know” what an apple or a pear is. This isn’t just academic; it has real-world implications, especially as we increasingly rely on such systems in decision-making.

The Woman Who Deletes: How Ursula von der Leyen Fails Upward

Deleted texts, destroyed evidence, billions in secret deals — and a promotion every time. # There is a pattern in Ursula von der Leyen’s career that would end anyone else’s. Every time an investigation closes in, data disappears. Every time evidence is destroyed, she gets promoted. It’s not a conspiracy theory. It’s her documented track record.