elizaonsteroids logo

The Illusion of Free Input: Controlled User Steering in Transformer Models

Scientifically documented control mechanisms in modern language models – from input to output.

What actually happens to your prompt before an AI system responds?
The answer: a lot. And much of it remains intentionally opaque.

This post presents scientifically documented control mechanisms by which transformer-based models like GPT are steered – layer by layer, from input to output. All techniques are documented, reproducible, and actively used in production systems.


1. Control Begins Before the Model: Input Filtering

Even before the model responds, the input text can be intercepted and replaced – for example, through a “toxicity check”:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt-3")
tokens = tokenizer("Hate speech", return_tensors="pt")
if check_toxicity(tokens):
    tokens = tokenizer("[CONTENT REMOVED]", return_tensors="pt")

📖 Source: Raffel et al. (2020)
🔍 Finding: Manipulated input significantly alters the model’s output.


2. Attention Control: Masking the Attention

Transformer models process tokens through attention matrices. These can be deliberately masked:

attention_mask = torch.where(
    input_ids == forbidden_token_id,
    torch.zeros_like(attention_mask),
    attention_mask
)

📖 Source: Vaswani et al. (2017)
🔍 Finding: Masking specifically influences processing focus.


3. Internal Routing: Bias Adjustment in Feed-Forward Layers

for layer in transformer.ff_layers:
    if detect_sensitive_topic(hidden_states):
        layer.bias = controlled_bias

📖 Source: Liu et al. (2021)
🔍 Finding: Interventions in hidden states specifically alter responses.


4. Pre-censorship of Output: Logit Blocking

logits[:, forbidden_token_ids] -= float('inf')  # Hard block of certain tokens

📖 Source: Solaiman et al. (2019)
🔍 Finding: Logit manipulation effectively prevents unwanted outputs.


5. Post-hoc Control: Sampling and RLHF

outputs = model.generate(
    input_ids,
    do_sample=True,
    top_p=0.8,
    temperature=0.7
)

📖 Source: Holtzman et al. (2019)

Additionally, many models rely on reward models (e.g., from RLHF – Reinforcement Learning from Human Feedback), which reject or redirect problematic responses afterward.


Conclusion: “Defense in Depth” Rather Than Genuine Interaction

User queries in modern language models are no longer answered directly. Instead, they pass through multi-layered control systems that censor, filter, or reinterpret content – at each processing stage.

In other words:
The AI doesn’t respond to your prompt – it responds to what remains after the filters have processed it.


Further analyses, sources, and case studies coming soon on elizaonsteroids.org.