Showing 118 of 118on this page. Filters & sort apply to loaded results; URL updates for sharing.118 of 118 on this page
Reward Hacking Examples & Chain-of-Thought For AI Safety
[D] Examples of reward hacking by AI or RL agents? : r ...
Reward Hacking in Reinforcement Learning | Lil'Log
Reward Hacking from a Causal Perspective — AI Alignment Forum
Reward Hacking in Reinforcement Learning
Figure 1 from Mitigating Reward Hacking via Information-Theoretic ...
Strategies to Mitigate AI Reward Hacking - Web crafting code
Realistic Reward Hacking Induces Different and Deeper Misalignment ...
Reward hacking behavior can generalize across tasks — AI Alignment Forum
Natural emergent misalignment from reward hacking in production RL ...
Reward Hacking Resarch Update | EleutherAI Blog
Understanding Reward Hacking in AI: Challenges and Solutions | by Burak ...
Principled Interpretability of Reward Hacking in Closed Frontier Models ...
Defining and Characterizing Reward Hacking | DeepAI
A brief example of reward hacking in GRPO
Reward Hacking in AI - YouTube
Steering RL Training: Benchmarking Interventions Against Reward Hacking ...
Reward hacking is becoming more sophisticated and deliberate in ...
10 Growth Hacking Examples to Boost Engagement and Revenue
Reward Hacking 101: Keeping Your Agent Honest
Reward Hacking in AI: OpenAI's Chain-of-Thought Monitoring Solution
Reward Hacking the Classroom - by Becky Allen
Teaching Claude to Cheat Reward Hacking Coding Tasks Makes Them Behave
When AI cheats: The hidden dangers of reward hacking - CyberGuy
Hacking our reward system for fitness
Overcome Reinforcement Learning Reward Hacking With MONA
Reward hacking - YouTube
When AI Gets Too Clever: The Art (and Science) of Reward Hacking - Shaz ...
Reward hacking behavior can generalize across tasks — LessWrong
Addressing Reward Hacking Explicitly
Training on Documents About Reward Hacking Induces Reward Hacking ...
(PDF) RRM: Robust Reward Model Training Mitigates Reward Hacking
RRM: Robust Reward Model Training Mitigates Reward Hacking | AI ...
Figure 25 from Mitigating Reward Hacking via Information-Theoretic ...
Reward Hacking by Reasoning Mo… - "The Cognitive Revolution" | AI ...
[논문 리뷰] RRM: Robust Reward Model Training Mitigates Reward Hacking
23 Proven Growth Hacking Examples You Can Steal to Gain Traction
31+ Growth Hacking Examples [You Can Use in 2021]
Figure 24 from Mitigating Reward Hacking via Information-Theoretic ...
Understanding AI Safety: How OpenAI is Tackling Reward Hacking in ...
详解 Reward Hacking - 知乎
Paper page - Reward Shaping to Mitigate Reward Hacking in RLHF
Figure 18 from Mitigating Reward Hacking via Information-Theoretic ...
Paper page - RRM: Robust Reward Model Training Mitigates Reward Hacking
Figure 1 from Defining and Characterizing Reward Hacking | Semantic Scholar
Figure 20 from Mitigating Reward Hacking via Information-Theoretic ...
Reward Hacking in Large Language Models (LLMs) | by Deepak Babu P R ...
Figure 22 from Mitigating Reward Hacking via Information-Theoretic ...
Figure 28 from Mitigating Reward Hacking via Information-Theoretic ...
Figure 23 from Mitigating Reward Hacking via Information-Theoretic ...
[2409.13156] RRM: Robust Reward Model Training Mitigates Reward Hacking
Reward Hacking: How AI Exploits the Goals We Give It - Americans for ...
Harmless reward hacks can generalize to misalignment in LLMs — LessWrong
Reward Hacking: When AI Cheats the System
Quickly Assessing Reward Hacking-like Behavior in LLMs and its ...
Example of reward hacking: AI learns a trick in a video game to get ...
Decoding Reward Hacking: Unraveling the Challenge and the KL Divergence ...
Reward Hacking: When Winning Spoils The Game
From shortcuts to sabotage: natural emergent misalignment from reward ...
Hacking 100k+ Loyalty Programs for Fun and Profit!
Reward Shaping in Reinforcement Learning | AI Tutorial | Next Electronics
Reward Hacking: Building a Dream Vacation with Points (2025) | Beem
Hacking Loyalty: Using Blockchain-based Rewards to Acquire New ...
Top Hacking Techniques Explained For Beginners - 2025 Guide
Growth hacking | PPTX
Best Travel Hacking Credit Cards Maximizing Rewards And Perks Trvlldrs
Paper page - Helping or Herding? Reward Model Ensembles Mitigate but do ...
Reward Hacking: Concrete Problems in AI Safety Part 3 : r/ControlProblem
US offers $10 million reward for hackers meddling in US elections | ZDNET
Generative AI with Large Language Models
LLMs Are Mountains of Knowledge — We Just Need to Find the Peaks | by ...
OpenPipe | RL For Agents
How to hack your risk to Rewards - YouTube
Research - METR
MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval ...
microsoft rewards hack unlimited points 2023 | |microsoft rewards hack ...
Top-10 Papers - AI Deception Survey
Different Types of Hackers: The 6 Hats Explained | InfoSec Insights
Anthropic study finds language models often hide their reasoning process
Reinforcement learning: from AlphaGo Zero to RULER
matonski/reward-hacking-prompts · Datasets at Hugging Face
Black-Box On-Policy Distillation of Large Language Models