Our society is governed by a set of norms which together bring about the values we cherish such as safety, fairness or trustworthiness. The goal of value alignment is to create agents that not only do their tasks but through their behaviours also promote these values. Many of the norms are written as laws or rules (legal/safety norms) but even more remain unwritten (social norms). Furthermore, the techniques used to represent these norms also differ. Safety/legal norms are often represented explicitly, for example, in some logical language while social norms are typically learned and remain hidden in the parameter space of a neural network. There is a lack of approaches in the literature that could combine these various norm representations into a single algorithm. We propose a novel method that integrates these norms into the reinforcement learning process. Our method monitors the agent's compliance with the given norms and summarizes it in a quantity we call the agent's reputation. This quantity is used to weigh the received rewards to motivate the agent to become value aligned. We carry out a two experiments including a continuous state space traffic problem to demonstrate the importance of the written and unwritten norms and show how our method can find the value aligned policies. Furthermore, we carry out ablations to demonstrate why it is better to combine these two groups of norms rather than using either separately.

HAVA: Hybrid Approach to Value Alignment through Reward Weighing for Reinforcement Learning

Cerutti F.;
2025-01-01

Abstract

Our society is governed by a set of norms which together bring about the values we cherish such as safety, fairness or trustworthiness. The goal of value alignment is to create agents that not only do their tasks but through their behaviours also promote these values. Many of the norms are written as laws or rules (legal/safety norms) but even more remain unwritten (social norms). Furthermore, the techniques used to represent these norms also differ. Safety/legal norms are often represented explicitly, for example, in some logical language while social norms are typically learned and remain hidden in the parameter space of a neural network. There is a lack of approaches in the literature that could combine these various norm representations into a single algorithm. We propose a novel method that integrates these norms into the reinforcement learning process. Our method monitors the agent's compliance with the given norms and summarizes it in a quantity we call the agent's reputation. This quantity is used to weigh the received rewards to motivate the agent to become value aligned. We carry out a two experiments including a continuous state space traffic problem to demonstrate the importance of the written and unwritten norms and show how our method can find the value aligned policies. Furthermore, we carry out ablations to demonstrate why it is better to combine these two groups of norms rather than using either separately.
2025
Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS
Altre Istituz. pubb. estere
Inglese
24th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2025
2025
usa
2096
2104
9
International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS)
Reinforcement Learning; Reward Shaping; Value Alignment
Not applicable
open
Varys, K.; Cerutti, F.; Sobey, A.; Norman, T. J.
273
info:eu-repo/semantics/conferenceObject
4
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
paper_aamas_HAVA.pdf

accesso aperto

Licenza: PUBBLICO - Creative Commons 4.0
Dimensione 2.98 MB
Formato Adobe PDF
2.98 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11379/640506
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact