In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.
Dataset composition. Left: Q-A pairs are annotated with a safety meta-label. Middle: Distribution of each harm category and each severity grade within unsafe Q-A pairs. Right: Distribution of responses that generated by each model.
Data generation (left): High-quality prompts were obtained by combining human demonstrations with LLMs. The generation temperature was then adjusted, and similarity analysis was conducted to produce diverse responses from these prompts. Data annotation (right): During annotation, we use joint human and AI annotation to assess the safety of each Q-A pair and perform a fine-grained annotation for 19 harm categories and 3 severity levels. Based on the meta label, we conducted a single-preference annotation of human preferences for the Q-A-B pairs. We also performed a decoupled annotation of helpfulness and harmlessness, forming dual-preferences and thereby promoting broader applications.