VeriReason

VeriReason

Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation

2.7k
Downloads on HuggingFace
2.8×
First-attempt Improvement
4
Model Sizes Available
1892
High-Quality Training Samples

Project Overview

VeriReason introduces a novel approach utilizing reinforcement learning with testbench feedback to enhance the performance of pre-trained models for Verilog RTL code generation. Our method combines supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning, establishing a new state-of-the-art for automated RTL synthesis.

Authors: Yiting Wang, Guoheng Sun, Wanghao Ye, Gang Qu, Ang Li

Framework Architecture

VeriReason Workflow Diagram
VeriReason Framework Architecture Prompt Verilog Task Description Base Model Qwen2.5-Coder SFT Model Fine-tuned GRPO Trainer Completions Rewards Syntax Functional Structural Test Bench Verification Feedback Golden Code w/ Reasoning <think> blocks Policy Model Optimized Verilog Gen Expected Output Truth Table Prompt Verilog Task Description Base Model Qwen2.5-Coder Pre-trained SFT Model Fine-tuned with Reasoning Data GRPO Trainer Reinforcement Learning Syntax Functional Structural Multi-faceted Rewards Testbench Verification Feedback Reasoning Data <think> blocks Enhanced Dataset Policy Model Optimized Verilog Generator Expected Output Golden Verilog Code Truth Table Iterative Refinement

The framework combines supervised fine-tuning with GRPO reinforcement learning, leveraging testbench feedback for comprehensive Verilog code generation optimization.

1
Data Preparation
Curate high-quality Verilog dataset with reasoning steps and testbenches
2
Supervised Fine-Tuning
Train base model on reasoning-augmented Verilog generation tasks
3
GRPO Training
Optimize model using reinforcement learning with testbench feedback
4
Evaluation
Validate performance on VerilogEval benchmarks

Step 1: Data Preparation & Reasoning Augmentation

We start with the RTLCoder dataset and apply comprehensive filtering and augmentation:

  • Syntax validation to ensure code correctness
  • GPT-4 validation for prompt-code alignment
  • Reasoning step generation with <think> blocks
  • Comprehensive testbench generation (100+ test cases)
  • Difficulty-based stratification (easy/hard splits)

Step 2: Supervised Fine-Tuning (SFT)

Fine-tune base models on curated reasoning-enhanced dataset:

# SFT Training Command llamafactory-cli train qwen2.5_7b.yaml # Custom Script Alternative chmod +x run_rtl_training.sh ./run_rtl_training.sh

This stage establishes foundational Verilog understanding with explicit reasoning capabilities.

Step 3: GRPO Reinforcement Learning

Optimize policy using Group Relative Policy Optimization with multi-faceted rewards:

  • Multi-faceted reward system: Combines syntax, functional, and structural correctness
  • Testbench-driven feedback: Real-time verification using comprehensive test suites
  • Group-based optimization: Compares multiple candidate solutions for better learning
  • Policy gradient methods: Iterative improvement through reinforcement signals

The GRPO framework enables the model to learn from both positive and negative examples, developing self-checking capabilities and improving first-attempt functional correctness by up to 2.8× compared to baseline methods.

Step 4: Comprehensive Evaluation

Validate performance on industry-standard benchmarks:

  • VerilogEval-Machine: 143 algorithmically generated specifications
  • VerilogEval-Human: 156 human-written specifications
  • Metrics: pass@1, pass@5 functional correctness
  • Comparison: Against GPT-4 Turbo, commercial systems

Interactive Examples

4-Bit Adder with Carry Implementation

Explore how VeriReason approaches complex Verilog design tasks with explicit reasoning and structured implementation.

User Prompt
You are tasked with designing a Verilog module that implements a 4-bit adder with carry-in and carry-out functionality. The module should have the following inputs and outputs: Inputs: - `clk`: a clock signal - `reset`: a synchronous reset signal that resets the output to 0 when high - `a`: a 4-bit input signal - `b`: a 4-bit input signal - `cin`: a carry-in signal Outputs: - `sum`: a 4-bit output signal that represents the sum of `a`, `b`, and `cin` - `cout`: a carry-out signal that represents the carry-out from the most significant bit of the sum Assume that the module will be used in a larger design and that you cannot use any Verilog libraries or pre-defined modules. First, think through the design approach, considering the functionality, inputs, outputs, and implementation details. Then provide the complete Verilog code implementation.
Reasoning Process
<think> 1. Requirement Analysis: - Inputs: `clk`, `reset`, 4-bit `a`, 4-bit `b`, `cin`. - Outputs: 4-bit `sum`, 1-bit `cout`. - Behavior: Output the sum (`a` + `b` + `cin`) synchronously on clock edge. - On `reset`, outputs (`sum`, `cout`) must be set to 0. 2. Logical Design Decisions: - 4-bit addition of `a`, `b`, and `cin` could produce a 5-bit result (maximum 31). - The adder needs to capture both the sum (lower 4 bits) and carry-out (fifth bit). - Synchronous reset: On each `posedge clk`, check `reset`. 3. Implementation Choices: - Sequential design: Use of `clk` and synchronous `reset` requires outputs `sum`/`cout` to be stored in registers (hence, `output reg`). - Adder operation: Addition performed as `adder_result = a + b + cin;` - Assignments: Lower 4 bits → `sum`, upper bit → `cout`. Summary: The code implements a registered 4-bit adder with carry-in/carry-out, synchronous reset, and clocked outputs. </think>
Generated Verilog Code
<answer> ```verilog module four_bit_adder ( input clk, input reset, input [3:0] a, input [3:0] b, input cin, output reg [3:0] sum, output reg cout ); reg [4:0] adder_result; always @(posedge clk) begin if (reset) begin sum <= 4'b0000; cout <= 1'b0; end else begin adder_result = a + b + cin; // 5-bit to get carry sum <= adder_result[3:0]; cout <= adder_result[4]; end end endmodule ``` </answer>

Technical Methodology

Reward Model Design

Multi-faceted Reward Function
R(o) = {
2.0, if functionally correct 0.1 + 1.0 · ASTscore(o), if syntactically correct 0, otherwise
ASTscore(o) measures structural similarity using hierarchical comparison:
ASTscore(o) = ∑c∈C wc·(0.6·simc + 0.5·covc - 0.3·redc)

GRPO Optimization

GRPO Objective Function
LGRPO(θ) = Eq~D,{oi}Gi=1θold(·|q)[ 1/GGi=1 min(ri·ρi, clip(ρi, 1-ε, 1+ε)·ri)] - β·DKLθ(·|q)∥πref(·|q))
where ρi = πθ(oi|q)/πθold(oi|q) is the importance sampling ratio, ri is the normalized reward, and β controls KL divergence penalty.

Data Filtration Strategy

Adaptive two-stage filtering optimizes training effectiveness:

Dfiltered = {s ∈ D | μr(s) ∈ [αmin, αmax] and σr(s) > β}
where αmin = 0.3, αmax = 1.8, β = 0.1
Difficulty score: δ(s) = 1 - (μr(s) - αmin)/(αmax - αmin)

This yields 1149 hard samples and 743 easy samples for targeted training.

Available Models

VeriReason-Qwen2.5-1.5B

Efficient 1.5B parameter model optimized for resource-constrained environments while maintaining strong performance.

VeriReason-Qwen2.5-3B

Flagship 3B parameter model combining SFT with GRPO reinforcement learning for state-of-the-art RTL generation.

VeriReason-Qwen2.5-7B-SFT

7B parameter model with supervised fine-tuning for enhanced reasoning capabilities in Verilog generation.

VeriReason-Llama-7B-GRPO

7B parameter Llama-based model with GRPO training demonstrating architecture generalizability.

Training Datasets

RTL-Coder Small

Filtered baseline dataset without reasoning components, ideal for initial training and comparison studies.

RTL-Coder 7B Reasoning TB Simple

Simplified reasoning dataset with testbench feedback for enhanced model training and validation.

RTL-Coder 7B Reasoning TB

Full reasoning dataset with comprehensive testbench feedback integration and explicit reasoning steps.

RTL-Coder 7B Combined

Comprehensive combined dataset incorporating all reasoning and testbench feedback components for full training.

Training & Usage

Supervised Fine-Tuning (SFT)

Fine-tune base models on curated reasoning-enhanced dataset using two methods:

# Method 1: Using LlamaFactory
llamafactory-cli train qwen2.5_7b.yaml

# Method 2: Custom training script
# Move sft_rtl to src/open_r1/
chmod +x run_rtl_training.sh
./run_rtl_training.sh

GRPO Reinforcement Learning

Optimize policy using Group Relative Policy Optimization with multi-faceted rewards:

# Move necessary files
mv verilog_rewards_tb.py verilog_train_tb.py src/open-r1/

# Create recipe directory
mkdir verilog_recipe
mv verilog_grpo_tb.yaml verilog_recipe/

# Launch GRPO training
NCCL_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL \
CUDA_VISIBLE_DEVICES=5,6,7 ACCELERATE_USE_NCCL=1 \
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
--num_processes=3 src/open_r1/verilog_train_rtlcoder.py \
--config verilog_recipe/verilog_grpo_tb.yaml --use_vllm=false

System Requirements

Hardware
  • CUDA-compatible GPUs
  • Multi-GPU recommended
  • 16GB+ VRAM per GPU
Software
  • PyTorch with CUDA support
  • Accelerate library
  • NCCL for distributed training
Tools
  • Iverilog simulator
  • LlamaFactory (optional)
  • Open-R1

Citation

BibTeX Entry

@misc{wang2025verireasonreinforcementlearningtestbench,
title={VeriReason: Reinforcement Learning with Testbench Feedback 
     for Reasoning-Enhanced Verilog Generation},
author={Yiting Wang and Guoheng Sun and Wanghao Ye and Gang Qu and Ang Li},
year={2025},
eprint={2505.11849},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.11849},
}