VeriReason - AI-Enhanced Verilog Generation

Project Overview

VeriReason introduces a novel approach utilizing reinforcement learning with testbench feedback to enhance the performance of pre-trained models for Verilog RTL code generation. Our method combines supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning, establishing a new state-of-the-art for automated RTL synthesis.

Authors: Yiting Wang, Guoheng Sun, Wanghao Ye, Gang Qu, Ang Li

GitHub Repo

Hugging Face Collection

Framework Architecture

Prompt Verilog Task Description Base Model Qwen2.5-Coder Pre-trained SFT Model Fine-tuned with Reasoning Data GRPO Trainer Reinforcement Learning Syntax Functional Structural Multi-faceted Rewards Testbench Verification Feedback Reasoning Data <think> blocks Enhanced Dataset Policy Model Optimized Verilog Generator Expected Output Golden Verilog Code Truth Table Iterative Refinement

The framework combines supervised fine-tuning with GRPO reinforcement learning, leveraging testbench feedback for comprehensive Verilog code generation optimization.

Data Preparation

Curate high-quality Verilog dataset with reasoning steps and testbenches

Supervised Fine-Tuning

Train base model on reasoning-augmented Verilog generation tasks

GRPO Training

Optimize model using reinforcement learning with testbench feedback

Evaluation

Validate performance on VerilogEval benchmarks

Step 1: Data Preparation & Reasoning Augmentation

We start with the RTLCoder dataset and apply comprehensive filtering and augmentation:

Syntax validation to ensure code correctness
GPT-4 validation for prompt-code alignment
Reasoning step generation with <think> blocks
Comprehensive testbench generation (100+ test cases)
Difficulty-based stratification (easy/hard splits)

Step 2: Supervised Fine-Tuning (SFT)

Fine-tune base models on curated reasoning-enhanced dataset:

                        # SFT Training Command
llamafactory-cli train qwen2.5_7b.yaml

# Custom Script Alternative  
chmod +x run_rtl_training.sh
./run_rtl_training.sh
                    

This stage establishes foundational Verilog understanding with explicit reasoning capabilities.

Step 3: GRPO Reinforcement Learning

Optimize policy using Group Relative Policy Optimization with multi-faceted rewards:

Multi-faceted reward system: Combines syntax, functional, and structural correctness
Testbench-driven feedback: Real-time verification using comprehensive test suites
Group-based optimization: Compares multiple candidate solutions for better learning
Policy gradient methods: Iterative improvement through reinforcement signals

The GRPO framework enables the model to learn from both positive and negative examples, developing self-checking capabilities and improving first-attempt functional correctness by up to 2.8× compared to baseline methods.

Step 4: Comprehensive Evaluation

Validate performance on industry-standard benchmarks:

VerilogEval-Machine: 143 algorithmically generated specifications
VerilogEval-Human: 156 human-written specifications
Metrics: pass@1, pass@5 functional correctness
Comparison: Against GPT-4 Turbo, commercial systems

Interactive Examples

4-Bit Adder with Carry Implementation

Explore how VeriReason approaches complex Verilog design tasks with explicit reasoning and structured implementation.

User Prompt

                                You are tasked with designing a Verilog module that implements a 4-bit adder with carry-in and carry-out functionality. The module should have the following inputs and outputs:

                                Inputs:
                                - `clk`: a clock signal
                                - `reset`: a synchronous reset signal that resets the output to 0 when high
                                - `a`: a 4-bit input signal
                                - `b`: a 4-bit input signal
                                - `cin`: a carry-in signal
                                
                                Outputs:
                                - `sum`: a 4-bit output signal that represents the sum of `a`, `b`, and `cin`
                                - `cout`: a carry-out signal that represents the carry-out from the most significant bit of the sum
                                
                                Assume that the module will be used in a larger design and that you cannot use any Verilog libraries or pre-defined modules.
                                
                                First, think through the design approach, considering the functionality, inputs, outputs, and implementation details. Then provide the complete Verilog code implementation.
                                   
                            

Reasoning Process

<think>
1. Requirement Analysis:
- Inputs: `clk`, `reset`, 4-bit `a`, 4-bit `b`, `cin`.
- Outputs: 4-bit `sum`, 1-bit `cout`.
- Behavior: Output the sum (`a` + `b` + `cin`) synchronously on clock edge.
- On `reset`, outputs (`sum`, `cout`) must be set to 0.

2. Logical Design Decisions:
- 4-bit addition of `a`, `b`, and `cin` could produce a 5-bit result (maximum 31).
- The adder needs to capture both the sum (lower 4 bits) and carry-out (fifth bit).
- Synchronous reset: On each `posedge clk`, check `reset`.

3. Implementation Choices:
- Sequential design: Use of `clk` and synchronous `reset` requires outputs 
  `sum`/`cout` to be stored in registers (hence, `output reg`).
- Adder operation: Addition performed as `adder_result = a + b + cin;`
- Assignments: Lower 4 bits → `sum`, upper bit → `cout`.

Summary: The code implements a registered 4-bit adder with 
carry-in/carry-out, synchronous reset, and clocked outputs.
</think>
                            

Generated Verilog Code

<answer>
```verilog
module four_bit_adder (
    input clk,
    input reset,
    input [3:0] a,
    input [3:0] b,
    input cin,
    output reg [3:0] sum,
    output reg cout
);

reg [4:0] adder_result;

always @(posedge clk) begin
    if (reset) begin
        sum <= 4'b0000;
        cout <= 1'b0;
    end else begin
        adder_result = a + b + cin; // 5-bit to get carry
        sum <= adder_result[3:0];
        cout <= adder_result[4];
    end
end

endmodule
```
</answer>
                            

Technical Methodology

Reward Model Design

Multi-faceted Reward Function

R(o) = {

2.0, if functionally correct 0.1 + 1.0 · AST_score(o), if syntactically correct 0, otherwise

AST_score(o) measures structural similarity using hierarchical comparison:
AST_score(o) = ∑_c∈C w_c·(0.6·sim_c + 0.5·cov_c - 0.3·red_c)

GRPO Optimization

GRPO Objective Function

L_GRPO(θ) = E_{q~D,{o_i}^G_i=1~π_{θ_old}(·|q)}[ ¹/_G ∑^G_i=1 min(r_i·ρ_i, clip(ρ_i, 1-ε, 1+ε)·r_i)] - β·D_KL(π_θ(·|q)∥π_ref(·|q))

where ρ_i = π_θ(o_i|q)/π_{θ_old}(o_i|q) is the importance sampling ratio, r_i is the normalized reward, and β controls KL divergence penalty.

Data Filtration Strategy

Adaptive two-stage filtering optimizes training effectiveness:

D_filtered = {s ∈ D | μ_r(s) ∈ [α_min, α_max] and σ_r(s) > β}

where α_min = 0.3, α_max = 1.8, β = 0.1
Difficulty score: δ(s) = 1 - (μ_r(s) - α_min)/(α_max - α_min)

This yields 1149 hard samples and 743 easy samples for targeted training.

Available Models

VeriReason-Qwen2.5-1.5B

Efficient 1.5B parameter model optimized for resource-constrained environments while maintaining strong performance.

Download Model

VeriReason-Qwen2.5-3B

Flagship 3B parameter model combining SFT with GRPO reinforcement learning for state-of-the-art RTL generation.

Download Model

VeriReason-Qwen2.5-7B-SFT

7B parameter model with supervised fine-tuning for enhanced reasoning capabilities in Verilog generation.

Download Model

VeriReason-Llama-7B-GRPO

7B parameter Llama-based model with GRPO training demonstrating architecture generalizability.

Download Model

Training Datasets

RTL-Coder Small

Filtered baseline dataset without reasoning components, ideal for initial training and comparison studies.

Access Dataset

RTL-Coder 7B Reasoning TB Simple

Simplified reasoning dataset with testbench feedback for enhanced model training and validation.

Access Dataset

RTL-Coder 7B Reasoning TB

Full reasoning dataset with comprehensive testbench feedback integration and explicit reasoning steps.

Access Dataset

RTL-Coder 7B Combined

Comprehensive combined dataset incorporating all reasoning and testbench feedback components for full training.

Access Dataset

Training & Usage

Supervised Fine-Tuning (SFT)

Fine-tune base models on curated reasoning-enhanced dataset using two methods:

# Method 1: Using LlamaFactory
llamafactory-cli train qwen2.5_7b.yaml

# Method 2: Custom training script
# Move sft_rtl to src/open_r1/
chmod +x run_rtl_training.sh
./run_rtl_training.sh

GRPO Reinforcement Learning

Optimize policy using Group Relative Policy Optimization with multi-faceted rewards:

# Move necessary files
mv verilog_rewards_tb.py verilog_train_tb.py src/open-r1/

# Create recipe directory
mkdir verilog_recipe
mv verilog_grpo_tb.yaml verilog_recipe/

# Launch GRPO training
NCCL_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL \
CUDA_VISIBLE_DEVICES=5,6,7 ACCELERATE_USE_NCCL=1 \
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
--num_processes=3 src/open_r1/verilog_train_rtlcoder.py \
--config verilog_recipe/verilog_grpo_tb.yaml --use_vllm=false

System Requirements

Hardware

CUDA-compatible GPUs
Multi-GPU recommended
16GB+ VRAM per GPU

Software

PyTorch with CUDA support
Accelerate library
NCCL for distributed training

Tools

Iverilog simulator
LlamaFactory (optional)
Open-R1

Citation

BibTeX Entry

@misc{wang2025verireasonreinforcementlearningtestbench,
title={VeriReason: Reinforcement Learning with Testbench Feedback 
     for Reasoning-Enhanced Verilog Generation},
author={Yiting Wang and Guoheng Sun and Wanghao Ye and Gang Qu and Ang Li},
year={2025},
eprint={2505.11849},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.11849},
}

Project Overview

Framework Architecture

Step 1: Data Preparation & Reasoning Augmentation

Step 2: Supervised Fine-Tuning (SFT)

Step 3: GRPO Reinforcement Learning

Step 4: Comprehensive Evaluation

Interactive Examples

4-Bit Adder with Carry Implementation

Technical Methodology

Reward Model Design

GRPO Optimization

Data Filtration Strategy

Available Models

VeriReason-Qwen2.5-1.5B

VeriReason-Qwen2.5-3B

VeriReason-Qwen2.5-7B-SFT

VeriReason-Llama-7B-GRPO

Training Datasets

RTL-Coder Small

RTL-Coder 7B Reasoning TB Simple

RTL-Coder 7B Reasoning TB

RTL-Coder 7B Combined

Training & Usage

Supervised Fine-Tuning (SFT) Copy

GRPO Reinforcement Learning Copy

System Requirements

Hardware

Software

Tools

Citation

BibTeX Entry Copy

Supervised Fine-Tuning (SFT)

GRPO Reinforcement Learning

BibTeX Entry