Bachelor's Project Β· Mohammadreza Joneidi Jafari Β· Supervisor: Prof. Nikoofard
This project evaluates encoder-decoder models (mlongT5 and parsT5) for simplifying complex Persian legal texts into plain language. We compared these 2 tuned models with 2 Persian LLM models based on Llama. Key contributions:
- Optimizer Comparison: AdamW vs. LAMB vs. SGD (AdamW achieved best performance).
- Unlimiformer Integration: Handles long legal documents effectively for parsT5 model.
- Rigorous Metrics: ROUGE, BERTScore, and custom readability scores.
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore-f1 |
|---|---|---|---|---|
| ParsT5 (1 Block) | 38.08% | 15.83% | 19.41% | 73.71% |
| ParsT5 (3 Blocks) | 38.4% | 15.61% | 23.18% | 75.13% |
| mlongT5 (1 Blocks) | 27.94% | 1.77% | 11.22% | 64.89% |
| mlongT5 (3 Blocks) | 25.36% | 1.23% | 10.81% | 49.46% |
| PersianLlaMA-13B | 28.64% | 9.81% | 13.67% | 70.80% |
| AVA Llama_3_V2 | 30.07% | 10.33% | 16.39% | 70.87% |
π Model Hosted on Hugging Face You can access and use this model directly via the Hugging Face Hub:
Link: simplification-legal-text
- Models:
mlongT5(12-block),parsT5(12-block),Unlimiformer(long-context). - Optimizers: AdamW (best), LAMB, SGD.
- Framework: PyTorch + HuggingFace
transformers. - Hardware: (Specify GPUs/TPUs if applicable).
- 16,000+ Persian legal texts (decision texts, dates)
- Split: 85% train, 5% validation, 10% test.
- Preprocessing: Scraping, manual labeling for simplification.
For questions, collaborations, or access to the full dataset, feel free to reach out:
π§ Email: m.r.joneidi.02@gmail.com
π LinkedIn: Mohammadreza Joneidi Jafari