Curriculum Learning Fine-Tuning for Mathematical Reasoning

Published: November 13, 2025

Curriculum Learning Fine-Tuning for Mathematical Reasoning

Technologies: Python, PyTorch, Transformers, LoRA, HuggingFace, Weights & Biases Duration: November 2025 Links: GitHub | Blog Post | Interactive Notebook

Project Overview

Investigated whether curriculum learning—training models progressively from easy to hard problems—improves LLM mathematical reasoning capabilities. Compared two curriculum strategies against baseline random-order training.

Key Results

Complexity-Based Curriculum (✅ Success):

+1.37% accuracy improvement on difficult problems (Phi-3 2.7B)
+2.34% improvement on medium difficulty problems
Validates multi-factor difficulty estimation approach

Answer-Length Curriculum (❌ Failure):

-0.39% to -3.51% degradation across model sizes
Proves naive heuristics can hurt performance
Important negative result for the field

Model Size Dependency:

Medium models (2.7B) benefit most from curriculum learning
Tiny models (135M) show minimal improvement (insufficient capacity)
Suggests curriculum learning effectiveness scales with model size

Technical Implementation

Models Fine-Tuned:

Phi-3 (2.7B parameters) - best curriculum learning gains
SmolLM2 (135M parameters) - baseline for small model behavior

Dataset:

GSM8K: 8.5K grade school math word problems
Split into 3 curriculum stages: Easy (33%) → Normal (33%) → Difficult (34%)
500 held-out samples for evaluation

Curriculum Strategies:

Naive Answer-Length Strategy:
- Sort by answer length (short → long)
- Hypothesis: longer answers = harder problems
- Result: Degraded performance (wrong assumption)
Complexity-Based Strategy (Novel):
- Multi-factor scoring: complexity = solution_steps × operation_complexity
- Accounts for reasoning chains and mathematical operations
- Result: Improved performance on difficult problems

Training Approach:

LoRA fine-tuning for parameter efficiency (r=16, alpha=32)
Progressive training: Train stage 1 → merge → train stage 2 → merge → train stage 3
Model merging between stages preserves learned knowledge
Weights & Biases for experiment tracking and visualization

Experimental Design

Baselines:

Random-order training (standard fine-tuning)
Checkpointed evaluation at 10%, 25%, 50%, 75%, 100% of training

Evaluation Metrics:

Exact match accuracy on GSM8K test set
Per-difficulty-level breakdown
Training efficiency (accuracy vs. training steps)

Reproducibility:

Complete code on GitHub with documentation
Fixed random seeds for reproducibility
Hyperparameter configurations documented
Interactive Jupyter notebook with visualizations

Key Findings

Curriculum design matters more than curriculum usage
- Well-designed curriculum: +1.37% improvement
- Poorly-designed curriculum: -3.51% degradation
Multi-factor difficulty estimation is crucial
- Complexity score (steps × operations) works
- Simple heuristics (answer length) fail
Model capacity threshold exists
- 135M parameters: insufficient capacity for curriculum benefit
- 2.7B parameters: sweet spot for curriculum learning
- Suggests larger models may not need curriculum
Progressive merging is essential
- Merging model weights between stages preserves knowledge
- Without merging, catastrophic forgetting occurs

Honest Reporting

What Worked:

Complexity-based curriculum improved difficult problem accuracy
Model merging strategy effective for knowledge retention
Reproducible pipeline with comprehensive logging

What Didn’t Work:

Answer-length curriculum degraded performance
SmolLM2 showed no improvement (too small)
Curriculum benefit smaller than expected (1-2% vs. hoped 5-10%)

Published negative results to help researchers avoid failed approaches.

Resources

Full Blog Post - Detailed analysis, visualizations, and insights
Interactive Jupyter Notebook - Executable analysis with Plotly charts
GitHub Repository - Complete training and evaluation code
HuggingFace Demo - Compare baseline vs. curriculum models
Training logs and curriculum difficulty distributions

Impact & Lessons

Scientific Contribution:

Demonstrates curriculum learning viability for LLM reasoning tasks
Identifies model size threshold for curriculum effectiveness
Provides negative results (answer-length heuristic) to guide future research

Practical Takeaway:

Spend time on curriculum design (difficulty estimation)
Not all curricula help—some hurt performance
Consider model capacity when applying curriculum learning

ML Engineering:

Complete end-to-end pipeline from hypothesis to deployment
Comprehensive evaluation and honest reporting of failures
Reproducible research with public code and demos

This project showcases research experimentation skills: hypothesis testing, rigorous evaluation, honest reporting of negative results, and sharing reproducible code with the community.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Timur Khairulov

Curriculum Learning Fine-Tuning for Mathematical Reasoning

Project Overview

Key Results

Technical Implementation

Experimental Design

Key Findings

Honest Reporting

Resources

Impact & Lessons

Share on