Curriculum Learning Fine-Tuning for Mathematical Reasoning
Published:
Curriculum Learning Fine-Tuning for Mathematical Reasoning
Technologies: Python, PyTorch, Transformers, LoRA, HuggingFace, Weights & Biases Duration: November 2025 Links: GitHub | Blog Post | Interactive Notebook
Project Overview
Investigated whether curriculum learning—training models progressively from easy to hard problems—improves LLM mathematical reasoning capabilities. Compared two curriculum strategies against baseline random-order training.
Key Results
Complexity-Based Curriculum (✅ Success):
- +1.37% accuracy improvement on difficult problems (Phi-3 2.7B)
- +2.34% improvement on medium difficulty problems
- Validates multi-factor difficulty estimation approach
Answer-Length Curriculum (❌ Failure):
- -0.39% to -3.51% degradation across model sizes
- Proves naive heuristics can hurt performance
- Important negative result for the field
Model Size Dependency:
- Medium models (2.7B) benefit most from curriculum learning
- Tiny models (135M) show minimal improvement (insufficient capacity)
- Suggests curriculum learning effectiveness scales with model size
Technical Implementation
Models Fine-Tuned:
- Phi-3 (2.7B parameters) - best curriculum learning gains
- SmolLM2 (135M parameters) - baseline for small model behavior
Dataset:
- GSM8K: 8.5K grade school math word problems
- Split into 3 curriculum stages: Easy (33%) → Normal (33%) → Difficult (34%)
- 500 held-out samples for evaluation
Curriculum Strategies:
- Naive Answer-Length Strategy:
- Sort by answer length (short → long)
- Hypothesis: longer answers = harder problems
- Result: Degraded performance (wrong assumption)
- Complexity-Based Strategy (Novel):
- Multi-factor scoring:
complexity = solution_steps × operation_complexity - Accounts for reasoning chains and mathematical operations
- Result: Improved performance on difficult problems
- Multi-factor scoring:
Training Approach:
- LoRA fine-tuning for parameter efficiency (r=16, alpha=32)
- Progressive training: Train stage 1 → merge → train stage 2 → merge → train stage 3
- Model merging between stages preserves learned knowledge
- Weights & Biases for experiment tracking and visualization
Experimental Design
Baselines:
- Random-order training (standard fine-tuning)
- Checkpointed evaluation at 10%, 25%, 50%, 75%, 100% of training
Evaluation Metrics:
- Exact match accuracy on GSM8K test set
- Per-difficulty-level breakdown
- Training efficiency (accuracy vs. training steps)
Reproducibility:
- Complete code on GitHub with documentation
- Fixed random seeds for reproducibility
- Hyperparameter configurations documented
- Interactive Jupyter notebook with visualizations
Key Findings
- Curriculum design matters more than curriculum usage
- Well-designed curriculum: +1.37% improvement
- Poorly-designed curriculum: -3.51% degradation
- Multi-factor difficulty estimation is crucial
- Complexity score (steps × operations) works
- Simple heuristics (answer length) fail
- Model capacity threshold exists
- 135M parameters: insufficient capacity for curriculum benefit
- 2.7B parameters: sweet spot for curriculum learning
- Suggests larger models may not need curriculum
- Progressive merging is essential
- Merging model weights between stages preserves knowledge
- Without merging, catastrophic forgetting occurs
Honest Reporting
What Worked:
- Complexity-based curriculum improved difficult problem accuracy
- Model merging strategy effective for knowledge retention
- Reproducible pipeline with comprehensive logging
What Didn’t Work:
- Answer-length curriculum degraded performance
- SmolLM2 showed no improvement (too small)
- Curriculum benefit smaller than expected (1-2% vs. hoped 5-10%)
Published negative results to help researchers avoid failed approaches.
Resources
- Full Blog Post - Detailed analysis, visualizations, and insights
- Interactive Jupyter Notebook - Executable analysis with Plotly charts
- GitHub Repository - Complete training and evaluation code
- HuggingFace Demo - Compare baseline vs. curriculum models
- Training logs and curriculum difficulty distributions
Impact & Lessons
Scientific Contribution:
- Demonstrates curriculum learning viability for LLM reasoning tasks
- Identifies model size threshold for curriculum effectiveness
- Provides negative results (answer-length heuristic) to guide future research
Practical Takeaway:
- Spend time on curriculum design (difficulty estimation)
- Not all curricula help—some hurt performance
- Consider model capacity when applying curriculum learning
ML Engineering:
- Complete end-to-end pipeline from hypothesis to deployment
- Comprehensive evaluation and honest reporting of failures
- Reproducible research with public code and demos
This project showcases research experimentation skills: hypothesis testing, rigorous evaluation, honest reporting of negative results, and sharing reproducible code with the community.
