Similar Items: Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching
- DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
- Pretrained Model Representations as Acquisition Signals for Active Learning of MLIPs
- Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
- Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
- Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction
- Label-Efficient School Detection from Aerial Imagery via Weakly Supervised Pretraining and Fine-Tuning