Abstract
While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact.
Here, we propose VioPTT (Violin Playing Technique-aware Transcription), a lightweight cascade model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release MOSA-VPT, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.
System Overview
VioPTT comprises two modules: a transcription module that predicts note onset, offset, velocity, and frame activation at the frame level; and an articulation module that fuses per-note acoustic embeddings with transcription features to classify playing technique.
Playing Techniques
MOSA-VPT Dataset
We introduce MOSA-VPT, a large-scale synthetic dataset of 76 hours of audio–MIDI pairs balanced across four techniques: détaché, flageolet, spiccato, and pizzicato. Audio is rendered from MOSA MIDI scores using DAWDreamer with a professional virtual violin instrument (Synchron Solo Violin I), with all room and spatial processing disabled. The dataset is fully annotation-free.
Audio Examples
MOSA-VPT Synthetic Dataset Samples
Samples from our synthetic MOSA-VPT dataset — violin audio rendered from MOSA MIDI scores using a professional virtual violin instrument (Synchron Solo Violin I) via DAWDreamer. Each track is a 30-second excerpt rendered with a different playing technique. The dataset contains 76 hours of audio across four techniques.
| Technique | Synthetic Violin Audio (30s) |
|---|---|
| Flageolet | |
| Détaché | |
| Pizzicato | |
| Spiccato |
Results
Pitch & Timing Transcription (Table 1)
Transcription performance on URMP and Bach10 violin tracks (Precision / Recall / F1 / onset-only F1). Our model trained from scratch with augmentation matches state-of-the-art MUSC on URMP while outperforming it on Bach10 despite using ~30% less training data.
| Model | URMP | Bach10 | ||||||
|---|---|---|---|---|---|---|---|---|
| P | R | F1 | F1no | P | R | F1 | F1no | |
| Ours w/o aug | 83.4 | 81.2 | 82.2 | 92.8 | 66.7 | 71.3 | 68.9 | 79.0 |
| Ours w/ aug | 86.1 | 83.6 | 84.5 | 93.1 | 68.1 | 71.8 | 69.9 | 79.5 |
| Ours + Piano FT w/o aug | 84.4 | 79.0 | 81.3 | 91.3 | 69.5 | 73.7 | 71.5 | 80.2 |
| Ours + Piano FT w/ aug | 85.0 | 82.1 | 83.3 | 92.9 | 63.3 | 68.4 | 65.7 | 77.8 |
| Baselines | ||||||||
| MUSC [Tamer et al., ISMIR 2023] | 86.5 | 83.1 | 84.6 | 93.0 | 65.0 | 64.8 | 64.8 | 77.0 |
| MERTech [Li et al., ICASSP 2024] | 26.6 | 33.7 | 29.8 | 30.3 | 27.6 | 53.4 | 36.4 | 36.9 |
Playing Technique Classification — Ablation Study (Table 2)
Macro accuracy and per-class accuracy on the RWC dataset (mean ± std over 3-fold cross-validation). The no ablation condition (all four transcription features) achieves the best macro accuracy of 77.22%, outperforming MERTech (53.36%) by a large margin.
| Condition | Macro Acc (%) | Flageolet (%) | Détaché (%) | Pizzicato (%) | Spiccato (%) |
|---|---|---|---|---|---|
| Full ablation (audio only) | 70.46 ±2.57 | 86.44 ±4.19 | 51.75 ±9.97 | 57.06 ±15.33 | 86.56 ±2.55 |
| w/o frame | 66.21 ±13.24 | 71.79 ±16.53 | 70.16 ±32.58 | 63.80 ±38.66 | 59.10 ±19.71 |
| w/o offset | 59.71 ±10.19 | 72.80 ±27.65 | 55.41 ±24.71 | 52.75 ±45.82 | 57.85 ±24.79 |
| w/o onset | 65.82 ±8.63 | 91.55 ±1.96 | 51.94 ±19.77 | 65.47 ±20.35 | 54.34 ±11.90 |
| w/o velocity | 55.59 ±3.55 | 77.07 ±22.73 | 65.67 ±26.70 | 0.16 ±0.28 | 79.45 ±2.81 |
| No ablation (all features) | 77.22 ±6.35 | 71.89 ±14.12 | 63.12 ±12.59 | 88.80 ±3.11 | 85.08 ±4.87 |
| Baseline | |||||
| MERTech [Li et al., ICASSP 2024] | 53.36 ±1.02 | 95.77 ±2.23 | 58.80 ±1.63 | 43.27 ±1.19 | 15.61 ±2.06 |
Citation
If you find this work useful, please cite:
@article{wang2025vioptt,
title = {VioPTT: Violin Playing Technique-Aware Transcription from Synthetic Data Augmentation},
author = {Wang, Ting-Kang and Peng, Yueh-Po and Su, Li and Cheung, Vincent K.M.},
journal = {arXiv preprint arXiv:2509.23759},
year = {2025}
}