VioPTT: Violin Playing Technique-Aware
Transcription from Synthetic Data Augmentation

ICASSP 2026

Ting-Kang Wang*2,3,1     Yueh-Po Peng*4,1     Li Su2     Vincent K.M. Cheung1

1 Sony Computer Science Laboratories, Inc., Tokyo, Japan    2 Institute of Information Science, Academia Sinica, Taipei, Taiwan
3 Graduate Institute of Communication Engineering, National Taiwan University, Taiwan    4 Original Content Center, Gamania Inc., Taipei, Taiwan
* Equal contribution

📄 Paper Code 🗃️ MOSA-VPT Dataset 🔊 Audio Examples 📊 Results

Abstract

While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact.

Here, we propose VioPTT (Violin Playing Technique-aware Transcription), a lightweight cascade model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release MOSA-VPT, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.

Violin Transcription Playing Technique Classification Synthetic Data Augmentation Music Information Retrieval

System Overview

VioPTT comprises two modules: a transcription module that predicts note onset, offset, velocity, and frame activation at the frame level; and an articulation module that fuses per-note acoustic embeddings with transcription features to classify playing technique.

🎻
Audio Input
Raw violin recording (16 kHz, mono)
🎹
Transcription Module
CRNN predicts onset, offset, frame & velocity at frame level
🏷️
Articulation Module
Fuses acoustic + transcription embeddings → technique label
📄
Output
MIDI + per-note technique CSV

Playing Techniques

Flageolet Détaché Pizzicato Spiccato

MOSA-VPT Dataset

We introduce MOSA-VPT, a large-scale synthetic dataset of 76 hours of audio–MIDI pairs balanced across four techniques: détaché, flageolet, spiccato, and pizzicato. Audio is rendered from MOSA MIDI scores using DAWDreamer with a professional virtual violin instrument (Synchron Solo Violin I), with all room and spatial processing disabled. The dataset is fully annotation-free.

→ Download MOSA-VPT on Zenodo

Audio Examples

🔊 Use headphones or quality speakers for the best listening experience.

MOSA-VPT Synthetic Dataset Samples

Samples from our synthetic MOSA-VPT dataset — violin audio rendered from MOSA MIDI scores using a professional virtual violin instrument (Synchron Solo Violin I) via DAWDreamer. Each track is a 30-second excerpt rendered with a different playing technique. The dataset contains 76 hours of audio across four techniques.

TechniqueSynthetic Violin Audio (30s)
Flageolet
Détaché
Pizzicato
Spiccato

Results

Pitch & Timing Transcription (Table 1)

Transcription performance on URMP and Bach10 violin tracks (Precision / Recall / F1 / onset-only F1). Our model trained from scratch with augmentation matches state-of-the-art MUSC on URMP while outperforming it on Bach10 despite using ~30% less training data.

Model URMP Bach10
PRF1F1no PRF1F1no
Ours w/o aug 83.481.282.292.8 66.771.368.979.0
Ours w/ aug 86.183.684.593.1 68.171.869.979.5
Ours + Piano FT w/o aug 84.479.081.391.3 69.573.771.580.2
Ours + Piano FT w/ aug 85.082.183.392.9 63.368.465.777.8
Baselines
MUSC [Tamer et al., ISMIR 2023] 86.583.184.693.0 65.064.864.877.0
MERTech [Li et al., ICASSP 2024] 26.633.729.830.3 27.653.436.436.9
Bold = best; underlined = second-best per column. "FT" = fine-tuned from piano-pretrained checkpoint.

Playing Technique Classification — Ablation Study (Table 2)

Macro accuracy and per-class accuracy on the RWC dataset (mean ± std over 3-fold cross-validation). The no ablation condition (all four transcription features) achieves the best macro accuracy of 77.22%, outperforming MERTech (53.36%) by a large margin.

Condition Macro Acc (%) Flageolet (%) Détaché (%) Pizzicato (%) Spiccato (%)
Full ablation (audio only) 70.46 ±2.57 86.44 ±4.19 51.75 ±9.97 57.06 ±15.33 86.56 ±2.55
w/o frame 66.21 ±13.24 71.79 ±16.53 70.16 ±32.58 63.80 ±38.66 59.10 ±19.71
w/o offset 59.71 ±10.19 72.80 ±27.65 55.41 ±24.71 52.75 ±45.82 57.85 ±24.79
w/o onset 65.82 ±8.63 91.55 ±1.96 51.94 ±19.77 65.47 ±20.35 54.34 ±11.90
w/o velocity 55.59 ±3.55 77.07 ±22.73 65.67 ±26.70 0.16 ±0.28 79.45 ±2.81
No ablation (all features) 77.22 ±6.35 71.89 ±14.12 63.12 ±12.59 88.80 ±3.11 85.08 ±4.87
Baseline
MERTech [Li et al., ICASSP 2024] 53.36 ±1.02 95.77 ±2.23 58.80 ±1.63 43.27 ±1.19 15.61 ±2.06
Bold = best; underlined = second-best per column. Results reported as mean ± std over three data splits.

Citation

If you find this work useful, please cite:

@article{wang2025vioptt,
  title   = {VioPTT: Violin Playing Technique-Aware Transcription from Synthetic Data Augmentation},
  author  = {Wang, Ting-Kang and Peng, Yueh-Po and Su, Li and Cheung, Vincent K.M.},
  journal = {arXiv preprint arXiv:2509.23759},
  year    = {2025}
}