VioPTT: Violin Technique-Aware Transcription

Abstract

While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact.

Here, we propose VioPTT (Violin Playing Technique-aware Transcription), a lightweight cascade model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release MOSA-VPT, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.

Violin Transcription Playing Technique Classification Synthetic Data Augmentation Music Information Retrieval

System Overview

VioPTT comprises two modules: a transcription module that predicts note onset, offset, velocity, and frame activation at the frame level; and an articulation module that fuses per-note acoustic embeddings with transcription features to classify playing technique.

🎻

Audio Input

Raw violin recording (16 kHz, mono)

→

🎹

Transcription Module

CRNN predicts onset, offset, frame & velocity at frame level

→

🏷️

Articulation Module

Fuses acoustic + transcription embeddings → technique label

→

📄

Output

MIDI + per-note technique CSV

Playing Techniques

Flageolet Détaché Pizzicato Spiccato

MOSA-VPT Dataset

We introduce MOSA-VPT, a large-scale synthetic dataset of 76 hours of audio–MIDI pairs balanced across four techniques: détaché, flageolet, spiccato, and pizzicato. Audio is rendered from MOSA MIDI scores using DAWDreamer with a professional virtual violin instrument (Synchron Solo Violin I), with all room and spatial processing disabled. The dataset is fully annotation-free.

→ Download MOSA-VPT on Zenodo

Audio Examples

🔊 Use headphones or quality speakers for the best listening experience.

MOSA-VPT Synthetic Dataset Samples

Samples from our synthetic MOSA-VPT dataset — violin audio rendered from MOSA MIDI scores using a professional virtual violin instrument (Synchron Solo Violin I) via DAWDreamer. Each track is a 30-second excerpt rendered with a different playing technique. The dataset contains 76 hours of audio across four techniques.

Technique	Synthetic Violin Audio (30s)
Flageolet
Détaché
Pizzicato
Spiccato

Results

Pitch & Timing Transcription (Table 1)

Transcription performance on URMP and Bach10 violin tracks (Precision / Recall / F1 / onset-only F1). Our model trained from scratch with augmentation matches state-of-the-art MUSC on URMP while outperforming it on Bach10 despite using ~30% less training data.

Model	URMP				Bach10
Model	P	R	F1	F1_no	P	R	F1	F1_no
Ours w/o aug	83.4	81.2	82.2	92.8	66.7	71.3	68.9	79.0
Ours w/ aug	86.1	83.6	84.5	93.1	68.1	71.8	69.9	79.5
Ours + Piano FT w/o aug	84.4	79.0	81.3	91.3	69.5	73.7	71.5	80.2
Ours + Piano FT w/ aug	85.0	82.1	83.3	92.9	63.3	68.4	65.7	77.8
Baselines
MUSC [Tamer et al., ISMIR 2023]	86.5	83.1	84.6	93.0	65.0	64.8	64.8	77.0
MERTech [Li et al., ICASSP 2024]	26.6	33.7	29.8	30.3	27.6	53.4	36.4	36.9

Bold = best; underlined = second-best per column. "FT" = fine-tuned from piano-pretrained checkpoint.

Playing Technique Classification — Ablation Study (Table 2)

Macro accuracy and per-class accuracy on the RWC dataset (mean ± std over 3-fold cross-validation). The no ablation condition (all four transcription features) achieves the best macro accuracy of 77.22%, outperforming MERTech (53.36%) by a large margin.

Condition	Macro Acc (%)	Flageolet (%)	Détaché (%)	Pizzicato (%)	Spiccato (%)
Full ablation (audio only)	70.46 ±2.57	86.44 ±4.19	51.75 ±9.97	57.06 ±15.33	86.56 ±2.55
w/o frame	66.21 ±13.24	71.79 ±16.53	70.16 ±32.58	63.80 ±38.66	59.10 ±19.71
w/o offset	59.71 ±10.19	72.80 ±27.65	55.41 ±24.71	52.75 ±45.82	57.85 ±24.79
w/o onset	65.82 ±8.63	91.55 ±1.96	51.94 ±19.77	65.47 ±20.35	54.34 ±11.90
w/o velocity	55.59 ±3.55	77.07 ±22.73	65.67 ±26.70	0.16 ±0.28	79.45 ±2.81
No ablation (all features)	77.22 ±6.35	71.89 ±14.12	63.12 ±12.59	88.80 ±3.11	85.08 ±4.87
Baseline
MERTech [Li et al., ICASSP 2024]	53.36 ±1.02	95.77 ±2.23	58.80 ±1.63	43.27 ±1.19	15.61 ±2.06

Bold = best; underlined = second-best per column. Results reported as mean ± std over three data splits.

Citation

If you find this work useful, please cite:

@article{wang2025vioptt,
  title   = {VioPTT: Violin Playing Technique-Aware Transcription from Synthetic Data Augmentation},
  author  = {Wang, Ting-Kang and Peng, Yueh-Po and Su, Li and Cheung, Vincent K.M.},
  journal = {arXiv preprint arXiv:2509.23759},
  year    = {2025}
}