VioPTT – Violin Playing Technique-Aware Transcription Poster

1 Motivation

Automatic music transcription (AMT) has achieved substantial progress, yet most systems only capture pitch and timing, overlooking expressive, instrument-specific nuances.

          The Gap: Violin playing technique—integral to its timbral palette—remains largely unexplored in transcription due to the cost and expertise required for annotation.
        

We propose VioPTT, the first model to jointly transcribe violin pitch, onset, offset, and playing technique within a unified framework.

2 Key Contributions

Lightweight cascade architecture combining a transcription module with an articulation module for joint note + technique prediction
MOSA-VPT: a novel 76-hour synthetic violin technique dataset eliminating the need for manual annotation
State-of-the-art transcription without pretrained representations from other instruments
Strong generalization from synthetic training data to real-world violin technique recognition

3 Model Architecture

Fig. 1: Overview of our technique-aware violin transcription model.

Training Objective

ℒ = ℒ_onset(BCE) + ℒ_offset(BCE) + ℒ_frame(BCE) + ℒ_velocity(MSE) + ℒ_technique(CCE)

4 Data & Augmentation

Datasets

MOSA (Training)
19 hrs solo violin · 15 expert players · Note-level annotations with pitch, rhythm, dynamics & articulation

MOSA-VPT (Ours — Synthetic)
76 hrs audio-MIDI pairs · 4 techniques balanced · Rendered via DAWDreamer + Synchron Solo Violin I

URMP / Bach10 (Test — Transcription)
44 chamber pieces + 10 four-part chorales with annotated pitch and timing

RWC (Test — Technique)
Real-world chromatic scales with multiple dynamics and playing styles

Technique Synthesis Pipeline

Détaché Flageolet Spiccato Pizzicato

MIDI scores → DAWDreamer (VST host) → Synchron Solo Violin I → key-switch + CC control → mono 16 kHz stems. Fully annotation-free and generalizable to any VST instrument.

Pitch & Timing Augmentation

Pitch shift (±0.1 st) · +5 dB gain · 2× random band-pass (32–4096 Hz) · Reverb (room 0.35)

5 Transcription Results

Model	URMP				Bach10
	P	R	F1	F1_no	P	R	F1	F1_no
Ours w/o aug	83.4	81.2	82.2	92.8	66.7	71.3	68.9	79.0
Ours w/ aug	86.1	83.6	84.5	93.1	68.1	71.8	69.9	79.5
Ours + FT w/o aug	84.4	79.0	81.3	91.3	69.5	73.7	71.5	80.2
Ours + FT w/ aug	85.0	82.1	83.3	92.9	63.3	68.4	65.7	77.8
MUSC [Tamer '24]	86.5	83.1	84.6	93.0	65.0	64.8	64.8	77.0
MERTech [Chen '24]	26.6	33.7	29.8	30.3	27.6	53.4	36.4	36.9

"FT" = fine-tuned from piano-pretrained checkpoint · "aug" = augmentation · Bold = best · Underlined = second-best

          Key insight: State-of-the-art performance achieved from scratch with augmentation and ~30% less data than MUSC — transfer learning gains are limited when sufficient domain-specific data are available.
        

6 Technique Classification Results

Ablation	Macro Acc (%)	Flageolet (%)	Détaché (%)	Pizzicato (%)	Spiccato (%)
Full ablation	70.46 ±2.57	86.44 ±4.19	51.75 ±9.97	57.06 ±15.33	86.56 ±2.55
Frame excl.	66.21 ±13.24	71.79 ±16.53	70.16 ±32.58	63.80 ±38.66	59.10 ±19.71
Offset excl.	59.71 ±10.19	72.80 ±27.65	55.41 ±24.71	52.75 ±45.82	57.85 ±24.79
Onset excl.	65.82 ±8.63	91.55 ±1.96	51.94 ±19.77	65.47 ±20.35	54.34 ±11.90
Velocity excl.	55.59 ±3.55	77.07 ±22.73	65.67 ±26.70	0.16 ±0.28	79.45 ±2.81
No ablation	77.22 ±6.35	71.89 ±14.12	63.12 ±12.59	88.80 ±3.11	85.08 ±4.87
MERTech [16]	53.36 ±1.02	95.77 ±2.23	58.80 ±1.63	43.27 ±1.19	15.61 ±2.06

Mean ± std from 3 random data splits on RWC dataset. Bold = best, Underlined = second-best per column.

7 Confusion Matrix

Predictions aggregated across all folds

Flag.

Dét.

Pizz.

Spic.

Flag.

69%

4%

18%

9%

Dét.

11%

63%

2%

24%

Pizz.

1%

4%

89%

6%

Spic.

3%

10%

2%

85%

← Predicted → ↑ True ↑

          Pizzicato shows minimal misclassification (distinct acoustic character). Détaché is most confused with spiccato (both use short bow strokes).
        

8 UMAP Embedding Visualization

Fig. 3: UMAP visualization of RWC data on learned note-level embeddings. Despite training on synthetic data only, learned representations generalize to unseen real-world recordings with clear class separation.

9 Conclusions & Future Work

93.1 F1_no on URMP — matching SOTA with ~30% less data
77.2% macro accuracy on real-world technique classification
76 hrs synthetic technique data — no expert annotation needed
1st joint violin transcription + technique prediction model

Future work: Extend synthesis framework to a broader range of techniques and other bowed string instruments (viola, cello, double bass).