VioPTT — ICASSP 2026 Poster

VioPTT: Violin Playing Technique-Aware Transcription from Synthetic Data Augmentation

Ting-Kang Wang★ 2,3,1  •  Yueh-Po Peng★ 4,1  •  Li Su2  •  Vincent K.M. Cheung1

1 Sony Computer Science Laboratories, Inc., Tokyo, Japan   2 Institute of Information Science, Academia Sinica, Taipei, Taiwan
3 Graduate Institute of Communication Engineering, National Taiwan University, Taiwan   4 Original Content Center, Gamania Inc., Taipei, Taiwan
Equal contribution; Work done during internship at Sony CSL.

Sony CSL IIS, Academia Sinica National Taiwan University Gamania

1 Motivation

Automatic music transcription (AMT) has achieved substantial progress, yet most systems only capture pitch and timing, overlooking expressive, instrument-specific nuances.

The Gap: Violin playing technique—integral to its timbral palette—remains largely unexplored in transcription due to the cost and expertise required for annotation.

We propose VioPTT, the first model to jointly transcribe violin pitch, onset, offset, and playing technique within a unified framework.

2 Key Contributions

  • Lightweight cascade architecture combining a transcription module with an articulation module for joint note + technique prediction
  • MOSA-VPT: a novel 76-hour synthetic violin technique dataset eliminating the need for manual annotation
  • State-of-the-art transcription without pretrained representations from other instruments
  • Strong generalization from synthetic training data to real-world violin technique recognition

3 Model Architecture

VioPTT Model Architecture

Fig. 1: Overview of our technique-aware violin transcription model.

Training Objective

ℒ = ℒonset(BCE) + ℒoffset(BCE) + ℒframe(BCE) + ℒvelocity(MSE) + ℒtechnique(CCE)

4 Data & Augmentation

Datasets

MOSA (Training)
19 hrs solo violin · 15 expert players · Note-level annotations with pitch, rhythm, dynamics & articulation
MOSA-VPT (Ours — Synthetic)
76 hrs audio-MIDI pairs · 4 techniques balanced · Rendered via DAWDreamer + Synchron Solo Violin I
URMP / Bach10 (Test — Transcription)
44 chamber pieces + 10 four-part chorales with annotated pitch and timing
RWC (Test — Technique)
Real-world chromatic scales with multiple dynamics and playing styles

Technique Synthesis Pipeline

Détaché Flageolet Spiccato Pizzicato

MIDI scores → DAWDreamer (VST host) → Synchron Solo Violin I → key-switch + CC control → mono 16 kHz stems. Fully annotation-free and generalizable to any VST instrument.

Pitch & Timing Augmentation

Pitch shift (±0.1 st) · +5 dB gain · 2× random band-pass (32–4096 Hz) · Reverb (room 0.35)

5 Transcription Results

Model URMP Bach10
PRF1F1noPRF1F1no
Ours w/o aug83.481.282.292.866.771.368.979.0
Ours w/ aug86.183.684.593.168.171.869.979.5
Ours + FT w/o aug84.479.081.391.369.573.771.580.2
Ours + FT w/ aug85.082.183.392.963.368.465.777.8
MUSC [Tamer '24]86.583.184.693.065.064.864.877.0
MERTech [Chen '24]26.633.729.830.327.653.436.436.9

"FT" = fine-tuned from piano-pretrained checkpoint · "aug" = augmentation · Bold = best · Underlined = second-best

Key insight: State-of-the-art performance achieved from scratch with augmentation and ~30% less data than MUSC — transfer learning gains are limited when sufficient domain-specific data are available.

6 Technique Classification Results

AblationMacro Acc (%)Flageolet (%)Détaché (%)Pizzicato (%)Spiccato (%)
Full ablation70.46 ±2.5786.44 ±4.1951.75 ±9.9757.06 ±15.3386.56 ±2.55
Frame excl.66.21 ±13.2471.79 ±16.5370.16 ±32.5863.80 ±38.6659.10 ±19.71
Offset excl.59.71 ±10.1972.80 ±27.6555.41 ±24.7152.75 ±45.8257.85 ±24.79
Onset excl.65.82 ±8.6391.55 ±1.9651.94 ±19.7765.47 ±20.3554.34 ±11.90
Velocity excl.55.59 ±3.5577.07 ±22.7365.67 ±26.700.16 ±0.2879.45 ±2.81
No ablation77.22 ±6.3571.89 ±14.1263.12 ±12.5988.80 ±3.1185.08 ±4.87
MERTech [16]53.36 ±1.0295.77 ±2.2358.80 ±1.6343.27 ±1.1915.61 ±2.06

Mean ± std from 3 random data splits on RWC dataset. Bold = best, Underlined = second-best per column.

7 Confusion Matrix

Predictions aggregated across all folds

Flag.
Dét.
Pizz.
Spic.
Flag.
69%
4%
18%
9%
Dét.
11%
63%
2%
24%
Pizz.
1%
4%
89%
6%
Spic.
3%
10%
2%
85%

← Predicted →  ↑ True ↑

Pizzicato shows minimal misclassification (distinct acoustic character). Détaché is most confused with spiccato (both use short bow strokes).

8 UMAP Embedding Visualization

UMAP of Note-Level Technique Embeddings

Fig. 3: UMAP visualization of RWC data on learned note-level embeddings. Despite training on synthetic data only, learned representations generalize to unseen real-world recordings with clear class separation.

9 Conclusions & Future Work

  • 93.1 F1no on URMP — matching SOTA with ~30% less data
  • 77.2% macro accuracy on real-world technique classification
  • 76 hrs synthetic technique data — no expert annotation needed
  • 1st joint violin transcription + technique prediction model

Future work: Extend synthesis framework to a broader range of techniques and other bowed string instruments (viola, cello, double bass).