多音音乐音高估计研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
多音音高估计(基频估计)是音乐信息检索领域非常重要而困难的研究方向之一,其基本任务就是估计多音音乐每一时刻音符的音高(基频)和数目。音符的发音时间、结束时间的估计有时也列入其中。
     本文从音乐信息检索的大背景出发,介绍了多音音乐音高估计研究的基本任务、研究价值及与其它研究的关系,然后较系统得回顾了一些有代表性的音高估计算法。在此基础上,本文提出了两个新算法。
     第一个算法是针对单帧信号的基于最大似然频谱建模的多音音高估计算法。与以往对于整个频谱建模的方法不同,该算法把信号的频谱简化为幅度谱的峰值和非峰区域,峰值进一步简化为其频率和幅度。最大似然模型也随之分为峰值似然和非峰区域似然两个部分。在建模峰值似然时,考虑到峰值检测算法的检测错误,我们提出了“真”峰和“假”峰的概念,并分别建模。在建模非峰区域似然时,我们用该区域未检测到由谐频产生的峰的概率作为似然函数。这两部分似然模型关注的焦点不同,互为补充。我们通过单音训练数据学习这些模型的参数,因为在单音数据中,“真”峰和“假”峰可以比较可靠的区分开来。我们还采用了一种加权的贝叶斯信息准则来估计音符个数。最后,该算法在由真实乐器音符合成的随机和弦和音乐和弦上进行测试,取得了不错的结果。
     第二个算法是针对多帧信号的基于计算听觉场景分析的多音音高估计算法。在该算法中,我们模仿人脑的声音感知规则,对信号频谱的时频成分做聚集。具体来说,我们在信号连续的频谱中定义了谐波事件的概念,每一个谐波事件是一个四元组(频率、幅度、发音时间、结束时间)。对于待处理的音乐,我们提取其所有的谐波事件并组成一个集合,集合中的每个事件都是基频事件的候选。我们设计了一个支持度传递的算法让这些谐波事件互相投票,选出支持度最高的事件作为基频。该算法在由真实乐器音符合成的随机和弦,以及计算机合成的重奏音乐上进行测试,取得了不错的结果。
Multi-Pitch Estimation (MPE), or say Multiple Fundamental Frequency (F0) Es-timation, is one of the most important and di?cult issues in the area of Music Informa-tion Retrieval (MIR). Its main tasks are to estimate the pitches (F0s) and their number(polyphony) at any time. Sometimes the note onsets and o?sets are also should beestimated.
     This thesis starts from introducing the background of the MIR research, thenpresents the main tasks, research values and relations to other research. After that,it systematically reviews some typical pitch estimation algorithms. Finally, it proposestwo new algorithms.
     The first algorithm is a signle-frame MPE algorithm based on maximum likeli-hood spectral modeling. Di?erent from the traditional whole spectral modeling meth-ods, this algorithm reduces the frequency spectrum into peaks and non-peak areas inthe amplitude spectrum, and the peaks are further reduced into their frequencies andamplitudes. Along with the reductions, the maximum likelihood model is split into twoparts: the peak likelihood and the non-peak area likelihood. In modeling the peaks, theconcepts of“true”and“false”peaks are proposed and modeled separately, to cope withthe errors in the peak detection method. In modeling the non-peak area likelihood, theprobability that the peaks which are generated by the harmonics but not detected is setto the likelihood function. The two parts of the likelihood function models di?erentaspects and are complementary. Their parameters are learned from monophonic train-ing data, where the“true”and“false”peaks are easy to be discriminated. A weightedBayesian Information Criteria (BIC) is employed to estimate the polyphony. Finally,the algorithm is tested on random chords and musical chords, which are both generatedusing the real instrumental notes. The experimental results are promising.
     The second algorithm is a multiple-frame MPE algorithm based on ComputationalAuditory Scene Analysis (CASA). In this algorithm, we simulate the auditory cues of human perception, to group the time-frequency components. More concretely, the con-cept of partial event is defined. Each partial event is a four-element vector (frequency,amplitude, onset and o?set). For a piece of music to be processed, all its partial eventsare extracted to compose a set, in which each event is a candidate of the F0 event. Thena support transfer algorithm is designed to make the events vote to each other, to electthe ones with highest degrees to be F0s. The proposed algorithm is tested on randomchords which are generated using real instrumental notes, and on computer-synthesizedchamber music. The results are promising.
引文
1 http://www.midomi.com
    1 http://theremin.music.uiowa.edu/
    [1]缪天瑞.律学.北京:人民音乐学院出版社, 1983.
    [2]高兴.音乐的多维视角.北京:文化艺术出版社, 2002.
    [3] Hall D E. Musical Acoustics. Australia: Pacific Grove, Calif : Brooks/Cole Pub. Co., 2002.
    [4]朱启东.音乐声学基础.上海:上海音乐学院出版社, 1983.
    [5] Fletcher N H. The Physics of Musical Instruments. New York: Springer, 1998.
    [6]唐林.音乐物理学导论.合肥:中国科学技术大学出版社, 1991.
    [7] Hodges D A,刘沛,任恺译.音乐心理学手册.长沙:湖南文艺出版社, 2006.
    [8]邹爱民,马凤东等译.音乐教育学.北京:人民音乐学院出版社, 2004.
    [9] Roads C. The Computer Music Tutorial. Cambridge, MA: MIT Press, 1996.
    [10] Dodge C, Jerse T A. Computer Music : Synthesis, Composition, and Performance. London:Prentice Hall International, 1997.
    [11] Dannenberg R. Recent work in music understanding. Proceedings of the 11th AnnualSymposium on Small Computers in the Arts, 1991.
    [12] Fingerhut M. Music information retrieval, or how to search for (and maybe find) music anddo away with incipits. Proceedings of IAML-IASA Congress, Oslo, Norway, 2004.
    [13] Feiten B, Guenzel S. Automatic indexing of a sound data base using self-organizing neuralnets. Computer Music Journal, 1994, 18(3):53–65.
    [14] Moorer J A. On the transcription of musical sound by computer. Computer Music Journal,1977. 32–38.
    [15] Klapuri A P. Automatic transcription of music. Proceedings of the Stockholm Music Acous-tics Conference, Stockholm, Sweden, 2003.
    [16] Dannenberg R B, Hu N. Polyphonic audio matching for score following and intelligentaudio editors. Proceedings of the 2003 International Computer Music Conference (ICMC),San Francisco, CA, USA, 2003. 27–73.
    [17] Pardo B, Birmingham W. Modeling form for on-line following of musical performances.Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI), Pitts-burgh, Pennsylvania, USA, 2005.
    [18] Raphael C. A Bayesian network for real-time musical accompaniment. Proceedings of the14th Annual Conference on Neural Information Processing Systems (NIPS), 2001.
    [19] Simon I, Morris D, Basu S. MySong: automatic accompaniment generation for vocalmelodies. Proceedings of the 26th Human-Computer Interface Conference (CHI), Florence,Italy, 2008.
    [20] Tzanetakis G, Cook P. Musical genre classification of audio signals. IEEE Transactions onSpeech and Audio Processing, 2002, 10(5):293–302.
    [21] McKay C, Fujinaga I. Automatic genre classification using large high-level musical fea-ture sets. Proceedings of the 5th International Conference on Music Information Retrieval(ISMIR), Barcelona, Spain, 2004.
    [22] Maddage N C, Xu C, Wang Y. A SVM-based classification approach to musical audio.Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR),Washington, D.C., USA, 2003.
    [23] Essid S, Richard G, David B. Instrument recognition in polyphonic music based on auto-matic taxonomies. IEEE Transactions on Audio, Speech and Language Processing, 2006,14(1):68–80.
    [24] Lu L, Liu D, Zhang H J. Automatic mood detection and tracking of music audio signals.IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(1):5–18.
    [25] Ellis D P W, Whitman B, Adam Berenzweig S L. The quest for ground truth in musicalartist similarity. Proceedings of the 3rd International Conference on Music InformationRetrieval (ISMIR), Paris, France, 2002.
    [26] Pampalk E, Flexer A, Widmer G. Improvements of audio-based music similarity and genreclassificaton. Proceedings of the 6th International Conference on Music Information Re-trieval (ISMIR), London, UK, 2005.
    [27] Mardirossian A, Chew E. Music summarization via key distributions: analyses of similarityassessment across variations. Proceedings of the 7th International Conference on MusicInformation Retrieval (ISMIR), Victoria, Canada, 2006.
    [28] Cooper M, Foote J. Automatic music summarization via similarity analysis. Proceedingsof the 3rd International Conference on Music Information Retrieval (ISMIR), Paris, France,2002.
    [29] Cano P, Batlle E, Kalker T, et al. A review of algorithms for audio fingerprinting. TheJournal of VLSI Signal Processing, 2005, 41(3):271–284.
    [30] Lu L, Zhang H J. Automated extraction of music snippets. Proceedings of the 2003 ACMConference on Multimedia (ACMMM), Berkeley, California, USA, 2003.
    [31] Camurri A. Multimodal interfaces for expressive sound control. Proceedings of the 7thInternational Conference on Digital Audio E?ects (DAFx), Naples, Italy, 2004.
    [32] Chorianopoulos K, Spinellis D. A?ective usability evaluation for an interactive music tele-vision channel. Computers in Entertainment (CIE), 2004, 2(3):14–14.
    [33] Heckroth J. Tutorial on MIDI and Music Synthesis, 1995.http://www.omega-art.com/midi/mfiles.html#top.
    [34] Choudhury G S, DiLauro T, Droettboom M, et al. Optical music recognition system withina large-scale digitization project. Proceedings of the 2nd International Conference on MusicInformation Retrieval (ISMIR), Plymouth, Massachusetts, USA, 2000.
    [35] Chew E. Modeling tonality: applications to music cognition. Proceedings of the 23rdAnnual Meeting of the Cognitive Science Society, 2001.
    [36] Brown J. Calculation of a constant Q spectral transform. Journal of Acoustical Society ofAmerica, 1991, 89(1):425–434.
    [37] Brown J. An e?cient algorithm for the calculation of a constant Q transform. Journal ofAcoustical Society of America, 92:2698–2701.
    [38] Desainte-Catherine M, Marchand S. High precision fourier analysis of sounds using signalderivatives. Technical report, The Pennsylvania State University CiteSeer Archives, May,1998. http://citeseer.ist.psu.edu/desainte-catherine98high.html.
    [39] Gribonval R, Bacry E. Harmonic decomposition of audio signals with matching pursuit.IEEE Transactions on Signal Processing, 2003, 51(1):101–111.
    [40] Klapuri A. Multiple fundamental frequency estimation by summing harmonic amplitudes.Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR),Victoria, Canada, 2006.
    [41] Raczynski S A, Ono N, Sagayama S. Multipitch analysis with harmonic nonnegative matrixapproximation. Proceedings of the 8th International Conference on Music InformationRetrieval (ISMIR), Vienna, Austria, 2007.
    [42] Gouyon F. A computational approach to rhythm description[博士学位论文]. Barcelona,Spain: Universitat Pompeu Fabra, 2005.
    [43] Dixon S, Gouyon F, Widmer G. Towards characterisation of music via rhythmic patterns.Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR),Barcelona, Spain, 2004.
    [44] Turnbull D, Barrington L, Torres D, et al. Exploring the semantic annotation and retrievalof sound. Technical report, University of California, San Diego, CA, USA, 2007.
    [45] Herrera P, Celma, Massaguer J, et al. Mucosa: a music content semantic annotator. Proceed-ings of the 6th International Conference on Music Information Retrieval (ISMIR), London,UK, 2005.
    [46] Foote J. Content-based retrieval of music and audio. Multimedia Storage and ArchivingSystems II, Proceedings of SPIE, 1997. 138–147.
    [47] Cano P. Content-based audio search: from fingerprinting to semantic audio retrieval[博士学位论文]. Barcelona, Spain: The Technology Department of Pompeu Fabra University,2006.
    [48] Vignoli F, Pauws S. A music retrieval system based on user-driven similarity and its eval-uation. Proceedings of the 6th International Conference on Music Information Retrieval(ISMIR), London, UK, 2005.
    [49] Pampalk E, Gasser M. An implementation of a simple playlist generator based on audiosimilarity measures and user feedback. Proceedings of the 7th International Conference onMusic Information Retrieval (ISMIR), Victoria, Canada, 2006.
    [50] Aucouturier J J, Pachet F. Improving timbre similarity: how high’s the sky?citeseer.ist.psu.edu/683609.html.
    [51] Marr D. Vision: A Computational Investigation into the Human Representation and Pro-cessing of Visual Information. San Francisco, CA, USA: W. H. Freeman, 1982.
    [52] Bregman A S. Auditory Scene Analysis. Cambridge, Massachusetts, USA: The MIT Press,1990.
    [53] Moorer J A. On the Segmentation and Analysis of Continuous Musical Sound by DigitalComputer[博士学位论文]. Palo Alto, CA, USA: Stanford University, 1975.
    [54] Chafe C, Kashima J, Mont-Reynaud B, et al. Techniques for note identification in poly-phonic music. Proceedings of the International Computer Music Conference (ICMC), 1985.
    [55] Chafe C, Ja?e D. Source separation and note identification in polyphonic music. Pro-ceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), Atlanta, USA, 1986.
    [56] Maher R C. An approach for the separation of voices in composite musical signals[博士学位论文]. Urbana, IL, USA: University of Illinois, Urbana-Champaign, 1989.
    [57] Maher R C. Evaluation of a method for separating digitized duet signals. Journal of AudioEngineering Society, 1990, 38(12):956–979.
    [58] Katayose K, Inokuchi S. The Kansei music system. Computer Music Journal, 1989,13(4):95–98.
    [59] Hawley M J. Structure out of Sound[博士学位论文]. Boston, MA, USA: MIT, 1993.
    [60] Nunn D. Source separation and transcription of polyphonic music.http://capella.dur.ac.uk/doug/icnmr.html.
    [61] Kashino K, Tanaka H. A sound source separation system with the ability of automatic tonemodeling. Proceedings of the International Computer Music Conference (ICMC), Tokyo,1993.
    [62] Kashino K, Nakadai K, Kinoshita T, et al. Application of Bayesian probability networkto music scene analysis: Music scene analysis with autonomous processing modules and aquantitative information integration mechanism. Proceedings of the International Joint Con-ference on Artificial Intelligence (IJCAI), CASA workshop, Montreal, Quebec, Canada,1995.
    [63] Martin K D. Automatic transcription of simple polyphonic music: robust front end pro-cessing. Technical report, MIT Media Laboratory Perceptual Computing Section TechnicalReport No. 385, 1996.
    [64] Martin K D. A blackboard system for automatic transcription of simple polyphonic music.Technical report, MIT Media Laboratory Perceptual Computing Section Technical ReportNo. 399, 1996.
    [65] Marolt M. Transcription of polyphonic piano music with neural networks. Proceedings ofthe 10th Mediterranean Electrotechnical Conference (MELECON), 2000.
    [66] Abdallah S, Plumbley M. An ICA approach to automatic music transcription. Proceedingsof the 114th Audio Engineering Society Convention (AES), 2003.
    [67] Abdallah S, Plumbley M. Polyphonic transcription by non-negative sparse coding of powerspectra. Proceedings of the 5th International Conference on Music Information Retrieval(ISMIR), Barcelona, Spain, 2004. 318–325.
    [68] Vincent E, Rodet X. Music transcription with ISA and HMM. Lecture Notes in ComputerScience, 2004, 2004(3195):1197–1204.
    [69] Bello J P. Towards the Automated Analysis of Simple Polyphonic Music: A Knowledge-Based Approach[博士学位论文]. London, UK: University of London, 2003.
    [70] Bello J P, Daudet L, Sandler M B. Automatic piano transcription using frequency andtime-domain information. IEEE Transactions on Audio, Speech, and Language Processing,2006, 14(6):2242–2251.
    [71] Klapuri A. Automatic Transcription of Music[硕士学位论文]. Tampere, Finland: TampereUniversity of Technology, 1998.
    [72] Klapuri A. Signal Processing Methods for the Automatic Transcription of Music[博士学位论文]. Tampere, Finland: Tampere University of Technology, 2004.
    [73] Ryyna¨nen M, Klapuri A. Polyphonic music transcription using note event modeling. Pro-ceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA), Mohonk, NY, USA, 2005.
    [74] Cemgil A T, Kappen H J, Barber D. A generative model for music transcription. IEEETransactions on Audio, Speech and Language Processing, 2006, 12(2):679– 694.
    [75] Peeling P, Li C. Poisson point process modeling for polyphonic music transcription. Journalof the Acoustical Society of America Express Letters, 2007, 121(4):EL168–EL175.
    [76] Klapuri A. Automatic music transcription as we know it today. Journal of New MusicResearch, 2004, 33(3):269–282.
    [77] Goto M. Music understanding at the beat level: real-time beat tracking for audio signals.Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), CASAWorkshop, 1995. 68–75.
    [78] Goto M, Muraoka Y. Real-time rhythm tracking for drumless audio signals: chord changedetection for musical decisions. Proceedings of the International Joint Conference on Arti-ficial Intelligence (IJCAI), CASA Workshop, 1997. 135–144.
    [79] Laroche J. Estimating tempo, swing and beat locations in audio recordings. Proceedings ofWorkshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), NewPaltz, NY, USA, 2001. 135–138.
    [80] Gouyon F, Herrera P, Cano P. Pulse-dependent analyses of percussive music. Proceedings ofAES 22nd International Conference on Virtual, Synthetic and Entertainment Audio, Espoo,Finland, 2002. 396–401.
    [81] Ellis D P W. Prediction-driven computational auditory scene analysis[博士学位论文].Boston, MA, USA: MIT Media Laboratory, 1996.
    [82] Rosenthal D F. Machine rhythm: Computer emulation of human rhythm perception[博士学位论文]. Boston, MA, USA: Massachusetts Institute of Technology, 1992.
    [83] Cooke M, Ellis D P W. The auditory organization of speech and other sources in listenersand computational models. Speech Communication, 2001, 35:141–177.
    [84] Slaney M. An e?cient implementation of the Patterson-Holdsworth auditory filter bank.Technical report, Apple Computer Technical Report 35, 1993.
    [85] Plumbley M D, Abdallah S A, Bello J P, et al. Automatic music transcription and audiosource separation. Cybernetics and Systems, 2002, 33(6):603–627.
    [86] Duan Z, Zhang Y, Zhang C, et al. Unsupervised single-channel music source separation byaverage harmonic structure modeling. IEEE Transactions on Audio, Speech, and LanguageProcessing, 2008, 16(4):766–778.
    [87] Li Y, Wang D L. Separation of Singing Voice From Music Accompaniment for Monau-ral Recordings. IEEE Transactions on Audio, Speech, and Language Processing, 2007,15(4):1475–1487.
    [88] Dannenberg R, Birmingham W, Pardo B, et al. A comparative evaluation of search tech-niques for query-by-humming using the MUSART testbed. Journal of the American Societyfor Information Science and Technology, 2007, 58(3).
    [89] Pardo B, Birmingham W P. Encoding timing information for musical query matching. Pro-ceedings of the International Conference on Music Information Retrieval (ISMIR), Paris,France, 2002.
    [90] Goto M. A real-time music scene-description system: Predominant-F0 estimation for de-tecting melody and bass lines in real-world audio signals. Speech Communication, 2004,43(4):311–329.
    [91] Fujishima T. Realtime chord recognition of musical sound: A system using common lispmusic. Proceedings of the International Computer Music Conference (ICMC), Beijing,China, 1999.
    [92] Lee K, Slaney M. Acoustic chord transcription and key extraction from audio using key-dependent HMMs trained on synthesized audio. IEEE Transactions on Audio, Speech, andLanguage Processing, 2008, 16(2):291–301.
    [93] Hartmann W M. Pitch, periodicity, and auditory organization. Journal of the AcousticalSociety of America, 1996, 100:3491–3502.
    [94] Licklider J C R. A duplex theory of pitch perception. Experientia, 1951, 7:128–134.
    [95] Cheveigne′A, Kawahara H. YIN, a fundamental frequency estimator for speech and music.Journal of the Acoustical Society of America, 2002, 111(4):1917–1930.
    [96] Terhardt E. Calculating virtual pitch. Hearing Research, 1979, 1:155–182.
    [97] Maher R C, Beauchamp J W. Fundamental frequency estimation of musical signals us-ing a two-way mismatch procedure. Journal of the Acoustical Society of America, 1993,95:2254–2263.
    [98] Meddis R, Hewitt M J. Modeling the identification of concurrent vowels with di?erentfundamental frequencies. Journal of Acoustical Society of America, 1992, 91:233–245.
    [99] Tolonen T, Karjalainen M. A computationally e?cient multipitch analysis model. IEEETransactions On Speech And Audio Processing, 2000, 8(6):708–716.
    [100] Davy M, Godsill S J, Idier J. Bayesian analysis of western tonal music. Journal of Acous-tical Society of America, 2006, 119(4):2498–2517.
    [101] Goto M. A predominant-F0 estimation method for realworld musical audio signals: MAPestimation for incorporating prior knowledge about F0s and tone models. Proceedings ofWorkshop on Consistent and Reliable Acoustic Cues for Sound Analysis, Aalborg, Den-mark, 2001.
    [102] Kameoka H, Nishimoto T, Sagayama S. A multipitch analyzer based on harmonic temporalstructured clustering. IEEE Transactions on Audio, Speech and Language Processing, 2007,15(3):982–994.
    [103] Poliner G, Ellis D. A discriminative model for polyphonic piano transcription. EurasipJournal on Applied Signal Processing, 2007, 2007:Article ID 48317.
    [104] Goldstein J. An optimum processor theory for the central formation of the pitch of complextones. Journal of Acoustical Society of America, 1973, 54:1496–1516.
    [105] Thornburg H, Leistikow R J, Berger J. Melody extraction and musical onset detection viaprobabilistic models of framewise STFT peak data. IEEE Transactions on Audio, Speech,and Language Processing, 2007, 15(4):1257–1272.
    [106] Smith J O, Serra X. PARSHL: an analysis/synthesis program for non-harmonic soundsbased on a sinusoidal representation. Proceedings of International Computer Music Con-ference (ICMC), 1987.
    [107] Schwarz G. Estimating the dimension of a model. Annals of Statistics, 1978, 6:461–464.
    [108] Rodet X. Musical sound signal analysis/synthesis: Sinusoidal+residual and elementarywaveform models. Proceedings of IEEE Time-Frequency and Time-Scale Workshop, 1997.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700