반응형
/*******************************************************************************************************************
-- Title : [Py3.5] RegexpParser를 위한 Chunking Rule 테스트
-- Reference : 웹검색
-- Key word : nltk pos word_tokenize pos_tag chunk chunking 자연어처리 자연어 처리 pos tagging pos_tagging
토크나이저 word tokenize 청킹 tokenizing
*******************************************************************************************************************/
■ Chunk Rule 테스트
-- Title : [Py3.5] RegexpParser를 위한 Chunking Rule 테스트
-- Reference : 웹검색
-- Key word : nltk pos word_tokenize pos_tag chunk chunking 자연어처리 자연어 처리 pos tagging pos_tagging
토크나이저 word tokenize 청킹 tokenizing
*******************************************************************************************************************/
■ Chunk Rule 테스트
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | # -*- coding: utf-8 -*- from nltk.tokenize import sent_tokenize, word_tokenize from nltk.tag import pos_tag from nltk.chunk import RegexpParser # ------------------------------ # -- word token & pos tagging # ------------------------------ #sent = "Systems and methods for spectrally dispersed illumination optical coherence tomography." #sent = "Telemetric docking station" sent = "Hand-held medical-data capture-device having a digital infrared sensor with no analog readout ports and" \ "optical detection of vital signs through variation amplification and interoperation with electronic medical record systems." #sent = "Device for use in electro-biological signal measurement in the presence of a magnetic field." word_token = word_tokenize(sent) pos_tags = pos_tag(word_token) print (sent) print("... raw_text", "." * 100, "\n") print (word_token) print("... word_token", "." * 100, "\n") # ------------------------------ # -- chunking # ------------------------------ chunk_rule = r""" NP: {<N.*><JJ><NN>*} {<N.*><VBG><NN>*} {<J.*>*<NN>*} """ chunk_parse = RegexpParser(chunk_rule) chunk = chunk_parse.parse(pos_tags) print(chunk) chunk.draw() print(",,, chunk", "," * 100, "\n") # ------------------------------ # -- 정규식 설명 # ------------------------------ """ <RB.?>* = "0 or more of any tense of adverb," followed by: <VB.?>* = "0 or more of any tense of verb," followed by: <NNP>+ = "One or more proper nouns," followed by <NN>? = "zero or one singular noun." NP: {<N.*>*<Suffix>?} # Noun phrase VP: {<V.*>*} # Verb phrase AP: {<A.*>*} # Adjective phrase {<DT|JJ>} # chunk determiners and adjectives }<[\.VI].*>+{ # chink any tag beginning with V, I, or . <.*>}{<DT> # split a chunk at a determiner <DT|JJ>{}<NN.*> # merge chunk ending with det/adj # with one starting with a noun parser = RegexpParser(''' ... NP: {<DT>? <JJ>* <NN>*} # NP ... P: {<IN>} # Preposition ... V: {<V.*>} # Verb ... PP: {<P> <NP>} # PP -> P NP ... VP: {<V> <NP|PP>*} # VP -> V (NP|PP)* ... ''') {<N.*>} # 명사(N으로 시작)인 모든 경우 {<P.*>} # 대명사(P로 시작)인 모든 경우 {<DT><JJR>} # DT와 JJR이 붙어 있는 경우 {<DT><J.*>} # DT와 J로 시작하는 단어가 붙어 있는 경우 {<DT><JJ>?<NN>} # <DT>와 <NN>사이에 <JJ>가 있든 없든.. """ |
반응형