关于人类语言(human language)
人类语言的特点
- 语言就是符号:人类语言本质是一个符号系统(symbol system),无论是汉字还是英文字母,都是一种符号,用来承载、传递我们想要表达的意思(meaning)。
- 语言的载体:sound, vision(writting), gesture,不论是哪一种载体,都是一种连续的交流方式。
- 大脑是一种符号处理器(symbolic processors):我们可以把大脑处理语言看成是连续模式的激活过程(continious pattern of activation)。
- 因此我们可以得到启发:探索一种连续的编码模式来表达思想(explore a continous encoding patten of thought)。这也是很多NLP算法的处理思想,同时也解决了sparsity的问题。
关于NLP
NLP levels
- 两大来源:通过语音或者文本。语音:语音分析(phonetic)或音韵分析(phonological);文本:OCR识别(Optical Character Recognition,光学字符识别)或分词处理(tokenization)。通过上述方法来获取NLP的输入。
- 形态分析(morphological):对单词进行形态分析:前缀(prefix)、后缀(suffix)等。
- 句法分析(syntactic):分析句子结构、语法结构(structure of sentence)。
- 语义理解(semantic interpretation):work out the meaning of sentences.
- 语篇处理(discourse processing):因为大多数句子含义需要通过上下文(context)来推测,不能仅仅只分析当前句子,因此就有了the field of discourse processing。
注:cs224n课只重点讲syntatic & semantic analysis 这两块,以及一部分speech signal analysis。
NLP Applications
- 较低级:spell checking, keyword search, finding synonyms
- 中级:extracting information。个人比较感兴趣的方向,让计算机可以阅读文本,理解在讲些什么,至少知道讲的是哪方面内容;从文本中识别、抽取某方面内容;或者为文本阅读难度分级(work out the reading level of school text),识别文本的目标受众(intended audience of document);情感分析(positive or negetive)。
- 高级:机器翻译、对话机器人、智能问答、机器撰写(exploit the knowledge of world)
Why is NLP hard
- 语言本身的困难性:Ambiguilty of language, and moreover, humen always do not say everything(为了高效表达,语言使用中会出现很多省略).
- 表征语言很困难:Complexity of representing, using linguistic/situational/world knowledge.
- 解释语言很困难:Real meaning of the language depends on real world, common sense, and contextual knowledge.
关于deep learning
传统机器学习的问题
- Most traditional machine learning algorithms work well because of human-designed representations and input featured.
- “Machines” are only used to optimize weights that best make a final prdiction.
- Moreover, manually designed featured are often over-specified(lack of generalization), incomplete and take a long time to design and validate.
What is deep learning
- Subfield of machine learning and part of representation learning.
- Deep learning algorithms attempt to learn (multiple levels of) representations and an output themselves.
- We only input the raw data.
- In a lot of times, deep learning means neural networks (the dominant model family).
deep learning in NLP
核心思想:用vector去表征语言,用神经网络去组织、计算vector。
Post Date: 2018-11-02
版权声明: 本文为原创文章,转载请注明出处