Syntactic parsing (computational linguistics)

From HandWiki
Short description: Automatic analysis of syntactic structure of natural language

Syntactic parsing is the automatic analysis of syntactic structure of natural language, especially syntactic relations (in dependency grammar) and labelling spans of constituents (in constituency grammar).[1] It is motivated by the problem of structural ambiguity in natural language: a sentence can be assigned multiple grammatical parses, so some kind of knowledge beyond computational grammar rules is needed to tell which parse is intended. Syntactic parsing is one of the important tasks in computational linguistics and natural language processing, and has been a subject of research since the mid-20th century with the advent of computers.

Different theories of grammar propose different formalisms for describing the syntactic structure of sentences. For computational purposes, these formalisms can be grouped under constituency grammars and dependency grammars. Parsers for either class call for different types of algorithms, and approaches to the two problems have taken different forms. The creation of human-annotated treebanks using various formalisms (e.g. Universal Dependencies) has proceeded alongside the development of new algorithms and methods for parsing.

Part-of-speech tagging (which resolves some semantic ambiguity) is a related problem, and often a prerequisite for or a subproblem of syntactic parsing. Syntactic parses can be used for information extraction (e.g. event parsing, semantic role labelling, entity labelling) and may be further used to extract formal semantic representations.

Constituency parsing

See also: Phrase structure grammar

Constituency parsing involves parsing in accordance with constituency grammar formalisms, such as Minimalism or the formalism of the Penn Treebank. This, at the very least, means telling which spans are constituents (e.g. [The man] is here.) and what kind of constituent it is (e.g. [The man] is a noun phrase) on the basis of a context-free grammar (CFG) which encodes rules for constituent formation and merging.[2]

Algorithms generally require the CFG to be converted to Chomsky Normal Form (with two children per constituent), which can be done without losing any information about the tree or reducing expressivity using the algorithm first described by Hopcroft and Ullman in 1979.[3]

CKY

The most popular algorithm for constituency parsing is the Cocke–Kasami–Younger algorithm (CKY),[4][5] which is a dynamic programming algorithm which constructs a parse in worst-case [math]\displaystyle{ \mathcal{O}\left( n^3 \cdot \left| G \right| \right) }[/math] time, on a sentence of [math]\displaystyle{ n }[/math] words and [math]\displaystyle{ \left| G \right| }[/math] is the size of a CFG given in Chomsky Normal Form.

Given the issue of ambiguity (e.g. preposition-attachment ambiguity in English) leading to multiple acceptable parses, it is necessary to be able to score the probability of parses to pick the most probable one. One way to do this is by using a probabilistic context-free grammar (PCFG) which has a probability of each constituency rule, and modifying CKY to maximise probabilities when parsing bottom-up.[6][7][8]

A further modification is the lexicalized PCFG, which assigns a head to each constituent and encodes rule for each lexeme in that head slot. Thus, where a PCFG may have a rule "NP → DT NN" (a noun phrase is a determiner and a noun) while a lexicalized PCFG will specifically have rules like "NP(dog) → DT NN(dog)" or "NP(person)" etc. In practice this leads to some performance improvements.[9][10]

More recent work does neural scoring of span probabilities (which can take into account context unlike (P)CFGs) to feed to CKY, such as by using a recurrent neural network or transformer[11] on top of word embeddings.

In 2022, Nikita Kitaev et al.[12] introduced an incremental parser that first learns discrete labels (out of a fixed vocabulary) for each input token given only the left-hand context, which are then the only inputs to a CKY chart parser with probabilities calculated using a learned neural span scorer. This approach is not only linguistically-motivated, but also competitive with previous approaches to constituency parsing. Their work won the best paper award at ACL 2022.

Transition-based

Following the success of [math]\displaystyle{ O(n) }[/math] transition-based parsing for dependency grammars, work began on adapting the approach to constituency parsing. The first such work was by Kenji Sagae and Alon Lavie in 2005, which relied on a feature-based classifier to greedily make transition decisions.[13] This was followed by the work of Yue Zhang and Stephen Clark in 2009, which added beam search to the decoder to make more globally-optimal parses.[14] The first parser of this family to outperform a chart-based parser was the one by Muhua Zhu et al. in 2013, which took on the problem of length differences of different transition sequences due to unary constituency rules (a non-existent problem for dependency parsing) by adding a padding operation.[15]

Note that transition-based parsing can be purely greedy (i.e. picking the best option at each time-step of building the tree, leading to potentially non-optimal or ill-formed trees) or use beam search to increase performance while not sacrificing efficiency.

Sequence-to-sequence

See also: Seq2seq

A different approach to constituency parsing leveraging neural sequence models was developed by Oriol Vinyals et al. in 2015.[16] In this approach, constituent parsing is modelled like machine translation: the task is sequence-to-sequence conversion from the sentence to a constituency parse, in the original paper using a deep LSTM with an attention mechanism. The gold training trees have to be linearised for this kind of model, but the conversion does not lose any information. This runs in [math]\displaystyle{ O(n) }[/math] with a beam search decoder of width 10 (but they found little benefit from greater beam size and even limiting it to greedy decoding performs well), and achieves competitive performance with traditional algorithms for context-free parsing like CKY.

Dependency parsing

See also: Dependency grammar

Dependency parsing is parsing according to a dependency grammar formalism, such as Universal Dependencies (which is also a project that produces multilingual dependency treebanks). This means assigning a head (or multiple heads in some formalisms like Enhanced Dependencies, e.g. in the case of coordination) to every token and a corresponding dependency relation for each edge, eventually constructing a tree or graph over the whole sentence.[17]

There are broadly three modern paradigms for modelling dependency parsing: transition-based, grammar-based, and graph-based.[18]

Transition-based

Many modern approaches to dependency tree parsing use transition-based parsing (the base form of this is sometimes called arc-standard) as formulated by Joakim Nivre in 2003,[19] which extends on shift-reduce parsing by keeping a running stack of tokens, and deciding from three operations for the next token encountered:

  • Template:Sc1 (current token is a child of the top of the stack, is not added to stack)
  • Template:Sc1 (current token is the parent of the top of the stack, replaces top)
  • Template:Sc1 (add current token to the stack)

The algorithm can be formulated as comparing the top two tokens of the stack (after adding the next token to the stack) or the top token on the stack and the next token in the sentence.

Training data for such an algorithm is created by using an oracle, which constructs a sequence of transitions from gold trees which are then fed to a classifier. The classifier learns which of the three operations is optimal given the current state of the stack, buffer, and current token. Modern methods use a neural classifier which is trained on word embeddings, beginning with work by Danqi Chen and Christopher Manning in 2014.[20] In the past, feature-based classifiers were also common, with features chosen from part-of-speech tags, sentence position, morphological information, etc.

This is an [math]\displaystyle{ O(n) }[/math] greedy algorithm, so it does not guarantee the best possible parse or even a necessarily valid parse, but it is efficient.[21] It is also not necessarily the case that a particular tree will have only one sequence of valid transitions that can reach it, so a dynamic oracle (which may permit multiple choices of operations) will increase performance.[22]

A modification to this is arc-eager parsing, which adds another operation: Template:Sc1 (remove the top token on the stack). Practically, this results in earlier arc-formation.

These all only support projective trees so far, wherein edges do not cross given the token ordering from the sentence. For non-projective trees, Nivre in 2009 modified arc-standard transition-based parsing to add the operation Template:Sc1 (swap the top two tokens on the stack, assuming the formulation where the next token is always added to the stack first). This increases runtime to [math]\displaystyle{ O(n^2) }[/math] in the worst-case but practically still near-linear.[23]

Grammar-based

A chart-based dynamic programming approach to projective dependency parsing was proposed by Michael Collins[24] in 1996 and further optimised by Jason Eisner[25] in the same year.[26] This is an adaptation of CKY (previously mentioned for constituency parsing) to headed dependencies, a benefit being that the only change from constituency parsing is that every constituent is headed by one of its descendant nodes. Thus, one can simply specify which child provides the head for every constituency rule in the grammar (e.g. an NP is headed by its child N) to go from constituency CKY parsing to dependency CKY parsing.

McDonald's original adaptation had a runtime of [math]\displaystyle{ O(n^5) }[/math], and Eisner's dynamic programming optimisations reduced runtime to [math]\displaystyle{ O(n^3) }[/math]. Eisner suggested three different scoring methods for calculating span probabilities in his paper.

Graph-based

Exhaustive search of the possible [math]\displaystyle{ n^2 }[/math] edges in the dependency tree, with backtracking in the case an ill-formed tree is created, gives the baseline [math]\displaystyle{ O(n^3) }[/math] runtime for graph-based dependency parsing. This approach was first formally described by Michael A. Covington in 2001, but he claimed that it was "an algorithm that has been known, in some form, since the 1960s".[27]

The problem of parsing can also be modelled as finding a maximum-probability spanning arborescence over the graph of all possible dependency edges, and then picking dependency labels for the edges in tree we find. Given this, we can use an extension of the Chu–Liu/Edmonds algorithm with an edge scorer and a label scorer. This algorithm was first described by Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič in 2005.[28] It can handle non-projective trees unlike the arc-standard transition-based parser and CKY. As before, the scorers can be neural (trained on word embeddings) or feature-based. This runs in [math]\displaystyle{ O(n^2) }[/math] with Tarjan's extension of the algorithm.[29]

Evaluation

The performance of syntactic parsers is measured using standard evaluation metrics. Both constituency and dependency parsing approaches can be evaluated for the ratio of exact matches (percentage of sentences that were perfectly parsed), and precision, recall, and F1-score calculated based on the correct constituency or dependency assignments in the parse relative to that number in reference and/or hypothesis parses. The latter are also known as the PARSEVAL metrics.[30]

Dependency parsing can also be evaluated using attachment score. Unlabelled attachment score (UAS) is the percentage of tokens with correctly assigned heads, while labelled attachment score (LAS) is the percentage of tokens with correctly assigned heads and dependency relation labels.[31]

Conversion between parses

Given that much work on English syntactic parsing depended on the Penn Treebank, which used a constituency formalism, many works on dependency parsing developed ways to deterministically convert the Penn formalism to a dependency syntax, in order to use it as training data. One of the major conversion algorithms was Penn2Malt, which reimplemented previous work on the problem.[32]

Work in the dependency-to-constituency conversion direction benefits from the faster runtime of dependency parsing algorithms. One approach is using constrained CKY parsing, ignoring spans which obviously violate the dependency parse's structure and thus reducing runtime to [math]\displaystyle{ O(n^2) }[/math].[33] Another approach is to train a classifier to find an ordering for all the dependents of every token, which results in a structure isomorphic to the constituency parse.[34]

References

  1. Jurafsky & Martin 2021.
  2. Jurafsky & Martin 2021, chapter 13.
  3. Lange, Martin; Leiß, Hans (2009). "To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm". Informatica Didactica 8. http://ddi.cs.uni-potsdam.de/InformaticaDidactica/LangeLeiss2009.pdf. 
  4. Younger, Daniel H. (1967). "Recognition and parsing of context-free languages in time n3". Information and Control 10 (2): 189–208. doi:10.1016/S0019-9958(67)80007-X. 
  5. Kasami, T. (1966). An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages. Bedford, MA: Air Force Cambridge Research Laboratory. https://www.ideals.illinois.edu/handle/2142/74304. 
  6. Tsvetkov, Yulia; Titov, Ivan; Berg-Kirkpatrick, Taylor; Klein, Dan (2018). "Algorithms for NLP: Parsing I". Carnegie Mellon University. http://demo.clab.cs.cmu.edu/11711fa18/slides/FA18%2011-711%20lecture%2011%20-%20parsing%201.pdf. 
  7. Salomaa, Arto (1969). "Probabilistic and weighted grammars". Information and Control 15 (6): 529–544. doi:10.1016/S0019-9958(69)90554-3. 
  8. Booth, Taylor L. (1969). "Probabilistic representation of formal languages". 10th Annual Symposium on Switching and Automata Theory. https://ieeexplore.ieee.org/document/4569603. 
  9. Chen, Danqi; Narasimhan, Karthik (2019). "Constituency Parsing". https://www.cs.princeton.edu/courses/archive/fall19/cos484/lectures/lec10.pdf. 
  10. Collins, Michael (2003). "Head-Driven Statistical Models for Natural Language Parsing". Computational Linguistics 29 (4): 589–637. doi:10.1162/089120103322753356. https://aclanthology.org/J03-4003/. 
  11. Kitaev, Nikita; Klein, Dan (2018). "Constituency Parsing with a Self-Attentive Encoder". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. pp. 2676–2686. https://aclanthology.org/P18-1249/. 
  12. Kitaev, Nikita; Lu, Thomas; Klein, Dan (2022). "Learned Incremental Representations for Parsing". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. pp. 3086–3095. https://aclanthology.org/2022.acl-long.220/. 
  13. Sagae, Kenji; Lavie, Alon (2005). "A Classifier-Based Parser with Linear Run-Time Complexity". Proceedings of the Ninth International Workshop on Parsing Technology. pp. 125–132. https://aclanthology.org/W05-1513/. 
  14. Zhang, Yue; Clark, Stephen (2009). "Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model". Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09). pp. 162–171. https://aclanthology.org/W09-3825/. 
  15. Zhu, Muhua; Zhang, Yue; Chen, Wenliang; Zhang, Min; Zhu, Jingbo (2013). "Fast and Accurate Shift-Reduce Constituent Parsing". Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. pp. 434–443. https://aclanthology.org/P13-1043/. 
  16. Vinyals, Oriol; Kaiser, Łukasz; Koo, Terry; Petrov, Slav; Sutskever, Ilya; Hinton, Geoffrey (2015). "Grammar as a Foreign Language". Advances in Neural Information Processing Systems 28 (NIPS 2015). https://papers.nips.cc/paper/2015/hash/277281aada22045c03945dcb2ca6f2ec-Abstract.html. 
  17. Jurafsky & Martin 2021, chapter 14.
  18. Kübler, McDonald & Nivre 2009.
  19. Nivre, Joakim (2003). "An Efficient Algorithm for Projective Dependency Parsing". Proceedings of the Eighth International Conference on Parsing Technologies. pp. 149–160. https://aclanthology.org/W03-3017/. 
  20. Chen, Danqi; Manning, Christopher (2014). "A Fast and Accurate Dependency Parser using Neural Networks". Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 740–750. https://aclanthology.org/D14-1082/. 
  21. Stymne, Sara (17 December 2014). "Transition-based dependency parsing". Uppsala Universitet. https://cl.lingfil.uu.se/~sara/kurser/5LN455-2014/lectures/5LN455-F8.pdf. 
  22. Goldberg, Yoav; Nivre, Joakim (2012). "A Dynamic Oracle for Arc-Eager Dependency Parsing". Proceedings of COLING 2012. pp. 959–976. https://aclanthology.org/C12-1059/. 
  23. Nivre, Joakim (2009). "Non-Projective Dependency Parsing in Expected Linear Time". Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. pp. 351–359. https://aclanthology.org/P09-1040/. 
  24. Collins, Michael John (1996). "A New Statistical Parser Based on Bigram Lexical Dependencies". 34th Annual Meeting of the Association for Computational Linguistics. pp. 184–191. https://aclanthology.org/P96-1025/. 
  25. Eisner, Jason M. (1996). "Three New Probabilistic Models for Dependency Parsing: An Exploration". COLING. https://aclanthology.org/C96-1058/. 
  26. Stymne, Sara (15 December 2014). "Collins' and Eisner's algorithms". Uppsala Universitet. https://cl.lingfil.uu.se/~sara/kurser/5LN455-2014/lectures/5LN455-F7.pdf. 
  27. Covington, Michael A. (2001). "A Fundamental Algorithm for Dependency Parsing". Proceedings of the 39th Annual ACM Southeast Conference. 
  28. McDonald, Ryan; Pereira, Fernando; Ribarov, Kiril; Hajič, Jan (2005). "Non-Projective Dependency Parsing using Spanning Tree Algorithms". Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. pp. 523–530. https://aclanthology.org/H05-1066/. 
  29. Goldberg, Yoav (2021). "Dependency Parsing". Bar Ilan University. https://u.cs.biu.ac.il/~89-680/depparsing.pdf. 
  30. Black, E.; Abney, S.; Flickenger, D.; Gdaniec, C.; Grishman, R.; Harrison, P.; Hindle, D.; Ingria, R. et al. (1991). "A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars". Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991. https://aclanthology.org/H91-1060/. 
  31. Kübler, McDonald & Nivre 2009, chapter 6.
  32. Nivre, Joakim; Hall, Johan; Nilsson, Jens (2006). "MaltParser: A Data-Driven Parser-Generator for Dependency Parsing". Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). https://aclanthology.org/L06-1084/. 
  33. Kong, Lingpeng; Rush, Alexander M.; Smith, Noah A. (2015). "Transforming Dependencies into Phrase Structures". Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 788–798. https://aclanthology.org/N15-1080/. 
  34. Fernández-González, Daniel; Martins, André F. T. (2015). "Parsing as Reduction". Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. pp. 1523–1533. https://aclanthology.org/P15-1147/. 

Further reading

Dependency parsing

  • Kübler, Sandra; McDonald, Ryan; Nivre, Joakim (2009). Graeme Hirst. ed. Dependency Parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool.