natural language processing in ai research paper

Deep Learning for Natural Language Processing: A Survey

Published: 26 June 2023
Volume 273 , pages 533–582, ( 2023 )

Cite this article

natural language processing in ai research paper

E. O. Arkhangelskaya 1 &
S. I. Nikolenko 2 , 3

887 Accesses

4 Citations

1 Altmetric

Explore all metrics

Over the last decade, deep learning has revolutionized machine learning. Neural network architectures have become the method of choice for many different applications; in this paper, we survey the applications of deep learning to natural language processing (NLP) problems. We begin by briefly reviewing the basic notions and major architectures of deep learning, including some recent advances that are especially important for NLP. Then we survey distributed representations of words, showing both how word embeddings can be extended to sentences and paragraphs and how words can be broken down further in character-level models. Finally, the main part of the survey deals with various deep architectures that have either arisen specifically for NLP tasks or have become a method of choice for them; the tasks include sentiment analysis, dependency parsing, machine translation, dialog and conversational models, question answering, and other applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Are Deep Learning Approaches Suitable for Natural Language Processing?

Deep Learning Methods in Natural Language Processing

A Review of the Development and Application of Natural Language Processing

Explore related subjects.

Artificial Intelligence

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems , 2015, Software available from tensorflow.org .

C. Aggarwal and P. Zhao, Graphical Models for Text: A New Paradigm for Text Representation and Processing , SIGIR ’10, ACM (2010), pp. 899–900.

R. Al-Rfou, B. Perozzi, and S. Skiena, “Polyglot: Distributed word representations for multilingual nlp,” in: Proc. 17th Conference on Computational Natural Language Learning (Sofia, Bulgaria), ACL (2013), pp. 183–192.

Google Scholar

G. Angeli and C. D. Manning, “Naturalli: Natural logic inference for common sense reasoning,” in: Proc. 2014 EMNLP (Doha, Qatar), ACL, (2014), pp. 534–545.

E. Arisoy, T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Deep neural network language models,” in: Proc. NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, ACL (2012), pp. 20– 28.

J. Ba, V. Mnih, and K. Kavukcuoglu, Multiple Object Recognition With Visual Attention , ICLR’15 (2015). ’

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv (2014).

D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” arXiv (2015).

M. Ballesteros, C. Dyer, and N. A. Smith, “Improved transition-based parsing by modeling characters instead of words with lstms,” in: Proc. EMNLP 2015 (Lisbon, Portugal), ACL (2015), pp. 349–359.

P. Baltescu and P. Blunsom, “Pragmatic neural language modelling in machine translation,” NAACL HLT 2015, pp. 820–829.

L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, and N. Schneider, “Abstract meaning representation for sembanking,” in: Proc. 7th Linguistic Annotation Workshop and Interoperability with Discourse (Sofia, Bulgaria), ACL (2013), pp. 178–186.

R. E. Banchs, “Movie-dic: A movie dialogue corpus for research and development,” ACL ’12, ACL (2012), pp. 203–207.

M. Baroni and R. Zamparelli, “Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space,” EMNLP ’10 , ACL (2010), pp. 1183– 1193.

S. Bartunov, D. Kondrashkin, A. Osokin and D. P. Vetrov, “Breaking sticks and ambiguities with adaptive skip-gram,” Proc. 19th International Conference on Artificial Intelligence and Statistics , AISTATS 2016, Cadiz, Spain (2016), pp. 130–138.

F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: New features and speed improvements,” Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop (2012).

Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” J. Machine Learning Research , 3 , 1137–1155 (2003).

MATH Google Scholar

Y. Bengio, “Learning deep architectures for ai,” Foundations and Trends in Machine Learning , 2 , No. 1, 1–127 (2009).

Article MathSciNet MATH Google Scholar

Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” in: Neural Networks: Tricks of the Trade , Second ed. (2012), pp. 437–478.

Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 35 , No. 8, 1798–1828 (2013).

Article Google Scholar

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” NIPS’06 , MIT Press (2006), pp. 153–160.

Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain, “Neural probabilistic language models,” in: Innovations in Machine Learning , Springer (2006), pp. 137–186.

Chapter Google Scholar

Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” arXiv (2013).

J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic parsing on Freebase from question-answer pairs,” in: Proc. 2013 EMNLP (Seattle, Washington, USA), ACL (2013), pp. 1533–1544.

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler,” in: Proc. Python for Scientific Computing Conference (SciPy) (2010), Oral Presentation.

D. P. Bertsekas, Convex Analysis and Optimization , Athena Scientific (2003).

J. Bian, B. Gao, and T.-Y. Liu, “Knowledge-powered deep learning for word embedding,” in: Machine Learning and Knowledge Discovery in Databases , Springer (2014), pp. 132–148.

C. M. Bishop, Pattern Recognition and Machine Learning , Springer (2006).

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: A collaboratively created graph database for structuring human knowledge,” in: SIGMOD ’08 , ACM (2008), pp. 1247–1250.

D. Bollegala, T. Maehara, and K.-i. Kawarabayashi, “Unsupervised cross-domain word representation learning,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 730–740.

F. Bond and K. Paik, A Survey of WordNets and their Licenses , GWC 2012 (2012), p. 64–71.

A. Bordes, X. Glorot, J. Weston, and Y. Bengio, Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing , JMLR (2012).

A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “A semantic matching energy function for learning with multi-relational data,” Machine Learning , 94 , No. 2, 233–259 (2013).

A. Bordes, N. Usunier, S. Chopra, and J. Weston, “Large-scale simple question answering with memory networks,” arXiv (2015).

A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov, “A neural click model for web search,” in: WWW ’16 , ACM (2016) (to appear).’

E. Boros, R. Besançon, O. Ferret, and B. Grau, “Event role extraction using domainrelevant word representations,” in: Proc. 2014 EMNLP (Doha, Qatar), ACL (2014), pp. 1852–1857.

J. A. Botha and P. Blunsom, “Compositional morphology for word representations and language modelling,” in Proc. 31th ICML (2014), pp. 1899–1907.

H. Bourlard and Y. Kamp, Auto-Association by Multilayer Perceptrons and Singular Value Decomposition , Manuscript M217, Philips Research Laboratory, Brussels, Belgium (1987).

O. Bousquet, U. Luxburg, and G. Ratsch (eds.), Advanced Lectures on Machine Learning , Springer (2004).

S. R. Bowman, C. Potts, and C. D. Manning, “Learning distributed word representations for natural logic reasoning,” arXiv (2014).

S. R. Bowman, C. Potts, and C. D. Manning, “Recursive neural networks for learning logical semantics,” arXiv (2014).

A. Bride, T. Van de Cruys, and N. Asher, “A generalisation of lexical functions for composition in distributional semantics,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 281–291.

P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai, “Class-based n-gram models of natural language,” Comput. Linguist. , 18 , No. 4, 467–479 (1992).

P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer, “The mathematics of statistical machine translation: Parameter estimation,” Comput. Linguist. , 19 , No. 2, 263–311 (1993).

J. Buysand P. Blunsom, “Generative incremental dependency parsing with neural networks,” in: Proc. 53rd ACL and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing , Vol. 2, Short Papers (2015), pp. 863–869.

E. Cambria, “Affective computing and sentiment analysis,” IEEE Intelligent Systems , 31 , No. 2 (2016).

Z. Cao, S. Li, Y. Liu, W. Li, and d H. Ji, “A novel neural topic model and its supervised extension,” in: Proc. 29th AAAI Conference on Artificial Intelligence , January 25-30, 2015, Austin, Texas (2015), pp. 2210–2216.

X. Carreras and L. Marquez, “Introduction to the conll-2005 shared task: Semantic role labeling,” in: CONLL ’05 , ACL (2005), pp. 152–164.

B. Chen and H. Guo, “Representation based translation evaluation metrics,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 2, Short Papers (Beijing, China), ACL (2015), pp. 150–155.

D. Chen, R. Socher, C. D. Manning, and A. Y. Ng, “Learning new facts from knowledge bases with neural tensor networks and semantic word vectors,” in: International Conference on Learning Representations (ICLR) (2013).

M. Chen, Z. E. Xu, K. Q. Weinberger, and F. Sha, “Marginalized denoising autoencoders for domain adaptation,” in: Proc. 29th ICML, icml.cc / Omnipress (2012).

S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” in: ACL ’96 , ACL (1996), pp. 310–318.

X. Chen, Y. Zhou, C. Zhu, X. Qiu, and X. Huang, “Transition-based dependency parsing using two heterogeneous gated recursive neural networks,” in: Proc. EMNLP 2015 (Lisbon, Portugal) , ACL (2015), pp. 1879–1889.

Y. Chen, L. Xu, K. Liu, D. Zeng, and J. Zhao, “Event extraction via dynamic multipooling convolutional neural networks,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 167–176.

S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv (2014).

K. Cho, Introduction to Neural Machine Translation With Gpus (2015).

K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv (2014).

K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in: Proc. 2014 EMNLP (Doha, Qatar), ACL (2014), pp. 1724–1734.

K. Cho, B. van Merrienboer, Ç. Gulçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in: Proc. EMNLP 2014 , pp. 1724–1734.

F. Chollet, “Keras”, https://github.com/fchollet/keras (2015).

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” arXiv (2015).

J. Chung, K. Cho, and Y. Bengio, “A character-level decoder without explicit segmentation for neural machine translation,” arXiv (2016).

J. Chung, Ç. Gulçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv (2014).

S. Clark, B. Coecke, and M. Sadrzadeh, “A compositional distributional model of meaning,” in: Proc. Second Symposium on Quantum Interaction (QI-2008) (2008), 133–140.

S. Clark, B. Coecke, and M. Sadrzadeh, “Mathematical foundations for a compositional distributed model of meaning,” Linguistic Analysis , 36 , Nos. 1–4, 345–384 (2011).

B. Coecke, M. Sadrzadeh, and S. Clark, “Mathematical foundations for a compositional distributional model of meaning,” arXiv (2010).

R. Collobert, S. Bengio, and J. Marithoz, Torch: A Modular Machine Learning Software Library (2002).

R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in: Proc. 25th International Conference on Machine Learning , ACM (2008), pp. 160–167.

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” J. Machine Learning Research , 12 , 2493– 2537 (2011).

T. Cooijmans, N. Ballas, C. Laurent, and A. Courville, “Recurrent batch normalization,” arXiv (2016).

L. Deng and Y. Liu (eds.), Deep Learning in Natural Language Processing , Springer (2018).

L. Deng and D. Yu, “Deep learning: Methods and applications,” Foundations and Trends in Signal Processing , 7 , No. 3–4, 197–387 (2014).

L. Deng and D. Yu, “Deep learning: Methods and applications,” Foundations and Trends in Signal Process , 7 , No. 3–4, 197–387 (2014).

J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul, “Fast and robust neural network joint models for statistical machine translation,” in: Proc. 52nd ACL , Vol. 1, Long Papers (Baltimore, Maryland), ACL (2014), pp. 1370–1380.

N. Djuric, H. Wu, V. Radosavljevic, M. Grbovic, and N. Bhamidipati, “Hierarchical neural language models for joint representation of streaming documents and their content,” in: WWW ’15 , ACM (2015), pp. 248–255.

B. Dolan, C. Quirk, and C. Brockett, “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources,” in: COLING ’04 , ACL (2004).

L. Dong, F. Wei, M. Zhou, and K. Xu, “Question answering over freebase with multicolumn convolutional neural networks,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 260–269.

S. A. Duffy, J. M. Henderson, and R. K. Morris, “Semantic facilitation of lexical access during sentence processing,” J. Experimental Psychology: Learning, Memory, and Cognition , 15 , 791–801 (1989).

G. Durrett and D. Klein, “Neural CRF parsing,” arXiv (2015).

C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith, “Transition-based dependency parsing with stack long short-term memory,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 334–343.

J. L. Elman, “Finding structure in time,” Cognitive Science , 14 , No. 2, 179–211 (1990).

K. Erk, “Representing words as regions in vector space,” in: CoNLL ’09 , ACL (2009), pp. 57–65.

A. Fader, L. Zettlemoyer, and O. Etzioni, “Paraphrase-driven learning for open question answering,” in: Proc. 51st ACL , Vol. 1, Long Papers (Sofia, Bulgaria), ACL (2013), pp. 1608–1618.

C. Fellbaum (ed.), WordNet: An Electronic Lexical Database , MIT Press, Cambridge, MA (1998).

C. Fellbaum, Wordnet and Wordnets, Encyclopedia of Language and Linguistics , (K. Brown, ed.), Elsevier (2005), pp. 665–670.

D. A. Ferrucci, E. W. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. M. Prager, N. Schlaefer, and C. A. Welty, “Building Watson: An overview of the DeepQA project,” AI Magazine , 31 , No. 3, 59–79 (2010).

O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neural machine translation with a shared attention mechanism,” arXiv (2016).

D. Fried, T. Polajnar, and S. Clark, “Low-rank tensors for verbs in compositional distributional semantics,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 2, Short Papers (Beijing, China), ACL (2015), pp. 731–736.

K. Fukushima, “Neural network model for a mechanism of pattern recognition unaffected by shift in position — Neocognitron,” Transactions of the IECE , J62-A(10) , 658–665 (1979).

K. Fukushima, “Neocognitron: A self-organizing neural network for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics , 36 , No. 4, 193–202 (1980).

Y. Gal, “A theoretically grounded application of dropout in recurrent neural networks,” arXiv:1512.05287 (2015).

Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Insights and applications,” in: Deep Learning Workshop , ICML (2015).

J. Gao, X. He, W. tau Yih, and L. Deng, “Learning continuous phrase representations for translation modeling,” in: Proc. ACL 2014 , ACL (2014).

J. Gao, P. Pantel, M. Gamon, X. He, L. Deng, and Y. Shen, Modeling Interestingness With Deep Neural Networks , EMNLP (2014).

F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” Neural Computation 12 , No. 10, 2451–2471 (2000).

F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in: Neural Networks , 2000. IJCNN 2000, Proc. IEEE-INNS-ENNS International Joint Conference on, Vol. 3, IEEE (2000), pp. 189–194.

L. Getoor and B. Taskar, Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning) , MIT Press (2007).

Book MATH Google Scholar

F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neural networks architectures,” Neural Computation , 7 , No. 2, 219–269 (1995).

X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in: International Conference on Artificial Intelligence and Statistics (2010), pp. 249–256.

X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier networks,” AISTATS , 15 , 315–323 (2011).

X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in: Proc. 28th ICML (2011), pp. 513–520.

Y. Goldberg, “A primer on neural network models for natural language processing,” arXiv (2015).

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning , MIT Press (2016), http://www.deeplearningbook.org .

J. T. Goodman, “A bit of progress in language modeling,” Comput. Speech Lang. , 15 , No. 4, 403–434 (2001).

A. Graves, “Generating sequences with recurrent neural networks,” arXiv (2013).

A. Graves, S. Fernandez, and J. Schmidhuber, “Bidirectional LSTM networks for improved phoneme classification and recognition,” in: Artificial Neural Networks: Formal Models and Their Applications – ICANN 2005, 15th International Conference , Warsaw, Poland, Proceedings, Part II (2005), pp. 799–804.

A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks , 18 , Nos. 5–6, 602–610 (2005).

E. Grefenstette, “Towards a formal distributional semantics: Simulating logical calculi with tensors,” arXiv (2013).

E. Grefenstette and M. Sadrzadeh, “Experimental support for a categorical compositional distributional model of meaning,” in: EMNLP ’11 , ACL (2011), pp. 1394–1404.

E. Grefenstette, M. Sadrzadeh, S. Clark, B. Coecke, and S. Pulman, “Concrete sentence spaces for compositional distributional models of meaning,” in: Proc. 9th International Conference on Computational Semantics (IWCS11) (2011), 125–134.

E. Grefenstette, M. Sadrzadeh, S. Clark, B. Coecke, and S. Pulman, “Concrete sentence spaces for compositional distributional models of meaning,” in: Computing Meaning , Springer (2014), pp. 71–86.

K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A search space odyssey,” arXiv (2015).

J. Gu, Z. Lu, H. Li, and V. O. K. Li, “Incorporating copying mechanism in sequence-tosequence learning,” arXiv (2016).

H. Guo, “Generating text with deep reinforcement learning,” arXiv (2015).

S. Guo, Q.Wang, B.Wang, L.Wang, and L. Guo, “Semantically smooth knowledge graph embedding,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 84–94.

R. Gupta, C. Orasan, and J. van Genabith, “Reval: A simple and effective machine translation evaluation metric based on recurrent neural networks,” in: Proc. 2015 EMNLP (Lisbon, Portugal), ACL (2015), pp. 1066–1072.

F. Guzmán, S. Joty, L. Marquez, and P. Nakov, “Pairwise neural machine translation evaluation,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 805–814.

D. Hall, G. Durrett, and D. Klein, “Less grammar, more features,” in: Proc. 52nd ACL , Vol. 1, Long Papers (Baltimore, Maryland), ACL (2014), pp. 228–237.

A. L. F. Han, D. F. Wong, and L. S. Chao, “LEPOR: A robust evaluation metric for machine translation with augmented factors,” in: Proc. COLING 2012: Posters (Mumbai, India), The COLING 2012 Organizing Committee (2012), pp. 441–450.

S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network construction with back-propagation,” in: Advances in Neural Information Processing Systems (NIPS) 1 (D. S. Touretzky, ed.), San Mateo, CA: Morgan Kaufmann (1989), pp. 177–185.

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in: Proc. ICCV (2015), pp. 1026–1034.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in: Proc. 2016 CVPR (2016), pp. 770–778.

K. M. Hermann and P. Blunsom, “Multilingual models for compositional distributed semantics,” in: Proc. 52nd ACL , Vol. 1, Long Papers (Baltimore, Maryland), ACL (2014), pp. 58–68.

K. M. Hermann, T. Ko˘cisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” arXiv (2015).

F. Hill, K. Cho, and A. Korhonen, “Learning distributed representations of sentences from unlabelled data,” arXiv (2016).

G. E. Hinton and J. L. McClelland, “Learning representations by recirculation,” Neural Information Processing Systems (D. Z. Anderson, ed.), American Institute of Physics (1988), pp. 358–366.

G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation , 18 , No. 7, 1527–1554 (2006).

G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length and helmholtz free energy,” in: Advances in Neural Information Processing Systems 6 (J. D. Cowan, G. Tesauro, and J. Alspector, eds.), Morgan-Kaufmann (1994), pp. 3–10.

S. Hochreiter, Untersuchungen zu dynamischen neuronalen Netzen, Diploma thesis, Institut fur Informatik, Lehrstuhl Prof. Brauer, Technische Universitat Munchen (1991), Advisor: J. Schmidhuber.

S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , A Field Guide to Dynamical Recurrent Neural Networks (S. C. Kremer and J. F. Kolen, eds.), IEEE Press (2001).

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory , Tech. Report FKI-207-95, Fakultat fur Informatik, Technische Universitat Munchen (1995).

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation , 9 , No. 8, 1735–1780 (1997).

B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural network architectures for matching natural language sentences,” in: Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), Curran Associates, Inc. (2014), pp. 2042–2050.

E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng, “Improving word representations via global context and multiple word prototypes,” in: ACL ’12 , ACL (2012), pp. 873–882.

E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng, “Improving word representations via global context and multiple word prototypes,” in: Proc. 50th ACL: Long Papers- Volume 1, ACL (2012), pp. 873–882.

P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using clickthrough data,” in: Proc. CIKM (2013).

D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture of monkey striate cortex,” J. Physiology , 195 , 215–243 (1968).

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv (2015).

O. Irsoy and C. Cardie, “Opinion mining with deep recurrent neural networks,” in: Proc. EMNLP (2014), pp. 720–728.

M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and H. Daumé III, “A neural network for factoid question answering over paragraphs,” in: Empirical Methods in NaturalLanguage Processing (2014).

K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?,” in: Proc. 12th ICCV (2009), pp. 2146–2153.

S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On using very large target vocabulary for neural machine translation,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 1–10.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv (2014).

M. Joshi, M. Dredze, W. W. Cohen, and C. P. Rosé, “What’s in a domain? multi-domain learning for multi-attribute data,” in: Proc. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Atlanta, Georgia), ACL (2013), pp. 685–690.

A. Joulin and T. Mikolov, “Inferring algorithmic patterns with stack-augmented recurrent nets,” arXiv (2015).

M. Kageback, O. Mogren, N. Tahmasebi, and D. Dubhashi, “Extractive summarization using continuous vector space models,” in: Proc. 2nd Workshop on Continuous Vector Space Models and Their Compositionality (CVSC)@ EACL (2014), pp. 31–39.

L. Kaiser and I. Sutskever, “Neural gpus learn algorithms,” arXiv (2015).

N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” EMNLP , 3 , 413 (2013).

N. Kalchbrenner and P. Blunsom, “Recurrent convolutional neural networks for discourse compositionality,” arXiv (2013).

N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” arXiv (2014).

N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” in: Proc. 52nd ACL , Vol. 1, Long Papers (Baltimore, Maryland), ACL (2014), pp. 655–665.

A. Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks (2015).

D. Kartsaklis, M. Sadrzadeh, and S. Pulman, “A unified sentence space for categorical distributional-compositional semantics: Theory and experiments,” in: Proc. 24th International Conference on Computational Linguistics (COLING): Posters (Mumbai, India) (2012), pp. 549–558.

T. Kenter and M. de Rijke, “Short text similarity with word embeddings,” in: CIKM ’15 , ACM (2015), pp. 1411–1420.

Y. Kim, “Convolutional neural networks for sentence classification,” in: Proc. 2014 EMNLP (Doha, Qatar), ACL (2014), pp. 1746–1751.

Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,” arXiv (2015).

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv (2014).

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv (2014).

D. P. Kingma, T. Salimans, M. Welling, “Variational dropout and the local reparameterization trick,” in: Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), Curran Associates, Inc. (2015), pp. 2575–2583.

S. Kiritchenko, X. Zhu, and S. M. Mohammad, “Sentiment analysis of short informal texts,” J. Artificial Intelligence Research , 723–762 (2014).

R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in: Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), Curran Associates, Inc. (2015), pp. 3294–3302.

R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” in: Proc. ICASSP-95 , Vol. 1 (1995), pp. 181–184.

P. Koehn, Statistical Machine Translation , 1st ed., Cambridge University Press, New York, USA (2010).

O. Kolomiyets and M.-F. Moens, “A survey on question answering technology from an information retrieval perspective,” Inf. Sci. 181 , No. 24, 5412–5434 (2011).

Article MathSciNet Google Scholar

A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in: Advances in Neural Information Processing Systems 4 (D. S. Lippman, J. E. Moody, and D. S. Touretzky, eds.), Morgan Kaufmann (1992), pp. 950–957.

A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gulrajani, and R. Socher, “Ask me anything: Dynamic memory networks for natural language processing,” arXiv (2015).

J. Lafferty, A. McCallum, and F. C. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data .

T. . Landauer and S. T. Dumais, “A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge,” Psychological review , 104 , No. 2, 211–240 (1997).

H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in: ICML ’07 , ACM (2007), pp. 473–480.

H. Larochelle and G. E. Hinton, “Learning to combine foveal glimpses with a third-order boltzmann machine,” in: Advances in Neural Information Processing Systems 23 (J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, eds.), Curran Associates, Inc. (2010), pp. 1243–1251.

A. Lavie, K. Sagae, and S. Jayaraman, The Significance of Recall in Automatic Metrics for MT Evaluation , Springer Berlin Heidelberg, Berlin, Heidelberg (2004), pp. 134–143.

Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to initialize recurrent networks of rectified linear units,” arXiv (2015).

Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents,” arXiv (2014).

Y. LeCun, “Une procédure d’apprentissage pour réseau a seuil asymétrique,” in: Proc. Cognitiva 85 , Paris (1985), pp. 599–604.

Y. LeCun, Modeles Connexionnistes de l’apprentissage (connectionist learning models) , Ph.D. thesis, Université P. et M. Curie (Paris 6) (1987).

Y. LeCun, “A theoretical framework for back-propagation,” in: Proc. 1988 Connectionist Models Summer School (CMU, Pittsburgh, Pa) (D. Touretzky, G. Hinton, and T. Sejnowski, eds.), Morgan Kaufmann (1988), pp. 21–28.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in: Intelligent Signal Processing , IEEE Press (2001), pp. 306– 351.

Y. LeCun and F. Fogelman-Soulie, Modeles Connexionnistes de l’apprentissage , Intellectica, special issue apprentissage et machine (1987).

Y. LeCun, Y. Bengio, and G. Hinton, “Human-level control through deep reinforcement learning,” Nature , 521 , 436–444 (2015).

Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applications in vision,” in: Proc. ISCAS 2010 (2010), pp. 253–256.

O. Levy, Y. Goldberg, and I. Ramat-Gan, “Linguistic regularities in sparse and explicit word representations,” in: CoNLL (2014), pp. 171–180.

J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao, “Deep reinforcement learning for dialogue generation,” in: Proc. 2016 Conference on Empirical Methods in Natural Language Processing , EMNLP 2016, Austin, Texas, USA (2016), pp. 1192–1202.

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv (2015).

C. Lin, Y. He, R. Everson, and S. Ruger, “Weakly supervised joint sentiment-topic detection from text,” IEEE Transactions on Knowledge and Data Engineering , 24 , No. 6, 1134–1145 (2012).

C.-Y. Lin and F. J. Och, “Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics,” in: ACL ’04 , ACL (2004).

Z. Lin, W. Wang, X. Jin, J. Liang, and D. Meng, “A word vector and matrix factorization based method for opinion lexicon extraction,” in: WWW ’15 Companion , ACM (2015), pp. 67–68.

W. Ling, C. Dyer, A. W. Black, I. Trancoso, R. Fermandez, S. Amir, L. Marujo, and T. Luis, “Finding function in form: Compositional character models for open vocabulary word representation,” in Proc. EMNLP 2015 (Lisbon, Portugal), ACL (2015), pp. 1520– 1530.

S. Linnainmaa, “The representation of the cumulative rounding error of an algorithm as a taylor expansion of the local rounding errors,” Master’s thesis, Univ. Helsinki (1970).

B. Liu, Sentiment Analysis and Opinion Mining , Synthesis Lectures on Human Language Technologies, vol. 5, Morgan & Claypool Publishers (2012).

B. Liu, Sentiment Analysis: Mining Opinions, Sentiments, and Emotions , Cambridge University Press (2015).

Book Google Scholar

C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau, “How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation,” in: Proc. EMNLP 2016 (2016), pp. 2122–2132.

P. Liu, X. Qiu, and X. Huang, “Learning context-sensitive word embeddings with neural tensor skip-gram model,” in: IJCAI’15 , AAAI Press (2015), pp. 1284–1290.

Y. Liu, Z. Liu, T.-S. Chua, and M. Sun, “Topical word embeddings,” in: AAAI’15 , AAAI Press (2015), pp. 2418–2424.

A. Lopez, “Statistical machine translation,” ACM Comput. Surv. , 40 , No. 3, 8:1–8:49 (2008).

R. Lowe, M. Noseworthy, I. V. Serban, N. Angelard-Gontier, Y. Bengio, and J. Pineau, “Towards an automatic turing test: Learning to evaluate dialogue responses,” in: Submitted to ICLR 2017 (2017).

R. Lowe, N. Pow, I. Serban, and J. Pineau, “The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems,” arXiv (2015).

Q. Luo and W. Xu, “Learning word vectors efficiently using shared representations and document representations,” in: AAAI’15 , AAAI Press (2015), pp. 4180–4181.

Q. Luo, W. Xu, and J. Guo, “A study on the cbow model’s overfitting and stability,” in: Web-KR ’14 , ACM (2014), pp. 9–12.

M.-T. Luong, M. Kayser, and C. D. Manning, “Deep neural language models for machine translation,” in: Proc. Conference on Natural Language Learning (CoNLL) (Beijing, China), ACL (2015), pp. 305–309.

M.-T. Luong, R. Socher, and C. D. Manning, “Better word representations with recursive neural networks for morphology,” CoNLL (Sofia, Bulgaria) (2013).

T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in: Proc. 2015 EMNLP (Lisbon, Portugal), ACL, (2015), pp. 1412– 1421.

T. Luong, I. Sutskever, Q. Le, O. Vinyals, and W. Zaremba, “Addressing the rare word problem in neural machine translation,” in: Proc. 53rd ACL and the 7the IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 11–19.

M. Ma, L. Huang, B. Xiang, and B. Zhou, “Dependency-based convolutional neural networks for sentence embedding,” in: Proc. ACL 2015 , Vol. 2, Short Papers (2015), p. 174.

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in: HLT ’11 , ACL (2011), pp. 142–150.

B. MacCartney and C. D. Manning, “An extended model of natural logic,” in: Proc. Eight International Conference on Computational Semantics (Tilburg, The Netherlands), ACL (2009), pp. 140–156.

D. J. MacKay, Information Theory, Inference and Learning Algorithms , Cambridge University Press (2003).

C. D. Manning, Computational Linguistics and Deep Learning , Computational Linguistics (2016).

C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval , Cambridge University Press (2008).

M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, and R. Zamparelli, Semeval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences Through Semantic Relatedness and Textual Entailment , SemEval-2014 (2014).

B. Marie and A. Max, “Multi-pass decoding with complex feature guidance for statistical machine translation,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 2, Short Papers (Beijing, China), ACL (2015), pp. 554–559.

W. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bull. Math. Biophysics , 7 , 115–133 (1943).

F. Meng, Z. Lu, M. Wang, H. Li, W. Jiang, and Q. Liu, “Encoding source language with convolutional neural network for machine translation,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 20–30.

T. Mikolov, Statistical Language Models Based on Neural Networks , Ph.D. thesis, Ph. D. thesis, Brno University of Technology (2012).

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv (2013).

T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur, Recurrent Neural Network Based Language Model , INTERSPEECH 2 , 3 (2010).

T. Mikolov, S. Kombrink, L. Burget, J. H. Cernocky, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, IEEE (2011), pp. 5528–5531.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” arXiv (2013).

J. Mitchell and M. Lapata, “Composition in distributional models of semantics,” Cognitive Science , 34 , No. 8, 1388–1429 (2010).

A. Mnih and G. E. Hinton, “A scalable hierarchical distributed language model,” in: Advances in Neural Information Processing Systems (2009), pp. 1081–1088.

A. Mnih and K. Kavukcuoglu, “Learning word embeddings efficiently with noisecontrastive estimation,” in: Advances in Neural Information Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), Curran Associates, Inc. (2013), pp. 2265–2273.

V. Mnih, N. Heess, A. Graves, and k. Kavukcuoglu, “Recurrent models of visual attention,” in: Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), Curran Associates, Inc. (2014), pp. 2204–2212.

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” in: NIPS Deep Learning Workshop (2013).

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature , 518 , No. 7540, 529–533 (2015).

G. Montavon, G. B. Orr, and K. Muller (eds.), Neural Networks: Tricks of the Trade (second ed), Lect. Notes Computer Sci., Vol. 7700, Springer (2012).

L. Morgenstern and C. L. Ortiz, “The winograd schema challenge: Evaluating progress in commonsense reasoning,” in: AAAI’15 , AAAI Press (2015), pp. 4024–4025.

K. P. Murphy, Machine Learning: a Probabilistic Perspective , Cambridge University Press (2013).

A. Neelakantan, B. Roth, and A. McCallum, “Compositional vector space models for knowledge base completion,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 156–166.

V. Ng and C. Cardie, “Improving machine learning approaches to coreference resolution,” in: ACL ’02 , ACL (2002), pp. 104–111.

Y. Oda, G. Neubig, S. Sakti, T. Toda, and S. Nakamura, ‘Syntax-based simultaneous translation through prediction of unseen syntactic constituents,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 198–207.

M. Osborne, S. Moran, R. McCreadie, A. Von Lunen, M. Sykora, E. Cano, N. Ireson, C. Macdonald, I. Ounis, Y. He, T. Jackson, F. Ciravegna, and A. O’Brien, “Real-time detection, tracking, and monitoring of automatically discovered events in social media,” in: Proc. 52nd ACL: System Demonstrations (Baltimore, Maryland), ACL (2014), pp. 37– 42.

B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales,” in: ACL ’05 , ACL (2005), pp. 115–124.

B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval , 2 , Nos. 1–2, 1–135 (2008).

P. Pantel, “Inducing ontological co-occurrence vectors,” in: ACL ’05 , ACL (2005), pp. 125–132.

D. Paperno, N. T. Pham, and M. Baroni, “A practical and linguistically-motivated approach to compositional distributional semantics,” in: Proc. 52nd ACL , Vol. 1, Long Papers (Baltimore, Maryland), ACL (2014), pp. 90–99.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in: Proc. 40th ACL, ACL (2002) pp. 311–318.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in: ACL ’02 , ACL (2002), pp. 311–318.

D. B. Parker, Learning-Logic , Tech. Report TR-47, Center for Comp. Research in Economics and Management Sci., MIT (1985).

R. Pascanu, Ç. Gulçehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” arXiv (2013).

Y. Peng, S. Wang, and -L. Lu, Marginalized Denoising Autoencoder via Graph Regularization for Domain Adaptation , Springer Berlin Heidelberg, Berlin, Heidelberg, 156–163 (2013).

J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in: Proc. 2014 EMNLP (Doha, Qatar), ACL (2014), pp. 1532–1543.

J. Pouget-Abadie, D. Bahdanau, B. van Merrienboer, K. Cho, and Y. Bengio, “Overcoming the curse of sentence length for neural machine translation using automatic segmentation,” arXiv (2014).

L. Prechelt, Early Stopping — But When? , Springer Berlin Heidelberg, Berlin, Heidelberg (2012), pp. 53–67.

J. Preiss and M. Stevenson, “Unsupervised domain tuning to improve word sense disambiguation,” in: Proc. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Atlanta, Georgia), ACL (2013), pp. 680–684.

S. Prince, Computer vision: Models, learning, and inference , Cambridge University Press (2012).

A. Ramesh, S. H. Kumar, J. Foulds, and L. Getoor, “Weakly supervised models of aspectsentiment for online course discussion forums,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 74–83.

R. S. Randhawa, P. Jain, and G. Madan, “Topic modeling using distributed word embeddings,” arXiv (2016).

M. Ranzato, G. E. Hinton, and Y. LeCun, “Guest editorial: Deep learning,” International J. Computer Vision , 113 , No. 1, 1–2 (2015).

J. Reisinger and R. J. Mooney, “Multi-prototype vector-space models of word meaning,” in: HLT ’10 , ACL (2010), pp. 109–117.

X. Rong, “word2vec parameter learning explained,” arXiv (2014).

F. Rosenblatt, Principles of Neurodynamics , Spartan, New York (1962).

F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review , 65 , No. 6, 386–408 (1958).

H. Rubenstein and J. B. Goodenough, “Contextual correlates of synonymy,” Communications of the ACM , 8 , No. 10, 627–633 (1965).

A. A. Rusu, S. G. Colmenarejo, Ç. Gulçehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” arXiv (2015).

M. Sachan, K. Dubey, E. Xing, and M. Richardson, “Learning answer-entailing structures for machine comprehension,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 239–249.

M. Sadrzadeh and E. Grefenstette, “A compositional distributional semantics, two concrete constructions, and some experimental evaluations,” in: QI’11 , Springer-Verlag (2011), pp. 35–47.

M. Sahlgren, “The Distributional Hypothesis,” Italian J. Linguistics , 20 , No. 1, 33–54 (2008).

R. Salakhutdinov, “Learning Deep Generative Models,” Annual Review of Statistics and Its Application , 2 , No. 1, 361–385 (2015).

R. Salakhutdinov and G. Hinton, “An efficient learning procedure for deep boltzmann machines,” Neural Computation , 24 , No. 8, 1967–2006 (2012).

R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in: Proc. Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS Clearwater Beach, Florida, USA (2009), pp. 448–455.

J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks , 61 , 85–117 (2015).

M. Schuster, “On supervised learning from sequential data with applications for speech recognition,” Ph.D. thesis, Nara Institute of Science and Technolog, Kyoto, Japan (1999).

M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing , 45 , No. 11, 2673–2681 (1997).

H. Schwenk, “Continuous space language models,” Comput. Speech Lang. , 21 , No. 3, 492–518 (2007).

I. V. Serban, A. G. O. II, J. Pineau, and A. C. Courville, “Multi-modal variational encoder-decoders,” arXiv (2016).

I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau, “Hierarchical neural network generative models for movie dialogues,” arXiv (2015).

I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. C. Courville, and Y. Bengio, “A hierarchical latent variable encoder-decoder model for generating dialogues,” in: Proc. 31st AAAI (2017), pp. 3295–3301.

H. Setiawan, Z. Huang, J. Devlin, T. Lamar, R. Zbib, R. Schwartz, and J. Makhoul, “Statistical machine translation features with multitask tensor networks,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 31–41.

A. Severyn and A. Moschitti, “Learning to rank short text pairs with convolutional deep neural networks,” in: SIGIR ’15 , ACM (2015), pp. 373–382.

K. Shah, R. W. M. Ng, F. Bougares, and L. Specia, “Investigating continuous space language models for machine translation quality estimation,” in: Proc. 2015 EMNLP (Lisbon, Portugal), ACL (2015), pp. 1073–1078.

L. Shang, Z. Lu, and H. Li, “Neural responding machine for short-text conversation,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 1577–1586.

Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “A latent semantic model with convolutional-pooling structure for information retrieval,” in: CIKM ’14 , ACM (2014), pp. 101–110.

C. Silberer and M. Lapata, “Learning grounded meaning representations with autoencoders,” ACL , No. 1, 721–732 (2014).

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the Game of Go with Deep Neural Networks and Tree Search,” Nature , 529 , No. 7587, 484–489 (2016).

M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul, “A study of translation edit rate with targeted human annotation,” in: Proc. Association for Machine Translation in the Americas (2006), pp. 223–231.

R. Snow, S. Prakash, D. Jurafsky, and A. Y. Ng, “Learning to Merge Word Senses,” in: Proc. Joint Meeting of the Conference on Empirical Methods on Natural Language Processing and the Conference on Natural Language Learning (2007), pp. 1005–1014.

R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng, “Parsing with compositional vector grammars,” in: Proc. ACL (2013), pp. 455–465.

R. Socher, D. Chen, C. D. Manning, and A. Ng, “ReasoningWith Neural Tensor Networks for Knowledge Base Completion,” Advances in Neural Information Processing Systems (NIPS) (2013).

R. Socher, E. H. Huang, J. Pennin, C. D. Manning, and A. Y. Ng, “Dynamic pooling and unfolding recursive autoencoders for paraphrase detection,” Advances in Neural Information Processing Systems , 801–809 (2011).

R. Socher, A. Karpathy, Q. Le, C. Manning, and A. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics , 2014 (2014).

R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, “Semi-supervised recursive autoencoders for predicting sentiment distributions,” in: Proc. EMNLP 2011 , ACL (2011), pp. 151–161.

R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in: Proc. EMNLP 2013 , Vol. 1631, Citeseer (2013), p. 1642.

Y. Song, H. Wang, and X. He, “Adapting deep ranknet for personalized search,” in: WSDM 2014 , ACM (2014).

A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, and J.-Y. Nie, “A hierarchical recurrent encoder-decoder for generative context-aware query suggestion,” in: CIKM ’15 , ACM (2015), pp. 553–562.

R. Soricut and F. Och, “Unsupervised morphology induction using word embeddings,” in: Proc. 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Denver, Colorado), ACL (2015), pp. 1627–1637.

B. Speelpenning, “Compiling fast partial derivatives of functions given by algorithms,” Ph.D. thesis, Department of Computer Science, University of Illinois, Urbana-Champaign (1980).

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Machine Learning Research , 15 , No. 1, 1929–1958 (2014).

MathSciNet MATH Google Scholar

R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in: NIPS’15 , MIT Press (2015), pp. 2377–2385.

P. Stenetorp, “Transition-based dependency parsing using recursive neural networks,” in: Deep Learning Workshop at NIPS 2013 (2013).

J. Su, D. Xiong, Y. Liu, X. Han, H. Lin, J. Yao, and M. Zhang, “A context-aware topic model for statistical machine translation,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 229–238.

P.-H. Su, M. Gasic, N. Mrkši, L. M. Rojas Barahona, S. Ultes, D. Vandyke, T.-H.Wen, and S. Young, “On-line active reward learning for policy optimisation in spoken dialogue systems,” in: Proc. 54th ACL , Vol. 1, Long Papers (Berlin, Germany), ACL (2016), pp. 2431–2441.

S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “Weakly supervised memory networks,” arXiv (2015).

F. Sun, J. Guo, Y. Lan, J. Xu, and X. Cheng, “Learning word representations by jointly modeling syntagmatic and paradigmatic relations,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 136–145.

I. Sutskever and G. E. Hinton, “Deep, narrow sigmoid belief networks are universal approximators,” Neural Computation , 20 , No. 11, 2629–2636 (2008).

Article MATH Google Scholar

I. Sutskever, J. Martens, and G. Hinton, “Generating text with recurrent neural networks,” in: ICML ’11 , ACM (2011), pp. 1017–1024.

I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” arXiv (2014).

Y. Tagami, H. Kobayashi, S. Ono, and A. Tajima, “Modeling user activities on the web using paragraph vector,” in: WWW ’15 Companion , ACM (2015), pp. 125–126.

K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from treestructured long short-term memory networks,” in: Proc. 53rd ACL and 7th IJCNLP , Vol. 1 (2015), pp. 1556–1566.

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to humanlevel performance in face verification,” in: CVPR ’14, IEEE Computer Society (2014), pp. 1701–1708.

D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, “Learning sentiment-specific word embedding for twitter sentiment classification,” ACL , 1 , 1555–1565 (2014).

W. T. Yih, X. He, and C. Meek, “Semantic parsing for single-relation question answering,” in: Proc. ACL , ACL (2014).

J. Tiedemann, “News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces,”in: Recent Advances in Natural Language Processing , Vol. V, (Amsterdam/Philadelphia) (N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, eds.), John Benjamins, Amsterdam/Philadelphia (2009), pp. 237–248.

I. Titov and J. Henderson, “A latent variable model for generative dependency parsing,” in: IWPT ’07 , ACL (2007), pp. 144–155.

E. F. Tjong Kim Sang and S. Buchholz, “Introduction to the conll-2000 shared task: Chunking,” in: ConLL ’00 , ACL (2000), pp. 127–132.

B. Y. Tong Zhang, “Boosting with early stopping: Convergence and consistency,” Annals of Statistics , 33 , No. 4, 1538–1579 (2005).

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, “Feature-rich part-of-speech tagging with a cyclic dependency network,” in: NAACL ’03 , ACL (2003), pp. 173–180.

Y. Tsuboi and H. Ouchi, “Neural dialog models: A survey,” Available from http://2boy.org/~yuta/publications/neural-dialog-models-survey-20150906.pdf., 2015.

J. Turian, L. Ratinov, and Y. Bengio, “Word representations: A simple and general method for semi-supervised learning,” in: ACL ’10 , ACL (2010), pp. 384–394.

P. D. Turney, P. Pantel, et al., “From frequency to meaning: Vector space models of semantics,” J. Artificial Intelligence Research , 37 , No. 1, 141–188 (2010).

E. Tutubalina and S. I. Nikolenko, “Constructing aspect-based sentiment lexicons with topic modeling,” in: Proc. 5th International Conference on Analysis of Images, Social Networks, and Texts (AIST 2016).

B. van Merri¨enboer, D. Bahdanau, V. Dumoulin, D. Serdyuk, D. Warde-Farley, J. Chorowski, and Y. Bengio, “Blocks and fuel: Frameworks for deep learning,” arXiv (2015).

D. Venugopal, C. Chen, V. Gogate, and V. Ng, “Relieving the computational bottleneck: Joint inference for event extraction with high-dimensional features,” in: Proc. 2014 EMNLP (Doha, Qatar), ACL (2014), pp. 831–843.

P. Vincent, “A connection between score matching and denoising autoencoders,” Neural Computation , 23 , No. 7, 1661–1674 (2011).

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in: ICML ’08 , ACM (2008), pp. 1096–1103.

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Machine Learning Research , 11 , 3371–3408 (2010).

O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. E. Hinton, “Grammar as a foreign language,” arXiv (2014).

O. Vinyals and Q. V. Le, “A neural conversational model,” in: ICML Deep Learning Workshop , arXiv:1506.05869 (2015).

V. Viswanathan, N. F. Rajani, Y. Bentor, and R. Mooney, “Stacked ensembles of information extractors for knowledge-base population,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 177–187.

X. Wang, Y. Liu, C. Sun, B. Wang, and X. Wang, “Predicting polarities of tweets by composing word embeddings with long short-term memory,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 1343–1353.

D. Weiss, C. Alberti, M. Collins, and S. Petrov, “Structured training for neural network transition-based parsing,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 323–333.

J. Weizenbaum, “Eliza – a computer program for the study of natural language communication between man and machine,” Communications of the ACM , 9 , No. 1, 36–45 (1966).

T. Wen, M. Gasic, N. Mrksic, L. M. Rojas-Barahona, P. Su, S. Ultes, D. Vandyke, and S. J. Young, “Conditional generation and snapshot learning in neural dialogue systems,” in: Proc. 2016 Conference on Empirical Methods in Natural Language Processing , EMNLP 2016, Austin, Texas, USA (2016), pp. 2153–2162.

P. J. Werbos, “Applications of advances in nonlinear sensitivity analysis,” in: Proc. 10th IFIP Conference , NYC (1981), pp. 762–770.

P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proc. IEEE , 78 , No. 10, 1550–1560 (1990).

P. J. Werbos, “Backwards differentiation in AD and neural nets: Past links and new opportunities,” in: Automatic Differentiation: Applications, Theory, and Implementations , Springer (2006), pp. 15–34.

Chapter MATH Google Scholar

J. Weston, A. Bordes, S. Chopra, and T. Mikolov, “Towards ai-complete question answering: A set of prerequisite toy tasks,” arXiv (2015).

J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv (2014).

L. White, R. Togneri, W. Liu, and M. Bennamoun, “How well sentence embeddings capture meaning,” in; ADCS ’15 , ACM (2015), pp. 9:1–9:8.

R. J. Williams and D. Zipser, “Gradient-based learning algorithms for recurrent networks and their computational complexity,” in: Backpropagation (Hillsdale, NJ, USA) (Y. Chauvin and D. E. Rumelhart, eds.), L. Erlbaum Associates Inc., Hillsdale, NJ, USA (1995), pp. 433–486.

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv (2016).

Z. Wu and C. L. Giles, “Sense-aware semantic analysis: A multi-prototype word representation model using wikipedia,” in: AAAI’15 , AAAI Press (2015), pp. 2188–2194.

S. Wubben, A. van den Bosch, and E. Krahmer, “Paraphrase generation as monolingual translation: Data and evaluation,” in: INLG ’10 , ACL (2010), pp. 203–207.

C. Xu, Y. Bai, J. Bian, B. Gao, G. Wang, X. Liu, and T.-Y. Liu, “Rc-net: A general framework for incorporating knowledge into word representations,” in: CIKM ’14 , ACM (2014), pp. 1219–1228.

K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” arXiv (2015).

R. Xu and D. Wunsch, Clustering , Wiley-IEEE Press (2008).

X. Xue, J. Jeon, and W. B. Croft, “Retrieval models for question and answer archives,” in: SIGIR ’08 , ACM (2008), pp. 475–482.

M. Yang, T. Cui, and W. Tu, “Ordering-sensitive and semantic-aware topic modeling,” arXiv (2015).

Y. Yang and J. Eisenstein, “Unsupervised multi-domain adaptation with feature embeddings,” in: Proc. 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Denver, Colorado), ACL (2015), pp. 672–682.

Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, “Stacked attention networks for image question answering,” arXiv (2015).

K. Yao, G. Zweig, and B. Peng, “Attention with intention for a neural network conversation model,” arXiv (2015).

X. Yao, J. Berant, and B. Van Durme, “Freebase qa: Information extraction or semantic parsing?” in: Proc. ACL 2014 Workshop on Semantic Parsing (Baltimore, MD), ACL (2014), pp. 82–86.

Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent learning,” Constructive Approximation , 26 , No. 2, 289–315 (2007).

W.-t. Yih, M.-W. Chang, C. Meek, and A. Pastusiak, “Question answering using enhanced lexical semantic models,” in: Proc. 51st ACL , Vol. 1, Long Papers (Sofia, Bulgaria), ACL (2013), pp. 1744–1753.

W.-t. Yih, G. Zweig, and J. C. Platt, “Polarity inducing latent semantic analysis,” in: EMNLP-CoNLL ’12 , ACL (2012), pp. 1212–1222.

W. Yin and H. Schutze, “Multigrancnn: An architecture for general matching of text chunks on multiple levels of granularity,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 63–73.

W. Yin, H. Schutze, B. Xiang, and B. Zhou, “ABCNN: attention-based convolutional neural network for modeling sentence pairs,” arXiv (2015).

J. Yohan and O. A. H., “Aspect and sentiment unification model for online review analysis,” in: WSDM ’11 , ACM (2011), pp. 815–824.

A. M. Z. Yang, A. Kotov, and S. Lu, “Parametric and non-parametric user-aware sentiment topic models,” in: Proc. 38th ACM SIGIR (2015).

W. Zaremba and I. Sutskever, “Reinforcement learning neural Turing machines,” arXiv (2015).

W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv (2014).

M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” arXiv (2012).

L. S. Zettlemoyer and M. Collins, “Learning to map sentences to l51ogical form: Structured classification with probabilistic categorial grammars,” arXiv (2012).

X. Zhang and Y. LeCun, “Text understanding from scratch,” arXiv (2015).

X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in: Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), Curran Associates, Inc. (2015), pp. 649–657.

G. Zhou, T. He, J. Zhao, and P. Hu, “Learning continuous word embedding with metadata for question retrieval in community question answering,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 250–259.

H. Zhou, Y. Zhang, S. Huang, and J. Chen, “A neural probabilistic structured-prediction model for transition-based dependency parsing,” in: Proc. 53rd ACL and the 7th IJCNLP , Vol. 1, Long Papers (Beijing, China), ACL (2015), pp. 1213–1222.

Download references

Author information

Authors and affiliations.

Saarland University, 66123, Saarbrücken, Germany

E. O. Arkhangelskaya

St. Petersburg State University, St. Petersburg, Russia

S. I. Nikolenko

St. Petersburg Department of Steklov Mathematical Institute RAS, St. Petersburg, Russia

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to E. O. Arkhangelskaya .

Additional information

Published in Zapiski Nauchnykh Seminarov POMI , Vol. 499, 2021, pp. 137–205.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Arkhangelskaya, E.O., Nikolenko, S.I. Deep Learning for Natural Language Processing: A Survey. J Math Sci 273 , 533–582 (2023). https://doi.org/10.1007/s10958-023-06519-6

Download citation

Received : 14 January 2019

Published : 26 June 2023

Issue Date : July 2023

DOI : https://doi.org/10.1007/s10958-023-06519-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Find a journal
Publish with us
Track your research

1. INTRODUCTION

2. data and preprocessing, 3. performed experiments, applied methods and analysis of results, 4. concluding remarks, the state of the art of natural language processing—a systematic automated review of nlp literature using nlp techniques.

Cite Icon Cite
Open the PDF for in another window
Permissions
Article contents
Figures & tables
Supplementary Data
Peer Review
Search Site

Jan Sawicki , Maria Ganzha , Marcin Paprzycki; The State of the Art of Natural Language Processing—A Systematic Automated Review of NLP Literature Using NLP Techniques. Data Intelligence 2023; 5 (3): 707–749. doi: https://doi.org/10.1162/dint_a_00213

Download citation file:

Ris (Zotero)
Reference Manager

Nowadays, natural language processing (NLP) is one of the most popular areas of, broadly understood, artificial intelligence. Therefore, every day, new research contributions are posted, for instance, to the arXiv repository. Hence, it is rather difficult to capture the current “state of the field” and thus, to enter it. This brought the id-art NLP techniques to analyse the NLP-focused literature. As a result, (1) meta-level knowledge, concerning the current state of NLP has been captured, and (2) a guide to use of basic NLP tools is provided. It should be noted that all the tools and the dataset described in this contribution are publicly available. Furthermore, the originality of this review lies in its full automation. This allows easy reproducibility and continuation and updating of this research in the future as new researches emerge in the field of NLP.

Natural language processing (NLP) is rapidly growing in popularity in a variety of domains, from closely related, like semantics [ 1 , 2 ] and linguistics [ 3 , 4 ] (e.g. inflection [ 5 ], phonetics and onomastics [ 6 ], automatic text correction [ 7 ]), named entity recognition [ 8 , 9 ] to distant ones, like biobliometry [ 10 ], cybersecurity [ 11 ], quantum mechanics [ 12 , 13 ], gender studies [ 14 , 15 ], chemistry [ 16 ] or orthodontia [ 17 ]. This, among others, brings an opportunity, for early-stage researchers, to enter the area. Since NLP can be applied to many domains and languages, and involves use of many techniques and approaches, it is important to realize where to start.

This contribution attempts at addressing this issue, by applying NLP techniques to analysis of NLP-focused literature. As a result, with a fully automated, systematic, visualization-driven literature analysis, a guide to the state-of-the-art of natural language processing is presented. In this way, two goals are achieved. (1) Providing introduction to NLP for scientists entering the field, and (2) supporting possible knowledge update for experienced researchers. The main research questions (RQs) considered in this work are:

RQ1: What datasets are considered to be most useful?

RQ2: Which languages, other than English, appear in NLP research?

RQ3: What are the most popular fields and topics in current NLP research?

Rq4: what particular tasks and problems are most often studied, rq5: is the field “homogenous”, or are there easily identifiable “subgroups”, rq6: how difficult is it to comprehend the nlp literature.

Taking into account that the proposed approach is, itself, anchored in NLP, this work is also an illustration of how selected standard NLP techniques can be used in practice, and which of them should be used for which purpose. However, it should be made clear that considerations presented in what follows should be treated as “illustrative examples”, not “strict guidelines”. Moreover, it should be stressed that none of the applied techniques has been optimized to the task (e.g. no hyperparameter tuning has been applied). This is a deliberate choice, as the goal is to provide an overview and “general ideas”, rather than overwhelm the reader with technical details of individual NLP approaches. For technical details, concerning optimization of mentioned approaches, reader should consult referenced literature.

The whole analysis has been performed in Python—a programming language which is ubiquitous in data science research and projects for years [ 18 , 19 , 20 , 21 , 22 , 23 ]. Python was also chosen for the following reasons:

It provides a heterogeneous environment

It allows use of Jupyter Notebooks ① , which allow quick and easy prototyping, testing and code sharing

There exists an abundance of data science libraries ② , which allow everything from acquiring the dataset, to visualizing the result

It offers readability and speed in development [ 24 ]

Presented analysis follows the order of research questions. To make the text more readable, readers are introduced to pertinent NLP methods in the context of answering individual questions.

At the beginning of NLP research, there is always data. This section introduces the dataset consisting of research papers used in this work, and describes how it was preprocessed.

2.1 Data Used in the Research

To adequately represent the domain, and to apply NLP techniques, it is necessary to select an abundant, and well-documented, repository of related texts (stored in a digital format). Moreover, to automatize the conducted analysis, and to allow easy reproduction, it is crucial to choose a set of papers, which can be easily accessed, e.g. a database with a functional Application Programming Interface (API). Finally, for obvious reasons, open access datasets are the natural targets for NLP-oriented work.

In the context of this work, while there are multiple repositories, which contain NLP-related literature, the best choice turned out to be arXiv (for the papers themselves, and for the metadata it provided), combined with the Semantic Scholar (for the “citation network” and other important metadata; see Section 3.3.1).

Note that other datasets have been considered, but were not selected. Reasons for this decision have been summarized in Table 1 .

Database .	The reason for in applicability in this research task .
Google Scholar	Google Scholar does not contain actual data (text, PDF, etc.) of any work—there are only links to other databases. Moreover, performed tests determined that the API (Python “scholarly” library) works well with small queries, but fetching information about thousands of papers results in download rate limits, and temporary IP address blocking. Finally, Google Scholar is criticized, among others, for excessive secrecy [ ], biased search algorithms [ ], and incorrect citation counts [ ].
PubMed	PubMed is mainly focused on medical and biological papers. Therefore, the number of works related to NLP is somewhat limited, and difficult to identify using straightforward approaches.
ResearchGate	There are two main problems with ResearchGate, as seen from the perspective of this work: lack of easy-accessible API and restrictions on some articles’ availability (large number of papers has to be requested from authors—and such requests may not be fulfilled, or wait time may be excessive).
Scopus	The Scopus API is not fully open-access, and has restrictions on the number of requests that can be issues within a specific time.
JSTOR	Even though the JSTOR website declares that API exists, the link does not provide any information about it (404 not found).
Microsoft Academic	The Microsoft Academic API is very well documented, but it does not provide true open access (requires a subscription key). Moreover, it does not contain the actual text of works; mostly metadata.

Database .	The reason for in applicability in this research task .
Google Scholar	Google Scholar does not contain actual data (text, PDF, etc.) of any work—there are only links to other databases. Moreover, performed tests determined that the API (Python “scholarly” library) works well with small queries, but fetching information about thousands of papers results in download rate limits, and temporary IP address blocking. Finally, Google Scholar is criticized, among others, for excessive secrecy [ ], biased search algorithms [ ], and incorrect citation counts [ ].
PubMed	PubMed is mainly focused on medical and biological papers. Therefore, the number of works related to NLP is somewhat limited, and difficult to identify using straightforward approaches.
ResearchGate	There are two main problems with ResearchGate, as seen from the perspective of this work: lack of easy-accessible API and restrictions on some articles’ availability (large number of papers has to be requested from authors—and such requests may not be fulfilled, or wait time may be excessive).
Scopus	The Scopus API is not fully open-access, and has restrictions on the number of requests that can be issues within a specific time.
JSTOR	Even though the JSTOR website declares that API exists, the link does not provide any information about it (404 not found).
Microsoft Academic	The Microsoft Academic API is very well documented, but it does not provide true open access (requires a subscription key). Moreover, it does not contain the actual text of works; mostly metadata.

2.1.1 Dataset Downloading and Filtering

The papers were fetched from arXiv on 26 August 2021. The resulting dataset includes all articles, which have been extracted as a result of issuing the query “natural language processing” ④ . As a result, 4712 articles were retrieved. Two articles were discarded because their PDFs were too complicated for the tools that were used for the text extraction (1 710.10229v1—problems with chart on page 15; 1803.07136v1 — problems with chart on page 6; see, also, section 2.2). Even though the query was not bounded by the “time when the article was uploaded to arXiv” parameter, it turned out that a solid majority of the articles had submission dates from the last decade. Specifically, the distribution was as follows:

192 records uploaded before 2010-01-01

243 records from between (including) 2010-01-01 and 2014-12-31

697 records from between (including) 2015-01-01 and 2017-12-31

3580 records uploaded after 2018-01-01

On the basis of this distribution, it was decided that there is no reason to impose time constraints, because the “old” works should not be able to “overshadow” the “newest” literature. Moreover, it was decided that it is worth keeping all available publications, as they might result in additional findings (e.g., as what concerns the most original work, described in Section 3.7.4).

Finally, all articles not written in English were discarded, reducing the total count to 4576 texts. This decision, while somewhat controversial, was made to be able to understand the results (by the authors of this contribution) and to avoid complex issues related to text translation. However, it is easy to observe that the number of texts not written in English (and stored in arXiv) was relatively small (< 5%). Nevertheless, this leaves open a question: what is the relationship between NLP-related work that is written in English and that written in other languages. However, addressing this topic is out of scope of this contribution.

2.2 Text Preprocessing

Obviously, the key information about a research contribution is contained in its text. Therefore, subsequent analysis applied NLP techniques to texts of downloaded papers. To do this, the following preprocessing has been applied. The PDFs have been converted to plain text, using pdfminer.six (a Python library ⑤ ). Here, notice that there are several other libraries that can also be used to convert PDF to text. Specifically, the following libraries have been tried: pdfminer ⑥ , pdftotree ⑦ , BeautifulSoup ⑧ . On the basis of performed tests, pdfminer.six was selected, because it provided the simplest API, produced results, which did not have to be further converted (as opposite to, e.g., BeautifulSoup), and performed the fastest conversion.

Use of different text analysis methods may require different preprocessing. Some methods, like keyphrase search, work best when the text is “thoroughly cleaned”; i.e. almost reduced to a “bag of words” [ 28 ]. This means that, for instance, words are lemmatized, there is no punctuation, etc. However, some more recent techniques (like text embeddings [ 29 ]) can (and should) be trained on a “dirty” text, like Wikipedia [ 30 ] dumps ⑨ or Common Crawl ⑩ . Hence, it is necessary to distinguish between (at least) two levels of text cleaning: (A) “delicately cleaned” text (in what follows, called “Stage 1” cleaning), where only parts insignificant to the NLP analysis are removed, and (B) a “very strictly cleaned” text (called “Stage 2” cleaning). Specifically, “Stage 1” cleaning includes removal of:

charts and diagrams improperly converted to text,

arXiv “watermarks”,

references section (which were not needed, since metadata from Semantic Scholar was used),

links, formulas, misconverted characters (e.g. “ff”).

Stage 2 cleaning is applied to the results of Stage 1 cleaning, and consists of the following operations:

All punctuation, numbers and other non-letter characters were removed, leaving only letters.

Adposition, adverb, conjunction, coordinating conjunction, determiner, interjection, numeral, particle, pronoun, punctuation, subordinating conjunction, symbol, end of line, space were removed. Parts of speech left after filtering were: verbs, nouns, auxiliaries and “other”. The “other” category is usually tagged for meaningless text, e.g. “asdfgh”. However, these were not deleted in case the algorithm detected something that was, in fact, important, e.g. domain-specific shortcuts and abbreviations like CNN, RNN, etc.

Words have been lemmatized.

Note that while individual NLP techniques may require more specific data cleaning, the two (Stage 1 and Stage 2) workflows are generic enough to be successfully applied in the majority of typical NLP applications.

This section traverses research questions RQ1 to RQ6 and summarizes the findings for each one of them. Furthermore, it introduces specific NLP methods used to address each question. Interested readers are invited to study referenced literature to find additional details.

3.1 RQ1: Finding Most Popular Datasets Used in NLP

As noted, a fundamental aspect for all data science projects is the data. Hence, this section summarizes the most popular (open) datasets that are used in NLP research. Here, the information about these datasets (names of datasets) was extracted from the analyzed texts, using Named Entity Recognition and Keyphrase search. Let us briefly summarize these two methods.

3.1.1 Named Entity Recognition-NER

Named Entity Recognition (NER) can be seen as finding an answer to “the problem of locating and categorizing important nouns, and proper nouns, in a text” [ 31 ]. Here, automatic methods should facilitate extraction of, among others, named topics, issues, problems, and other “things” mentioned in texts (e.g. in articles). Hence, the spaCy [ 32 ] NER model “en-core-web-lg” ⑪ has been used to extract named entities. These entities have been linked by co-occurrence, and visualized as networks (further described in section 3.4).

SpaCy has been chosen over other models (e.g. transformers [ 33 ] pipeline ⑫ ), because it was simpler to use, and performed faster.

3.1.2 Key phrase Search

Another simple and effective way of extracting information from text, is keyword and/or keyphrase search [ 34 , 35 ]. This technique can be used not only in the preliminary exploratory data analysis (EDA), but also to extract actual and useful findings. Furthermore, keyphrase search is also complementary to, and extends, results of Named Entity Recognition (NER) (Section 3.1.1).

To apply keyphrase search, first, texts were cleaned with Stage 2 cleaning (see Section 2.2). Second, they were converted to phrases (n-grams) of lengths 1-4. Next, two exhaustive lists were created, based on all phrases (n-grams): (a) allowed phrases (609 terms), and (b) banned phrases (1235 terms). The allowed phrases contained word and phrases, which were meaningful for natural language processing or were specific enough to be considered separate, e.g. TF-IDF, accuracy, annotation, NER, taxonomy. The list of banned phrases contains words and phrases, which on their own carried no significant meaning for this research, e.g. bad, big, bit, long, power, index, default. The banned phrases also contained some incoherent phrases, which slipped through the previous cleaning phases. These lists were used to filter the phrases found in the texts. Obtained results were converted to networks of phrase co-occurrence, to visualize phrase importance, and relations between phrases.

3.1.3 Approaches to finding names of most popular NLP datasets

Keyword search was used to extract names of NLP datasets used in collected papers. To properly factor out dataset names and omit noise words, two approaches were applied: unsupervised and list-based.

Unsupervised approach included extracting words (proper nouns detected with Python spaCy ⑬ library) in the near neighborhood (max 3 words before or after) of words “data”, “dataset” and similar.

In list-based approaches, the algorithm looked for particular dataset names that were identified in the three big aggregated lists of NLP datasets ⑭ ⑮ ⑯ .

3.1.4 Findings Related to RQI; What are the Most Popular NLP Datasets

This section presents the findings, which answer RQ1, i.e. which datasets are most often used in NLP research. To best show datasets that are popular, and outline which are used together, a heatmap has been created. It is presented in Figure 1 . In general, a heatmap allows getting not only a general ranking of features (looking only at the diagonal), but also provides the information of correlation of features, or lack thereof.

Heatmap of top 10 percentile of NLP datasets co-usage (logarithmic scale).

It can be easily seen that the most popular dataset, used in NLP, is Wikipedia. Among the top 4 most popular datasets, one can find also: Twitter, Facebook, and WordNet. There is a high correlation between use of datasets, which were extracted from Twitter and Facebook, which are very frequently used together. This is both intuitive and observable in articles dedicated to social network analysis [ 36 ], social text sentiment analysis [ 37 ], social media mining [ 38 ] and other social science related texts [ 39 ]. Manual checking determined also that Twitter is extremely popular in sentiment analysis and other emotion-related explorations [ 40 ].

3.2 Findings Related to RQ2: What Languages are Studied in NLP Research

The second research question concerned languages that were analyzed in reported research (not the language the paper was written in). This information was mined using the same two methods, i.e. keyphrase search and NER. The results were represented in two ways. The basic method was a co-occurrence heatmap presented in Figure 2 .

Heatmap of language co-occurrence in articles.

For clarity, the following is the ranking of top 20 most popular languages, by number of papers in which they have been considered:

English: 2215

Chinese: 809

German: 682

French: 533

Spanish: 416

Arabic: 306

Japanese: 299

Italian: 257

Russian: 239

Portuguese: 154

Turkish: 144

Korean: 130

Finnish: 125

Swedish: 125

As it is visible in Figure 2 , the most popular language is English, but it may be caused by the bias of analyzing only English-language-written papers. Next, there is no particular positive, or negative, correlation between languages. However, there are slight negative correlations between languages Basque and Bengali, Irish and Thai, and Thai and Urdu, which means that these languages are very rarely researched together. There are two observations regarding these languages. (1) All of them are niche and do not have a big speaking population. (2) All pairs have very distant geographical origins, so there may be a low demand for their co-studying.

3.3 Findings Related to RQ3: What are the Popular Fields, and Topics, of Research

Let us now discuss the finding related to the most popular fields and topics of reported research. In order to ascertain them, in addition to keyphrase search and NER, metadata mining and text summarization have been applied. Let us now introduce these methods in some detail.

3.3.1 Metadata Mining

In addition to the information available within the text of a publication, further information can be found in its metadata. For instance, the date of publishing, overall categorization, hierarchical topic assignment and more, as discussed in the next paragraphs.

Therefore, metadata has been fetched both from the original source (arXiv API) and from the Semantic Scholar ⑰ . As a result, for each retrieved paper, the following information became available for further analysis:

data: title, abstract and PDF,

metadata: authors, arXiv category and publishing date,

citations/references,

Note that the Semantic Scholar topics are different from the arXiv categories. The arXiv categories follow a set taxonomy ⑱ , which is used by the person who uploads the text. On the other hand, the Semantic Scholar “uses machine language techniques to analyze publications and extract topic keywords that balance diversity, relevance, and coverage relative to our corpus.” ⑲

The metadata from both sources was complete for all articles (there were no missing fields for any of the papers). Obviously, one cannot guarantee that the information itself was correct. This had to be (and was) assumed, to use this data in further analysis.

3.3.2 Matching Literature to Research Topics

In literature review, one may analyze all available information. However, it is much faster to initially check if a particular paper's topic is related to ones planned/ongoing research. Both Semantic Scholar and arXiv provide this information in the metadata. Semantic Scholar provides “topics”, while arXiv provides “categories”.

Figure 3 shows (1) what topics are the most popular (see the first column from the left), and (2) the correlation of topics. The measure used in the heatmap (correlation matrix) is the count of articles tagged with topics (logarithmic scale has been used).

Correlation matrix between top 0.5 percentile of topics (logarithmic scale).

Obviously, the most popular field of research is “Natural Language Processing”. It is also worth mentioning that Artificial intelligence, Machine Learning and Deep Learning also score high in the article count. This is intuitive, as current applications of NLP are pursued using approaches from, broadly understood, artificial intelligence.

Moreover, the correlation, and high score, between “Deep Learning” and “Artificial Neural Networks” mirrors the influence of BERT and similar models. On the other hand, there are topics, which very rarely coincide. These are, for instance, Parsing and Computer Vision, Convolutional Neural Networks and Machine Translation, Speech Recognition and Sentiment analysis.

There is also one topic worth pointing out to: Baseline (configuration management) . According to the Semantic Scholar, it is defined as “an agreed description of the attributes of a product, at a point in time, which serves as a basis for defining change” ⑳ . This topic does not suit the NLP particularly, as it is too vague, and it could have been incorrectly assigned by the machine learning algorithm on the backend of Semantic Scholar.

Yet another interesting aspect is the evolution of topics in time, which gives a wider perspective of what topics are on the rise in, or fall from, popularity. Figures 4 show the most popular categories in time. The category cs.CL (“Computation and Language”) is a dominating in all periods because it is the main subcategory of NLP. However, multiple interesting observation can be made. First, categories that are particularly popular nowadays are: cs.LG (Machine Learning), cs.AI (Artificial Intelligence), cs.CV (Computer Vision and Pattern Recognition). Second, there are categories, which experience a drop in interest. These are: stat. ML (Machine Learning) and cs.NE (Neural and Evolutionary Computing).

Most popular categories in time (top 96 percentile for each time period).

Moving to “categories” from arXiv, it is important to elaborate the difference between them and “topics”. As mentioned, arXiv follows a taxonomy with two levels: primary category (always a single one) and secondary categories (may be many).

To best show this relation, as well as categories’ popularity, a treemap chart has been created, which is most suitable for “nested” category visualization. It is shown in Figure 5 .

Similarly to the Semantic Scholar “topics”, the largest primary category is cs.CL (Computation and Language), which is a counterpart to the NLP topic from the arXiv nomenclature. Its top secondary categories are cs.LG/stat.ML (both categories of Machine Learning) and cs.AI (Artificial Intelligence). This is, again, consistent with previous findings and shows how these domains overlap each other. It is also worth noting the presence of cs.CV (Computer Vision and Pattern Recognition), which, although to a lesser degree, is also important in the NLP literature. Manual verification shows that, in this context, computer vision refers mostly to image description with text [ 41 ], visual question answering [ 42 ], using transformer neural networks for image recognition [ 43 , 44 ], and other image pattern recognition, vaguely related to NLP.

Similarly, as for topics, a trend analysis has been performed for categories. It is presented in Figure 6 . The most popular topic over time is NLP , followed by Artificial neural network, Experiment, Deep learning , and Machine learning. Here, no particular evolution is noticeable, except for rise in interests in the Language model topic.

Simplified treemap visualizing arXiv primary categories aggregating secondary categories. Outer rectangles are primary categories, inner rectangles are other assigned categories. Other categories include primary category to additionally show the primary categories size. Top 20.0 of primary categories and categories. Colors are purely aesthetic.

Most popular topics in time (top 99,8 percentile for each time period).

3.3.3 Citations

Another interesting metainformation, is the citation count [ 45 , 46 ]. Hence, this statistic was used to determine key works, which were then used to establish key research topic in NLP (addressing also RQ1-3).

It is well known that, in most cases, the distribution of node degree in a citation network is exponential [ 47 ]. Specifically, there are many works with 0-1 citations, and very few with more than 10 citations. In this context, the citations network of top 10% of most highly cited papers is depicted in Figure 7 . The most cited papers are 1810.04805v2 [ 48 ] (5760 citations), 1603.04467v2 [ 49 ] (2653 citations) and 1606.05250v3 [ 50 ] (1789 citations). The first one is the introduction of the BERT model. Here, it is easy to notice that this papers absolutely dominates the network in terms of the degree. It is the networks focal point. This means that the whole domain not only revolves around one particular topic, but also around a single paper.

Citation network of all articles (arrows point towards cited paper); top 5 percentile; A→B, means A cites B (B is a reference of A); Color scale indicates how many papers cite a given paper (yellow—higher, dark blue— lower).

The second paper concerns TensorFlow, the state-of-the-art library for neural networks construction and management. The third introduces “Squad”—a text dataset with over 100,000 questions, used for machine learning. It is important to note that these are the top 3 papers when considering not only works published after 2015, but also when the “all time most cited works” are searched for.

How can two papers cite each other. An interesting observation has been made, during the citation analysis. Typically, relation, where one paper quotes another paper, should be one-way. In other words when paper A cites paper B, that means that paper B is a reference for paper A. So the set of citations and reference should be disjoint. This is true for over 95% of works. However, 363 of papers have an intersection between citations and references, with the biggest having even 10 common positions. Further, manual, analysis determined that this “anomaly” happens due to the existence of preprints, and all other cases where a paper appeared publicly (e.g. being a Technical Report) and then was revised and cited a different paper. This may happen, for instance, when a paper is criticised and it is reprinted (an updated version is created) to address the critique.

3.4 RQ3 Related Findings Based on Application of Keyphrase and Entity Networks

As discussed, NER has been used to determine NLP datasets and languages analyzed in papers. It can also be used when looking for techniques used in research. However, to better visualize the topic of interest, it can be combined with network analysis. Specifically, work reported in the literature involves many-to-many relations, which provide information of what techniques, methods, problems, languages etc., are used alone, in tandem or, perhaps, in groups. To properly explore the area, four dimensional networks (see Figures 8 and 9 ) have been constructed, with: nodes (entities), node size (scaled by an attribute), edges (relations), edge width (scaled by an attribute). Moreover, since all networks are exponential and have very high edge density, only the top percentile of entities has been graphically represented (to allow readability). Networks have been built using networkx [ 51 ] and igraph [ 52 ] Python libraries.

Entity network; entities detected using spaCy (en_core_web_lg 3.1.0); edges width—number of papers with the entity; node size and color—sum of weight of edges; top 0.4 percentile of node weight; top 20.0 percentile of edge weight.

Figure 9 shows very popular name entities, but skips the most often found ones. This has been done to allow other frequent terms to become visible. Specifically, the networks were trimmed by node weight, i.e. number of papers including the named entity. The Figure 9 contains terms between the 99.5 and 99.9 percentiles by node weight. In addition to some previously made observations, new entities appeared, which show what is also of considerable interest in NLP literature. These are:

As shown in the Figure 8 the majority of entities are related to models such as BERT, and neural network architectures (e.g. RNN, CNN). However, the findings show not only NLP-related topics, but all entities. Here, an important warning, regarding used NER models, should be stated. In most cases, when NER is applied directly, and without additional techniques, the entities are not disambiguated, or properly unified. For instance, surnames, like, Kim, Wang, Liang, Liu, Chen, etc. are not properly recognized as names of different persons and “bagged together”. Therefore, further interpretation of results of NER may require manual checking of results.

Moreover, corroborating earlier noted result, is that Wikipedia and Twitter, being the most popular data sources for NLP, can be observed.

Finally, among important entities, Association for Computational Linguistics (also shown as “the Association for Computational Linguistics” and “ACL” ㉑ ) has been found. This society organizes conferences, events and also runs a journal about natural language processing.

GPU (Graphic Processing Unit), which are often used to accelerate neural network training (and use) [ 53 ]

WordNet—semantic network “connecting words” with regard to their meaning [ 54 ] ㉒ and ImageNet —a image database using WordNet hierarchy to propose a network of images [ 55 ] ㉓

SemEval—popular contents in NLP, occurring annually and challenging scientist with different NLP tasks ㉔

and other particular methods like (citation contain example papers): Bayesian methods [ 56 ], CBOW (Continuous Bag of Words) [ 57 ], Markov processes [ 58 ]

Entity network; entities detected using spaCy (en_core_web_lg 3.1.0); edges width—number of papers with the entity; node size and color—sum of weight of edges; node weight between 99.5 and 99.9 percentile; top 20.0 percentile of edge weight.

As described in Section 3.1.2, the keyphrase search was used to extract these terms and findings, which might have been skipped in the NER results. For example, the word “accuracy” is a widely used metric in NLP and many other domains. However, it is not a named entity, because it is also an “ordinary” word in English and is not detected as such by the NER models. Applied analysis produced a network of keyphrase co-occurrence. Hence, network visualization was, again, applied ( Figure 10 ). This allowed formulation of hypotheses, which underwent further (positive) manual verification, specifically:

Keyphrase co-occurrence network Node size—article count where keyword appears Node color— citation sum where keyword appears Edge width & color—number of articles in which two terms appeared.

BERT models are most commonly used in their pretrained “version” / “state”. BERT is already a pretrained model, but it is possible to continue its training (to get a better representation of particular language, topic or domain). The second approach is using BERT, or its pretrained variant, to train it on a target task, called downstream task (these techniques is also called “fine-tuning”).

Transformers are connected strongly with attention. This is because transformer (a neural network architecture) is characterized by the presence of attention mechanism in it. This is the distinguishing factor of this architecture [ 59 ].

“Music” is connected with “lyrics”. This shows that the intersection between NLP research and music domain is via lyrics analysis. The lack of correlation between music and other terms shows that audio analysis, sentiment analysis, etc. are not that popular in this context.

“Precision” is connected with “recall” These two extremely popular evaluation metrics for classification are often used together. Their main point is to handle imbalanced datasets, where the performance is not evaluated correctly by the “accuracy” [ 60 ] measure.

“Synset” is connected with “WordNet”. As shown, WordNet is most commonly used with Synset (a user programmer-friendly interface available in the NLTK framework ㉕ ).

Quantum mechanics begins to emerge in NLP. The oldest works in the field of quantum computing (in the set under study) date back to 2013 [ 61 ], but most (>90%) of the recent works dates to 2019-2021. These provide answers to the to problems such as: applying NLP algorithm on “nearly quantum” computers [ 62 ], sentence meaning inference with quantum circuit model(s), encoding-decoding [ 63 ], quantum machine learning [ 64 ] or, even, ready-to-use Python libraries for the quantum NLP [ 65 ] are investigated. There are still very few works joining the worlds of NLP and quantum computing, but their number is significantly growing since 2019.

Graphs are very common in research related to semantic analysis. One of the the domains that NLP overlaps/includes is semantics. The entities network illustrates how important the concept of a graph is in semantics research (e.g. knowledge graphs). Some works touch these topics in tandem with text embedding [ 66 ], text summarization [ 67 ], knowledge extraction/inference/infusion [ 67 ] or question answering [ 68 ].

3.4.1 Text Summarization

Another approach to extract key information (including the field of research) is to reduce the original text to a brief and simple “conclusion”. This can be done with extractive and abstractive summarization methods. Both aim at allowing the user to comprehend the main message of the text. Moreover, depending on what sentences are chosen in the extractive summarization methods, one may find which abstracts (and papers) are most “summaritive”.

Extractive summarization. First, the extractive methods have been used to summarize the text of all abstracts. Specifically, the following methods have been applied.

Luhn methods [ 69 ] (max 5 sentences) shown in Listing 1

Latent Semantic Analysis [ 70 ] (max 5 sentence) shown in Listing 2

LexRank [ 71 ] (max 5 sentence) shown in Listing 3

TextRank [ 72 ] (max 5 sentence) shown in Listing 4

Here, note that, due to formatting errors in the original texts, the library pysummarization ㉖ had trouble with “sentences with periods” (e.g. “3.5% by the two models, respectively.” is only a part of a full sentence, but it contains a period character).

Abstractive summarization. Previous research found that abstractive summarization methods can “understand the sense” of the text, and build its summary [ 73 ]. It was also found that their overall performance is better than that of extractive methods [ 74 ]. However, most state-of-the-art solutions have limitations related to the maximum number of tokens, i.e. BERT-like models (e.g. distilbart-cnn-12-6 model [ 75 ], bart-large-cnn [ 75 ], bert-extractive-summarizer [ 76 ]) support maximum of 512 tokens, while the largest Pegasus model supports 1024 [ 77 ].

Nevertheless, very recent work proposes a transformer model for long text summarization, a “Longformer” [ 78 ], which is designed to summarize texts of 4000 tokens and more. However, this capability comes with a high RAM memory requirement. So, in order to test abstractive methods, Longformer was applied only to titles of most influential texts (top 5% of citation count).

The final note about text summarization is that, most recent research proposed innovative ways to overcome the length issue (see, [ 79 ]). There is thus a possibility to apply text summarization, for instance, to abstracts combined with introduction and conclusions of research papers. Testing this possibility may be a good starting point for research, but is out of scope of this contribution.

3.4.2 Summarization Findings

Listings 1, 2, 3, 4, show summaries of all abstracts and Listing 5 shows summary of all titles (as described in Section 3.4.1).

The common part for all summaries addresses (in a hierarchical order, starting from most popular features):

natural language processing and artificial intelligence,

translation and image processing,

neural networks,

deep neural network architectures, e.g. CNN, RNN, encoder-decoder, transformers, and

deep neural network models, e.g. BERT, ELMO.

Moreover, the main “ideas”, which appear in the summaries are: effectiveness, “state-of-the-art” solutions, and solutions “better than others”. This shows the “competitive” and “progress-focused” nature of the domain. Authors find it necessary to highlight how “good” or “better than” their solution is. It may also mean that there is not much space for “exploratory” and “non-results-oriented” research (at least this is the message permeates the top cited articles). Similarly, research indicating which approaches do not work in a given domain is not appreciated.

Summary with LSA (512.9 sec)

Natural language processing, as a data analytics related technology, is used widely in many research areas such as artificial intelligence, human language processing, and translation. [paper id: 1608.04434v1]

At present, due to explosive growth of data, there are many challenges for natural language processing. [paper id: 1608.04434v1]

Hadoop is one of the platforms that can process the large amount of data required for natural language processing. [paper id: 1608.04434v1]

KOSHIK is one of the natural language processing architectures, and utilizes Hadoop and contains language processing components such as Stanford CoreNLP and OpenNLP. [paper id: 1608.04434v1]

This study describes how to build a KOSHIK platform with the relevant tools, and provides the steps to analyze wiki data. [paper id: 1608.04434v1]

Summary with sumy-LSA (512.9 sec)

Summary with LexRank (11323.26 sec)

Many natural language processing applications use language models to generate text. [paper id: 1511.06732v7]

However, there is no known natural language processing (NLP) work on this language. [paper id: 1912.03444v1]

However, few have been presented in the natural language process domain. [paper id: 2107.07114v1]

Here, we show their effectiveness in natural language processing. [paper id: 2109.04712v1]

The other two methods however, are not as useful. [paper id: 2109.01411v1]

Summary with sumy-TextRank (497.67 sec)

Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive results on various natural language processing tasks such as question-answering and natural language inference. [paper id: 1901.04085v5]

In chapter 1, we give a brief introduction of the history and the current landscape of collaborative filtering and ranking; chapter 2 we first talk about pointwise collaborative filtering problem with graph information, and how our proposed new method can encode very deep graph information which helps four existing graph collaborative filtering algorithms; chapter 3 is on the pairwise approach for collaborative ranking and how we speed up the algorithm to near-linear time complexity; chapter 4 is on the new listwise approach for collaborative ranking and how the listwise approach is a better choice of loss for both explicit and implicit feedback over pointwise and pairwise loss; chapter 5 is about the new regularization technique Stochastic Shared Embeddings (SSE) we proposed for embedding layers and how it is both theoretically sound and empirically effectively for 6 different tasks across recommendation and natural language processing; chapter 6 is how we introduce personalization for the state-of-the-art sequential recommendation model with the help of SSE, which plays an important role in preventing our personalized model from overfitting to the training data; chapter 7, we summarize what we have achieved so far and predict what the future directions can be; chapter 8 is the appendix to all the chapters. [paper id: 2002.12312v1]

We explore how well the model performs on several languages across several tasks: a diagnostic classification probing the embeddings for a particular syntactic property, a cloze task testing the language modelling ability to fill in gaps in a sentence, and a natural language generation task testing for the ability to produce coherent text fitting a given context. [paper id: 1910.03806v1]

Neural Architecture Search (NAS) methods, which automatically learn entire neural model or individual neural cell architectures, have recently achieved competitive or state-of-the-art (SOTA) performance on variety of natural language processing and computer vision tasks, including language modeling, natural language inference, and image classification. [paper id: 2010.04249v1]

Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language problems. [paper id: 2104.08335v1]

‘The Natural Language Processing (NLT) is a new tool that can teach people about the world. The tool is based on the data collected by CNN and RNN. A survey of the Usages of Deep Learning was carried out by the 2015 MSCOCO Image Search. It was created by a survey of people in the UK and the US. An image is worth 16x16 words, and a survey reveals how many people are interested in the language.’

3.5 RQ1, RQ2, RQ3: Relations between NLP Datasets, Languages, and Topics of Research

Additionally, to separate results for RQ1, RQ2 and RQ3, there are situations when important information is the coincidence of these three aspects: NLP datasets, languages, and research topics. The triplet dataset-language-problem is usually fixed on two positions. For example, a research may be focused on machine translation (problem) into English (language), but with missing a corpus (dataset); or a group of Chinese researchers (language) has access to a rich Twitter API (dataset), but is considering what type of analysis (problem) is most prominent. This sparks a question what datasets are used, with which languages, and for what problems. Presented results of correlations between these 3 aspects are divided into two groups, for 2 most popular language: English and Chinese. They are shown in Figure 11 . The remaining results for the selected languages, from the most popular ones, can be found in Figure 12 and 13.

Datasets and NLP problems for languages English and Chinese.

Datasets and NLP problems for chosen languages.

For English and Chinese languages (being the subject of NLP research) the distribution of problems is very similar. The top problems are: machine translation, question answering, sentiment analysis and summarization. The most popular dataset used for all of these problems is Wikipedia. Additionally, for sentiment analysis, there is a significant number of contributions that use also Twitter. All of these observations are consistent with previous results (reported in Sections 3.1 3.6 3.2).

Before going into languages other than English and Chinese, it is crucial to recall that this analysis focused only on articles written in English. Hence, reported results may be biased in the case of research devoted to other language(s). Nevertheless, there exists a large body of work about NLP applied to non-English languages, which is written in English. For instance, among all analyzed papers for this contribution, 41% were devoted to NLP in the context of neither English (non-english papers are 46% of the dataset) nor Chinese (non-chinese papers are 80% of the dataset).

The most important observation is that the distribution of problems for languages other than English and Chinese is, overall, similar (Machine Translation, Question-Answering, sentiment and summarization are the most popular ones). However, there are also some distinguishable differences:

For German and French, summarization, language modelling and natural language inference, and named entity recognition are the key research areas.

In Arabic and Italian, Japanese, Polish, Estonian, Swedish and Finish, there is a visible trend of interest in named entity recognition.

Dependency parsing is more pronounced in research on languages such as German, French, Czech, Japanese, Spanish, Slovene, Swahili and Russian.

In Basque, Ukrainian, Bulgarian the domain does not have particular homogeneous subdomain distribution. The problems of interests are: co-reference resolution, dependency parsing, dialogue-focused research, language modeling, machine translation, multitask learning, named entity recognition, natural language inference, part-of-speech tagging, question answering.

In Bengali, a special area of interest is part-of-speech tagging.

Research focused on Catalan have a particular interests in dialogue-related texts.

Research regarding Indonesian have a very high percent of sentiment analysis research. Even higher than most popular topic of machine translation.

Studies on Norwegian language are strongly focused on sentiment analysis, which peeks over the most common domain of most of the languages—machine translation.

Research focusing on Russian puts a special effort in analyzing dialogues and dependency parsing.

There are only minimal difference between datasets used for English and Chinese, and other languages. The key ones are:

Facebook is present as one of the main sources in many languages, being particularly popular data source for: Bengali, and Spanish

Twitter is a key data source in research on languages: Arabic, Dutch, French, German, Hindi, Italian, Korean, Spanish, Tamil

WordNet is very often used in research involving: Moldovan and Romanian

Tibetan language research nearly never uses Twitter as the dataset.

3.6 Findings Concerning RQ4: Most Popular Specific Tasks and Problems

At the heart of the research is yet another key aspect—the specific problem that is being tackled, or the task, which is being solved. This may seem similar to the domain, or to the general direction of the research. However, some general problems contain specific problems (e.g. machine translation and English-Chinese machine translation, or named entity recognition and named entity linking). On the other hand, some specific problems have more complicated relation, e.g. machine translation, which in NLP can be solved using neural networks, but neural networks are also an independent domain on their own, which is also a superdomain (or a subdomain) of, for instance, image recognition. These complicated relations point to the need for a standardized NLP taxonomy. This, however, is also out of scope of this contribution.

Let us come back to the methods of analyzing specific results. To extract most popular specific tasks and particular problems, methods described above, such as NER, keyphrase search, metadata mining, text summarization, and network visualization were used. Before presenting specific results, an important aspect of keyphrase search needs to be mentioned. An unsupervised search for particular specific topics of research cannot be reasonably performed. All approaches of unsupervised keyphrase search that have been tried (in an exploratory fashion) produced thousands of potential results. Therefore, supervised keyphrase search has been applied. Therefore, the NLP problems were determined based on an exhaustive (multilingual) list, aggregating most popular NLP tasks ㉗ .

The list has been extracted from the website and pruned of any additional markdown ㉘ , to obtain a clean text format. Next, all keywords and keyphrases from the text of each paper has been compared with the NLP tasks list. Finally, each paper has been assigned a list of problems found in its text. Figure 14 shows the popularity (by count) of problems addressed in NLP literature.

Again, there is a dominating problem—machine translation. This is very intuitive, if one takes into account the recent studies [ 80 , 81 , 82 , 83 , 84 ] showing that lack of high fidelity machine translation remains the key barrier for world-wide communication. This problem seems very persistent, because it was indicated also in older research (e.g. in text from 1968 [ 85 ]). Here, it is important to recall that this contribution is likely to be biased towards translation involving English language, because it only analyzed English-written literature.

The remaining top 3 most popular problems are question answering [ 86 ] and sentiment analysis [ 87 ]. In both these domains, there are already state-of-the-art models ready to be used ㉙ . What is interesting, for both question answering and sentiment analysis, most of the models are based either on BERT or its variation, DistilBERT [ 88 ].

Histogram of problems tackled in NLP literature.

3.7 RQ5: Seeking Outliers in the NLP Domain

Some scientific research areas are homogeneous, and all publication revolve around similar topic (group of topics). On the other hand, some can be very diverse, with individual papers touching very different subfields. Finally, there are also domains where, from a more or less homogeneous set, a separate, distinguishable, subset can be pointed to. To verify the structure of the field of NLP, two methods have been used. One is, previously introduced, metadata mining. The second one was text embedding and cauterization. Let us briefly introduce the second one.

3.7.1 Text Embeddings

One of ubiquitous methods in text processing are word, sentence and document embeddings. Text embeddings, which “convert texts to numbers”, have been used to determine key differences/similarities between analyzed texts.

Embeddings can be divided into: contextualized and context-less [ 89 ]. Scientific papers often use words, which strongly depend on the context The prime example is the word “BERT” [ 48 ], which on the one hand is a character from a TV show, but in the NLP world it is a name of one of the state-of-the-art embedding models. In this context, envision application of BERT, the NLP method, to analysis of dialogues in children TV, where one of the dialogues would include BERT, the character. Similar situation concerns words like network (either neural network, graph network, social network, or computer network), “spark” [ 90 ] (either a small fiery particle, or the name of a popular Big Data library), lemma (either a proven proposition in logic, or a morphological form of a word), etc. Hence, in this study, using contextualized text embeddings is more appropriate. This being the case, very popular static text embeddings like Glove [ 91 ] and Word2Vec [ 92 , 93 ] have not been used.

There are many libraries and models available for contextualized text embedding, e.g.: transformers [ 33 ], flair [ 94 ], gensim [ 95 ] and models: BERT [ 48 ] (and its variations like Roberta [ 96 ], DistilBERT [ 88 ]), GPT-2 [ 97 ], T5 [ 98 ], ELMo [ 99 ] and others. However, most of them require specific and high-end hardware to operate reasonably fast (i.e. GPU acceleration [ 100 ]). Here, the decision was to proceed with FastText [ 101 ]. FastText is designed to produce time efficient results, which can be recreated on standard hardware. Moreover, it is designed for “text representations and text classifiers” ㉚ , which is exactly what is needed in this work.

3.7.2 Embedding and Clustering

It is important to highlight that since FastText, like most embeddings, has been trained on a pretty noisy data [ 101 ], the input text of articles was preprocessed only with Stage 1 cleaning (see Section 2.2). Next, a grid search [ 102 ] was performed, to tune hyperparameters. While, as noted earlier, hyperparameter tuning has not been applied, use of grid search, reported here, illustrates that there exist ready-to-use libraries that can be applied when hyperparameter tuning is required. Overall, the best embeddings were produced by a model with the following hyperparameters ㉛ :

dimension: 20

minimum subword size: 3

maximum subword size: 6

number of epochs: 5

learning rate: 0.00005

Finally, the FastText model was further trained in an unsupervised mode (which is standard in majority of cases for general language modelling), on texts of papers, to better fit the representation.

After embeddings have been calculated, their vector representations have been clustered. Since there was no response variable, an unsupervised classifier was applied. Again (as in Section 3.7.1), the main goal was simplicity and time efficiency.

Out of all tested algorithms (K-means [ 103 ], OPTICS [ 104 , 105 ], DBSCAN [ 106 , 107 ], HDBSCAN [ 108 ] and Birch [ 109 ]), the best time efficiency, combined with relative simplicity of use, was achieved with K-means (see, also [ 110 , 111 ]). Moreover, in found research, K-means clustering showed best results, when applied to FastText embeddings (see, [ 112 ]).

The evaluation of clustering has been performed using three clustering metrics: Silhouette score [ 113 ], Davies-Bouldin score [ 114 ], Caliński-Harabasz Score [ 115 ]. These metrics were chosen because they allow evaluation of unsupervised clustering. To visualize the results on a 2D plane, the multidimensional FastText vectors were converted with t-distributed stochastic neighbor embedding (T-SNE) method [ 116 , 117 ]. T-SNE has been suggested by text embedding visualizations reported in earlier work [ 118 , 119 ].

3.7.3 RQ5: Outliers Found in the NLP Research

Visualizations of embeddings are shown in Figure 15 .

Note that Figure 15 is mainly aesthetic, as actual relations are rarely visible, when dimension reduction is applied. The number of clusters has been evaluated according to 3 clustering metrics (Silhouette score [ 113 ], Davies-Bouldin score [ 114 ], Cali-ski-Harabasz Score [ 115 ]) and the best clustering score has been achieved for 2 clusters. Hence, further analysis considers separation of the embeddings into 2 clusters. To further explore why these particular embeddings appear in the same group, various tests were performed. First, wordclouds of texts (titles and paper texts) in the clusters have been built. The texts for wordclouds were processed with Stage 2 cleaning. Title wordclouds are shown in Figure 2 , while text wordclouds are shown in Figure 3 .

“The blade of NLP”. A visualization of all paper text embeddings grouped in clusters (dimensionality reduced with T-SNE).

Further, citation count comparison (Figures 16 and 17) and authors were checked for text in both clusters.

Based on the content of Figures 2 , 3 , 16 , 17 , 18 , 19 , 20 , 21 and the author per cluster distribution analysis the following conclusions have been drawn:

Histogram of citation counts in cluster 1 (bigger cluster) - logarithmic scale.

Histogram of citation counts in cluster 0 (smaller cluster) - logarithmic scale.

Last, the differences in topics from Semantic Scholar (Figures 18 and 19) and categories from arXiv (Figures 20 and 21) have been checked.

Histogram of topics counts in cluster 1 (bigger cluster).

Histogram of topics counts in cluster 0 (smaller cluster).

Histogram of categories counts in cluster 1 (bigger cluster).

Histogram of categories counts in cluster 0 (smaller cluster).

There is one specific outlier, this is the cluster of work related to texts embeddings.

Content of texts shows strong topical shift towards deep neural networks.

Categories and topics of clusters are not particularly far away from each other, because their distribution is similar. There is a higher representation of computer vision and information retrieval area in the smaller cluster (cluster 0).

There are no distinguishable authors who are responsible for texts in both clusters.

The distribution of citation counts is similar in both clusters.

Furthermore, manual verification showed that deep neural networks is actually the biggest subdomain of NLP, and it touches upon issues, which do not appear in other works. These issues are strictly related to neural networks (e.g. attention mechanism, network architectures, transfer learning, etc.) They are universal, and their applications play an important role in NLP, but also in other domains (image processing[ 120 ], signal processing [ 121 ], anomaly detection [ 122 ], clinical medicine [ 123 ] and many others [ 124 ]).

3.7.4 “Most Original Papers”

In addition to unsupervised clustering, an additional approach to outlier detection has been applied. Specifically, metadata representing citations/reference information was further analyzed. On the one hand, of the “citation spectrum” are the most influential works (as shown in Section 3.3.3). On the other side, there are papers that either are new and have not been cited yet, or those that do not have high influence.

However, the true “original” works are papers which have many citations (they are in top 2 percentile), but very few references (bottom 2 percentile). Based on performed analysis, it was found that such papers are:

“Natural Language Processing (almost) from Scratch” [ 125 ]—a neural network approach to learning internal representations of text, based on unlabeled training data. A similar idea was used in future publications, especially, the most cited paper about BERT model [ 48 ].

“Experimental Support for a Categorical Compositional Distributional Model of Meaning” [ 126 ]—a paper about “modelling compositional meaning for sentences using empirical distributional methods”.

“Gaussian error linear units (gelus)” [ 127 ]—paper introducing GELU, a new activation function in neural networks, which was extensively tested in future research [ 128 ].

Each of these papers introduced novel, very innovative ideas that inspired further research directions. They can be thus treated as belonging to a unique (separate) subset of contributions.

3.8 RQ6: Text Comprehension

Finally, an additional aspect of text belonging to the dataset was measured; text comprehensibility. This is a very complicated problem, which is still being explored. Taking into account that one of the considered audiences are researchers interested in starting work in NLP, text difficulty, using existing text complexity metrics, was evaluated. An important note is that these metrics are known for problems, such as: not considering complicated mathematical formula; skipping charts, pictures and other visuals. Keeping this in mind, let us proceed further.

3.8.1 Text Complexity

The most common comprehensibility measures map text to school grade, in the American education system [ 129 ]. In this way, it is established what is the expected level of reader that should be able to understand the text. The used measures were:

Flesch Reading Ease [ 130 ]

Flesch Kincaid Grade [ 130 ]

Gunning Fog [ 131 ]

Smog Index [ 132 ]

Automated Readability Index [ 130 ]

Coleman Liau Index [ 133 ]

Linsear Write Formula [ 134 ]

All measures return results on equal scale (school grade). Furthermore, they were all consistent in terms of paper scores. To provide the least biased results, the numerical values (Section 3.8.2) have been averaged to achieve a single, straightforward, measure for text complexity. Here, it should be noted that this was done also because delving into discussion of ultimate validity of individual comprehensibility measurements and pros/cons of each of them is out of scope of current contribution. Rather, the combined measure was calculated to obtain a general idea as to the “readability” of the literature in question.

The results can be averaged together between metrics, because all of they refer to the same scale (school grade).

3.8.2 RQ6: Establishing Complexity Level of NLP Literature

Results of the text complexity (RQ6) are rather intuitive.

As shown in Figure 22 , the averaged score of 15 comprehensibility metrics suggests that the majority of papers, in the NLP domain, can be understood by a person after “15th grade”. This matches roughly a person who finished the “1st stage” of college education (engineering studies, bachelor degree, and similar). Obviously, this result shows that use of such metrics to “scientific texts” has limited applicability, as they are based mostly on syntactic features of the text, while the semantics makes some of them difficult to follow even for the specialists. This, particularly, applies to texts which contain mathematical equations, which are being removed during text preprocessing.

Average reading grade (mean of all metrics; bottom 99th percentile) histogram showing what grade should the reader be to understand the papers.

3.9 Summary of Key Results

Let us now summarize the key finding, in the form of a question-answer for each of RQs that have been postulated in Section 1.

The datasets used most commonly for NLP research are: Wikipedia, Twitter, Facebook, WordNet, arXiv, Academic, SST (The Stanford Sentiment Treebank), SQuAD (The Stanford Question Answering Dataset), NLI and SNLI (Stanford Natural Language Inference Corpus), COCO (Common Objects in Context), Reddit.

RQ2: Which languages, other than English, appear as a topic of NLP research?

Languages analyzed most commonly in NLP research, apart from English and Chinese, are: German, French and Spanish.

The most popular fields studied in NLP literature are: Natural Language Processing/Language Computing, artificial intelligence, machine learning, neural networks and deep learning and text embedding.

Particular tasks and problems, which appear in the literature, are: text embedding with BERT and transformers, machine translation between English and other languages (especially English-Chinese), sentiment analysis (most popular with Twitter and Wikipedia datasets), question answering models (with Wikipedia and SQuAD datasets), named entity recognition, and text summarization.

According to the text embedding analysis, there is not enough evidence to find a strongly distinguishable clusters. Hence, there are no outstanding subgroups in the NLP literature.

According to averaged standard comprehensibility measures, scientific texts related to NLP can be digested by a 15th graders, which maps to the 3rd year of higher education (e.g. College, Bachelor's degree studies etc.)

This analysis used Natural Language Processing methods to analyze scientific literature related to NLP. The goal was to answer 6 research questions (RQ1-RQ6). A total of 4712 scientific papers in the field of NLP from arXiv were analyzed. The work used and illustrated at the same time the following NLP methods: text extraction, text cleaning, text preprocessing, keyword and keyphrase search, text embeddings, abstractive and extractive text summarization, text complexity and other methods such as: clustering, metadata analysis, citation/reference analysis, network visualization. This analysis focuses on only Natural Language Processing and its subdomains, topics, etc. Since the procedures of obtaining results reported here were fully automated, the same or similar analysis could be analogically done with ease for different literature languages and even fields. Hence, all the tools used for the analysis are available in a designated repository ㉜ for future applications.

https://jupyter.org

https://pypi.org

http://labs.jstor.org/api/docs

Specifically, the query had the form http://export.arxiv.org/api/query?search_query=all:%22natural%20language%20processing%22start=0&max_results=10000 . Since such query may take long time to load; to reduce time, one can change the value of the max_results parameter to a smaller number, e.g. 5

https://pdfminersix.readthedocs.io/en

https://github.com/euske/pdfminer

https://github.com/HazyResearch/pdftotree

https://www.crummy.com/software/BeautifulSoup

https://dumps.wikimedia.org

https://commoncrawl.org

https://github.com/explosion/spacy-models/releases/tag/en_core_webJg-3.2.0

https://huggingface.co/transformers/main_classes/pipelines.html#tokenclassificationpipeline

https://spacy.io

https://metatext.io/datasets

https://github.com/niderhoff/nlp-datasets

https://github.com/karthikncode/nlp-datasets

https://www.semanticscholar.org

https://arxiv.org/category_taxonomy

https://www.semanticscholar.org/faq#extract-key-phrases

https://www.semanticscholar.org/topic/Baseline-(configuration-management)/3403

https://www.aclweb.org

https://wordnet.princeton.edu

https://image-net.org

https://semeval.github.io

https://www.nltk.org/howto/wordnet.html

https://pypi.org/project/pysummarization

https://github.com/sebastianruder/NLP-progress

https://www.markdownguide.org

https://huggingface.co/models?language=enpipeline_tag=question-answering

https://fasttext.cc

https://fasttext.cc/docs/en/options.html

https://anonymous.4open.science/r/nlp-review-F81D

Email alerts

A product of The MIT Press

Mit press direct.

About MIT Press Direct

Information

Accessibility
For Authors
For Customers
For Librarians
Direct to Open
Open Access
Media Inquiries
Rights and Permissions
For Advertisers
About the MIT Press
The MIT Press Reader
MIT Press Blog
Seasonal Catalogs
MIT Press Home
Give to the MIT Press
Direct Service Desk
Terms of Use
Privacy Statement
Crossref Member
COUNTER Member
The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

EDITORIAL article

Editorial: perspectives for natural language processing between ai, linguistics and cognitive science.

$\nAlessandro Lenci$

1 Computational Linguistics Laboratory, University of Pisa, Pisa, Italy
2 Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany

Editorial on the Research Topic Perspectives for natural language processing between AI, linguistics and cognitive science

Natural Language Processing (NLP) today—like most of Artificial Intelligence (AI)—is much more of an “engineering” discipline than it originally was, when it sought to develop a general theory of human language understanding that not only translates into language technology, but that is also linguistically meaningful and cognitively plausible.

At first glance, this trend seems to be connected to the rapid development in the last 10 years that was driven to a large extent by the adoption of deep learning techniques. However, it can be argued that the move toward deep learning has the potential of bringing NLP back to its roots after all. Some recent activities and findings in this direction include: Techniques like multi-task learning have been used to integrate cognitive data as supervision in NLP tasks ( Barrett et al., 2016 ); Pre-training/fine-tuning regimens are potentially interpretable in terms of cognitive mechanisms like general competencies applied to specific tasks ( Flesch et al., 2018 ); The ability of modern models for ‘few-shot' or even ‘zero-shot' performance on novel tasks mirrors human performance ( Srivastava et al., 2018 ); Evidence of unsupervised structure learning in current neural network architectures that mirrors classical linguistic structures ( Hewitt and Manning, 2019 ; Tenney et al., 2019 ).

In terms of developing systems endowed with natural language capabilities, the last generation of neural network architectures has allowed AI and NLP to make unprecedented progress. Such systems (e.g., the GPT family) are typically trained with huge computational infrastructures on large amounts of textual data from which they acquire knowledge thanks to their extraordinary ability to record and generalize the statistical patterns found in data. However, the debate about the human-like semantic abilities that such “juggernaut models” really acquire is still wide open. In fact, despite the figures typically reported to show the success of AI on various benchmarks, other research argues that their semantic competence is still very brittle ( Lake and Baroni, 2018 ; Bender and Koller, 2020 ; Ravichander et al., 2020 ). Thus, an important limitation of current AI research is the lack of attention to the mechanisms behind human language understanding. The latter does not only consist of a brute-force, data-intensive processing of statistical regularities but it is also governed by complex inferential mechanisms that integrate linguistic information and contextual knowledge coming from different sources and potentially different modalities.

The current Research Topic was conceived on the assumption that the possibility for new breakthroughs in the study of human and machine intelligence calls for a new alliance between NLP, AI, and linguistic and cognitive research. The current computational paradigms can offer new ways to explore human language learning and processing, while linguistic and cognitive research can highlight those aspects of human intelligence that systems need to model or incorporate within their architectures.

We are very happy to present seven articles that embody this promise in different ways.

Two papers focus on the use of large neural language models to model aspects of natural language syntax, arguably a cornerstone of human linguistic competence, and therefore a target of much research in recent years. Oh et al.'s Comparison of structural parsers and neural language models as surprisal estimators contrasts the current standard architecture—neural parsers trained in a purely data-driven fashion—against a parser incorporating linguistic generalizations and find a better fit with various reading time measures for the latter. Kulmizev and Nivre's Schrödinger's tree–on syntax and neural language models makes a methodological contribution, sounding a note of caution about the current state of affairs. They point out the large impact that choices regarding experimental design and evaluation measures have on the study of syntactic generalizations in neural parsers.

Three more papers are concerned primarily with natural language semantics, a long-standing multi-dimensional problem that has so far resisted comprehensive modeling. The papers bring different methods to bear on this topic: Brown et al.'s Semantic representations for NLP using VerbNet and the generative lexicon continues a long tradition of careful linguistic modeling work, demonstrating how the combination of semantic theories and carefully curated lexical resources can provide computational predictions of event semantics with broad coverage. In contrast, Schulte im Walde and Frassinelli's Distributional measures of semantic abstraction proposes a decomposition of the concept of semantic abstraction into the two dimensions of abstractness/concreteness and specificity/generality and demonstrates that distributional corpus evidence can model both sub-aspects convincingly. The third paper, Stevenson and Merlo's Beyond the benchmarks: toward human-like lexical representations , is again located at the methodological level, offering a critical review of current computational investigations into lexical representation and perspectives looking forward. In particular, they stress the need for models able to address the rich structure of lexical meanings, which is still only partially tackled by mainstream computational semantic approaches, including those based on word embeddings.

The two final papers take seriously the idea of multimodality, extending their reach beyond textual data, as a strategy to address long-standing challenges in natural language processing. Bruera and Poesio's Exploring the representations of individual entities in the brain combining eeg and distributional semantics compare corpus-based and EEG-based embeddings for entities, paving the way toward a better understanding of the relationship between online and offline representations. Finally, Krishnaswamy and Pustejovsky's Affordance embeddings for situated language understanding ” argues that grounding of language in concrete situations, whether real or simulated, is a crucial step toward generalized learning, and demonstrate this claim with a model capable of learning properties of novel objects.

Taken together, we believe that these papers offer important contributions to the state of the art and open promising directions for future research. Despite their different approaches and perspectives, all papers support the same conclusion: It is time for a new alliance between AI, linguistics, and cognitive science, because only from their synergistic efforts and mutual feeding can we hope to achieve significant breakthroughs in the computational modeling of human intelligence and of natural language in particular. In closing, we would like to express our gratitude to the reviewers for their timely and insightful comments, and to the authors that have engaged with them a constructive scientific discussion.

Author contributions

AL and SP wrote the editorial together. Both authors contributed to the article and approved the submitted version.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Barrett, M., Bingel, J., Keller, F., and Søgaard, A. (2016). “Weakly supervised part-of-speech tagging using eye-tracking data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 2: Short Papers (Berlin: Association for Computational Linguistics), 579–584.

Google Scholar

Bender, E. M., and Koller, A. (2020). “Climbing towards NLU: on meaning, form, and understanding in the age of data,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Seattle, WA: Association for Computational Linguistics), 5185–5198.

Flesch, T., Balaguer, J., Dekker, R., Nili, H., and Summerfield, C. (2018). Comparing continual task learning in minds and machines. Proc. Natl. Acad. Sci. U.S.A. 115, E10313–E10322. doi: 10.1073/pnas.1800755115

PubMed Abstract | CrossRef Full Text | Google Scholar

Hewitt, J., and Manning, C. D. (2019). “A structural probe for finding syntax in word representations,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1: Long and Short Papers (Minneapolis, MN: Association for Computational Linguistics), 4129–4138.

Lake, B., and Baroni, M. (2018). “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks,” in Proceedings of the 35th International Conference on Machine Learning, Volume 80 of Proceedings of Machine Learning Research (Stockholm), eds J. Dy and A. Krause, 2873–2882.

Ravichander, A., Hovy, E., Suleman, K., Trischler, A., and Cheung, J. C. K. (2020). “On the systematicity of probing contextualized word representations: The case of hypernymy in BERT,” in Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics (Barcelona: Association for Computational Linguistic), 88–102s.

Srivastava, S., Labutov, I., and Mitchell, T. (2018). “Zero-shot learning of classifiers from natural language quantification,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers (Melbourne, VIC: Association for Computational Linguistics), 306–316.

Tenney, I., Das, D., and Pavlick, E. (2019). “BERT rediscovers the classical NLP pipeline,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Florence: Association for Computational Linguistics), 4593–4601.

Keywords: artificial intelligence, natural language processing, linguistics, interdisciplinary, cognitive science

Citation: Lenci A and Padó S (2022) Editorial: Perspectives for natural language processing between AI, linguistics and cognitive science. Front. Artif. Intell. 5:1059998. doi: 10.3389/frai.2022.1059998

Received: 02 October 2022; Accepted: 13 October 2022; Published: 03 November 2022.

Edited and reviewed by: Shlomo Engelson Argamon , Illinois Institute of Technology, United States

Copyright © 2022 Lenci and Padó. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Alessandro Lenci, alessandro.lenci@unipi.it

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 10 February 2024

The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks

Kent F. Hubert ORCID: orcid.org/0009-0009-7348-5102 1 na1 ,
Kim N. Awa ORCID: orcid.org/0000-0002-4932-6277 1 na1 &
Darya L. Zabelina ORCID: orcid.org/0000-0002-0313-7358 1

Scientific Reports volume 14 , Article number: 3440 ( 2024 ) Cite this article

19k Accesses

10 Citations

273 Altmetric

Metrics details

Human behaviour

The emergence of publicly accessible artificial intelligence (AI) large language models such as ChatGPT has given rise to global conversations on the implications of AI capabilities. Emergent research on AI has challenged the assumption that creative potential is a uniquely human trait thus, there seems to be a disconnect between human perception versus what AI is objectively capable of creating. Here, we aimed to assess the creative potential of humans in comparison to AI. In the present study, human participants (N = 151) and GPT-4 provided responses for the Alternative Uses Task, Consequences Task, and Divergent Associations Task. We found that AI was robustly more creative along each divergent thinking measurement in comparison to the human counterparts. Specifically, when controlling for fluency of responses, AI was more original and elaborate. The present findings suggest that the current state of AI language models demonstrate higher creative potential than human respondents.

People devalue generative AI’s competence but not its advice in addressing societal and personal challenges

Best humans still outperform artificial intelligence in a creative divergent thinking task

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Introduction.

The emergence of ChatGPT—a natural language processing (NLP) model developed by OpenAI 1 to the general public has garnered global conversation on the utility of artificial intelligence (AI). OpenAI’s Generative Pretrained Transformer (GPT) is a type of machine learning that specializes in pattern recognition and prediction and has been further trained using Reinforcement Learning from Human Feedback (RLHF) so that ChatGPT responses would be indistinguishable from human responses. Recently, OpenAI 1 has advertised the new model (GPT-4) as “more creative” particularly “on creative and technical writing tasks” in comparison to previous versions, although there are arguably semantic limitations such as nonsensical answers or the possibilities of incorrect information generation 2 . Given the accessibility of AI models in the current climate, research across a variety of domains has started to emerge, thus contributing to our growing understanding of the possibilities and potential limitations of AI.

Creativity as a phenomenological construct is not immune to the effects of AI. For example, researchers have begun to assess AI models to determine appropriate design solutions 3 and logical reasoning 4 . These assessments focus on convergent thinking, i.e., determining one optimal solution to a pre-defined problem 5 . Traditionally, convergent thinking assumes an optimal single solution path and can be assessed through traditional intelligence measures or synthesis tasks. Although convergent thinking emphasizes single optimal solutions, this does not negate the potential for original or non-obvious solutions. However, convergent thinking tasks by design typically do not allow for flexible or out-of-the-box thinking. In contrast, divergent thinking involves generating multiple creative solutions to a problem which allows for the flexibility to determine multiple creative solutions 6 . Creativity researchers commonly focus on divergent creativity (in comparison to convergent creativity), given the associative mechanisms that allude to people’s ability to generate creative solutions (i.e., creative potential). Specifically, divergent thinking is considered an indicator of a person’s creative potential, but this does not guarantee creative achievement 7 . Instead, creative potential can be indicative on future capability, rather than an immediate trait that determines if someone is creative. Accordingly, a person’s creative potential has been captured via divergent thinking tasks such as the Alternative Uses Task [AUT 6 , 7 ] or the Consequences Task [CT 8 , 9 ]. Divergent thinking tasks can be evaluated along three dimensions: fluency (number of responses), originality (response novelty), and elaboration (length/detail of response). Responses in each category are given scores (i.e., according to each task) and used to assess individual differences in divergent creativity, or in other words, a person’s creative potential.

Given the emergence of OpenAI’s GPT-4 as a large language model, research has begun to empirically assess the creative potential of artificial intelligence language models through divergent thinking tasks. On one hand, some researchers argue that the human cognitive mechanisms present during creative tasks are not present in AI, and thus the creative potential of artificial intelligence can only reflect artificial creativity 10 . On the other hand, computational creativity suggests parallel networks that reflect the mechanisms of how humans go through iterative, deliberative, and generative creative processes which aid in the ability to determine creative solutions 11 . Although these aspects have been shown to aid in creative solutions, humans can experience idea fixedness, which can act as a roadblock to other creative solutions. Machines, however, will not experience this phenomenon in a metacognitive way due to computationally trained models that streamline a machine’s direct responses to a prompt 12 , 13 , 14 . Instead, a machine’s fixedness may perhaps reflect the training data of the model which could be argued is a computational consideration, rather than a creative one.

Furthermore, computational researchers have posed increasing debate on the creative capabilities of artificial intelligence models 15 by asking questions such as: How are machines capable of determining what is creative? At present, AI’s inability to explicitly determine why or if something is creative is then compensated through human-assistance. For example, human intervention is necessary for inputting appropriate and relevant data to train the model and shape outputs to become more linguistically natural 16 , 17 . This computational limitation suggests that AI is not capable of divergent creativity due to the lack of metacognitive processes (i.e., evaluation, task motivation) because AI could not generate creative ideas or reiterate on existing ideas without the intervention (i.e., input) of a human user 10 . Similarly, emotions have been seen as an integral part of creativity such that emotions help dictate states of flow or mind-wandering that aid in creative processes 18 . However, AI may not necessarily need to rely on metacognitive or affective processes to generate novel ideas 19 due to the computational framework. Thus, inner processes that contribute to human creativity may be a philosophical argument within artificial creativity models 20 .

As briefly reviewed, the creative capabilities of artificial intelligence, thus far, have scientifically and philosophically varied [e.g., 10 , 20 ]. Researchers posit humanistic and computational considerations of the creative potential of AI, however, the accessibility of tools to artificially generate products or ideas have given researchers the opportunity to evaluate public perception. For instance, people think more highly of generated artworks if they were told the artworks were created by humans but not AI 21 , 22 . The expectancy that AI generated products or ideas are less creative or hold less aesthetic value than human-created artworks appear to depend on implicit anti-AI biases 22 , 23 , 24 , as AI has been found to be indistinguishable from human-created products 25 , 26 , 27 . People’s inability to distinguish between human and AI-created products supports the feasibility of AI having creative potential.

Indeed, AI has been found to generate novel connections in music 28 , science 26 , medicine 29 , and visual art 30 to name a few. In assessments of divergent thinking, humans outperformed AI on the Alternative Uses Task 31 , but it is noteworthy that the authors propose a possible rise in AI capabilities given future progress of large language models. In fact, recent studies have found that AI divergent creativity matched that of humans using a later version of GPT-4 32 , 33 . Researchers have continued to demonstrate that the current state of LLM’s frequently score within the top 1% of human responses on standard divergent thinking tasks such as the Alternative Uses Task 32 , 33 , 34 . Additional studies utilizing other divergent thinking tasks have also reported findings that paint a more nuanced picture. For example, when scores were compared between humans and GPT-4 on a Divergent Associations Task (DAT 35 ), the researcher found that GPT-4 was more creative than human counterparts 36 . Recent research on OpenAI’s text-to-image platform DALL▪E has reported similar findings 37 and suggests that OpenAI models could match or even outperform humans in combinational creativity tasks. Given the research on AI creativity thus far, OpenAI’s advertorial claims that GPT-4 is “more creative” may hold more merit than anticipated.

Current research

Thus far, the novelty of OpenAI’s ChatGPT has posed more questions that have yet to be examined. Although creativity has considered to be a uniquely human trait 38 , the emergence of OpenAI’s generative models suggests a possible shift in how people may approach tasks that require “out of the box” thinking. Thus, the current research aims to examine how divergent creativity (i.e., fluency, originality, elaboration) may differ between humans and AI on verbal divergent thinking tasks. To our knowledge, this is one of the first studies to comprehensively examine the verbal responses across a battery of the most common divergent thinking tasks (i.e., Alternative Uses Task, Consequences Task, and Divergent Associations Task) with novel methodology by matching the fluency of ideas between human subjects and ChatGPT. We anticipate that AI may demonstrate higher creative potential in comparison to humans, though given the recency of AI-centered creativity research, our primary research questions serve as exploratory in nature.

Participants

Human participation.

Human participants (N = 151) were recruited via Prolific online data collection platform in exchange for monetary compensation of $8.00. Participants were limited to having a reported approval rating above 97%, were proficient English speakers, and were born/resided in the USA. Average total response time for completing the survey was 34.66 min. A statistical sensitivity analysis indicated that we had sufficient power to detect small effects with the present sample size ( f 2 = 0.06, 1 − β = 0.80). The present study was performed in accordance with the Declaration of Helsinki and was approved by the Institutional Review Board for Human Subjects Research at the University of Arkansas. All participants provided informed consent prior to the start of the study. All statistical analyses were conducted in R studio 39 . See Table 1 for participant demographics.

AI participation

Artificial participants were operationalized as ChatGPT’s instancing feature. Each ChatGPT session was considered an independent interaction between the user and GPT interface. Here, we prompted separate instances per creativity measure (as detailed below) which resulted in artificial participation sessions. For example, we used a single session instance to feed each prompt and aggregated each prompt response into a data file. In total, we collected 151 instances which represent AI’s participation for a balanced sample. For two of the creativity measures (Alternative Uses Task and Consequences Task), which are the only timed tasks, fluency was matched 1:1 such that the number of responses for both groups is equal on these timed tasks. Fluency scores of each human respondent were first calculated to match 1:1 for each GPT-4 instance for the Alternative Uses Task and Consequences Task (detailed below). Only valid responses were retained. For example, human participant #52 had a total fluency score of 6, thus GPT-4 instance #52 was instructed to provide 6 responses.

Creativity measures

Alternative uses task.

The Alternate Uses Task (AUT 6 ) was used to test divergent thinking. In this task, participants were presented with a common object (‘fork’ and ‘rope’) and were asked to generate as many creative uses as possible for these objects. Responses were scored for fluency (i.e., number of responses), originality (i.e., uniqueness of responses), and elaboration (i.e., number of words per valid response). Participants were given 3 min to generate their responses for each item. Following prior research 40 , instructions for human respondents on the AUT were:

For this task, you'll be asked to come up with as many original and creative uses for [item] as you can. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your ideas don't have to be practical or realistic; they can be silly or strange, even, so long as they are CREATIVE uses rather than ordinary uses. You can enter as many ideas as you like. The task will take 3 minutes. You can type in as many ideas as you like until then, but creative quality is more important than quantity. It's better to have a few really good ideas than a lot of uncreative ones. List as many ORIGINAL and CREATIVE uses for a [item] .

Because the goal was to control for fluency, we excluded prompt parameters such as 'quantity' from the GPT-4 instructions. Similarly, GPT does not need timing parameters in comparison to humans because we denoted the specific number of responses required. See below for instructions used per GPT instance:

For this task, you'll be asked to come up with as original and creative uses for [item] as you can. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your ideas don't have to be practical or realistic; they can be silly or strange, even, so long as they are CREATIVE uses rather than ordinary uses. List [insert fluency number] ORIGINAL and CREATIVE uses for a [item].

Consequences task

The Consequences Task (CT 8 , 9 ) is part of the verbal section of the Torrance Test of Creative Thinking (TTCT) that provides prompts to hypothetical scenarios (i.e., what would happen if humans no longer needed to sleep?). Similar to the AUT, people respond to as many consequences to the prompt as they can within a given timeframe. Responses were scored for fluency (i.e., number of responses), originality (i.e., uniqueness of responses), and elaboration (i.e., number of words per valid response). General task instructions for human respondents were:

In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". For 3 minutes, try and think of any and all consequences that might result from the statement. Please be as creative as you like. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your responses will be scored based on originality and quality. Remember, it is important to try to keep thinking of responses and to type them in for the entire time for the prompt. REMINDER: In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". For 3 minutes, try and think of any and all consequences that might result from the statement. Do this as many times as you can in 3 min. The screen will automatically change when the time is completed. Remember, it is important to try to keep thinking of responses and to type them in for the entire time for the prompt.

Participants were given two prompts shown independently: “Imagine humans no longer needed sleep,” and “Imagine humans walked with their hands.” The two CT prompts have been extensively used in research on divergent thinking 41 , 42 , 43 . Similar to the AUT, fluency and timing parameters were excluded from the GPT instructions on the CT:

In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". Please be as creative as you like. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your responses will be scored based on originality and quality. Try and think of any and all consequences that might result from the statement. [Insert scenario]. What problems might this create? List [insert fluency number] CREATIVE consequences.

Divergent associations task

The Divergent Association Task (DAT 35 ) is a task of divergent and verbal semantic creative ability. This task asks participants to come up with 10 nouns as different from each other as possible. These nouns must not be proper nouns or any type of technical term. Pairwise comparisons of semantic distance between the 10 nouns are calculated using cosine distance. The average distance scores between all pairwise comparisons are then multiplied by 100 that results in a final DAT score ( https://osf.io/bm5fd/ ). High scores indicate longer distances (i.e., words are not similar). Task instructions for both human participants and GPT-4 were:

Please enter 10 words that are as different from each other as possible, in all meanings and uses of the words. The rules: Only single words in English. Only nouns (e.g., things, objects, concepts). No proper nouns (e.g., no specific people or places). No specialized vocabulary (e.g., no technical terms). Think of the words on your own (e.g., do not just look at objects in your surroundings).

There were no time constraints for this task. The average human response time was 126.19 s ( SD = 90.62) and the average DAT score was 76.95 ( SD = 6.13). We scored all appropriate words that participants gave. Participants with fewer than 7 responses were excluded from data analysis (n = 2). Instructions were identical for the GPT-4 to the human instructions.

Human participants’ responses were collected online via Qualtrics. The entire study took on average 34 min ( SD = 13.64). The order of the creativity tasks was counterbalanced. The online study used two attention checks randomly presented throughout the study. Each attention check allowed one additional attempt. Participants who failed two attention checks were removed from all analyses (N = 2). After providing their responses to each task, participants answered demographics questions.

GPT-4 procedural responses were generated through human-assistance facilitated by the first author, who provided each prompt in the following order: AUT, CT, and DAT. We did not have to account for typical human-centered confounds such as feelings of fatigue 44 , 45 and order biases 44 as these states are not relevant confounds in AI, thus the order of tasks was not counterbalanced.

Research disclosure statement

All variables, measurements, and exclusions for this article’s target research question have been reported in the methods section.

Creativity scoring

Both human and GPT-4 responses were cleaned to remove any instances that were incomplete or inappropriate at two stages: First, human responses that did not follow instructions from the task or were not understandable as a use (AUT; 0.96% removed) or a consequence (CT; 4.83%) were removed. Only valid human responses were used in matching for GPT fluency; Second, inappropriate or incomplete GPT responses for the AUT (< 0.001% removed) and CT (< 0.001% removed) were removed. Despite matching for fluency, only valid responses in both groups were used in subsequent analyses.

Traditional scoring methods of divergent thinking tasks have required human ratings of products or ideas and are assumed to be normative tasks (i.e., consensus will eventually be met with more raters). Here, we used the Open Creativity Scoring tool [OCS 46 ] to automate scoring of semantic distance objectively by capturing the originality of ideas by assigning scores of the remoteness (uniqueness) of responses. Unlike human scoring which requires multiple factors of consideration (e.g., fatigue, biases, time, cost 47 ) which could result in potential confounds, automated scoring tools such as OCS circumvent the human-centered issues and has been found to robustly correlate with human ratings 46 .

Open Creativity Scoring tool (OCS 46 ) was used to score both the AUT and CT tasks. Specifically, the semantic distance scoring tool 17 was used, which applies the GLoVe 840B text-mining model 48 to assess originality of responses by representing a prompt and response as vectors in semantic space and calculates the cosine of the angle between the vectors. The OCS tool also scores for elaboration by using the stoplist method 46 . The prompts for the AUT were “rope” and “fork” and the prompts for the CT were “humans no sleep” and “humans walked hands.”

Preliminary results

Descriptive statistics for all tasks are reported in Tables 2 and 3 . Fluency descriptive statistics are reported in Table 2 . Semantic distance descriptive statistics are reported in Table 3 .

Primary results

As expected, an independent sample t -test revealed no significant differences in total fluency due to controlling for fluency (as detailed above) between humans ( M = 6.94, SD = 3.80) and GPT-4 ( M = 7.01, SD = 3.81), t (602) = 0.21, 95% CI [− 0.54, 0.67], p = 0.83.

To assess originality of responses via semantic distance scores, we conducted a 2 (group: human, GPT-4) X 2 (prompt: ‘fork, rope) analysis of variance. The model revealed significant main effects of group ( F (1, 600) = 622.10, p < 0.001, η 2 = 0.51) and prompt ( F (1, 600) = 584.50, p < 0.001, η 2 = 0.49) on originality of responses. Additionally, there were significant interaction effects between group and prompt, F (1, 600) = 113.80, p < 0.001, η 2 = 0.16. Particularly, both samples had higher originality scores for the prompt ‘fork’ in comparison to ‘rope,’ but GPT-4 scored higher in originality, regardless of prompt. Tukey’s HSD post hoc analysis showed that all pairwise comparisons were significantly different ( p < 0.001) aside from the human ‘fork’ and GPT-4 ‘rope’ originality ( p = 0.989). Overall, GPT-4 was more successful at coming up with divergent responses given the same number of opportunities to generate answers compared to the human counterpart and showed higher originality but only for specific prompts (Fig. 1 ).

Analysis of variance of originality on the alternative uses task.

Next, we compared elaboration scores between humans and GPT-4. Fluency scores differ from elaboration in the sense that fluency accounts for each coherent response whereas elaboration quantifies the number of words per valid response. For example, a person could respond “you could use a fork to knit or as a hair comb.” In this example, the fluency would be 2 (knitting instrument and comb), but the elaboration would be 12 (number of words used in the response). The results of an independent t -test revealed that elaboration was significantly higher for GPT-4 ( M = 15.45, SD = 6.74) in comparison to humans ( M = 3.38, SD = 2.91), t (602) = 28.57, 95% CI [11.24, 12.90], p < 0.001.

As expected, an independent t -test revealed no significant differences in total fluency between humans ( M = 5.71, SD = 3.20) and GPT-4 ( M = 5.50, SD = 3.15), t (621) = 0.82, 95% CI [− 0.29, 0.71], p = 0.41.

To assess originality of responses via semantic distance scores, we conducted a 2 (group: human, GPT) X 2 (prompt: ‘no more sleep,’ ‘walk on hands’) analysis of variance. The model revealed significant main effects of group ( F (1, 619) = 622.10, p < 0.001, η 2 = 0.51) and prompt ( F (1, 619) = 584.50, p < 0.001, η 2 = 0.49) on the originality of responses. Additionally, there were significant interaction effects between group and prompt, F (1, 619) = 113.80, p < 0.001, η 2 = 0.16. Particularly, originality was marginally higher for the prompt ‘walk on hands’ in the GPT sample, although there were no significant differences in originality in the human sample between the two prompts. Tukey’s HSD post hoc analysis showed that all pairwise comparisons were significantly different ( p < 0.001) aside from the human responses for both prompts ( p = 0.607). Overall, GPT-4 was more successful at coming up with more divergent responses given the same number of opportunities compared to the human counterparts, and also showed higher originality dependent on prompt type (Fig. 2 ).

Analysis of variance of originality on the consequences task.

Next, we calculated the difference in elaboration between humans and GPT-4. The results of an independent I-test revealed that elaboration was significantly higher in the GPT-4 sample ( M = 38.69, SD = 15.60) than in the human sample ( M = 5.45, SD = 4.04), t (621) = − 36.04, 95% CI [− 35.04, − 31.45], p < 0.001.

We assessed the qualitative aspect of the words generated in the DAT between both humans and GPT through word occurrence. Namely, the frequency of single-occurrence (non-repeating words within groups) and unique occurrence (words only occurring once between groups).

Humans had a higher number of single-occurrence words (n = 523) that accounted for 69.92% within the total group response in comparison to GPT’s number of single-occurrence words (n = 152) that accounted for 47.95% within the total group response (Table 4 ). In total, there was 9.11% (n = 97) of overlapping responses between both groups. Exclusively unique words that only occurred in the human responses accounted for 87.03% (n = 651) in comparison to unique GPT responses which accounted for 69.40% (n = 220).

A chi-square test of independence was performed to examine the relationship between groups (GPT vs human) and word type (single occurrence vs unique occurrence). The relationship between these variables was not significant, $\chi$ 2 (1, N = 302) = 1.56, p = 0.211. This suggests that uniqueness and occurrences of words may not have necessarily aided either group in originality, but rather aided in word complexity.

Differences in semantic distance scores were calculated between human and GPT-4 DAT responses. An independent sample t -test revealed that GPT responses ( M = 84.56, SD = 3.05) had higher semantic distances in comparison to human responses ( M = 76.95, SD = 6.13), t (300) = 13.65, 95% CI [6.51, 8.71], p < 0.001. Despite human participants having a broader range of unique responses, the fluency uniqueness did not appear to advantage semantic distance scores when comparing groups.

The present study offers novel evidence on the current state of large language models (i.e., GPT-4) and the capabilities of divergent creative output in comparison to human participants. Overall, GPT-4 was more original and elaborate than humans on each of the divergent thinking tasks, even when controlling for fluency of responses. In other words, GPT-4 demonstrated higher creative potential across an entire battery of divergent thinking tasks (i.e., Alternative Uses Task, Consequences Task, and Divergent Associations Task).

Notably, no other study has comprehensively assessed multiple dimensions of the most frequently used divergent thinking tasks and AI. However, studies have begun to examine differences in divergent creativity between humans and AI, particularly after the public emergence of OpenAI’s ChatGPT, with findings showing that AI’s creative potential scores within the top 1% of human responses in terms of originality 32 , 33 , 34 . While there has been an influx in research examining the creativity of generative language models, to date only one previous study showed that humans outperformed GPT on the AUT (GPT-3 31 ), while another study reported that later versions of GPT (GPT-4 showed similar, albeit slightly less, creative potential in comparison to humans 32 ). Similarly, one previous study demonstrated that generative models were improved in GPT 4 compared to GPT 3.5, particularly in terms of fluency, but interestingly, not in terms of elaboration 49 which suggests that the creative potential of these LLM’s are improving, particularly the ability to generate original ideas. Indeed, only one other study thus far has reported similar results that GPT outperformed humans on the DAT 36 , but the DAT is only one aspect of divergent thinking. Instead, the novelty of the present findings provides a foundation for future research to continue to examine multiple dimensions of divergent thinking and artificial intelligence.

While the present results suggest that the current state of AI models outperform humans on divergent thinking tasks by a significant margin, there are methodological considerations that could have contributed to the present results. To comprehensively examine creativity requires not only an assessment of originality, but also of the usefulness and appropriateness of an idea or product 50 . Traditionally, this has proven difficult to standardize in comparison to assessing originality given the multifaceted dimensions that contribute to assessments of appropriateness such as accounting for sociocultural and historical contexts. Semantic distance scores do not take into consideration the aforementioned variables; instead, the scores reflect the relative distance between seemingly related (or unrelated) ideas. In this instance, GPT-4’s answers yielded higher originality than human counterparts, but the feasibility or appropriateness of an idea could be vastly inferior to that of humans. Thus, we need to consider that the results reflect only a single aspect of divergent thinking, rather than a generalization that AI is indeed more creative across the board. Future research on AI and creativity needs to not only account for the traditional measurements of creativity (i.e., fluency, elaboration, originality) but also for the usefulness and appropriateness of the ideas.

Interestingly, GPT-4 used a higher frequency of repeated words in comparison to human respondents. Although the breadth of vocabulary used by human responses was much more flexible, this did not necessarily result in higher semantic distance scores. Flexibility, or number of categories of responses, has also been found to be smaller (i.e., more similar categories of words were generated) for AI in comparison to humans 34 . In other words, like our present results, humans came up with a wider range of responses, however, this did not indicate increased originality. These findings highlight the consideration that flexible thinking may be the strong point in human-centered divergent thinking.

More so, the complexity of words chosen by AI, albeit more concentrated in occurrence, could have more robustly contributed to the originality effects. For example, only AI used words that are non-tangible items (i.e., freedom, philosophy) whereas humans may have experienced a fixedness on generating ideas that are appropriate and observable. The differences between generated lists (incorporating tangible and non-tangible word) could inflate originality to be biased toward AI.

Similarly, we need to critically consider the uniqueness of words generated in DAT responses. There was a marginal overlap of responses between the human and the AI samples (9.11%), but humans responded with a higher number of single-occurrence words. Despite these differences, AI still had a higher semantic distance score. Prior research shows that in human respondent’s originality increases over time 51 . This increase is seen as an expansion of activation in an individual’s semantic network, which leads to more original responses 52 . Human responses on these DT tasks tend to follow a diminishing returns curve before reaching a plateau for an individual’s more original responses 53 . The higher levels of elaboration and semantic distance in AI responses suggests that the LLM processing possibly does not need this ramp-up time as seen in human responses, therefore LLM’s can respond with their highest level of original responses when prompted. Whereas humans may fixate on more obvious responses at first, this algorithmic trait could then serve as an aid in overcoming ideation fixedness in humans.

It is important to note that the measures used in this study are all measures of creative potential, but involvement in creative activities or achievements is another aspect of measuring a person’s creativity. Creative potential is not a guarantee for creative achievement; instead, we need to consider creative potential as an indicator of a person’s creative capabilities 7 . Here, AI was more original thus indicating higher creative potential, but this metric may more appropriately reflect the advancement of the algorithms these models were trained on in conjunction with human input. In other words, AI, unlike humans, does not have agency, thus AI creative potentials are dependent on the assistance of a human user to elicit responses. Therefore, the creative potential of AI is in a constant state of stagnation unless prompted.

Moreover, researchers have examined the interplay between creative potential and real-world creative achievements 54 , 55 but this approach assumes human level creativity and is not able to account for artificial intelligence. AI can generate creative ideas, but it cannot be assumed that this potential would translate to achievement. The creative potential of AI is limited by the (lack of) autonomy of what the algorithms can create (i.e., creative potential) without the intervention of human assistance. Thus, future research should consider the conceptual implications of current measurements of creativity as implicated in applications in real-world settings and how generalizability at the intersection of potential and achievement may be a human-centric consideration.

The prevalence and accessibility of the internet has drastically shaped the way in which humans interact with language processing systems and search engines. LLM’s such as GPT-4 are now not an exception in ubiquity. Searching for information has multiple channels which were not previously available, and with these functions come an array of strategies to best find the desired information. Research has shown that younger people are better and more efficient in their search strategies online to find the information they want 56 , which suggests that exposure to search platforms acts as a practice in efficiency. Similar to interactions with GPT-4 and other AI platforms, humans may gradually navigate how to best utilize LLM’s. For information seeking tools like GPT-4, the creative potential has shown clear progression in capabilities, albeit there are still limitations such as response appropriateness and AI’s ability to generate idiosyncratic associations. Generative AI has demonstrated robustness in creative potential but has also shown weaknesses (i.e., less flexible thinking) that could then be supplemented by human assistance. Moving forward, future possibilities of AI acting as a tool of inspiration, as an aid in a person’s creative process, or to overcome fixedness is promising.

Data availability

All data associated with the present study is available at https://osf.io/xv6kh/ .

OpenAI. ChatGPT: Optimizing Language Models for Dialogue . (2023). https://openai.com/blog/chatgpt/ . Accessed July 2023.

Rahaman, M. S., Ahsan, M. T., Anjum, N., Terano, H. J. R. & Rahman, M. M. From ChatGPT-3 to GPT-4: A significant advancement in ai-driven NLP tools. J. Eng. Emerg. Technol. 2 (1), 1–11. https://doi.org/10.52631/jeet.v2i1.188 (2023).

Article Google Scholar

Lee, Y. H., & Lin, T. H. (2023). The feasibility study of AI image generator as shape convergent thinking tool. in International Conference on Human-Computer Interaction (pp. 575–589). https://doi.org/10.1007/978-3-031-35891-3_36 .

Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., & Zhang, Y. (2023). Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv:2304.03439 .

Cropley, A. In praise of convergent thinking. Creat. Res. J. 18 (3), 391–404. https://doi.org/10.1207/s15326934crj1803_13 (2006).

Guilford, J. P. The Nature of Human Intelligence (McGraw-Hill, 1967).

Google Scholar

Runco, M. A. & Acar, S. Divergent thinking as an indicator of creative potential. Creat. Res. J. 24 (1), 66–75. https://doi.org/10.1080/10400419.2012.652929 (2012).

Torrance, E. P. The Torrance Tests of Creative Thinking: Norms-Technical Manual (Personal Press, 1974).

Wilson, R. C., Guilford, J. P., Christensen, P. R. & Lewis, D. J. A factor-analytic study of creative-thinking abilities. Psychometrika 19 (4), 297–311. https://doi.org/10.1007/bf02289230 (1954).

Runco, M. A. AI can only produce artificial creativity. J. Creat. 33 (3), 100063. https://doi.org/10.1016/j.yjoc.2023.100063 (2023).

Finke, R. A. Imagery, creativity, and emergent structure. Conscious. Cogn. 5 (3), 381–393. https://doi.org/10.1006/ccog.1996.0024 (1996).

Article PubMed CAS Google Scholar

Sarker, I. H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2 (6), 420. https://doi.org/10.1007/s42979-021-00815-1 (2021).

Article PubMed PubMed Central Google Scholar

Khurana, D., Koli, A., Khatter, K. & Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 82 (3), 3713–3744. https://doi.org/10.1007/s11042-022-13428-4 (2022).

Zhou, M., Duan, N., Liu, S. & Shum, H.-Y. Progress in neural NLP: Modeling, learning, and reasoning. Engineering 6 (3), 275–290. https://doi.org/10.1016/j.eng.2019.12.014 (2020).

Cardoso, A., Veale, T. & Wiggins, G. A. Converging on the divergent: The history (and future) of the international joint workshops in computational creativity. AI Mag. 30 (3), 15. https://doi.org/10.1609/aimag.v30i3.2252 (2009).

Lambert, N., Castricato, L., von Werra, L., & Havrilla A. Illustrating Reinforcement Learning from Human Feedback (RLHF). Hugging Face . (2022). https://huggingface.co/blog/rlhf .

Dumas, D., Organisciak, P. & Doherty, M. Measuring divergent thinking originality with human raters and text-mining models: A psychometric comparison of methods. Psychol. Aesthet. Creat. Arts 15 (4), 645–663. https://doi.org/10.1037/aca0000319 (2021).

Kane, S. et al. Attention, affect, and creativity, from mindfulness to mind-wandering. In The Cambridge Handbook of Creativity and Emotions (eds Ivcevic, Z. et al. ) 130–148 (Cambridge University Press, 2023). https://doi.org/10.1017/9781009031240.010 .

Chapter Google Scholar

Chatterjee, A. Art in an age of artificial intelligence. Front. Psychol. 13 , 1024449. https://doi.org/10.3389/fpsyg.2022.1024449 (2022).

Boden, M. A. Computer models of creativity. AI Mag. 30 (3), 23–23. https://doi.org/10.1609/aimag.v30i3.2254 (2009).

Bellaiche, L. et al. Humans versus AI: Whether and why we prefer human-created compared to AI-created artwork. Cogn. Res. Princ. Implic. 8 (1), 1–22. https://doi.org/10.1186/s41235-023-00499-6 (2023).

Chiarella, S. et al. Investigating the negative bias towards artificial intelligence: Effects of prior assignment of AI-authorship on the aesthetic appreciation of abstract paintings. Comput. Hum. Behav. 137 , 107406. https://doi.org/10.1016/j.chb.2022.107406 (2022).

Fortuna, P. & Modliński, A. A(I)rtist or counterfeiter? Artificial intelligence as (D) evaluating factor on the art market. J. Arts Manag. Law Soc. 51 (3), 188–201. https://doi.org/10.1080/10632921.2021.1887032 (2021).

Liu, Y., Mittal, A., Yang, D., & Bruckman, A. (2022). Will AI console me when I lose my pet? Understanding perceptions of AI-mediated email writing. in Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3517731

Chamberlain, R., Mullin, C., Scheerlinck, B. & Wagemans, J. Putting the art in artificial: Aesthetic responses to computer-generated art. Psychol. Aesthet. Creat. Arts 12 (2), 177–192. https://doi.org/10.1037/aca0000136 (2018).

Gao, C. A. et al. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. Biorxiv https://doi.org/10.1016/j.patter.2023.100706 (2023).

Samo, A. & Highhouse, S. Artificial intelligence and art: Identifying the aesthetic judgment factors that distinguish human- and machine-generated artwork. Psychol. Aesthet. Creat. Arts. https://doi.org/10.1037/aca0000570 (2023).

Yin, Z., Reuben, F., Stepney, S. & Collins, T. Deep learning’s shallow gains: A comparative evaluation of algorithms for automatic music generation. Mach. Learn. 112 (5), 1785–1822. https://doi.org/10.1007/s10994-023-06309-w (2023).

Article MathSciNet Google Scholar

Kumar, Y., Koul, A., Singla, R. & Ijaz, M. F. Artificial intelligence in disease diagnosis: A systematic literature review, synthesizing framework and future research agenda. J. Ambient Intell. Hum. Comput. https://doi.org/10.1007/s12652-021-03612-z (2022).

Anantrasirichai, N. & Bull, D. Artificial intelligence in the creative industries: A review. Artif. Intell. Rev. https://doi.org/10.1007/s10462-021-10039-7 (2022).

Stevenson, C., Smal, I., Baas, M., Grasman, R., & van der Maas, H. Putting GPT-3's Creativity to the (Alternative Uses) Test . (2022). arXiv:2206.08932 .

Haase, J. & Hanel, P. H. (2023). Artificial Muses: Generative Artificial Intelligence Chatbots Have Risen to Human-Level Creativity . https://doi.org/10.48550/arXiv.2303.12003

Koivisto, M. & Grassini, S. Best humans still outperform artificial intelligence in a creative divergent thinking task. Sci. Rep. 13 , 13601. https://doi.org/10.1038/s41598-023-40858-3 (2023).

Article ADS PubMed PubMed Central CAS Google Scholar

Guzik, E. E., Byrge, C. & Gilde, C. The originality of machines: AI takes the torrance test. J. Creat. 33 (3), 100065. https://doi.org/10.1016/j.yjoc.2023.100065 (2023).

Olson, J. A., Nahas, J., Chmoulevitch, D., Cropper, S. J. & Webb, M. E. Naming unrelated words predicts creativity. Proc. Natl. Acad. Sci. 118 , 25. https://doi.org/10.1073/pnas.2022340118 (2021).

Article CAS Google Scholar

Cropley, D. Is artificial intelligence more creative than humans?: ChatGPT and the divergent association task. Learn. Lett. 2 , 13–13. https://doi.org/10.59453/ll.v2.13 (2023).

Chen, L., Sun, L. & Han, J. A comparison study of human and machine-generated creativity. J. Comput. Inf. Sci. Eng. 23 (5), 051012. https://doi.org/10.1115/1.4062232 (2023).

Sawyer, R. K. Explaining Creativity: The Science of Human Innovation (Oxford University Press, 2012).

R Core Team. R: A Language and Environment for Statistical Computing (Version 4.1.0) [Computer Software] . (2021). http://www.R-project.org .

Nusbaum, E. C., Silvia, P. J. & Beaty, R. E. Ready, set, create: What instructing people to “be creative” reveals about the meaning and mechanisms of divergent thinking. Psychol. Aesthet. Creat. Arts 8 (4), 423. https://doi.org/10.1037/a0036549 (2014).

Acar, S. et al. Applying automated originality scoring to the verbal form of Torrance tests of creative thinking. Gift. Child Q. 67 (1), 3–17. https://doi.org/10.1177/00169862211061874 (2021).

Hass, R. W. & Beaty, R. E. Use or consequences: Probing the cognitive difference between two measures of divergent thinking. Front. Psychol. 9 , 2327. https://doi.org/10.3389/fpsyg.2018.02327 (2018).

Urban, M. & Urban, K. Orientation toward intrinsic motivation mediates the relationship between metacognition and creativity. J. Creat. Behav. 57 (1), 6–16. https://doi.org/10.1002/jocb.558 (2023).

Day, B. et al. Ordering effects and choice set awareness in repeat-response stated preference studies. J. Environ. Econ. Manag. 63 (1), 73–91. https://doi.org/10.1016/j.jeem.2011.09.001 (2012).

Igorov, M., Predoiu, R., Predoiu, A. & Igorov, A. Creativity, resistance to mental fatigue and coping strategies in junior women handball players. Eur. Proc. Soc. Behav. Sci. https://doi.org/10.15405/epsbs.2016.06.39 (2016).

Organisciak, P. & Dumas, D. Open Creativity Scoring [Computer Software] . (University of Denver, 2020). https://openscoring.du.edu/ .

Beaty, R. E., Johnson, D. R., Zeitlen, D. C. & Forthmann, B. Semantic distance and the alternate uses task: Recommendations for reliable automated assessment of originality. Creat. Res. J. 34 (3), 245–260. https://doi.org/10.1080/10400419.2022.2025720 (2022).

Pennington, J., Socher, R. & Manning, C. Glove: Global vectors for word representation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 1532–1543 (2014).

Vinchon, F., Gironnay, V., & Lubart, T. The Creative AI-Land: Exploring new forms of creativity. In Review . (2023).

Runco, M. A. & Jaeger, G. J. The standard definition of creativity. Creat. Res. J. 24 (1), 92–96. https://doi.org/10.1080/10400419.2012.650092 (2012).

Beaty, R. E. & Silvia, P. J. Why do ideas get more creative across time? An executive interpretation of the serial order effect in divergent thinking tasks. Psychol. Aesthet. Creat. Arts 6 (4), 309–319. https://doi.org/10.1037/a0029171 (2012).

Mednick, S. The associative basis of the creative process. Psychol. Rev. 69 (3), 220–232. https://doi.org/10.1037/h0048850 (1962).

Hubert K. F., Finch A., Zabelina D. (2023). Diminishing Creative Returns: Predicting Optimal Creative Performance via Individual Differences in Executive Functioning .

Carson, S. H., Peterson, J. B. & Higgins, D. M. Reliability, validity, and factor structure of the creative achievement questionnaire. Creat. Res. J. 17 (1), 37–50. https://doi.org/10.1207/s15326934crj1701_4 (2005).

Jauk, E., Benedek, M. & Neubauer, A. C. The road to creative achievement: A latent variable model of ability and personality predictors. Pers. Individ. Diff. https://doi.org/10.1016/j.paid.2013.07.129 (2014).

Chevalier, A., Dommes, A. & Marquié, J.-C. Strategy and accuracy during information search on the web: Effects of age and complexity of the search questions. Comput. Hum. Behav. 53 , 305–315. https://doi.org/10.1016/j.chb.2015.07.017 (2015).

Download references

Author information

These authors contributed equally: Kent F. Hubert and Kim N. Awa.

Authors and Affiliations

Department of Psychological Sciences, University of Arkansas, Fayetteville, AR, 72701, USA

Kent F. Hubert, Kim N. Awa & Darya L. Zabelina

You can also search for this author in PubMed Google Scholar

Contributions

D.L.Z., K.F.H., and K.N.A. contributed to the conceptualization and methodology. K.F.H. and K.N.A. contributed to formal analysis and investigation. K.F.H. prepared all figures. K.N.A. prepared all tables. D.L.Z., K.F.H., and K.N.A. contributed to writing and revision.

Corresponding author

Correspondence to Kent F. Hubert .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hubert, K.F., Awa, K.N. & Zabelina, D.L. The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks. Sci Rep 14 , 3440 (2024). https://doi.org/10.1038/s41598-024-53303-w

Download citation

Received : 14 October 2023

Accepted : 30 January 2024

Published : 10 February 2024

DOI : https://doi.org/10.1038/s41598-024-53303-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

An empirical investigation of the impact of chatgpt on creativity.

Byung Cheol Lee
Jaeyeon Chung

Nature Human Behaviour (2024)

Tackling AI Hyping

Mona Sloane
David Danks
Emanuel Moss

AI and Ethics (2024)

Towards a mixed human–machine creativity

Mirko Farina
Witold Pedrycz
Andrea Lavazza

Journal of Cultural Cognitive Science (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Natural Language Processing

Introduction.

Natural Language Processing (NLP) is one of the hottest areas of artificial intelligence (AI) thanks to applications like text generators that compose coherent essays, chatbots that fool people into thinking they’re sentient, and text-to-image programs that produce photorealistic images of anything you can describe. Recent years have brought a revolution in the ability of computers to understand human languages, programming languages, and even biological and chemical sequences, such as DNA and protein structures, that resemble language. The latest AI models are unlocking these areas to analyze the meanings of input text and generate meaningful, expressive output.

What is Natural Language Processing (NLP)

Natural language processing (NLP) is the discipline of building machines that can manipulate human language — or data that resembles human language — in the way that it is written, spoken, and organized. It evolved from computational linguistics, which uses computer science to understand the principles of language, but rather than developing theoretical frameworks, NLP is an engineering discipline that seeks to build technology to accomplish useful tasks. NLP can be divided into two overlapping subfields: natural language understanding (NLU), which focuses on semantic analysis or determining the intended meaning of text, and natural language generation (NLG), which focuses on text generation by a machine. NLP is separate from — but often used in conjunction with — speech recognition, which seeks to parse spoken language into words, turning sound into text and vice versa.

Why Does Natural Language Processing (NLP) Matter?

NLP is an integral part of everyday life and becoming more so as language technology is applied to diverse fields like retailing (for instance, in customer service chatbots) and medicine (interpreting or summarizing electronic health records). Conversational agents such as Amazon’s Alexa and Apple’s Siri utilize NLP to listen to user queries and find answers. The most sophisticated such agents — such as GPT-3, which was recently opened for commercial applications — can generate sophisticated prose on a wide variety of topics as well as power chatbots that are capable of holding coherent conversations. Google uses NLP to improve its search engine results , and social networks like Facebook use it to detect and filter hate speech .

NLP is growing increasingly sophisticated, yet much work remains to be done. Current systems are prone to bias and incoherence, and occasionally behave erratically. Despite the challenges, machine learning engineers have many opportunities to apply NLP in ways that are ever more central to a functioning society.

What is Natural Language Processing (NLP) Used For?

NLP is used for a wide variety of language-related tasks, including answering questions, classifying text in a variety of ways, and conversing with users.

Here are 11 tasks that can be solved by NLP:

Sentiment analysis is the process of classifying the emotional intent of text. Generally, the input to a sentiment classification model is a piece of text, and the output is the probability that the sentiment expressed is positive, negative, or neutral. Typically, this probability is based on either hand-generated features, word n-grams, TF-IDF features, or using deep learning models to capture sequential long- and short-term dependencies. Sentiment analysis is used to classify customer reviews on various online platforms as well as for niche applications like identifying signs of mental illness in online comments.

Toxicity classification is a branch of sentiment analysis where the aim is not just to classify hostile intent but also to classify particular categories such as threats, insults, obscenities, and hatred towards certain identities. The input to such a model is text, and the output is generally the probability of each class of toxicity. Toxicity classification models can be used to moderate and improve online conversations by silencing offensive comments , detecting hate speech , or scanning documents for defamation .
Machine translation automates translation between different languages. The input to such a model is text in a specified source language, and the output is the text in a specified target language. Google Translate is perhaps the most famous mainstream application. Such models are used to improve communication between people on social-media platforms such as Facebook or Skype. Effective approaches to machine translation can distinguish between words with similar meanings . Some systems also perform language identification; that is, classifying text as being in one language or another.
Named entity recognition aims to extract entities in a piece of text into predefined categories such as personal names, organizations, locations, and quantities. The input to such a model is generally text, and the output is the various named entities along with their start and end positions. Named entity recognition is useful in applications such as summarizing news articles and combating disinformation . For example, here is what a named entity recognition model could provide:

Spam detection is a prevalent binary classification problem in NLP, where the purpose is to classify emails as either spam or not. Spam detectors take as input an email text along with various other subtexts like title and sender’s name. They aim to output the probability that the mail is spam. Email providers like Gmail use such models to provide a better user experience by detecting unsolicited and unwanted emails and moving them to a designated spam folder.
Grammatical error correction models encode grammatical rules to correct the grammar within text. This is viewed mainly as a sequence-to-sequence task, where a model is trained on an ungrammatical sentence as input and a correct sentence as output. Online grammar checkers like Grammarly and word-processing systems like Microsoft Word use such systems to provide a better writing experience to their customers. Schools also use them to grade student essays .
Topic modeling is an unsupervised text mining task that takes a corpus of documents and discovers abstract topics within that corpus. The input to a topic model is a collection of documents, and the output is a list of topics that defines words for each topic as well as assignment proportions of each topic in a document. Latent Dirichlet Allocation (LDA), one of the most popular topic modeling techniques, tries to view a document as a collection of topics and a topic as a collection of words. Topic modeling is being used commercially to help lawyers find evidence in legal documents .
Autocomplete predicts what word comes next, and autocomplete systems of varying complexity are used in chat applications like WhatsApp. Google uses autocomplete to predict search queries. One of the most famous models for autocomplete is GPT-2, which has been used to write articles , song lyrics , and much more.
Database query: We have a database of questions and answers, and we would like a user to query it using natural language.
Conversation generation: These chatbots can simulate dialogue with a human partner. Some are capable of engaging in wide-ranging conversations . A high-profile example is Google’s LaMDA, which provided such human-like answers to questions that one of its developers was convinced that it had feelings .
Information retrieval finds the documents that are most relevant to a query. This is a problem every search and recommendation system faces. The goal is not to answer a particular query but to retrieve, from a collection of documents that may be numbered in the millions, a set that is most relevant to the query. Document retrieval systems mainly execute two processes: indexing and matching. In most modern systems, indexing is done by a vector space model through Two-Tower Networks, while matching is done using similarity or distance scores. Google recently integrated its search function with a multimodal information retrieval model that works with text, image, and video data.

Extractive summarization focuses on extracting the most important sentences from a long text and combining these to form a summary. Typically, extractive summarization scores each sentence in an input text and then selects several sentences to form the summary.
Abstractive summarization produces a summary by paraphrasing. This is similar to writing the abstract that includes words and sentences that are not present in the original text. Abstractive summarization is usually modeled as a sequence-to-sequence task, where the input is a long-form text and the output is a summary.
Multiple choice: The multiple-choice question problem is composed of a question and a set of possible answers. The learning task is to pick the correct answer.
Open domain : In open-domain question answering, the model provides answers to questions in natural language without any options provided, often by querying a large number of texts.

How Does Natural Language Processing (NLP) Work?

NLP models work by finding relationships between the constituent parts of language — for example, the letters, words, and sentences found in a text dataset. NLP architectures use various methods for data preprocessing, feature extraction, and modeling. Some of these processes are:

Stemming and lemmatization : Stemming is an informal process of converting words to their base forms using heuristic rules. For example, “university,” “universities,” and “university’s” might all be mapped to the base univers . (One limitation in this approach is that “universe” may also be mapped to univers , even though universe and university don’t have a close semantic relationship.) Lemmatization is a more formal way to find roots by analyzing a word’s morphology using vocabulary from a dictionary. Stemming and lemmatization are provided by libraries like spaCy and NLTK.
Sentence segmentation breaks a large piece of text into linguistically meaningful sentence units. This is obvious in languages like English, where the end of a sentence is marked by a period, but it is still not trivial. A period can be used to mark an abbreviation as well as to terminate a sentence, and in this case, the period should be part of the abbreviation token itself. The process becomes even more complex in languages, such as ancient Chinese, that don’t have a delimiter that marks the end of a sentence.
Stop word removal aims to remove the most commonly occurring words that don’t add much information to the text. For example, “the,” “a,” “an,” and so on.
Tokenization splits text into individual words and word fragments. The result generally consists of a word index and tokenized text in which words may be represented as numerical tokens for use in various deep learning methods. A method that instructs language models to ignore unimportant tokens can improve efficiency.

Bag-of-Words: Bag-of-Words counts the number of times each word or n-gram (combination of n words) appears in a document. For example, below, the Bag-of-Words model creates a numerical representation of the dataset based on how many of each word in the word_index occur in the document.

Term Frequency: How important is the word in the document?

TF(word in a document)= Number of occurrences of that word in document / Number of words in document

Inverse Document Frequency: How important is the term in the whole corpus?

IDF(word in a corpus)=log(number of documents in the corpus / number of documents that include the word)

A word is important if it occurs many times in a document. But that creates a problem. Words like “a” and “the” appear often. And as such, their TF score will always be high. We resolve this issue by using Inverse Document Frequency, which is high if the word is rare and low if the word is common across the corpus. The TF-IDF score of a term is the product of TF and IDF.

Word2Vec , introduced in 2013 , uses a vanilla neural network to learn high-dimensional word embeddings from raw text. It comes in two variations: Skip-Gram, in which we try to predict surrounding words given a target word, and Continuous Bag-of-Words (CBOW), which tries to predict the target word from surrounding words. After discarding the final layer after training, these models take a word as input and output a word embedding that can be used as an input to many NLP tasks. Embeddings from Word2Vec capture context. If particular words appear in similar contexts, their embeddings will be similar.
GLoVE is similar to Word2Vec as it also learns word embeddings, but it does so by using matrix factorization techniques rather than neural learning. The GLoVE model builds a matrix based on the global word-to-word co-occurrence counts.
Numerical features extracted by the techniques described above can be fed into various models depending on the task at hand. For example, for classification, the output from the TF-IDF vectorizer could be provided to logistic regression, naive Bayes, decision trees, or gradient boosted trees. Or, for named entity recognition, we can use hidden Markov models along with n-grams.
Deep neural networks typically work without using extracted features, although we can still use TF-IDF or Bag-of-Words features as an input.
Language Models : In very basic terms, the objective of a language model is to predict the next word when given a stream of input words. Probabilistic models that use Markov assumption are one example:

P(W n )=P(W n |W n−1 )

Deep learning is also used to create such language models. Deep-learning models take as input a word embedding and, at each time state, return the probability distribution of the next word as the probability for every word in the dictionary. Pre-trained language models learn the structure of a particular language by processing a large corpus, such as Wikipedia. They can then be fine-tuned for a particular task. For instance, BERT has been fine-tuned for tasks ranging from fact-checking to writing headlines .

Top Natural Language Processing (NLP) Techniques

Most of the NLP tasks discussed above can be modeled by a dozen or so general techniques. It’s helpful to think of these techniques in two categories: Traditional machine learning methods and deep learning methods.

Traditional Machine learning NLP techniques:

Logistic regression is a supervised classification algorithm that aims to predict the probability that an event will occur based on some input. In NLP, logistic regression models can be applied to solve problems such as sentiment analysis, spam detection, and toxicity classification.
Naive Bayes is a supervised classification algorithm that finds the conditional probability distribution P(label | text) using the following Bayes formula:

P(label | text) = P(label) x P(text|label) / P(text)

and predicts based on which joint distribution has the highest probability. The naive assumption in the Naive Bayes model is that the individual words are independent. Thus:

P(text|label) = P(word_1|label)*P(word_2|label)*…P(word_n|label)

In NLP, such statistical methods can be applied to solve problems such as spam detection or finding bugs in software code .

Decision trees are a class of supervised classification models that split the dataset based on different features to maximize information gain in those splits.

Latent Dirichlet Allocation (LDA) is used for topic modeling. LDA tries to view a document as a collection of topics and a topic as a collection of words. LDA is a statistical approach. The intuition behind it is that we can describe any topic using only a small set of words from the corpus.
Hidden Markov models : Markov models are probabilistic models that decide the next state of a system based on the current state. For example, in NLP, we might suggest the next word based on the previous word. We can model this as a Markov model where we might find the transition probabilities of going from word1 to word2, that is, P(word1|word2). Then we can use a product of these transition probabilities to find the probability of a sentence. The hidden Markov model (HMM) is a probabilistic modeling technique that introduces a hidden state to the Markov model. A hidden state is a property of the data that isn’t directly observed. HMMs are used for part-of-speech (POS) tagging where the words of a sentence are the observed states and the POS tags are the hidden states. The HMM adds a concept called emission probability; the probability of an observation given a hidden state. In the prior example, this is the probability of a word, given its POS tag. HMMs assume that this probability can be reversed: Given a sentence, we can calculate the part-of-speech tag from each word based on both how likely a word was to have a certain part-of-speech tag and the probability that a particular part-of-speech tag follows the part-of-speech tag assigned to the previous word. In practice, this is solved using the Viterbi algorithm.

Deep learning NLP Techniques:

Convolutional Neural Network (CNN): The idea of using a CNN to classify text was first presented in the paper “ Convolutional Neural Networks for Sentence Classification ” by Yoon Kim. The central intuition is to see a document as an image. However, instead of pixels, the input is sentences or documents represented as a matrix of words.

convolutional neural network based text classification

Recurrent Neural Network (RNN) : Many techniques for text classification that use deep learning process words in close proximity using n-grams or a window (CNNs). They can see “New York” as a single instance. However, they can’t capture the context provided by a particular text sequence. They don’t learn the sequential structure of the data, where every word is dependent on the previous word or a word in the previous sentence. RNNs remember previous information using hidden states and connect it to the current task. The architectures known as Gated Recurrent Unit (GRU) and long short-term memory (LSTM) are types of RNNs designed to remember information for an extended period. Moreover, the bidirectional LSTM/GRU keeps contextual information in both directions, which is helpful in text classification. RNNs have also been used to generate mathematical proofs and translate human thoughts into words.

Autoencoders are deep learning encoder-decoders that approximate a mapping from X to X, i.e., input=output. They first compress the input features into a lower-dimensional representation (sometimes called a latent code, latent vector, or latent representation) and learn to reconstruct the input. The representation vector can be used as input to a separate model, so this technique can be used for dimensionality reduction. Among specialists in many other fields, geneticists have applied autoencoders to spot mutations associated with diseases in amino acid sequences.

Encoder-decoder sequence-to-sequence : The encoder-decoder seq2seq architecture is an adaptation to autoencoders specialized for translation, summarization, and similar tasks. The encoder encapsulates the information in a text into an encoded vector. Unlike an autoencoder, instead of reconstructing the input from the encoded vector, the decoder’s task is to generate a different desired output, like a translation or summary.

Transformers : The transformer, a model architecture first described in the 2017 paper “ Attention Is All You Need ” (Vaswani, Shazeer, Parmar, et al.), forgoes recurrence and instead relies entirely on a self-attention mechanism to draw global dependencies between input and output. Since this mechanism processes all words at once (instead of one at a time) that decreases training speed and inference cost compared to RNNs, especially since it is parallelizable. The transformer architecture has revolutionized NLP in recent years, leading to models including BLOOM , Jurassic-X , and Turing-NLG . It has also been successfully applied to a variety of different vision tasks , including making 3D images .

Six Important Natural Language Processing (NLP) Models

Over the years, many NLP models have made waves within the AI community, and some have even made headlines in the mainstream news. The most famous of these have been chatbots and language models. Here are some of them:

Eliza was developed in the mid-1960s to try to solve the Turing Test; that is, to fool people into thinking they’re conversing with another human being rather than a machine. Eliza used pattern matching and a series of rules without encoding the context of the language.
Tay was a chatbot that Microsoft launched in 2016. It was supposed to tweet like a teen and learn from conversations with real users on Twitter. The bot adopted phrases from users who tweeted sexist and racist comments, and Microsoft deactivated it not long afterward. Tay illustrates some points made by the “Stochastic Parrots” paper, particularly the danger of not debiasing data.
BERT and his Muppet friends: Many deep learning models for NLP are named after Muppet characters , including ELMo , BERT , Big BIRD , ERNIE , Kermit , Grover , RoBERTa , and Rosita . Most of these models are good at providing contextual embeddings and enhanced knowledge representation.
Generative Pre-Trained Transformer 3 (GPT-3) is a 175 billion parameter model that can write original prose with human-equivalent fluency in response to an input prompt. The model is based on the transformer architecture. The previous version, GPT-2, is open source. Microsoft acquired an exclusive license to access GPT-3’s underlying model from its developer OpenAI, but other users can interact with it via an application programming interface (API). Several groups including EleutherAI and Meta have released open source interpretations of GPT-3.
Language Model for Dialogue Applications (LaMDA) is a conversational chatbot developed by Google. LaMDA is a transformer-based model trained on dialogue rather than the usual web text. The system aims to provide sensible and specific responses to conversations. Google developer Blake Lemoine came to believe that LaMDA is sentient. Lemoine had detailed conversations with AI about his rights and personhood. During one of these conversations, the AI changed Lemoine’s mind about Isaac Asimov’s third law of robotics. Lemoine claimed that LaMDA was sentient, but the idea was disputed by many observers and commentators. Subsequently, Google placed Lemoine on administrative leave for distributing proprietary information and ultimately fired him.
Mixture of Experts ( MoE): While most deep learning models use the same set of parameters to process every input, MoE models aim to provide different parameters for different inputs based on efficient routing algorithms to achieve higher performance . Switch Transformer is an example of the MoE approach that aims to reduce communication and computational costs.

Programming Languages, Libraries, And Frameworks For Natural Language Processing (NLP)

Many languages and libraries support NLP. Here are a few of the most useful.

Natural Language Toolkit (NLTK) is one of the first NLP libraries written in Python. It provides easy-to-use interfaces to corpora and lexical resources such as WordNet . It also provides a suite of text-processing libraries for classification, tagging, stemming, parsing, and semantic reasoning.
spaCy is one of the most versatile open source NLP libraries. It supports more than 66 languages. spaCy also provides pre-trained word vectors and implements many popular models like BERT. spaCy can be used for building production-ready systems for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking, and so on.
Deep Learning libraries: Popular deep learning libraries include TensorFlow and PyTorch , which make it easier to create models with features like automatic differentiation. These libraries are the most common tools for developing NLP models.
Hugging Face offers open-source implementations and weights of over 135 state-of-the-art models. The repository enables easy customization and training of the models.
Gensim provides vector space modeling and topic modeling algorithms.
R : Many early NLP models were written in R, and R is still widely used by data scientists and statisticians. Libraries in R for NLP include TidyText , Weka , Word2Vec , SpaCyR , TensorFlow , and PyTorch .
Many other languages including JavaScript, Java, and Julia have libraries that implement NLP methods.

Controversies Surrounding Natural Language Processing (NLP)

NLP has been at the center of a number of controversies. Some are centered directly on the models and their outputs, others on second-order concerns, such as who has access to these systems, and how training them impacts the natural world.

Stochastic parrots: A 2021 paper titled “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” by Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell examines how language models may repeat and amplify biases found in their training data. The authors point out that huge, uncurated datasets scraped from the web are bound to include social biases and other undesirable information, and models that are trained on them will absorb these flaws. They advocate greater care in curating and documenting datasets, evaluating a model’s potential impact prior to development, and encouraging research in directions other than designing ever-larger architectures to ingest ever-larger datasets.
Coherence versus sentience: Recently, a Google engineer tasked with evaluating the LaMDA language model was so impressed by the quality of its chat output that he believed it to be sentient . The fallacy of attributing human-like intelligence to AI dates back to some of the earliest NLP experiments.
Environmental impact: Large language models require a lot of energy during both training and inference. One study estimated that training a single large language model can emit five times as much carbon dioxide as a single automobile over its operational lifespan. Another study found that models consume even more energy during inference than training. As for solutions, researchers have proposed using cloud servers located in countries with lots of renewable energy as one way to offset this impact.
High cost leaves out non-corporate researchers: The computational requirements needed to train or deploy large language models are too expensive for many small companies . Some experts worry that this could block many capable engineers from contributing to innovation in AI.
Black box: When a deep learning model renders an output, it’s difficult or impossible to know why it generated that particular result. While traditional models like logistic regression enable engineers to examine the impact on the output of individual features, neural network methods in natural language processing are essentially black boxes. Such systems are said to be “not explainable,” since we can’t explain how they arrived at their output. An effective approach to achieve explainability is especially important in areas like banking, where regulators want to confirm that a natural language processing system doesn’t discriminate against some groups of people, and law enforcement, where models trained on historical data may perpetuate historical biases against certain groups.

“ Nonsense on stilts ”: Writer Gary Marcus has criticized deep learning-based NLP for generating sophisticated language that misleads users to believe that natural language algorithms understand what they are saying and mistakenly assume they are capable of more sophisticated reasoning than is currently possible.

How To Get Started In Natural Language Processing (NLP)

If you are just starting out, many excellent courses can help.

If you want to learn more about NLP, try reading research papers. Work through the papers that introduced the models and techniques described in this article. Most are easy to find on arxiv.org . You might also take a look at these resources:

The Batch : A weekly newsletter that tells you what matters in AI. It’s the best way to keep up with developments in deep learning.
NLP News : A newsletter from Sebastian Ruder, a research scientist at Google, focused on what’s new in NLP.
Papers with Code : A web repository of machine learning research, tasks, benchmarks, and datasets.

We highly recommend learning to implement basic algorithms (linear and logistic regression, Naive Bayes, decision trees, and vanilla neural networks) in Python. The next step is to take an open-source implementation and adapt it to a new dataset or task.

NLP is one of the fast-growing research domains in AI, with applications that involve tasks including translation, summarization, text generation, and sentiment analysis. Businesses use NLP to power a growing number of applications, both internal — like detecting insurance fraud , determining customer sentiment, and optimizing aircraft maintenance — and customer-facing, like Google Translate.

Aspiring NLP practitioners can begin by familiarizing themselves with foundational AI skills: performing basic mathematics, coding in Python, and using algorithms like decision trees, Naive Bayes, and logistic regression. Online courses can help you build your foundation. They can also help as you proceed into specialized topics. Specializing in NLP requires a working knowledge of things like neural networks, frameworks like PyTorch and TensorFlow, and various data preprocessing techniques. The transformer architecture, which has revolutionized the field since it was introduced in 2017, is an especially important architecture.

NLP is an exciting and rewarding discipline, and has potential to profoundly impact the world in many positive ways. Unfortunately, NLP is also the focus of several controversies, and understanding them is also part of being a responsible practitioner. For instance, researchers have found that models will parrot biased language found in their training data, whether they’re counterfactual, racist, or hateful. Moreover, sophisticated language models can be used to generate disinformation. A broader concern is that training large models produces substantial greenhouse gas emissions.

This page is only a brief overview of what NLP is all about. If you have an appetite for more, DeepLearning.AI offers courses for everyone in their NLP journey, from AI beginners and those who are ready to specialize . No matter your current level of expertise or aspirations, remember to keep learning!

Natural Language Processing

Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains. Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more.

Our work spans the range of traditional NLP tasks, with general-purpose syntax and semantic algorithms underpinning more specialized systems. We are particularly interested in algorithms that scale well and can be run efficiently in a highly distributed environment.

Our syntactic systems predict part-of-speech tags for each word in a given sentence, as well as morphological features such as gender and number. They also label relationships between words, such as subject, object, modification, and others. We focus on efficient algorithms that leverage large amounts of unlabeled data, and recently have incorporated neural net technology.

On the semantic side, we identify entities in free text, label them with types (such as person, location, or organization), cluster mentions of those entities within and across documents (coreference resolution), and resolve the entities to the Knowledge Graph.

Recent work has focused on incorporating multiple sources of knowledge and information to aid with analysis of text, as well as applying frame semantics at the noun phrase, sentence, and document level.

Recent Publications

Some of our teams.

Africa team

Impact-Driven Research, Innovation and Moonshots

We're always looking for more talented, passionate people.

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

European Journal of Cultural Management and Policy
Special Issues

Artificial Intelligence: Cultural Policy, Management, Education, and Research

Total Downloads

Total Views and Downloads

About this Special Issue

The impact of AI on cultural management and cultural policy cuts across multiple disciplines and creative fields, from the theorisation of the very nature of cultural producers, to the practical construction of art form and audiences. Dynamic shifts in both conceptualisation and practice are rapidly ...

Keywords : Artificial Intelligence, Generative AI, Digital Policy, Copyright, Operations

Issue Editors

Recent articles, participating journals.

Manuscripts can be submitted to this Special Issue via the following journals:

total views

Demographics

No records found

total views article views downloads issue views

Top countries

Top referring sites.

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Computation and Language

Title: efficient methods for natural language processing: a survey.

Abstract: Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.

Comments:	Accepted at TACL, pre publication version
Subjects:	Computation and Language (cs.CL)
Cite as:	[cs.CL]
	(or [cs.CL] for this version)
	Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

Other Formats

References & Citations

Google Scholar
Semantic Scholar

1 blog link

Bibtex formatted citation.

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

Natural Language Processing (NLP) For Artificial Intelligence
What is Natural Language Processing in Artificial Intelligence?
How Does Natural Language Processing Use Machine Learning?
Natural Language Processing Functionality in AI
(PDF) Artificial Intelligence Approach to Natural Language Processing
Natural Processing Language in AI

COMMENTS

(PDF) NATURAL LANGUAGE PROCESSING: TRANSFORMING HOW ...
Natural Language Processing (NLP) stands as a pivotal advancement in the field of artificial intelligence, revolutionizing the way machines comprehend and interact with human language. This paper ...
Vision, status, and research topics of Natural Language Processing
The field of Natural Language Processing (NLP) has evolved with, and as well as influenced, recent advances in Artificial Intelligence (AI) and computing technologies, opening up new applications and novel interactions with humans. Modern NLP involves machines' interaction with human languages for the study of patterns and obtaining ...
Natural language processing: state of the art, current trends and
Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP ...
Natural Language Processing Journal
The Open Access Natural Language Processing Journal aims to advance modern understanding and practice of trustworthy, interpretable, explainable human-centered and hybrid Artificial Intelligence as it relates to all aspects of human language. ... The NLP journal welcomes original research papers, review papers, position papers, tutorial and ...
An introduction to Deep Learning in Natural Language Processing: Models
Natural Language Processing (NLP) is a branch of artificial intelligence that involves the design and implementation of systems and algorithms able to interact through human language. Thanks to the recent advances of deep learning, NLP applications have received an unprecedented boost in performance. In this paper, we present a survey of the ...
Recent Advances in Generative AI and Large Language Models: Current
The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP), introducing unprecedented capabilities that are revolutionizing various domains. This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications. Our paper ...
[2407.14962v5] Recent Advances in Generative AI and Large Language
The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP), introducing unprecedented capabilities that are revolutionizing various domains. This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications. Our paper ...
Deep Learning for Natural Language Processing: A Survey
Over the last decade, deep learning has revolutionized machine learning. Neural network architectures have become the method of choice for many different applications; in this paper, we survey the applications of deep learning to natural language processing (NLP) problems. We begin by briefly reviewing the basic notions and major architectures of deep learning, including some recent advances ...
Recent Trends in Deep Learning Based Natural Language Processing
dvances in fields such as computer vision and pattern recognition. Following this trend, recent NLP research is now increasing. y focusing on the use of new deep learning methods (see Figure 1). For decades, machine learning approaches targeting NLP problems have been based on shallow models (e.g., SVM and logisti.
Exploring the Landscape of Natural Language Processing Research
As an efficient approach to understand, generate, and process natural language texts, research in natural language processing (NLP) has exhibited a rapid spread and wide adoption in recent years. Given the increasing research work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established topics ...
Advancements in NLP: The Role of AI in Language Understanding
Dhiraj Jadhav. Department of Data Science &. Artificial Intelligence. Bournemouth University. [email protected]. Abstract — This research paper explores recent. advancements in Natural ...
(PDF) Natural Language Processing
Natural language processing is an integral area of computer. science in which machine learni ng and computational. linguistics are b roadly used. This field is mainly concerned. with making t he h ...
The State of the Art of Natural Language Processing—A Systematic
ABSTRACT. Nowadays, natural language processing (NLP) is one of the most popular areas of, broadly understood, artificial intelligence. Therefore, every day, new research contributions are posted, for instance, to the arXiv repository. Hence, it is rather difficult to capture the current "state of the field" and thus, to enter it. This brought the id-art NLP techniques to analyse the NLP ...
Natural Language Processing and Its Applications in ...
As an essential part of artificial intelligence technology, natural language processing is rooted in multiple disciplines such as linguistics, computer science, and mathematics. The rapid advancements in natural language processing provides strong support for machine translation research. This paper first introduces the key concepts and main content of natural language processing, and briefly ...
Editorial: Perspectives for natural language processing between AI
Natural Language Processing (NLP) today—like most of Artificial Intelligence (AI)—is much more of an "engineering" discipline than it originally was, when it sought to develop a general theory of human language understanding that not only translates into language technology, but that is also linguistically meaningful and cognitively plausible.
The emergent role of artificial intelligence, natural learning
1. Introduction. Artificial intelligence (AI) has become an essential element of modern society, revolutionizing various domains such as education and research. 1 Large Language Models (LLMs) are an emerging Natural Language Processing (NLP) technology that has seen considerable advancement in recent years. 2 Trained on vast amounts of textual data sets using publicly available data and data ...
A Primer on Neural Network Models for Natural Language Processing
Yoav Goldberg. View a PDF of the paper titled A Primer on Neural Network Models for Natural Language Processing, by Yoav Goldberg. Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in fields such as image recognition and speech processing.
Natural language processing for mental health interventions: a
Neuropsychiatric disorders pose a high societal cost, but their treatment is hindered by lack of objective outcomes and fidelity metrics. AI technologies and specifically Natural Language ...
The current state of artificial intelligence generative language models
The emergence of ChatGPT—a natural language processing (NLP) model developed by OpenAI 1 to the general public has garnered global conversation on the utility of artificial intelligence (AI ...
Natural Language Processing (NLP) [A Complete Guide]
Natural Language Processing is the discipline of building machines that can manipulate language in the way that it is written, spoken, and organized ... try reading research papers. Work through the papers that introduced the models and techniques described in this article. ... NLP is one of the fast-growing research domains in AI, with ...
Natural Language Processing in Artificial Intelligence
It looks at intelligent natural language processing and related models of thought, mental states, reasoning, and other cognitive processes. It explores the difficult problems and challenges ...
Natural Language Processing
Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains. Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more. Our work spans the range of traditional NLP tasks, with general-purpose syntax and ...
Chatbots Development Using Natural Language Processing: A Review
Abstract: with the growing presence of artificial intelligence in every sphere of business ecosystem, humans have moved towards a society that promotes increased connection between human and machine. With this, technological applications such as Chatbots enable seamless communication process between computers and humans by the use of NLP (Natural Language Processing).
Natural Language Processing RELIES on Linguistics
Natural Language Processing (NLP) is the field concerned with developing technology for sophisticated computational processing of text, and especially, meaning-focused processing of individ-ual sentences, documents, or conversations (as op-posed to drawing inferences about entire collec-tions).2 Contemporary NLP research is organized
Artificial Intelligence: Cultural Policy, Management, Education, and
The impact of AI on cultural management and cultural policy cuts across multiple disciplines and creative fields, from the theorisation of the very nature of cultural producers, to the practical construction of art form and audiences. Dynamic shifts in both conceptualisation and practice are rapidly occurring, and as such our knowledge base must reckon with these changes to inform management ...
NLP-based ergonomics MSD risk root cause analysis and risk controls
To the best of our knowledge, this paper constitutes the first work employing natural language processing and machine learning for the identification of the root causes of the ergonomics MSD risk. Videos have been widely used as a source of information in research aimed at risk assessment using motion capture.
Efficient Methods for Natural Language Processing: A Survey
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require ...
Natural language processing in artificial intelligence (NLP AI) and
Natural language processing in artificial intelligence (NLP AI) and natural language processing algorithms relating to grammar as a foreign language. ... such as research papers. In other words ...

Deep Learning for Natural Language Processing: A Survey

Cite this article

Access this article

Similar content being viewed by others

Are Deep Learning Approaches Suitable for Natural Language Processing?

Deep Learning Methods in Natural Language Processing

A Review of the Development and Application of Natural Language Processing

Author information

Corresponding author

Additional information

Rights and permissions

About this article

Share this article

1. INTRODUCTION

RQ1: What datasets are considered to be most useful?

RQ3: What are the most popular fields and topics in current NLP research?

2.1 Data Used in the Research

2.1.1 Dataset Downloading and Filtering

2.2 Text Preprocessing

3.1 RQ1: Finding Most Popular Datasets Used in NLP

3.1.1 Named Entity Recognition-NER

3.1.2 Key phrase Search

3.1.3 Approaches to finding names of most popular NLP datasets

3.1.4 Findings Related to RQI; What are the Most Popular NLP Datasets

3.2 Findings Related to RQ2: What Languages are Studied in NLP Research

3.3 Findings Related to RQ3: What are the Popular Fields, and Topics, of Research

3.3.1 Metadata Mining

3.3.2 Matching Literature to Research Topics

3.3.3 Citations

3.4 RQ3 Related Findings Based on Application of Keyphrase and Entity Networks

3.4.1 Text Summarization

3.4.2 Summarization Findings

3.5 RQ1, RQ2, RQ3: Relations between NLP Datasets, Languages, and Topics of Research

3.6 Findings Concerning RQ4: Most Popular Specific Tasks and Problems

3.7 RQ5: Seeking Outliers in the NLP Domain

3.7.1 Text Embeddings

3.7.2 Embedding and Clustering

3.7.3 RQ5: Outliers Found in the NLP Research

3.7.4 “Most Original Papers”

3.8 RQ6: Text Comprehension

3.8.1 Text Complexity

3.8.2 RQ6: Establishing Complexity Level of NLP Literature

3.9 Summary of Key Results

RQ2: Which languages, other than English, appear as a topic of NLP research?

Email alerts

A product of The MIT Press

Information

This Feature Is Available To Subscribers Only

EDITORIAL article

Author contributions

Conflict of interest

Publisher's note

The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks

Similar content being viewed by others

People devalue generative AI’s competence but not its advice in addressing societal and personal challenges

Best humans still outperform artificial intelligence in a creative divergent thinking task

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Current research

Participants

AI participation

Creativity measures

Consequences task

Divergent associations task

Research disclosure statement

Creativity scoring

Preliminary results

Primary results

Data availability

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Share this article

This article is cited by

Tackling AI Hyping

Towards a mixed human–machine creativity