Natural language processing

Natural language processing ( NLP ) is a field of computer science, an artificial intelligence that deals with the interaction between computers and human (natural) languages, and, in particular, concerns with programming computers.

Challenges in natural language processing often involve speech recognition, natural language understanding, and natural language generation.


Main article: History of natural language processing

The history of NLP generally started in the 1950s. In 1950, Alan Turing published an article titled ” Computing Machinery and Intelligence ” which is now called Turing test as a criterion of intelligence.

The Georgetown experiment in 1954 Involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. [2]However, the progress is much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations. Little further research was conducted until the late 1980s, when the first statistical machine translation systems were developed.

Some notably successful NLP systems Developed in the 1960s Were SHRDLU , a natural language system working in restricted ” blocks worlds ” with restricted vocabularies, and ELIZA , a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum entre 1964 and 1966. Using Almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the “patient” is exceeded, ELIZA can provide a response, for example, responding to “My head hurts” with “Why do you say your head hurts?”.

During the 1970s, many programmers began to write “conceptual ontologies”, which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). ). During this time, many chatterbots were written including PARRY , Racter , and Jabberwacky .

Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This paper is a study of the theory of chomskyan (see Moore’s law ) and the theory of linguistics (eg, transformational grammar ), the theoretical underpinnings of the language of the machine-learning approach to language processing. [3] Some of the earliest-used learning machine algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to NLP, and increasingly, research has focused on statistical models , which makes soft, probabilistic decisions based on real-valued weights to the facts making up the input data. The hidden language models upon qui Many speech recognitionThese models are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially when the data are integrated into a larger system of multiple-data subtasks.

Many of the notable early successes in the field of machine translation , especially at work at IBM Research, have been developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Unionas a result of laws for the translation of all official proceedings into the official languages ​​of the corresponding systems of government. However, most other systems have been developed for the purposes of these systems, which have been subject to a major limitation in the success of these systems. As a result, a great deal of research has gone into the past.

Recent research has been focused on unsupervised and semi-supervised learning algorithms. Such algorithms are capable of learning with the desired answers, or a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning , and typically produces less accurate results for a given amount of data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web ), which can often be made for the inferior results.

In recent years, there has-been a flurry of results showing deep learning technology [4] [5] Achieving state-of-the-art results in Many natural language tasks, for example in language modeling, [6] parsing, [7] [8] and many others.

Statistical Natural Language Processing (SNLP)

Since the so-called “statistical revolution” [9] [10] in the late 1980s and mid-1990s, much research Natural Language Processing HAS relied Heavily is machine learning .

Formerly, many language-processing tasks typically involved the direct hand coding of rules, [11] [12] which is not in general robust to natural language variation. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples (a corpus (plural, “corpora”) is a set of documents, possibly with human or computer annotations ).

Many different classes of machine learning algorithms have been applied to NLP tasks. These algorithms take a large set of “features” that are generated from the input data. Some of the earliest-used algorithms, such as decision trees , were then produced. Increasingly, however, research has focused on statistical models , which make soft, probabilistic decisions based on attaching real-valuedweights to each input feature. Such models have the advantage that they can express the relative certainty of many different possibilities rather than one, producing more reliable results when such a model is included as a component of a larger system.

Systems based on machine-learning algorithms have many advantages over hand-produced rules:

  • The learning procedures used during the learning process are usually the most common practice, where the effort should be directed.
  • Automatic learning procedures can be used for statistical inference algorithms that are robust to unfamiliar input (eg with words that have not been seen before) and to erroneous input (eg with misspelled words or words accidentally omitted). Generally speaking, with the help of these tools, it is very difficult to use error-prone and time-consuming software.
  • Systems based on automatic learning the rules can be made more accurate simply by supplying more input data. However, it is possible to make more accurate accounting by using the complexity of the rules, which is a much more difficult task. In particular, there is a limit to the complexity of systems based on hand-crafted rules, beyond which the systems become more unmanageable. However, creating more data to be input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.

Major evaluations and tasks

The following is a list of some of the most commonly researched tasks in NLP. Note that some of these tasks have direct real-world applications, while others are more commonly used as subtasks.

NLP tasks are obviously very closely intertwined, they are frequently, for convenience, subdivided into categories. A coarse division is given below.


Morphological segmentation
Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (ie the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology , and thus it is often possible to ignore this task and all possible forms of a word (eg “open, opened, opened, opening”) as separate words. In languages ​​such as Turkish or Meitei , [13] highly agglutinated Indian language, however, such an approach is not possible.
Part-of-speech tagging
Given a sentence, determines the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech . For example, “book” can be a noun (“the book on the table”) or verb (“to book a flight”); “set” may be a noun , verb or adjective ; and “out” can be any other. Some languages ​​have more such ambiguity than others. Languages ​​with little inflectional morphology , such as English , are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal languageduring verbalization. Such inflection is not conveyed through the entities employed within the meaning of the phrase.
(see also: Stochastic grammar ) Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyzes. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing, Dependency Parsing and Constituency Parsing. Dependency Parsing Focuses on the relationships entre words in a sentence (marking things like Primary Objects and predicates), whereas Constituency Parsing Focuses on building out the Parse Tree using a Probabilistic Context-Free Grammar (PCFG).
Sentence breaking (also known as sentence boundary disambiguation )
Given a chunk of text. Such markings may be marked by periods or other punctuation marks , but these same characters may be used other purposes (eg marking abbreviations ).
Word segmentation
Separate a chunk of continuous text into separate words. For a language like English , this is fairly trivial, since words are usually separated by spaces. HOWEVER, some written languages like Chinese , Japanese and Thai do not mark word boundaries in Such a fashion, and de ces languages text segmentation is a significant task Requiring knowledge of the vocabulary and morphology of words in the language.
Terminology extraction
The goal of terminology is to automatically extract terms from a given corpus.


Lexical semantics
What is the computational meaning of individual words in context?
Machine translation
Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed ” AI-complete “, ie requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc. .) in order to solve properly.
Named entity recognition (NER)
Given a stream of text, such as people or places, and what is the type of each person’s name (eg person, location, organization). That note, ALTHOUGH capitalization can aid in Recognizing named entities in languages Such As English, this information can not aid in Determining the kind of named entity, and in Any checkbox is Often Inaccurate Insufficient gold. For example, the first word of a sentence is also capitalized, and named entities are often capitalized. Also, many other languages ​​in non-Western scripts (eg Chinese or Arabic) do not have any capitalization at all, and even languages ​​with capitalization For example, German capitalizes all nouns , irrespective of whether they refer to names, and French and Spanish do not capitalize names that serve as adjectives .
Natural language generation
Convert information from computer databases or semantic intents to readable human language.
Natural language understanding
Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the subject that can be derived from a natural language that usually takes the form of the organization of natural language concepts. Introduction and creation of metamodel and ontology are efficient yet empirical solutions. An explicit formalization of natural languages ​​with limitations (CWA). open-world assumption, subjective gold Yes / No vs. objective True / False is expected for the construction of a basis of semantics formalization. [14]
Optical character recognition (OCR)
Given an image of a printed text, determined the corresponding text.
Question answering
Given a human-language question, determine its answer. Typical questions have a specific right answer (such as “What is the capital of Canada?”), But sometimes open-ended questions are also considered (such as “What is the meaning of life?”). Recent works have looked at even more complex questions. [15]
Recognizing Textual entailment
Given two text fragments, determined by one of the other negatives, or allows the other to be true or false. [16]
Relationship extraction
Given a chunk of text, identify the relationships between named entities (eg who is married to whom).
Sentiment analysis
Extract subjective information usually from a set of documents, often using online reviews to determine “polarity” about specific objects. It is particularly useful for social media, for the purpose of marketing.
Topic segmentation and recognition
The subject of the discussion, and the topic of the segment.
Word sense disambiguation
Many words have more than one meaning ; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, eg from a dictionary or from an online resource such as WordNet .


Automatic summarization
Produce a readable summary of a chunk of text. A number of articles of the same type in the financial section of a newspaper.
Coreference resolution
Given which words (“mentions”) refers to the same objects (“entities”). Anaphora is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names to which they refer. Bridging relationships “involving more expressions . For example, “the front door” is a reference to John’s house through the front door, “the front door” is a reference to the bridging relationship to be identified to the front door of John ‘
Discourse analysis
This rubric includes a number of related tasks. One task is identifying the discourse structure of connected text, ie the nature of the discourse relationships between sentences (eg elaboration, explanation, contrast). Another possible task is recognizing and classifying the speech acts in a chunk of text (eg yes-no question, content question, statement, assertion, etc.).


Speech recognition
Given a sound of a person or person speaking, determines the textual representation of the speech. This is the opposite of the text and is one of the extremely difficult problems colloquially termed ” AI-complete ” (see above). In natural speech there are successive entre Hardly Any words breaks, and THUS speech segmentation is A Necessary subtask of speech recognition (see below). Note also that in the most recent languages, the sounds of successive letters blending into each other in a process termed coarticulation , so the conversion of the analog signal to discrete characters can be very difficult process.
Speech segmentation
Given a sound clip of a person or people speaking, separate it into words. A subtask of speech recognition and typically grouped with it.

Natural language processing application program interfaces

  • IBM Watson
  • Google Cloud Natural Language API
  • Amazon Lex
  • Microsoft Cognitive Services
  • Facebook’s DeepText
  • Expert System’s Cogito
  • FriendlyData
  • Paralleldots
  • Lexalytics
  • Automated Insights
  • Indico
  • MeaningCloud
  • Rosette
  • WSC iMinds
  • Rasa NLU
  • DKPro Core

See also

  • Automated essay scoring
  • Biomedical text mining
  • Compound term processing
  • Computational linguistics
  • Computer-assisted reviewing
  • Controlled natural language
  • Deep learning
  • Deep linguistic processing
  • Foreign language reading aid
  • Foreign language writing aid
  • Information extraction
  • Retrieval information
  • Language technology
  • Latent Dirichlet allocation (LDA)
  • Latent semantic indexing
  • List of natural language processing toolkits
  • Native-language identification
  • Natural language programming
  • Natural language search
  • Query Expansion
  • Reification (linguistics)
  • Semantic folding
  • Speech processing
  • Spoken dialogue system
  • Text-proofing
  • Text simplification
  • Thought vector
  • truecasing
  • Question answering
  • Word2vec


  1. Jump up^ Implementing an online help desk based on conversational agentAuthors: Alisa Kongthon, Sangkeettrakarn Chatchawal, Sarawoot Kongyoung and Choochart Haruechaiyasak. Published by ACM 2009 Article, Bibliometrics Data Bibliometrics. Published in: Proceeding, MEDES ’09 Proceedings of the International Conference on Management of Emerging Digital EcoSystems, ACM New York, NY, USA. ISBN 978-1-60558-829-2,doi:10.1145 / 1643823.1643908
  2. Jump up^ Hutchins, J. (2005). “The history of machine translation in a nutshell” . [ self-published source ]
  3. Jump up^ Chomskyan linguistics encourages the investigation of “corner cases” that the stress of the limits of its theoretical models (comparable topathologicalphenomena in mathematics), typically created usingthought experiments, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case incorpus linguistics. The establishment and use of Suchcorporaof real-world data is a Fundamental share of machine learning algorithms for NLP. In addition, chomskyan linguistics such as so-called “underpinnings of poverty””argument entail that general learning algorithms, as are typically used in machine learning, can not be successful in language a result, the Chomskyan paradigm discouraged the application of such models to language processing.
  4. Jump up^ Goldberg, Yoav (2016). Primer on Neural Network Models for Natural Language Processing. Journal of Artificial Intelligence Research 57 (2016) 345-420
  5. Jump up^ Ian Goodfellow, Yoshua Bengio and Aaron Courville. Learning]. MIT Press.
  6. Jump up^ Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu (2016). the Limits of Language Modeling
  7. Jump up^ Do Kook and Eugene Choe Charniak (EMNLP 2016). as Language Modeling
  8. Jump up^ Vinyals, Oriol, et al. (NIPS2015).
  9. Jump up^ Mark Johnson. How the statistical revolution changes (computational) linguistics. Proceedings of the AECL 2009 Workshop on Interaction between Linguistics and Computational Linguistics.
  10. Jump up^ Philip Resnik. Four revolutions. Language Log, February 5, 2011.
  11. Jump up^ Winograd, Terry (1971). Procedures as a Representation for Data in a Computer Program for Understanding Natural Language.
  12. Jump up^ Roger C. Schank and Robert P. Abelson (1977). Scripts, plans, goals, and understanding: An inquiry into human knowledge structures
  13. Jump up^ Kishorjit, N., Vidya Raj RK., Nirmal Y., and Sivaji B. (2012) “Manipuri Morpheme Identification”, Proceedings of the 3rd Workshop on South and Asian Natural Language Processing (SANLP), pages 95- 108, COLING 2012, Mumbai, December 2012
  14. Jump up^ Yucong Duan, Christophe Cruz (2011), Semantic Formalizing of Natural Language through Conceptualization from Existence . International Journal of Innovation, Management and Technology (2011) 2 (1), pp. 37-42.
  15. Jump up^ “Versatile question answering systems: see in synthesis”, Mittal et al., IJIIDS, 5 (2), 119-142, 2011.
  16. Jump up^ PASCAL Recognizing Textual Entailment Challenge (RTE-7)