In the field of software development and Artificial Intelligence especially when applied to Digital Marketing, is to develop software applications and services that are able to understand human languages. This is an area of expertise which is commonly referred to as Natural Language Processing (NLP) or the automatic manipulation of natural language, like speech and text, by software.
What is Natural Language
Natural Language is typically a language that people regularly use such as English, French, Italian, Spanish, Italian, etc in order to communicate with each other. Natural languages evolve with the context of a culture and often have a large number of variations such as a regional dialect or vocabularies used by a profession or subculture.
Although language, may use a common words, they will also have unique geographical accents, colloquialisms or even entirely different dialects or even words and phrases. For instance, travelling through England from South to North, you’ll undoubtedly discover and encounter a huge variety of different accents. In some areas, there is a noticeable difference in accents between neighboring towns and cities.
This phenomena is not only peculiar to English or even the UK as it exists in most languages across the world. Even some of the smallest countries. For instance, in the Netherlands there are towns or even areas that actually have their own distinct language and dialects. Although most people will broadly speak Dutch.
This is one of the reasons, why it is often so challenging for Computers to understand human language as it is for other humans to understand each other. The field of Natural Language Processing is a Software Engineering discipline which is primarily focused on enabling humans and machines to communicate with each other making use of the Natural Language. This will enable machines to extract and provide data to human beings, which is ultimately the primary focus of Natural Language Processing.
Benefits of Natural Language Processing
We live in a world where there is millions of gigabytes of data generated by blogs, social websites, and web pages every day. Buried within all this data, are untold business opportunities, discoveries and hidden insights. Developing the capabilities to mine this data and extract these opportunities is seen as a business advantage. This is also the area where the implementation of NLP is of most use to business.
Implementations of NLP
There are a number of successful implementations NLP within common services most of use everyday
- Search Engines – Google, Duck Duck Go, Bing, Baidu and Yandex search engines all implement elements of Natural Language Processing to help provide effective Search facilities to their customers. Their web crawlers first index all websites using NLP and then the Search Engines themselves use NLP in order to understand what the search intent of the user and to deliver the desired results.
- Social Website feeds – News feed algorithms are used to understand your interests using Natural Language Processing in order to present you with related posts and advertisements of interest.
- Digital Personal Assistants – Devices like Amazon Echo, Google Assistant, Apple Siri and Microosft Cortana all make use of NLP to convert human speech into data that a machine can act on and vice versa.
- Spam Filters – Email spam filters have now evolved to examine the content of an email to dtermine whether the content is considered to be spam or not.
How to get started with NLP and Python
Python is the ideal language to get started with NLP tasks the reason why is that a number of libraries have been developed to make a number of tasks in NLP easier. There are also a number of frameworks available make it easier to configure your laptop with ease. You may want to take the opportunity to Getting Started with Python and Artificial Intelligence on Ubuntu and Set up Anaconda, Jupyter Notebook, Tensorflow for Deep Learning .
Using a tool like Jupyter Notebooks or Jupyter Labs is a great way to get started with NLP tasks although having a basic understanding of Python language is essential.
In this example, we are going to be using Python, Jupyter Notebooks and a number of python libraries to do some basic web scraping and analyse the content of a web page derive a basic understanding of the content.
A popular library for NLP tasks in NTLK ( Natural Language Tool Kit ) which is also developed in Python and has a big community behind it.
Common terms and concepts of NLP
Before we get started with developing our simple web scrapiing and analysis we should probably cover the basic common terms and concepts in NLP. Gaining an understanding of these terms will assist you in future NLP projects
- Text corpus or corpora – A corpus is a large set of text data in any Natural Language . A corpus may comprise of a single document or even an entire set of documents.
- Paragraph – A paragraph is the largest piece of text handled by an NLP task.
- Sentences – A sentence encapsulate meaning and thought in context
- Phrases and Words – phrases are a group of consecutive words within a sentence that can convey a specific meaning
- N-grams – A sequence of characters or words
- Bag-of-words – captures the word occurrence frequencies in the text corpus
Install perquisite libraries
In our sample application we’re going to make use of several libraries which we should install into our conda environment making use of
Start a new Jupyter Notebook
To develop our simple application we are going to make use of Jupyter Notebook to iteratively develop and test. Once we have a new notebook created we quickly add the following lines of code
The code above makes a request to url, in this case, our page which details how to get started with python on Ubuntu it then writes the response out. If you run the code you’ll notice it writes out the entire response with all HTML tags etc.
We can tidy this up a little bit and prepare the content so we can easily parse it, by using a python library called BeautifulSoup for pulling data out of HTML and XML files. So we can refactor our code as follows
We can now start the process of trying to understand the content of the page without actually reading it. The first step in this process is to toeknize the content which is to split a string into multiple pieces based on a delimiter.
We can simply do this byt using the following code
We can now start to use functionality from NLTK start extracting information from our content. In this first instance lets remove all the StopWords from the content, StopWords usually refers to the most common words in a language, so in this case it is words like ‘and, the, it, but … ‘etc. The primary reason for this is because we are going to count the most frequently recurring words in the content, in order to get an idea what the content is about.
In the code above, we remove all the stop words then we count the frequency of the words used in the article in an attempt to guage what the article is about.
If run the code we will see the result
Based on the top number of recurring words we could deduce that the primary content of the web page is based around Getting Started Python Artificial Intelligence Ubuntu
This is a rather simple and trivial implementation of the sort of algorithm search engines and web crawlers will do when they crawl web pages and index them . Although as you will appreciate they will undoubtedly do a whole lot more.
We have covered the basics of NLP and seen a very basic and simple implementation of how it could be implemented in search engines to derive the basic contents of a web page and determine the most relevant content of a page.
This is just one implementation of NLP but there are many more examples and implementations of the hugely interesting aspect of Computer and Data Science and why NLP is also one the cornerstone tasks in Machine Learning and Artificial Intelligence.
- How to use Github actions to build & deploy Github nuget packages - October 14, 2021
- How to implement cross cutting concerns with MediatR Pipeline Behaviours - October 5, 2021
- Understanding the difference between Queue and Stack Data Structure - September 22, 2021