It is a common requirement in AI or Machine learning projects to extract information from Unstructured Data Sources such as Word Documents or other Binary documents. It may also be necessary to extract information from these sources to create Training Data to create your Machine Learning Models.
In order to train a Machine Learning model to identify various Entities in a Named Entity Recognition Task or Document Classification task. You may first want to convert your Word Document to a simple text file so you can carry out a Data Annotation task whereby you identify elements of the document you want to train your model to identify.
This is common requirement in Natural Language Processing (NLP) projects. I recommend reading Hands-On Natural Language Processing with Python to understand more about the tasks involved in typical NLP Project.
In this post, we will explore how to use Python to Convert Word Documents to text files in order to make use of the data contained. We are specifically going to be making use of the Jupyter Notebook in an Anaconda Environment, so if you haven't got Jupyter Notebook or Anaconda installed you may want to check out How to Set up Anaconda, Jupyter Notebook, Tensorflow for Deep Learning
What is Jupyter Notebook
Jupyter Notebook is an open source web application that allows a user, scientific researcher, scholar or analyst to create and share the document called the Notebook, containing live code, documentation, graphs, plots, and visualizations.
Jupyter Notebook provides support for 40+ programming languages to the users including the most frequently used programming languages â€“ Python, R, Julia to name a few. It allows the user to download the notebook in various file formats like PDF, HTML, Python, Markdown or an .ipynb file.
Jupyter Notebooks are one of the most important tools for data scientists using Python because they're an ideal environment to develop reproducible data analysis pipelines. Data can be loaded, transformed, and modeled all inside a single Notebook, where it's quick and easy to test out code and explore ideas along the way. Furthermore, all of this can be documented "inline" using formatted text, so you can make notes for yourself or even produce a structured report.
Another bonus is that they enable you to execute code as you go, see the results directly in your Web Browser. You can also easily share your code with others. Its a great tool to become familiar with and use on a daily basis.
A word of warning though, you don't want to attempt Production ready code with Jupyter Notebook, for that you're mostly like want to use an IDE like Jetbrains pyCharm , which incidentally also enables you to write and execute Jupyter Notebook files.
Rename Files using Python
Working with files in a data science project, you may want to rename a file to something that is non-descrip, for instance renaming a descriptive file name to be nothing more than a UUID. Fortunately Python makes this task extremely easy. For this we will just need to import some namespaces to enable this functionality.
If you haven't as yet, created a new Jupyter Notebook, go ahead and do so. In your first block import the namespaces we a=going to use.
import os import uuid
We have imported
os namespace which provides a portable way of using operating system dependent functionality. Enabling you to easily manipulate directory and file paths and even open files.
The uuid provides a convenient way to generate Unversally Unique ID a.k.a GUIDs, which are ID's that are guaranteed to be unique.
We are now ready to write out code that will rename all files in the directory. In my instance I created a directory in the same directory as my Jupyter Notebook and named it
source which I have also populated with a number of files to use as sources training data.
source_directory = os.path.join(os.getcwd(), "source") for filename in os.listdir(source_directory): file, extension = os.path.splitext(filename) unique_filename = str(uuid.uuid4()) + extension os.rename(os.path.join(source_directory, filename), os.path.join(source_directory, unique_filename))
We get the file path of the source directory making use of the
os , to get our Current Working Directory and using the
os again to concatenate the path to
source directory we created.
Once we have this information we then use the os library to get all the files in directory and iterate through them changing their name. As you can tell we've extensively made use of the os library to do a number of the tasks required. It is such a rich namespace and is well worth exploring to uncover additional functionality it provides.
In our instance, the goal we wanted to achieve is done. If we run the blocks and examine the files in the directory we see that the names have all been changed to
Extracting text from Word Documents and Writing to Text Files with Python
We can now move on to our next objective. Before we continue we will import an additional library that will help us to extract text from Word Documents - textract enables developers to extract text from any document easily. I have previously installed textract into my Anaconda Environment so there is nothing more for me to do use it. If you're using the Anaconda Navigator it is easy to install textract into your environment. I will not be providing details on how to do this, but feel free to let me know in the comments if you need help with this.
I prefer to add all imports at the top of file, like you would in a normal programming class, but you can do it wherever you like. So the top of my Jupyter Notebook file now looks like below. If you do this, remember to run through your application again
import os import uuid import textrac
We can now write our code to iterate through all the files in the directory to extract the text and write the contents to a text file. In my instance I have created an additional directory in my workspace name
training_data which I will be writing all my text files to use
training_directory = os.path.join(os.getcwd(), "training_data") for process_file in os.listdir(source_directory): file, extension = os.path.splitext(process_file) # We create a new text file name by concatenating the .txt extension to file UUID dest_file_path = file + '.txt' #extract text from the file content = textract.process(os.path.join(source_directory, process_file)) # We create and open the new and we prepare to write the Binary Data which is represented by the wb - Write Binary write_text_file = open(os.path.join(training_directory, dest_file_path), "wb") #write the content and close the newly created file write_text_file.write(content) write_text_file.close()
You'll notice the process is similar to what we did previously. I have heavily commented the code to provide the additional information.
If you are going t running this on Ubuntu Linux you may need to install antiword an application to show the text and images of MS Word Documents, which is in turn used by textract.
sudo apt install antiword
If you execute the code, you notice that the training_data directory is populated with the newly created text files.
Python makes it extremely easy to write automated tasks to take care of mundane tasks. Making use of Anaconda and Jupyter Notebooks provides an easy and safe playground to try out these tasks before implementing them into production code.
This worfklow helps with many of the exploratory tasks involved in a typical data science project. It is well worth taking the time to read Hands-On Data Science with Anaconda: Utilize the right mix of tools to create high-performance data science applications to further explore and understand how Anaconda and Jupyter Notebooks will help you in your Data Science projects.