User Guide

The User Guide

The User Guide documentation begins with some background information about NeatText, then focuses on step-by-step instructions for getting the most out of NeatText.

Getting Started

Installation of NeatText

Neattext is availble on PyPI hence you can use pip to install it as follows

pip install neattext

or specifically for a python version as such

python3 -m pip install neattext

Quick Start

After install neattext with pypi, you can use neattext in two main ways - the OOP way or the Method Oriented way. Neattext is designed to be used either via an object oriented approach or a functional/method oriented approach.

Usage via The OOP Way(Object Oriented Way)

Neattext comes with 3 main class or objects for cleaning text and doing your text preprocessing.These classes include:

TextCleaner: For cleaning text by either removing or replacing the specific noise eg. emails,special characters,numbers,urls,emojis

TextExtractor: For extracting certain terms from a text or document

TextMetrics: For checking some basic word statics or metrics such as the count of vowels,consonants,stopwords,etc

>>> from neattext import TextCleaner,TextExtractor,TextMetrics
>>> docx = TextCleaner()
>>> docx.text = "your text goes here"
>>> docx.clean_text()

Usage via the OOP way - Object Oriented Way (General usage)

Text Preprocessing

Preprocess texts and clean text

>>> import neattext as nt
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx = nt.TextFrame(mytext)
>>> docx.describe()
Key      Value          
Length  : 73             
vowels  : 21             
consonants: 34             
stopwords: 4              
punctuations: 8              
special_char: 8              
tokens(whitespace): 10             
tokens(words): 14 
>>>
>>> docx.head(16)
'This is the mail'
>>> docx.tail(16)
'//example.com 😊.'
>>> 
>>> docx.normalize()
'this is the mail example@gmail.com ,our website is https://example.com 😊.'
>>> docx.normalize(level='deep')
'this is the mail examplegmailcom our website is httpsexamplecom '
>>> docx.remove_emojis()

Simple NLP Task

You can also do some basic Natural Language Preprocessing task such as tokenization,ngrams,text generation,etc

>>> docx.word_tokens()

Clean Text using the Method Oriented Approach

Clean text by removing emails,numbers,stopwords,emojis,etc
A simple method for cleaning text by specifying as True/False what to clean from a text.

>>> from neattext.functions import clean_text
>>> 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> 
>>> clean_text(mytext)
'mail example@gmail.com ,our website https://example.com .'

You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True.
You can choose to remove or not remove punctuations by setting to True/False respectively

>>> clean_text(mytext,puncts=True)
'mail example@gmailcom website https://examplecom '
>>> 
>>> clean_text(mytext,puncts=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,puncts=False,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>>

You can also remove the other non-needed items accordingly

>>> clean_text(mytext,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>>
>>> clean_text(mytext,urls=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,urls=True)
'mail example@gmail.com ,our website .'
>>>

Remove Punctuations [A Very Common Text Preprocessing Step]

You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default
Alternatively you can also remove all known punctuations from a text.

>>> import neattext as nt 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_puncts()
TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ")

>>> docx.remove_puncts(most_common=False)
TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ")

Remove Emails,Numbers,Phone Numbers

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.remove_emails()
>>> 'This is the mail  ,our WEBSITE is https://example.com 😊.'
>>>
>>> docx.remove_stopwords()
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.'
>>>
>>> docx.remove_numbers()
>>> docx.remove_phone_numbers()

Remove Special Characters

>>> docx.remove_special_characters()

Remove Emojis

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.remove_emojis()
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'

Replace Emails,Numbers,Phone Numbers

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()

Using TextExtractor

To Extract emails,phone numbers,numbers,urls,emojis from text

>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['😊']

Using TextMetrics

To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats

>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()
>>> docx.memory_usage()

Usage via the MOP(Method/Function Oriented Way)

If you are a fun of functions you can also use neattext in such a manner using the functions sub-package. In that case you will have to import as this

>>> from neattext.functions import remove_emails,remove_emojis,clean_text

You can also use the import as feature.

>>> import neattext.functions as ntf
>>> ntf.remove_emails(your_text)
>>>

>>> from neattext.functions import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> extract_emails(t1)
>>> ['example@gmail.com']

Alternatively you can also use this approach

>>> import neattext.functions as nfx 
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> nfx.clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> nfx.extract_emails(t1)
>>> ['example@gmail.com']

Pipeline Approach using TextPipeline

This is a new feature(from version 0.1.2) that introduces the concept of pipeline.
TextPipeline operates like the clean_text function but in this case you specify according as steps a group of functions you need to use to clean a given text.

>>> from neattext.pipeline import TextPipeline
>>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU"""

>>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis])
>>> p.transform(t1)
'This is the mail  ,our WEBSITE is https://example.com . This is visa     and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard    . Send it to PO Box , KNU'

Check For steps and named steps

>>> p.steps
>>> p.named_steps