The User Guide

The User Guide documentation begins with some background information about NeatText, then focuses on step-by-step instructions for getting the most out of NeatText.

Getting Started

Installation of NeatText

  • Neattext is availble on PyPI hence you can use pip to install it as follows
pip install neattext

or specifically for a python version as such

python3 -m pip install neattext

Quick Start

After install neattext with pypi, you can use neattext in two main ways - the OOP way or the Method Oriented way. Neattext is designed to be used either via an object oriented approach or a functional/method oriented approach.

Usage via The OOP Way(Object Oriented Way)

  • Neattext comes with 3 main class or objects for cleaning text and doing your text preprocessing.These classes include:

TextCleaner: For cleaning text by either removing or replacing the specific noise eg. emails,special characters,numbers,urls,emojis

TextExtractor: For extracting certain terms from a text or document

TextMetrics: For checking some basic word statics or metrics such as the count of vowels,consonants,stopwords,etc

>>> from neattext import TextCleaner,TextExtractor,TextMetrics
>>> docx = TextCleaner()
>>> docx.text = "your text goes here"
>>> docx.clean_text()

Usage via the OOP way - Object Oriented Way (General usage)

Text Preprocessing

  • Preprocess texts and clean text
>>> import neattext as nt
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx = nt.TextFrame(mytext)
>>> docx.describe()
Key      Value          
Length  : 73             
vowels  : 21             
consonants: 34             
stopwords: 4              
punctuations: 8              
special_char: 8              
tokens(whitespace): 10             
tokens(words): 14 
>>>
>>> docx.head(16)
'This is the mail'
>>> docx.tail(16)
'//example.com 😊.'
>>> 
>>> docx.normalize()
'this is the mail example@gmail.com ,our website is https://example.com 😊.'
>>> docx.normalize(level='deep')
'this is the mail examplegmailcom our website is httpsexamplecom '
>>> docx.remove_emojis()

Simple NLP Task

You can also do some basic Natural Language Preprocessing task such as tokenization,ngrams,text generation,etc

>>> docx.word_tokens()

Clean Text using the Method Oriented Approach

  • Clean text by removing emails,numbers,stopwords,emojis,etc
  • A simple method for cleaning text by specifying as True/False what to clean from a text.
>>> from neattext.functions import clean_text
>>> 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> 
>>> clean_text(mytext)
'mail example@gmail.com ,our website https://example.com .'
  • You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True.
  • You can choose to remove or not remove punctuations by setting to True/False respectively
>>> clean_text(mytext,puncts=True)
'mail example@gmailcom website https://examplecom '
>>> 
>>> clean_text(mytext,puncts=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,puncts=False,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>> 
  • You can also remove the other non-needed items accordingly
>>> clean_text(mytext,stopwords=False)
'this is the mail example@gmail.com ,our website is https://example.com .'
>>>
>>> clean_text(mytext,urls=False)
'mail example@gmail.com ,our website https://example.com .'
>>> 
>>> clean_text(mytext,urls=True)
'mail example@gmail.com ,our website .'
>>> 

Remove Punctuations [A Very Common Text Preprocessing Step]

  • You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default
  • Alternatively you can also remove all known punctuations from a text.
>>> import neattext as nt 
>>> mytext = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_puncts()
TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ")

>>> docx.remove_puncts(most_common=False)
TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ")

Remove Emails,Numbers,Phone Numbers

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.remove_emails()
>>> 'This is the mail  ,our WEBSITE is https://example.com 😊.'
>>>
>>> docx.remove_stopwords()
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.'
>>>
>>> docx.remove_numbers()
>>> docx.remove_phone_numbers()

Remove Special Characters

>>> docx.remove_special_characters()

Remove Emojis

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.remove_emojis()
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'

Replace Emails,Numbers,Phone Numbers

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()

Using TextExtractor

  • To Extract emails,phone numbers,numbers,urls,emojis from text
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['😊']

Using TextMetrics

  • To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()
>>> docx.memory_usage()

Usage via the MOP(Method/Function Oriented Way)

If you are a fun of functions you can also use neattext in such a manner using the functions sub-package. In that case you will have to import as this

>>> from neattext.functions import remove_emails,remove_emojis,clean_text

You can also use the import as feature.

>>> import neattext.functions as ntf
>>> ntf.remove_emails(your_text)
>>>
>>> from neattext.functions import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> extract_emails(t1)
>>> ['example@gmail.com']
  • Alternatively you can also use this approach
>>> import neattext.functions as nfx 
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> nfx.clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> nfx.extract_emails(t1)
>>> ['example@gmail.com']

Pipeline Approach using TextPipeline

  • This is a new feature(from version 0.1.2) that introduces the concept of pipeline.
  • TextPipeline operates like the clean_text function but in this case you specify according as steps a group of functions you need to use to clean a given text.
>>> from neattext.pipeline import TextPipeline
>>> t1 = """This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊. This is visa 4111 1111 1111 1111 and bitcoin 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 with mastercard 5500 0000 0000 0004. Send it to PO Box 555, KNU"""

>>> p = TextPipeline(steps=[remove_emails,remove_numbers,remove_emojis])
>>> p.transform(t1)
'This is the mail  ,our WEBSITE is https://example.com . This is visa     and bitcoin BvBMSEYstWetqTFnAumGFgxJaNVN with mastercard    . Send it to PO Box , KNU'

  • Check For steps and named steps
>>> p.steps
>>> p.named_steps