• Blueeye.ai Review - Data Labeling on online Data Platform

  • Short update on benchmarking popular Vietnamese NLP tools

    This is a follow up post to my previous review. Due to new versions of these tools being out that can potentially change speed and accuracy significantly. Most notable change is from Underthesea, with vastly increase segmentation speed as you will see in my benchmark below. We still use UD_Vietnamese-VTB dataset for this benchmark which comes with it’s limitations, so take the accuracy results with a grain of salt.

  • NLP Benchmarking popular Vietnamese tokenizer

    I started this when I tried to build a chatbot in Vietnamese for a property company. Natural language processing on Vietnam language is not that different from English due to the fact that they both use alphabetical characters, a dot to end a sentence or semicolons to separate sentences.  The main difference is Vietnam can use 2 or 3 words to form a noun, thus relies heavily on accuracy of words segmentation. The state of annotators for Vietnamese is that they claim to achieves 95% accuracy on large data set that includes segmentation, POS and Entity tagging, which I think is very good.