The Commons

Back to Results

Patent Title: Natural language determination using partial words

Assignee: IBM
Patent Number: US6216102
Issue Date: 04-10-2001
Application Number:
File Date:09-30-1996


Abstract: Comparing the short and truncated words of a document to word tables of most frequently used words in each of the respective candidate language to identify the language in which the document is written. First, a plurality of words from a document is read into a computer memory. Then, words within the plurality of words which exceed a predetermined length are truncated to produce a set of short and truncated words. The set of short and truncated words are compared to words in a plurality of word tables. Each word table is associated with and contains a selection of most frequently used words in a respective candidate language. Although the most frequently words in most languages tend to be short those which which exceed the predetermined length may be truncated in the word tables. A respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language. In some embodiments, the count may weighted by factors related to the frequency of occurrence of the words in the respective candidate languages. The language of the document is identified as the language associated with the count having the highest value.

Notes:

Link to USPTO

IBM Pledge dated 1/11/2005