Core-term

I. Introduction

Lots of nGrams have punctuation at the begining or/and at the end. Such as:

  • - in details,
  • - in details
  • in details,
  • in (5) details,
  • (in (5) details,
  • (in (5) details),

All above n-grmas are normalized to "in details" and "in )5) details" by stripping the inital or/and final punctuation. The normalized term is called core-term, which is the core of the term. This process is called core-term normalization.

A core term might remain internal punctuation, such as "in (5) details". Also, initial or/and final puncutation might remian in core-term, such as "clean room(s)".

II. Algorithm

Recursively repeat the following process until term does not change or legnth = 0:

  • Strip initial chars if they are punctuation except for left closed brackets, such as ([{%<
  • Strip final chars if they are punctuation except for right closed brackets, such as )]}>
  • strip close brackets of (), [], {}, <> at both ends: (initial and final position)
    • Strip brackets of both lead end char if they matches and net bracket no* is = 0
    • Strip left brackets of lead char if net bracket no* is > 0
    • Strip right brackets of end char if net bracket no* is < 0
  • trim

* net bracket no = total left bracket no - total eight bracket no

For example,

TermNet Bracket No
(in details:)0
(in (5) details:)0
(in (5) details1
in (5) details)-1

III. Examples

Input nGramCore-term
Strip punctuation
-in detailsin details
In details:In details
#$%IN DETAILS:%^(IN DETAILS
( 
() 
Strip brackets
{in (5) details}in (5) details
{{in (5) details}in (5) details
{in (5) details}}in (5) details
{in (5)} details}}{in (5)} details
Strip brackets and punctuation
(in details:)in details
(in details:))in details
(-(in details)%^)in details
{in (5) days},in (5) days
in (5 days),in (5 days)
in ((5) days),in ((5) days)
((clean room(s)))clean room(s)
((inch(es)))inch(es)