Home Links
Home Page
How to us to correct the person who has entered
Yandex - like search by the hands
Stemmer.
Creation of sites - model of qualitative imposition
Entrance parameter of function is the file from six elements
Promotion of a site in Rambler
The good design should make a profit
Validnost` HTML
Perenapravlenie mistakes in a browser - 100 % as in PHP
Terrible animal the traffic
Cajt with help HTML:: Mason
Choice of the module
Bases of
Creation of a site
Adjustment of a site
TT - the counter of the traffic
GetCurrBytes
OnOverflow
Small improvements of our counter
 

How to us to correct the person who has entered "malako" instead of "milk"?


For English language for a long time there is an algorithm soundex  which establishes an identical index for the lines having similar sounding.

In MySQL and PHP function soundex is built - in. Whether she works for Russian words? No, she works only with the words which have been written down by latin letters. However nothing prevents us to write down our Russian words latin letters!

That is we translate a word in translit (not standard translit, and more suitable on sounding) and at once we can apply soundex.


To a word, there is an attempt of realization of Russian version soundex . The truth is based she on all by the same soundex, only preliminary there is an account of sounding of Russian letters.


At indexation of a word we find it{him} sound index and it is written down in a database rjadyshkom with a root. By search it is marked those words from search which have not been found in one document, and only for them we prospect close analogue on sounding (if it is found a little - select most often meeting on a site). If such word is found - we deduce{remove} the help.

4. Technical realization.


(Realization in language PHP for MySQL.)


Structure of tables in a database.



CREATE TABLE 'indexing_link' (

   'id' int PRIMARY KEY auto_increment,

   'url' varchar (255) not null default ",

   'title' varchar (255) not null default ",

   'short' text not null default "

);


CREATE TABLE 'indexing_word' (

   'id' int PRIMARY KEY auto_increment,

   'word' varchar (30) not null default ",

   'sound' char (4) not null default 'A000'

);

CREATE INDEX idx_word_word ON indexing_word (word (8));

CREATE INDEX idx_word_sound ON indexing_word (sound (4));


CREATE TABLE 'indexing_index' (

   'id' int PRIMARY KEY auto_increment,

   'link' int not null default 0,

   'word' int not null default 0,

   'times' int not null default 0

);

CREATE INDEX idx_index_linkword ON indexing_index (link, word);


The table link contains the list of documents as the link, heading and the announcement (the first 300 symbols of page for a conclusion as a result of search).


The table word - contains words and includes word - that that remained after stemmera (that that we named "root") and sound - result of function soundex for the given word.


The table index connects two other tables. Its{her} each line is a word "word", vstretivsheeja on page "link" "times" time.



Indexation.


Before to break the text into words and to index them, follows will get rid of tags and other unnecessary symbols.

The elementary decision will be application regular expression:



$tex = preg_replace (' <script [^>] *?>. *? </script> @si ', ' ', $tex);

$tex = preg_replace (' <style [^>] *?>. *? </style> @si ', ' ', $tex);

$tex = preg_replace (' <[\/\!] *? [^ <>] *?> si ', ' ', $tex);


For improvement of quality of search, it is possible to allocate preliminary the words made in tags title, h1-h6, b, strong, em, meta keywords and description, and to raise their weight in an index for the given page. For example, it is banal having increased number ocurrence of these words on 3.