The simplest description of how the Yandex search engine works. How the Yandex search engine works Yandex search engine information
Hello dear friends! In this article, we will continue to consider the Yandex search engine, and as you remember, in past articles, the history of the creation of this great company, which ranks first among competitors in Russia and not only, was considered.
All this is good, but newbies and seasoned site builders are interested in the most important question, of course, related to how to bring their projects to the first places of the TOP results.
Therefore, let's look at how the Yandex search engine works in order to understand what kind of rake you can step on, and what you should expect from a Russian search engine.
In the last article, we discussed with you. The topic turned out to be quite interesting and useful. Therefore, I decided to supplement it, deepen it, so to speak.
So, probably, with the question “Why does a search engine index documents?” I got excited - this is a no brainer. It remains to clarify the question "how".
Website ranking algorithms
First, let's take a look at some of the algorithms that are fundamental to any search engine:
- Algorithm for direct search.
What is it - you remember that you read a wonderful story in one of the books. And you start looking in turn. They took one book - leafed through - did not find, took another ... The principle is clear, but this method is extremely long. This is also understandable.
- Reverse search algorithm.
For this algorithm is generated from every page of your blog - a text file is created. This file lists in alphabetical order ALL the words you have used. Even the position of this word in the text is indicated (coordinates in the text).
This is enough quick way, but the search is already taking place with some kind of error.
The main thing here is to understand that this algorithm is not looking for the Internet, not with a blog search. And separately taken text file, which was created a long time ago. When the robot came to you. And these files (reverse indexes) are stored on Yandex servers.
So, these were the basic search algorithms. Those. how Yandex simply finds the documents it needs. There shouldn't be any problems with this.
But Yandex knows not one documents, or even 100 documents, but according to the latest data from my sources - Yandex knows about 11 billion documents (10,727,736,489 pages).
And among all this quantity, you need to select documents that are suitable for the request. And more importantly, you need to somehow rank them. Those. rank according to the degree of importance, or rather, according to the degree of usefulness to the reader.
Search Mathematical Models
To solve this issue, mathematical models come to the rescue. We will now talk about the simplest models.
Boolean mat. Model- If the word occurs in the document, the document is considered found. Just coincidence and nothing complicated.
But there are problems here. For example, if you, as a user, enter some popular word, or even better the preposition "in", which is the most common word in Russian and is found in EVERY document, then you will be given so many results that you do not even realize such a number, how many documents did you find. Therefore, the following mate model appeared.
Vector mat. Model- this model determines the "weight" of the document. Not only does a coincidence occur, but this word must also appear several times. Moreover, the more a word occurs, the higher the relevance (correspondence).
It is the vector model that ALL search engines use.
Probability model- more complex. The principle is this: the search engine found the page reference itself. For example, you are looking for information about the history of Yandex. Yandex has some kind of standard, let's say it will be my previous article about Yandex.
And he will compare all other documents with this article. And the logic here is this: the more a page of your blog looks like my article, the more LIKELY the fact that your blog page will also be useful to the reader and also tells about the history of Yandex.
To reduce the number of documents that need to be shown to the user, the concept of relevance was introduced, i.e. compliance.
How well your blog page really matches the topic. This is an important topic when it comes to search quality.
Assessors - who they are and what are they responsible for
This relevance is also needed to assess the quality of the algorithms.
For this there is a special forces headquarters - they are called Assessors. it special people who browse the search results with their hands.
They have instructions on how to check sites, how to rate, etc. And they manually determine in order whether your pages are suitable for search queries or not.
And the quality of search algorithms depends on the opinion of the assessors. If all the assessors say that the search results do not match the queries, then the ranking algorithm is incorrect, and here only Yandex is to blame.
If the assessors say that only one site does not match the request, it means that the site flies somewhere far away and goes down in the search results. More precisely, not the entire site, but only one article, but this is "not the point."
Of course, assessors cannot view and evaluate ALL articles with their hands and eyes. Well this is understandable.
And other parameters come to the rescue, according to which the ranking of pages is carried out.
There are a lot of them, well, for example:
- page weight (VIC, PageRank, tumblers all in all);
- domain authority;
- the relevance of the text to the request;
- relevance of texts external links request;
- as well as many other ranking factors.
Assessors make comments, and people who are responsible for setting mathematical model ranking already, in turn, edit the formula, as a result of which the search engine works better.
The main criteria for evaluating the work of the formula:
1. Accuracy of search engine results- percentage of documents that match the request (relevant). Those. the fewer pages not matching the request are present, the better.
2. Completeness of the search engine results is the ratio of relevant web pages to given request to the total number of relevant documents in the collection (a set of pages in the search engine).
For example, if there are more relevant pages in the entire collection than in the search results, then this means that the search results are incomplete. This was due to the fact that some of the relevant web pages fell under the filter.
3. Relevance of search engine results- This is the correspondence of the web page to what is written in the snippet. For example, a document may be very different or may not exist at all, but it may be present in the SERP.
The relevance of the issue directly depends on how often the search robot scans documents from its collection.
Collection collection (indexing of site pages) is carried out special program- a search robot.
The search robot receives a list of addresses for indexing, copies them, then the contents of the copied web pages are sent for processing to an algorithm that converts them into reverse indexes.
Well, here "in a nutshell", if I may say so, we discussed the principles of the search engine.
Let's summarize:
- A search robot comes to your blog.
- The crawler maintains the reverse index of the page for later retrieval.
- Using a mathematical model, the document is processed and displayed in the search results according to the formulas and taking into account the opinion of the assessor.
This is, if very, very simplified. Just to get a basic understanding of how the Yandex search engine works.
I have now written so much text, and perhaps so many things are not clear. Therefore, I suggest that you return to this article a little later and watch this video.
This is an excellent guide that I used to study at one time.
Hope this information will help you better understand why any of your sites are in relevant positions in the search and do everything to improve them.
On this I say goodbye to you, if you have any questions, I am always happy to answer them in the comments. Or maybe you want to supplement the article?
In any case, give your opinion. !
We are not as unique as we think: millions of people before us have puzzled and millions after us will puzzle the search engine with almost the same questions. On the other hand, we are too unpredictable: the formulation of our request is influenced by a huge number of factors that we do not understand. And at least for this reason, the request of each of us, no matter how banal it may be, requires an individual approach.
In fact, the entire work of the search engine "Yandex" is reduced to two simple things: to understand what a person really wants to know, and in a few seconds to find suitable documents for him among the billions of documents on the Web.
Take prints
The search engine's system is somewhat similar to the Matrix, and the search robot (a complex program that makes decisions on its own) is like Agent Smith.
In order not to search the entire Internet every time someone needs to find out something, the search engine does part of the work in advance - it checks what is on the Web and where it lies with the help of thousands of search robots. They are of two types: basic and fast. The main one bypasses and processes the Internet as a whole, and the fast one - documents that appeared a minute or even a couple of seconds ago. The task of robotic programs is to select information that is useful and useful for users, to process it, filtering out all that is outdated and unnecessary. In some ways it resembles sorting garbage: paper in one container, glass in another, plastic in a third, food waste in a fourth ...
The information collected by the robots forms the so-called “snapshot of the Internet”. It is stored on thousands of Yandex servers and is constantly updated. A snapshot is like a list that tells you where you can find what information. On this list, everyone has keyword not one, but millions of "pages" are listed. In order for all updates to the nugget to be available to users, they are transferred from the repository to " basic search". Data from the main robot is transferred every few days, and from the fast robot - in real time.
Bring to clean water
ILLUSTRATION: EUGENE TONKONOGIY |
Looking for an answer to the question asked in a prepared base, the machine faces two main difficulties. The first difficulty is language. Before looking for an answer to a question, it is important for a machine to understand in which language to do it. For example, for a Russian-speaking person on the query "Prince Igor's squad", the search will find documents with information about the army, and for a Ukrainian, the "Prince Igor's squad" will also give documents mentioning Princess Olga, his spouse, since in Ukrainian "wife" is "Squad". And in the rich Russian language, the same word or its derivatives can mean different things. For example, the word “steel” is one of the forms of the noun “steel” and the verb “to become”. The second difficulty is human psychology. When we enter a request, we expect a quick and accurate answer, without worrying, of course, about the correspondence of the formulation of the request to the principles of mathematical analysis, according to which the brain of the machine works. For example, by typing in search string the word "napoleon", what does a person want to get: a recipe for a cake or a biography of a French emperor, buy brandy or find the address of a mental hospital?
In such situations, several technologies come into play at once. You can give you a few hints below the search bar that further refine your query. Like, choose what you need: Napoleon recipes or Napoleon - Bonaparte. If the user does not respond to the request of the car and does not add words to the "Napoleon", then the "Spectrum" technology helps: without hoping for help, the machine immediately searches for information in several categories (about the cake, and about the emperor, and about the yak horse. ..). In addition, personalization mechanisms help to understand the user - the machine's knowledge of what this user was looking for from his computer a day or two or three or a month ago: if you often asked Yandex questions about cooking, the machine will first show you the results saying, that Napoleon is a cake.
Combinations: hobby clubs
The task of a search engine is not limited to simply selecting documents containing words and phrases from search query... The machine needs to understand which documents meet our conflicting requirements and why they meet them. Do we want to get information about Napoleon - a cake, or maybe we visited a fitness club with a pretentious name for a couple of years, or even are completely concerned about the complexes of people of short stature. In any case, solving the problem requires a non-trivial approach.
The creators of the Yandex search program found this approach by delegating the choice to the machine. On the one hand, a soulless, but very fast and intelligent machine does not know and does not want to know anything about us as individuals, and on the other, it tries to find out as much as possible about each one.
In addition to the geographic location of the user and the linguistic analysis of his requests search engine uses several thousand criteria that are not at all obvious to humans.
The trick is that the machine develops and updates these criteria on its own.
It simply uses data on the preferences and user behavior of millions of people and connects this “arithmetic mean” to our query history. The principles that guide the Matrix within itself, comparing the thousands of categories of user interests it has developed, often do not fit into traditional human notions of what “interests” can in principle be. There are tens of thousands of them. They create different, sometimes funny, combinations with each other. For example, one of such combinations may be that the search results match the interests of the person who bred newts. At the same time, a person is not just interested in newts, but already breeds them, but only for the first year.
Estimates. Helping hands
The matrix, of course, decides itself (with the help of higher mathematics) what and in what sequence should be shown to users based on tens of thousands of criteria. But the Matrix also uses living people - 1000 Yandex employees, the so-called assessors, evaluate the search results for a particular query (of course, not every query is evaluated, and this is done not in real time) for their compliance with expectations regular user: not as rational as a machine, not as precise in wording, contradictory and emotional.
They have long become an integral part of the Russian Internet. Search engines now these are huge and complex mechanisms that represent not only a tool for finding information, but also tempting areas for business.
Most of the users of search engines have never thought (or thought, but did not find an answer) about the principle of work of search engines, about the scheme for processing user requests, about what these systems consist of and how they function ...
This master class is designed to answer the question of how search engines work. However, you will not find factors that influence the ranking of documents here. And the more you should not count on detailed explanation the Yandex operation algorithm. He, according to Ilya Segalovich, director of technologies and development of the search engine "Yandex", can be recognized only "under torture" by Ilya Segalovich himself ...
2. The concept and functions of the search engine
A search engine is a software and hardware complex designed to search the Internet and responding to a user's request, specified in the form of a text phrase (search query), by issuing a list of links to information sources, in order of relevance (in accordance with the request). Major international search engines: "Google", Yahoo, MSN. On the Russian Internet, these are Yandex, Rambler, and Aport.
Let's take a closer look at the concept of a search query using the Yandex search engine as an example. The search query should be formulated by the user in accordance with what he wants to find, as briefly and simply as possible. Let's say we want to find information in Yandex on how to choose a car. To do this, open home page"Yandex", and enter the text of the search query "how to choose a car." Further, our task is to open the links to sources of information on the Internet provided at our request. However, it is quite possible not to find the information we need. If this happened, then either you need to rephrase your request, or in the search engine database there really is no relevant information on our request (this can be when setting very "narrow" queries, such as "how to choose a car in Arkhangelsk")
The primary task of any search engine is to deliver people exactly the information they are looking for. And to teach users to make "correct" requests to the system, ie. queries that match the principles of search engines are not possible. Therefore, developers create algorithms and principles of search engines that would allow users to find the information they are looking for.
This means the search engine must "think" the way the user thinks when looking for information. When a user makes a request to a search engine, he wants to find what he needs as quickly and easily as possible. Having received the result, he assesses the work of the system, guided by several basic parameters. Did he find what he was looking for? If not, how many times did he have to rephrase the query to find what he was looking for? How relevant was he able to find information? How fast was the search engine processing the request? How convenient were the search results? Was the desired result the first or the hundredth? How much junk was found along with useful information? Will you find the information you need when you turn to a search engine, say, in a week, or in a month?
In order to satisfy all these questions with answers, the developers of search engines are constantly improving the algorithms and principles of search, adding new functions and capabilities, and trying in every possible way to speed up the work of the system.
3. The main characteristics of the search engine
Let's describe the main characteristics of search engines:
- Completeness
Completeness is one of the main characteristics of a search engine, which is the ratio of the number of documents found upon request to the total number of documents on the Internet that satisfy this request. For example, if there are 100 pages on the Internet containing the phrase “how to choose a car”, and only 60 of them were found for the corresponding query, then the completeness of the search will be 0.6. Obviously, what fuller search, the less likely it is that the user will not find the document he needs, provided that it exists on the Internet at all.
- Accuracy
Accuracy is another main characteristic of a search engine, which is determined by the degree to which the found documents match the user's request. For example, if the query “how to choose a car” contains 100 documents, 50 of them contain the phrase “how to choose a car”, and the rest simply contain these words (“how to choose the right radio tape recorder and install it in a car”), then the search accuracy is considered equal to 50/100 (= 0.5). The more accurate the search, the faster the user will find the documents he needs, the less various kinds of "garbage" will be encountered among them, the less often the documents found will not match the request.
- Relevance
Relevance is an equally important component of search, which is characterized by the time that passes from the moment documents are published on the Internet until they are entered into the index base of the search engine. For example, the next day after the appearance of interesting news, a large number of users turned to search engines with relevant queries. Objectively, less than a day has passed since the publication of news information on this topic, but the main documents have already been indexed and are available for search, thanks to the existence of the so-called "quick base" in large search engines, which is updated several times a day.
- Search speed
Search speed is closely related to its resistance to stress. For example, according to Rambler Internet Holding LLC, today, during business hours, the Rambler search engine receives about 60 queries per second. Such workload requires a reduction in the processing time of an individual request. Here, the interests of the user and the search engine coincide: the visitor wants to get results as quickly as possible, and the search engine must process the query as quickly as possible so as not to slow down the calculation of the following queries.
- Visibility
4. Short story search engine development
In the initial period of the development of the Internet, the number of its users was small, and the amount of available information was relatively small. For the most part, only research workers had access to the Internet. At this time, the task of searching for information on the Internet was not as urgent as it is now.
One of the first ways to organize access to information resources network was the creation of open catalogs of sites, links to resources in which were grouped according to topic. The first such project was the site Yahoo.com, which opened in the spring of 1994. After the number of sites in the directory increased significantly, the search option was added the information you need according to the catalog. In the full sense, it was not yet a search engine, since the search area was limited only to the resources present in the directory, and not to all Internet resources.
Link directories were widely used in the past, but have almost completely lost their popularity at the present time. Since even modern catalogs, huge in their volume, contain information only about an insignificant part of the Internet. The largest directory of the DMOZ network (also called the Open Directory Project) contains information about 5 million resources, while the search database google systems consists of over 8 billion documents.
In 1995, the search engines Lycos and AltaVista appeared. The last for many years was a leader in the field of information search on the Internet.
In 1997, Sergey Brin and Larry Page created the Google search engine as part of research project at Stanford University. Google is currently the most popular search engine in the world!
In September 1997, the Yandex search engine, which is the most popular in the Russian-speaking Internet, was officially announced.
Currently, there are three main search engines (international) - Google, Yahoo and with their own databases and search algorithms. Most of the other search engines (of which there are a large number) use in one form or another the results of the three listed. For example, AOL search (search.aol.com) uses a Google base, while AltaVista, Lycos, and AllTheWeb use a Yahoo base.
5. The composition and principles of the search engine
In Russia, the main search engine is Yandex, then - Rambler.ru, Google.ru, Aport.ru, Mail.ru. Moreover, at the moment, Mail.ru uses the Yandex search engine and database.
Almost all major search engines have their own structure that is different from others. However, it is possible to single out the main components common to all search engines. Differences in the structure can only be in the form of the implementation of mechanisms for the interaction of these components.
Indexing module
The indexing module consists of three auxiliary programs (robots):
Spider (spider) - a program designed to download web pages. Spider provides page download and retrieves all internal links from this page. The html-code of each page is downloaded. Robots use HTTP protocols to download pages. The "spider" works as follows. The robot sends the “get / path / document” request and some other HTTP request commands to the server. In response, the robot receives a text stream containing service information and the document itself.
- Page url
- the date the page was downloaded
- server response http header
- page body (html-code)
Crawler ("traveling" spider) - a program that automatically crawls all the links found on the page. Highlights all links present on the page. Its task is to determine where the spider should go next, based on links or based on a predefined list of addresses. Crawler, following the links found, searches for new documents that are still unknown to the search engine.
Indexer is a program that analyzes web pages downloaded by spiders. The indexer parses the page into its constituent parts and analyzes them using its own lexical and morphological algorithms. Various page elements are analyzed, such as text, headings, links, structural and style features, special service html tags, etc.
Thus, the indexing module allows you to crawl a given set of resources by links, download pages encountered, extract links to new pages from received documents and perform a complete analysis of these documents.
Database
A database, or an index of a search engine, is a data storage system, an information array that stores specially converted parameters of all documents downloaded and processed by the indexing module.
Search Server
The search server is an essential element of the entire system, since the quality and speed of search directly depends on the algorithms that underlie its functioning.
The search engine works as follows:
- The request received from the user is subjected to morphological analysis. The information environment of each document contained in the database is generated (which will subsequently be displayed in the form, that is, corresponding to the request text information on the search results page).
- The received data is passed as input parameters to a special ranging module. The data for all documents is processed, as a result of which, for each document, its own rating is calculated, which characterizes the relevance of the query entered by the user, and the various components of this document stored in the search engine index.
- Depending on the user's choice, this rating can be adjusted additional conditions(for example, the so-called "advanced search").
- Next, a snippet is generated, that is, for each found document, the title, a short annotation that best matches the request and a link to the document itself are extracted from the document table, and the found words are highlighted.
- The resulting search results are transmitted to the user in the form of a SERP (Search Engine Result Page) - search results page.
As you can see, all these components are closely related to each other and work in interaction, forming a clear, rather complex mechanism for the search engine operation, which requires a huge expenditure of resources.
6. Conclusion
Now let's summarize all of the above.
- The primary task of any search engine is to deliver people exactly the information they are looking for.
- The main characteristics of search engines:
- Completeness
- Accuracy
- Relevance
- Search speed
- Visibility
- The first full-fledged search engine was the WebCrawler project, published in 1994.
- The search engine includes the following components:
- Indexing module
- Database
- Search Server
We hope that our master class will allow you to get a closer look at the concept of search engines, to better know the main functions, characteristics and the principle of operation of search engines.