In-depth understanding of the workings of search engines

How Search Engines Work

The working principle of a search engine is shown in Figure 1, which mainly consists ofspider crawling,grab and build a library,web processing,retrieval servicerespond in singingPresentation of resultsEach of the five areas is described below.

Explanation of the working principle of search engine diagram — Figure 1 Explanation of the working principle diagram of search engine (Figure from Baidu Encyclopedia)

Step 1: Spider Crawl

The first problem that search engines have to solve is how to effectively access and utilize the vast amount of information on the Internet.In order to achieve this goal, data crawling systems have become one of the indispensable components of search engines. Data crawling systems are mainly responsible for collecting, saving and updating information on the Internet. Search engines can be compared to spiders, which crawl through the Internet, hence their name as web spiders or search engine spiders. Each web spider has its own name, such as BaiduSpider of Baidu, Sogou Web Spider of Sogou, Googlebot of Google and Bingbot of Bing.

When crawling web pages, search engines run multiple spider programs at the same time. They crawl the web from a number ofImportant seed sitesTo begin with, new URLs are constantly discovered and crawled through hyperlinks on web pages, and the process is repeated over and over again to crawl as many web pages as possible. Since web pages in the Internet may be modified, deleted or new hyperlinks appear at any time, large search engines like Baidu
The engine needs to be constantly updated with pages that have been crawled in the past.

When a spider crawls to a site, it first checks to see if a Robots.txt file exists in the root directory of the site. If it exists, the conventions therein are used to determine the scope of the webpage to be crawled.Upon entering a website, spiders will crawl to all the pages in that website using strategies such as depth-first, width-first, or best-first.

1. Depth-first strategy

Early web spiders typically crawled using a depth-first strategy.The way the depth-first strategy works is that after crawling a web page, if there are other links in it, the spider will continue crawling along one of those links to the next web page, then look for new links in that web page and continue crawling deeper.This process will continue until there are no uncrawled links to choose from and the spider will return to the initial web page and then continue crawling deeper through another link. Only when all the links have been traversed does the entire crawling process end, as shown in Figure 2.

Deep Crawl Strategy — Fig. 2 Depth-first crawling strategy

Depth-first crawl orderYes: n subsections of website homepage→A1→A2→---------→A. Home page of website→B1→B2→---------→n sub-columns of B. Homepage→C1→C2→---------→n sub-columns of C. Home→D1→D2→---------→D's n subsections.

2. Width-first strategy

Breadth-first strategy is a crawling strategy in which web spiders come to a web page and crawl all the links on that page first, and then crawl the next layer of web links.

Width-first crawl orderYes: Home page of the website → All pages of the first level of links (A, B, C, ---------) → All pages of the second level of links (A1, A2 ---------B1, B2 ---------) → All pages of the second level of links (A11, A12 ---------B21, B22 ---------).

3. Optimal prioritization strategy

The best priority strategy is a crawling strategy that when a web spider arrives at a web page, it collects all the links in it into the address base and analyzes them, from which it sifts out the links with higher importance for crawling. Factors affecting the importance of links are mainly PR (PageRank, page ranking algorithm) value, website size and response speed. Under this strategy, when a link's PR value is higher, the larger the site size and the faster the response speed, the more it will be prioritized for crawling.

Knowledge links →Click here to see PR encyclopedia knowledge

PR value is based on the site's external links and the number and quality of internal links to measure the popularity of the site's standard, which is divided into 0 ~ 10. PR value of the higher shows that the site is more popular (the more important). For example, PR value of 1; indicating that the site is not very popular; and PR value of 7 ~ 10, the site is very popular (extremely important). Generally speaking, a website with a PR value of 4 indicates that the website is more popular.

Step 2: Crawl to build a library

After a long period of time, web spiders can crawl through all the web pages on the Internet, but the resources of these web pages are extremely large, and they are also mixed with a large number of "garbage web pages", coupled with the limited resources of the search engine, usually only part of the web pages crawled into the database.

When the web spider arrives at a web page, it will first detect its content and judge whether the information therein is "spam" (e.g., the existence of a large number of duplicated content, garbled code, or highly duplicated with the content already included, etc.). After the detection, the web spider will include the valuable web pages and store the information of the web pages in the original page database.

Step 3: Web Processing

After the web spider crawls to the web page data, the web page can not be directly used for indexing service because the amount of data is too large, the web spider also has to do a lot of pre-processing work, such as structuring the web page, word splitting, de-stopping words, noise reduction, de-emphasis, building indexing libraries, conducting link analysis and data integration, and so on.

1. Structured web pages

In the web page data crawled by web spiders, in addition to the visible text that users can see on their browsers, it also contains HTML tags, JavaScript programs, navigation, friendly links, advertisements and other content that cannot be used for ranking calculations. Structured webpage removes these contents from the webpage data and retains the body text, tag content, anchor text, annotations of images and videos, etc. that can be used for ranking.
As in this code below.

<div id="baike-title">
    <h1>
<span class="title">Youyuan SEO</span>
    </h1>
</div>

After structuring the page, the remaining text used for ranking is "Youyuan SEO"

2. Segmentation

The terminology is specific to Chinese search engines, sinceLanguages such as English have spaces separating words from each other, while Chinese does not have any separator between words, so search engines must first break down a sentence into a number of words.For example."Youyuan SEO"will be categorized into 2 words "Woo Won" and "SEO".

There are many methods of word separation, including dictionary-based word separation, word sense-based word separation and statistical word separation. At present, major search engines usually combine these three methods to form a set of word separation system.

3. Go to stop words

Whether in English or Chinese, the content of the page will have some high frequency, but the content of the article does not have the actual significance of the stop words, such as the Chinese "ah", "ha", "yah" "the" "ground" "get", etc., the English "a" "an "the," "of," "to," and so on. Because the stop word on the main meaning of the page content has little effect, so the search engine will remove these words, so that both the index data to make the subject more prominent, but also to reduce a lot of unnecessary calculations.

4. Noise reduction

In the page content, there is also a part of the page with the theme of the page has nothing to do with the content, such as copyright notice text, navigation bar, advertising and so on. These completely unrelated to the page theme of the content belongs to the noise, the page theme can only play a decentralized role. Therefore, search engines need to recognize and eliminate noise. The basic method of noise reduction is based on HTML tags on the page for chunking, distinguish the header, navigation, body, footer, advertising and other areas, the content of the irrelevant area is removed, the rest is the main content of the page.

5. De-weighting

There is also a large amount of duplicate content on the Internet, "which is mainly generated by the mutual reproduction of websites and the use of web templates. In the user search, if the search results contain a large number of the same content, it will reduce the user experience, so the search engine needs to be indexed before the identification of duplicate content and processing, this process is called "de-duplication".

The basic method of de-duplication is to calculate the characteristic keyword fingerprint of the page, i.e., select the most frequent - part of the keywords from the main content of the page, and then calculate the digital fingerprint of these keywords. If the keyword fingerprint of the page is the same, the relevant page will be determined as duplicate content, not included.

In addition, the simple addition of "the" "ground" to "get" or swap the order of paragraphs of the pseudo-original way, and can not escape the search engine de-emphasis algorithms, this is because such operations can not change the article's characteristics of the keywords.

6. Establishment of an indexing library

After the content of a web page is processed with subwording, stop word removal, noise reduction and de-emphasis, a collection of keywords that reflect the main content of the page can be obtained.The search engine will record the frequency of each keyword in the page, the number of times, format (such as title, bold, anchor text, etc.), location and other information, and based on this information to calculate the importance of each keyword, and then in accordance with the importance of the keywords to be sorted. The search engine will build the page and its corresponding keywords into a positive ranking index and stored into the index library.

Through the orthographic index can quickly find a page which contains which keywords, but the actual search is through the keywords to find the page containing it. In the front-row index, you need to scan each page to determine whether it contains the corresponding keywords, the calculation is large, can not meet the real-time return ranking results. Therefore, the search engine will also be reconstructed into the inverted index, the page to the keyword mapping into the keyword to the page mapping

In the inverted index, each keyword corresponds to a series of pages. When a user searches for a keyword, he or she can immediately find all the pages containing that keyword by locating it in the inverted index.

7. Conducting link analysis

When purchasing goods on the Internet, users not only browse the seller's description of the goods, but also check the buyer's evaluation of the goods. There is a similar situation when search engine sorts the pages, it needs to consider the keyword density and position of the web page itself, but also needs to introduce the criteria outside the web page to measure the web page. In the web page outside the standard, link analysis is quite important, the search engine will analyze the link to the web page of all the external links, the number and quality of external links can reflect the quality of the web page and its relevance to the keywords.

Link analysis is time-consuming due to the sheer number of web pages on the Internet and the fact that the linking relationships between web pages are constantly being updated. Search engines need to complete link analysis before they can perform inverted indexing, which will have an impact on indexing.

8. Conducting data integration

In addition to HTML files, search engines are usually able to crawl and index a wide range of text-based file types, such as PDF, XLS, PPT, TXT files and so on. However, for images, videos, animations and other non-text content, the search engine can not yet be processed directly, only through its illustrative text.

Different data formats are stored separately, but when indexing and sorting, search engines tend to relate the content associated with the data to determine its relevance and importance, and then form a final, searchable database that facilitates search ranking.

Step 4: Retrieve the service

After a search engine builds a search database, it can provide search services for users. When the user enters a search keyword, the search engine will first process the search keyword, filter and split it, and then extract the matching pages from the index database, and then comprehensively sort the scores of the pages through different dimensions, and finally optimize the results by collecting user search data to get the final search results.

1. Processing search terms

Processing search keywords and processing page keywords similar to the search engine on the user input search keywords also need to be divided into words and noise reduction and other processing, that is, it is split into key phrases, and eliminate the search results of the meaning of the word is not great. Such as typing "computer blue screen how to do ah", the search engine will be split into "computer" "blue screen" "how to do "3 keywords, as shown in Figure 4.

Processing of search terms — Figure 4 Processing search terms

2. Extract page

Determine a good keyword, the search engine will retrieve from the search database contains the keyword page, but these pages will not all participate in the ranking. Because the search results will generally have hundreds of thousands or even tens of millions of items, if all ranked, the search engine will be very large amount of calculations, the speed will be very slow, and the user will usually only view the results of the first few pages. Therefore, the search engine usually only display 100 pages of search results. According to the default 10 search results per page calculation, the search engine usually only need to return 1000 results on it!

3. Comprehensive ranking

Search engines will be based on different aspects of the score of the pages involved in the ranking of the comprehensive ranking to get the final search results. The criteria for comprehensive ranking mainly include the following aspects.

Relevance:The extent to which the content of the page matches the search terms. The Search Citation Police evaluates how well a page matches a search term based on factors such as the number of search terms included on the page, the position of the term on the page, and the anchor text used by other pages to point to the page.
Authority:Authoritative websites usually provide more authentic and reliable content and therefore have a higher advantage in ranking. Search engines judge the authority of a website based on its credibility and reputation and rank the pages of more authoritative websites higher.
Timeliness:Search engines look at whether the page is the most recently published page and whether the content of the page is the most up-to-date information. As time goes on, timeliness becomes more and more important in the ranking of search citations.
Richness:Diversity and comprehensiveness of page content. If the page content is rich and varied, it can not only meet the single needs of users, but also meet a wider range of user needs.
Weighting:Search engines will also weight some special pages. For example, pages such as official websites and special channels may be ranked higher.
Downgrade:Search engines will also lower the ranking of some web pages that are suspected of cheating to ensure the quality and reliability of search results.

4. Search optimization

Finally, search engines also optimize search results based on information such as IP address, time of search, previous search history and pages viewed.

Generally speaking, through the IP address, we can obtain the user's region, and according to the search habits of users in each region, we can return the ranking results of the user's specific region. Through the search time, previous search records and browsed web pages and other information can understand the user's interests, concerns, etc., so that you can provide more accurate, personalized search results.

Step 5: Presentation of results

At present, the search engine search results are presented in a variety of forms, such as summary, picture, video, software download, step-by-step and news information.

Abstract style:The most primitive way of presentation shows only a title, relevant summary, and relevant links, as shown in Figure 5 below. The presentation of enterprise websites and information websites is mostly in the form of a summary.

abstract style — Figure 5 Summary formula

Picture Style:A picture is displayed on top of the summary style, as shown in Figure 6 below

Video Style:It is used to display a web page containing a video, displaying a thumbnail image of the video along with information such as the duration of the video on top of the summary style, as shown in Figure 7:

Software download style:Used to display the page to provide software downloads, in addition to the title, it will also display the software icon, version, size, update time, runtime environment and other information and buttons used to download, click the appropriate button to download the software directly, as shown in Figure 8:

software downloadable — Fig. 8 Software download type

Step-by-step:It is mainly used to display the operation steps, and will display several thumbnails and abbreviated text of the steps, as shown in Figure 9.

Newsfeed Style:It will display the title of multiple news items, the website where they were published and their publication time, and it will display the summary information of the news items, as shown in Figure 10.

Original article by Woo-won SEO, if reproduced, please specify the source: https://www.ycsu.com/en/236/

An in-depth look at how search engines work

How Search Engines Work

Step 1: Spider Crawl

1. Depth-first strategy

2. Width-first strategy

3. Optimal prioritization strategy

Step 2: Crawl to build a library

Step 3: Web Processing

1. Structured web pages

2. Segmentation

3. Go to stop words

4. Noise reduction

5. De-weighting

6. Establishment of an indexing library

7. Conducting link analysis

8. Conducting data integration

Step 4: Retrieve the service

1. Processing search terms

2. Extract page

3. Comprehensive ranking

4. Search optimization

Step 5: Presentation of results

Leave a Reply

An in-depth look at how search engines work

How Search Engines Work

Step 1: Spider Crawl

1. Depth-first strategy

2. Width-first strategy

3. Optimal prioritization strategy

Step 2: Crawl to build a library

Step 3: Web Processing

1. Structured web pages

2. Segmentation

3. Go to stop words

4. Noise reduction

5. De-weighting

6. Establishment of an indexing library

7. Conducting link analysis

8. Conducting data integration

Step 4: Retrieve the service

1. Processing search terms

2. Extract page

3. Comprehensive ranking

4. Search optimization

Step 5: Presentation of results

Recommended

Leave a Reply