We use cookies for marketing and statistics.
How does Google Search work?
A search engine is both enormously complex and quite simple. In fact, all an online search engine like Google or Bing does is compile a database of web pages (also known as the 'search index') and then, every time a query is made, it looks into that database, collecting the best and most relevant pages, and displays those pages. That's really all.
That sounds fairly simple, but all these steps require critical and complex considerations. Most of these are determined by two things: time and money.
Because even if you could theoretically build a constantly updated database containing all the billions of pages on the internet, the storage costs and bandwidth requirements alone would bankrupt practically every company in the world. Not to mention the cost of searching that database millions or billions of times a day. So you have to make sure that you only store relevant information and that you can search it quickly. Because every millisecond matters (Google still shows how long each search took at the top of search results) and there's no time to search the entire database anyway.
Fundamental question
So every search engine, say you want to build one yourself, starts with a surprisingly philosophical question: 'what makes a web page good?'. You have to decide what is only dissent and what is outright disinformation. And you need to figure out what commercial advertising is and how much advertising is too much. Sites that are clearly written by AI and filled with SEO junk are bad. Recipe blogs written by a real human and filled with SEO junk are usually fine.
Once you've had all these discussions and defined your boundaries, you can identify a few thousand domains that you definitely want in your search engine. You'll include news sites from CNN to Breitbart, popular discussion forums like Reddit and Twitter, helpful services like Wikipedia, broad platforms like YouTube and Amazon, and the best websites for recipes, sports, shopping, and whatever else you can find on the web. Sometimes you can partner with those sites to get that data from the site itself in a structured way without having to look at each page individually. Many major platforms make this easy and sometimes even do it for free.
View pages
Once you've formulated the answer to the question "What makes a web page good?", it's time to release the spiders. These are bots that look at the content of a web page, find and follow every link on the page, index all those pages, follow every link on those pages again, and so on... Every time the bot lands on a page, it evaluates the page, based on the criteria you set for a good page. Anything the bot rates as "good" is downloaded onto a server somewhere, and is added to your search index with "good" pages you've designated.
But the bots are not welcome everywhere. Every time a bot or crawler opens and visits a web page, it costs the provider bandwidth; For example, a search engine that tries to load and save every page of your website once per second, just to make sure the pages in the search index are up-to-date, can put a huge drain on the available capacity.
So most websites have a file called "robots.txt" that specifies which bots can and can't access their content and which URLs they are allowed to crawl/view and index. Search engines technically don't have to respect the wishes of 'robots.txt', but the convention is to do so, and it's neat. Almost all sites allow Google and Bing to crawl their pages because findability in these search engines outweighs bandwidth capacity. But many websites block specific providers, such as shopping sites that do not want Amazon to crawl and analyse their websites. Other websites will set general rules: no search engines other than Google and Bing. This is the reason why different search engines can give different results.
Efficient search: rank
The crawlers return with a broad snapshot of the Internet: the bots of the defunct Neeva, for example, crawled about 200 million URLs per day. Google does not want to indicate how many pages it crawls, but it says it knows trillions of pages and regularly monitors them.
Then the job is to rank all those pages in order for every single query your search engine might get. By doing this you can limit the number of pages you have to look through when someone does a search. You could sort the pages by topic, and store them in smaller and more searchable indexes instead of one giant database: local results along with other local results, shoes with shoes, news with news. The more detailed you make your searchable index, the faster you can search it.
Search engines make extensive use of machine learning to figure out the topics and content of a given page, but a lot of human insight is also required. Teams rate questions and results on a scale of zero to ten. Sometimes it's clear: if someone searches for 'Facebook' and the first result is not facebook.com, then something is clearly wrong. But most of the time, the reviews are merged, and edited in the search index and topic model, and then the process starts all over again.
Synonyms
Actually, this is only half of the problem. You also need to consider what is known as "query understanding". This means you know that people who search for "orange" and "national football team" are searching for the same thing, but those who search for "orange" and "hair" probably aren't. So you end up with a huge library of synonyms and matches and ways to rewrite queries. But Google likes to say that 15 per cent of searches are brand new every day, so you'll always be learning new things about how people search for things online. So you have to continuously update this database of synonyms and similarities.
Google can also base this on the behaviour within the search engine and on data about what people click on. Clicking a link without following it up with a direct search or clicking other links is the best signal; after all, it indicates that the searcher immediately found what he or she was looking for and that the combination of keywords was right. And the more users click, the more you know about what they're actually looking for.
Speed, cost and quality
Running a search engine is a constant balance between speed, cost and quality. You could search the entire database every time someone searches for 'YouTube', but that would take too long and take up too much bandwidth and storage space. You could have a database the size of the Internet, but the storage costs would be huge and searching this database would be far too slow. One option would be to limit yourself to the 100 most popular sites on the internet, but that's not much use to anyone. A final point is that websites are constantly changing, so they will have to be visited by crawlers all the time and the ranking systems will have to adapt all the time.
Finally
In short, it is difficult and expensive to build a search engine from scratch. That's why many search engines don't; they use Bing's search indexes for between $10 and $25 per 1,000 transactions, then add their features and interface. That's what DuckDuckGo, Yahoo, and most of the other smaller search engines do, because Bing is pretty good, and managing and maintaining your search system is a lot of work. However, Google carefully shields its search engine and data from third parties; it does not want to lose its hard-earned position or even share it with others.
P.S. How to get your website indexed by Google
- Go to Google Search Console.
- Open the URL inspection tool.
- Paste the URL you want Google to index in the search bar.
- Wait for Google to verify the URL.
- Click the 'Request Indexing' button.
By following this procedure, you will open your site to Google's bots/crawlers and your website will be indexed by Google's search engine.
In tune
When a lot of noise is made, the point is not to drown it out, but to harmonize and hit the right string. That is what we stand for. A story that resonates with the right people through smart online marketing.
Contact