Porting my blog to Angular was a success story. There was just one catch: Google wouldn't see my articles.
So I dived deep into the anatomy of .htaccess files, learned that the Google crawler can read Angular pages, and ended up with pre-rendered HTML pages. I also learned how vital "pretty URLs" are.
Talking of Angular: the hints of this article apply to every SPA, including React and Vue.js. I'll frequently talk about Angular simply because I used Angular in my project. Plus, some of the source code in the article is Angular code.
Similarly, I always talk about "Google." Again, that's a pars pro toto. I'm totally aware there are other search engines as well, and I reckon they are using similar strategies. However, from day one, the vast majority of my readers found my blog in a Google search. So I tend to forget the other search engines.
Pretty URLs
Let's start with something simple. Well, it should have been simple, but it gave me my fair share of headaches. Mostly because I found the correct solution quickly, but a nasty typo sent me off to a long, winding journey through SEO space.
Angular is a framework to create Single Page Applications (aka SPAs). In other words: there's only one entry point. The starting point of the application is always the index.html
. Nonetheless, every view within the app has its own URL. History rewriting makes it possible.
Plus a little magic of node.js. More generally speaking: The web server has to support SPAs with history rewriting. It has to know that no matter how convoluted the URL is, the application is always located at https://www.example.com/index.html
.
My blog runs on a standard Apache http server. By default, this server does not support pretty URLs. But you can teach it with a few lines in the .htaccess
file:
Kudos go to Brent Jackson and Leo Caseiro for helping me to figure this out. Plus a plethora a tutorials on .htaccess
. A notable example is URL Rewriting for Beginners, which starts at a fairly elementary level but covers many advanced topics.
How it works
The first RewriteRule
makes sure that the second RewriteRule
only applies to virtual URLs. Every URL pointing to a real file on the webserver is delivered as is. That's important, because that includes images, CSS stylesheets and the JavaScript code of the SPA itself.
The second RewriteRule
delivers the index.html
for every other URL. Note that this is not a redirect. The URL of the browser is still "www.example.com/whatever.html". That's important because Angular needs this URL to display the correct page. My first attempt used a browser redirect:
That's a bad idea because the Google crawler notices that the original URL has gone. If you're lucky, it continues crawling the redirected page. But in any case, it's a potential source of confusion.
Debugging .htaccess fles
As a rookie, I'm often confused by the why .htaccess
files work. There's a nice online tool at https://htaccess.madewithlove.be/ allowing you to debug and to understand your .htaccess
file better. Don't be confused by the domain name. As far as I can they, they are an IT consulting company who're just so proud of what they're doing that the company name expresses this, too.
Bad URLs
Due to a simple typo, I didn't manage to get the rule up and running at first. As an alternative, I used a browser redirect and passed the URL as fragment URL. Maybe you know this kind of URLs from early AngularJS applications:
https://www.beyondjava.net/#/category/bootsfacesAs it turns out, the Google crawler does not accept this kind of URL. It ignores the fragment, keeping only the part in front of the hash. When I submitted a sitemap.xml full of fragment URLs, the crawler refused to accept any of these URLs. For some reason, it still hasn't resumed its work, even after correcting the sitemap file. I suspect the wrong URLs made it into a cache with an extended expiration time.
If you're investigating the internet a bit harder, you find references to the "hash bang" syntax. That's adding an exclamation mark to the hash like so:
https://www.beyondjava.net/#!/category/bootsfacesGooglebot uses this syntax for a couple of years to distinguish traditional fragments (like #scroll-to-top
) from fragments controlling the behavior of an SPA. As far as I know, this syntax still works. You shouldn't use it, anyway. Google deprecated it a couple of years ago.
Always keep your URLs
The golden rule of search engine optimization (aka SEO) is never to throw away an URL. No matter how you reorganize your blog or webshop, always see to it that the old URLs still work. Otherwise, the new URLs start without history. In other words, they start with a bad page rank.
So you end up with two sets of URLs. Adding a canonical URL allows Google to map the make the connection between the two URLs.
Now there's a catch. Angular is a single page application. The canonical URL is part of the header of the HTML page. Angular doesn't support modifying the header out-of-the-box. The only exception is the title of the page. However, you can solve this with a custom directive. The idea is to define the link in the HTML fragment of a component. So the link is rendered somewhere in the body of the HTML page. The directive moves it to the header of the page, and deletes it again when the component is destroyed:
The directive is implemented like so:
import { Directive, Renderer2, ElementRef, Inject, OnInit, OnDestroy, OnChanges, Input, SimpleChanges } from '@angular/core'; import { DOCUMENT, DomSanitizer } from '@angular/platform-browser'; @Directive({ selector: '[appMoveToHead]' }) export class MoveToHeadDirective implements OnInit, OnDestroy, OnChanges { @Input() appMoveToHead: any; private hasBeenAdded = false; constructor(private renderer: Renderer2, private elRef: ElementRef, @Inject(DOCUMENT) private document: Document) {} ngOnInit(): void { this.addLink(); this.renderer.removeAttribute(this.elRef.nativeElement, 'movetohead'); this.hasBeenAdded = true; } ngOnDestroy(): void { this.renderer.removeChild(this.document.head, this.elRef.nativeElement); } ngOnChanges(changes: SimpleChanges): void { if (this.hasBeenAdded) { this.renderer.removeChild(this.document.head, this.elRef.nativeElement); } this.addLink(); } private addLink() { this.renderer.appendChild(this.document.head, this.elRef.nativeElement); const native: HTMLLinkElement = this.elRef.nativeElement; native.setAttribute('href', this.appMoveToHead); } }Kudos for this idea go to Alireza Mirian.
Don't forget your polyfills!
We're almost there. Or so I thought. After a couple of days, I noticed that the Google index didn't contain my new articles.
It took me a lot of, well, googling to find out what's going on. The first and most obvious idea is that the crawler doesn't cope with JavaScript-based SPAs. But that's not the case. It used to be true in earlier time, so you still find many resources on the internet telling you do store pre-rendered HTML pages on your server.
Nowadays, the Google crawler "understands" JavaScript. It starts your application in a browser in "headless" mode (i.e., without UI) and waits until the page is loaded and initialized. This crawler needs more resources than the simple HTML crawler, so expect it to index your page with a couple of days delay.
As it turns out, "understanding JavaScript" doesn't mean the headless browser copes with the same HTML5 as your local browser does. In August 2017, Googlebot used Google Chrome 41 to crawl the web. That was an old version, even back in August 2017. I suppose the version is updated every once in a while, but it's a good idea to prepare for a legacy browser.
In the case of Angular 6, this means you have to activate your polyfills. That's a good idea, anyway, because you probably want to include corporate users who have to use a stone-age Internet Explorer.
Verify if Google gets it
What we need is a tool checking what the crawler can parse and what it can't. If you haven't already done so, this is probably the time to create a Google account and to register yourself as the owner of your domain. Now you can open the webmaster tools and request a "fetch as Google" for your URLs. This tool gives you a preview what the crawler makes of your website.
In the case of Angular 6, the crawler probably crashes quickly with a JavaScript exception. You'll never see the JavaScript error. All you see is an incomplete or even blank page in the "fetch as Google" preview. Activating the polyfills usually fixes that (but of course, it depends on your website).
If you're really desperate, you can catch the JavaScript errors and print them on the application window. Granted, that's a mediocre replacement of watching the console log, but for some reason, the Google Search Console doesn't show us the console window yet. However, you should remove the diagnostic code after running your debugging session. Otherwise, your customers may be confused by a cryptic error message intended for the experts.
Finetuning
Now that the basic functionality of your web application is there and indexed by every major search engine, let's talk about how to make it even better.
One thing is to activate Gzip. We do this by adding a few lines in the .htaccess
file:
We can use the same file to activate caching:
These lines activate caching for your images, the favicon, the fonts, the CSS files, and the JavaScript code. Just in case you consider one month too much for CSS and JavaScript: Angular generates a unique file name for the CSS and JavaScript files in the production build. So you can safely set the caching period to a month, a year or a decade. If you change the JavaScript code or the CSS, Angular chooses another file name, bypassing the browser cache altogether.
Down to Earth
It's all good and well that Google manages to interpret SPAs correctly. Unfortunately, after a while, I noticed that at least one other web application reads my blog - and I doubt it understands JavaScript. I'm talking about https://www.topjavablogs.com. That's a no-nonsense news aggregator popular among the readers of BeyondJava.net.
So even if the Google crawler doesn't require it, it's a good idea to store pre-rendered HTML pages on your servers. As a side effect, the first load is much faster. Mobile users will thank you.
An interesting alternative is Angular Universal. I don't cover it in this article because I didn't have an opportunity to try it myself yet.
Wrapping it up
The technical migration from Wordpress to an Angular blog was easy enough. However, it was just the beginning of the journey. Angular hasn't been written with blogs in mind, and search engines haven't been written with single page applications in mind. In other words: there are several steps necessary to optimize your SPA application for the world of SEO. In this article, I concentrated on Googlebot and Angular, because they are important for my blog. But I'm sure you can use the ideas of this blog for other search engines and other frameworks and libraries as well.
Dig deeper
Brent Jackson's Gist for pretty URLs using .htaccess
Leo Caseiro's Gist for pretty URLs using .htaccess
Online tool to debug and understand .htaccess files
Alireza Mirian's moveToHead
directive
Angular Universal (server side rendering for Angular)