- 11 minutes read

Porting my blog to Angular was a success story. There was just one catch: Google wouldn't see my articles.

So I dived deep into the anatomy of .htaccess files, learned that the Google crawler can read Angular pages, and ended up with pre-rendered HTML pages. I also learned how vital "pretty URLs" are.

Talking of Angular: the hints of this article apply to every SPA, including React and Vue.js. I'll frequently talk about Angular simply because I used Angular in my project. Plus, some of the source code in the article is Angular code.

Similarly, I always talk about "Google." Again, that's a pars pro toto. I'm totally aware there are other search engines as well, and I reckon they are using similar strategies. However, from day one, the vast majority of my readers found my blog in a Google search. So I tend to forget the other search engines.

Pretty URLs

Let's start with something simple. Well, it should have been simple, but it gave me my fair share of headaches. Mostly because I found the correct solution quickly, but a nasty typo sent me off to a long, winding journey through SEO space.

Angular is a framework to create Single Page Applications (aka SPAs). In other words: there's only one entry point. The starting point of the application is always the index.html. Nonetheless, every view within the app has its own URL. History rewriting makes it possible.

Plus a little magic of node.js. More generally speaking: The web server has to support SPAs with history rewriting. It has to know that no matter how convoluted the URL is, the application is always located at https://www.example.com/index.html.

My blog runs on a standard Apache http server. By default, this server does not support pretty URLs. But you can teach it with a few lines in the .htaccess file:

RewriteEngine on # If an existing asset or directory is requested go to it as it is RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} -f [OR] RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} -d RewriteRule ^ - [L] # If the requested resource doesn't exist, use index.html Options +FollowSymLinks RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteCond %{REQUEST_URI} !index RewriteRule (.*) index.html [L]

Kudos go to Brent Jackson and Leo Caseiro for helping me to figure this out. Plus a plethora a tutorials on .htaccess. A notable example is URL Rewriting for Beginners, which starts at a fairly elementary level but covers many advanced topics.

How it works

The first RewriteRule makes sure that the second RewriteRule only applies to virtual URLs. Every URL pointing to a real file on the webserver is delivered as is. That's important, because that includes images, CSS stylesheets and the JavaScript code of the SPA itself.

The second RewriteRule delivers the index.html for every other URL. Note that this is not a redirect. The URL of the browser is still "www.example.com/whatever.html". That's important because Angular needs this URL to display the correct page. My first attempt used a browser redirect:

RewriteRule (.*) index.html [R=301,L]

That's a bad idea because the Google crawler notices that the original URL has gone. If you're lucky, it continues crawling the redirected page. But in any case, it's a potential source of confusion.

Debugging .htaccess fles

As a rookie, I'm often confused by the why .htaccess files work. There's a nice online tool at https://htaccess.madewithlove.be/ allowing you to debug and to understand your .htaccess file better. Don't be confused by the domain name. As far as I can they, they are an IT consulting company who're just so proud of what they're doing that the company name expresses this, too.

Bad URLs

Due to a simple typo, I didn't manage to get the rule up and running at first. As an alternative, I used a browser redirect and passed the URL as fragment URL. Maybe you know this kind of URLs from early AngularJS applications:

https://www.beyondjava.net/#/category/bootsfaces

As it turns out, the Google crawler does not accept this kind of URL. It ignores the fragment, keeping only the part in front of the hash. When I submitted a sitemap.xml full of fragment URLs, the crawler refused to accept any of these URLs. For some reason, it still hasn't resumed its work, even after correcting the sitemap file. I suspect the wrong URLs made it into a cache with an extended expiration time.

If you're investigating the internet a bit harder, you find references to the "hash bang" syntax. That's adding an exclamation mark to the hash like so:

https://www.beyondjava.net/#!/category/bootsfaces

Googlebot uses this syntax for a couple of years to distinguish traditional fragments (like #scroll-to-top) from fragments controlling the behavior of an SPA. As far as I know, this syntax still works. You shouldn't use it, anyway. Google deprecated it a couple of years ago.

Always keep your URLs

The golden rule of search engine optimization (aka SEO) is never to throw away an URL. No matter how you reorganize your blog or webshop, always see to it that the old URLs still work. Otherwise, the new URLs start without history. In other words, they start with a bad page rank.

So you end up with two sets of URLs. Adding a canonical URL allows Google to map the make the connection between the two URLs.

Now there's a catch. Angular is a single page application. The canonical URL is part of the header of the HTML page. Angular doesn't support modifying the header out-of-the-box. The only exception is the title of the page. However, you can solve this with a custom directive. The idea is to define the link in the HTML fragment of a component. So the link is rendered somewhere in the body of the HTML page. The directive moves it to the header of the page, and deletes it again when the component is destroyed:

... (content of the article) ...

The directive is implemented like so:

import { Directive, Renderer2, ElementRef, Inject, OnInit, OnDestroy, OnChanges, Input, SimpleChanges } from '@angular/core'; import { DOCUMENT, DomSanitizer } from '@angular/platform-browser'; @Directive({ selector: '[appMoveToHead]' }) export class MoveToHeadDirective implements OnInit, OnDestroy, OnChanges { @Input() appMoveToHead: any; private hasBeenAdded = false; constructor(private renderer: Renderer2, private elRef: ElementRef, @Inject(DOCUMENT) private document: Document) {} ngOnInit(): void { this.addLink(); this.renderer.removeAttribute(this.elRef.nativeElement, 'movetohead'); this.hasBeenAdded = true; } ngOnDestroy(): void { this.renderer.removeChild(this.document.head, this.elRef.nativeElement); } ngOnChanges(changes: SimpleChanges): void { if (this.hasBeenAdded) { this.renderer.removeChild(this.document.head, this.elRef.nativeElement); } this.addLink(); } private addLink() { this.renderer.appendChild(this.document.head, this.elRef.nativeElement); const native: HTMLLinkElement = this.elRef.nativeElement; native.setAttribute('href', this.appMoveToHead); } }

Kudos for this idea go to Alireza Mirian.

Don't forget your polyfills!

We're almost there. Or so I thought. After a couple of days, I noticed that the Google index didn't contain my new articles.

It took me a lot of, well, googling to find out what's going on. The first and most obvious idea is that the crawler doesn't cope with JavaScript-based SPAs. But that's not the case. It used to be true in earlier time, so you still find many resources on the internet telling you do store pre-rendered HTML pages on your server.

Nowadays, the Google crawler "understands" JavaScript. It starts your application in a browser in "headless" mode (i.e., without UI) and waits until the page is loaded and initialized. This crawler needs more resources than the simple HTML crawler, so expect it to index your page with a couple of days delay.

As it turns out, "understanding JavaScript" doesn't mean the headless browser copes with the same HTML5 as your local browser does. In August 2017, Googlebot used Google Chrome 41 to crawl the web. That was an old version, even back in August 2017. I suppose the version is updated every once in a while, but it's a good idea to prepare for a legacy browser.

In the case of Angular 6, this means you have to activate your polyfills. That's a good idea, anyway, because you probably want to include corporate users who have to use a stone-age Internet Explorer.

Verify if Google gets it

What we need is a tool checking what the crawler can parse and what it can't. If you haven't already done so, this is probably the time to create a Google account and to register yourself as the owner of your domain. Now you can open the webmaster tools and request a "fetch as Google" for your URLs. This tool gives you a preview what the crawler makes of your website.

In the case of Angular 6, the crawler probably crashes quickly with a JavaScript exception. You'll never see the JavaScript error. All you see is an incomplete or even blank page in the "fetch as Google" preview. Activating the polyfills usually fixes that (but of course, it depends on your website).

If you're really desperate, you can catch the JavaScript errors and print them on the application window. Granted, that's a mediocre replacement of watching the console log, but for some reason, the Google Search Console doesn't show us the console window yet. However, you should remove the diagnostic code after running your debugging session. Otherwise, your customers may be confused by a cryptic error message intended for the experts.

Finetuning

Now that the basic functionality of your web application is there and indexed by every major search engine, let's talk about how to make it even better.

One thing is to activate Gzip. We do this by adding a few lines in the .htaccess file:

SetOutputFilter DEFLATE AddOutputFilterByType DEFLATE text/html text/plain text/xml application/json application/ld+json

We can use the same file to activate caching:

ExpiresActive On ExpiresByType image/jpg "access 1 month" ExpiresByType image/jpeg "access 1 month" ExpiresByType image/gif "access 1 month" ExpiresByType image/png "access 1 month" ExpiresByType image/vnd.microsoft.icon "access 1 year" ExpiresByType font/ttf "access 1 month" ExpiresByType font/woff "access 1 month" ExpiresByType font/woff2 "access 1 month" ExpiresByType image/svg+xml "access 1 month" ExpiresByType text/css "access 1 month" ExpiresByType application/javascript "access 1 month"

These lines activate caching for your images, the favicon, the fonts, the CSS files, and the JavaScript code. Just in case you consider one month too much for CSS and JavaScript: Angular generates a unique file name for the CSS and JavaScript files in the production build. So you can safely set the caching period to a month, a year or a decade. If you change the JavaScript code or the CSS, Angular chooses another file name, bypassing the browser cache altogether.

Down to Earth

It's all good and well that Google manages to interpret SPAs correctly. Unfortunately, after a while, I noticed that at least one other web application reads my blog - and I doubt it understands JavaScript. I'm talking about https://www.topjavablogs.com. That's a no-nonsense news aggregator popular among the readers of BeyondJava.net.

So even if the Google crawler doesn't require it, it's a good idea to store pre-rendered HTML pages on your servers. As a side effect, the first load is much faster. Mobile users will thank you.

An interesting alternative is Angular Universal. I don't cover it in this article because I didn't have an opportunity to try it myself yet.

Wrapping it up

The technical migration from Wordpress to an Angular blog was easy enough. However, it was just the beginning of the journey. Angular hasn't been written with blogs in mind, and search engines haven't been written with single page applications in mind. In other words: there are several steps necessary to optimize your SPA application for the world of SEO. In this article, I concentrated on Googlebot and Angular, because they are important for my blog. But I'm sure you can use the ideas of this blog for other search engines and other frameworks and libraries as well.


Dig deeper

Brent Jackson's Gist for pretty URLs using .htaccess

Leo Caseiro's Gist for pretty URLs using .htaccess

URL Rewriting for Beginners

Online tool to debug and understand .htaccess files

Details about Googlebot

"fetch as Google" tutorial

Alireza Mirian's moveToHead directive

Angular Universal (server side rendering for Angular)


Comments