Search Engine with Apache Nutch, MongoDB and Elasticsearch

This post describes how to build a search engine with Apache Nutch and use MongoDB as the data-store, then index the crawled data using ElasticSearch and finally visualize the indexed data with Kibana.

A brief description about each component:

Apache Nutch

An open source web crawler forked from Apache Lucene


An open source document database and one of the leading NoSQL databases


A highly scalable, distributed search engine stemming from Apache Lucene


An open source browser based analytics and search dashboard for Elasticsearch

Data Flow

Post 538

Prepare the stage

The installation steps were performed on a vanilla Centos 7 with the base group installed.


Download, configure and start MongoDB

Starting from version 2.6, mongodb uses a YAML-based configuration file format. A reference for the below configuration can be found here.

Start mongodb

And ensure that mongodb started without issues


Download, configure and start Elasticsearch

If elasticsearch starts without issues, you should be able to access it’s RESTful web interface


Download and start Kibana

Kibana should now be accessible via it’s web interface; Launch a browser and go to

Apache Nutch

Download, compile, and configure Apache Nutch

Specify MongoDB as the GORA backend in $NUTCH_HOME/conf/nutch-site.xml

Ensure the MongoDB gora-mongodb dependency is available in $NUTCH_HOME/ivy/ivy.xml; Uncomment the below line from the file

Ensure that MongoStore is set as the default datastore in $NUTCH_HOME/conf/

And start compiling nutch

At this point dependencies are being resolved by ivy, then nutch will be compiled and placed in $NUTCH_HOME/runtime/

Finally verify that nutch has been compiled and running correctly

Now the stage is ready for the play

Crawl your first website

Nutch requires two configuration changes before a website can be crawled:

  1. Customize your crawl properties, where at a minimum, you provide a name for your crawler for external servers to recognize
  2. Set a seed list of URLs to crawl

Customize your crawl properties

Default crawl properties can be viewed and edited within conf/nutch-default.xml – where most of these can be used without modification

The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwrite conf/nutch-default.xml. The only required modification for this file is to override the value field of the

i.e. Add your agent name in the value field of the property in conf/nutch-site.xml, for example:

Create a URL seed list

(Optional) Configure Regular Expression Filters

Edit the file conf/regex-urlfilter.txt and replace

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the domain, the line should read:

This will include any URL in the domain

Initialize the crawldb

Generate URLs from crawldb

Fetch generated URLs

Parse fetched URLs

Update database from parsed URLs

Index parsed URLs

This is the first round, where the base URL is fetched, generated, parsed, updated and indexed. These steps should be repeated few more times until the whole site is indexed. This will depend on the depth of the website, for a small site it could take 2 or 3 rounds, but for large dynamic websites, it could take days to finish indexing.

Now that the website has been crawled and indexed, the data can be visualized using Kibana.

From the web browser, navigate to http://localhost:5601

Kibana needs an index from elasticsearch in order to visualize the data under this index, in the “Configure an index pattern” page uncheck “Configure an index pattern” and put “nutch” in the index field.

Kibana index

Now go to “Discover” page from the top navigation bar, and the crawl content should show up.

Kibana Discover

Passing Red Hat EX401

By passing Red Hat EX401, I thought of sharing my experience and the study notes I took in order to pass this exam.

First things first, I must say that I won’t be sharing anything about the exam itself as this will be a breach of the NDA. Also I want to highlight that I’ve used satellite and spacewalk over the past 3 years a lot, did a lot of administration and numerous installations, so this helped a lot in studying.

Also I was certified on Satellite 5.6, and as Satellite 6 will be released in the next few months, so this review will be specifically on 5.6 because 6 will be totally different and will involve new components such as Puppet, foreman and katello.

As with all Red Hat exams, Red Hat EX401 is a performance based exam, you get multiple tasks which you must perform and the task should persist across reboots and without intervention of anyone.

I took the exam without attending the training course so it was a self study effort, and will do the same with all the other RHCA exams that I am planning to take.

So how did I study?

From the official documentation provided by Red Hat, the documentation is very thorough and covers everything about satellite.

How to plan for studying?

First review the Study points for Red Hat EX401 and get very familiar with them, I was checking these points almost daily and making sure that they are thoroughly covered during my studies. Trust me they are very important. And I made sure to practice them almost every three or four days.

The lab

Satellite is provided by Red Hat and you will need to get in touch with them in order to obtain the software, here comes the but part, but there is the upstream open source project from Satellite, which is Spacewalk, it’s freely available and it’s there is almost no difference between the two. Although the newer versions of Spacewalk have a bit different UI, it’s rather the same and simple to use.

I practiced on virtual machines that I created on my laptop, mainly I create a virtual machine then take a snapshot, then perform an installation then a snapshot, then perform some of the tasks in the study points then a snapshot, and so on. Why? to revert back to any snapshot and perform the tasks again.

One thing I must highlight though is that you have to be very comfortable in creating RPMs from source code, I already build RPM’s so I was very familiar with the concept, but in case you haven’t created RPM’s before, you need to practice a lot.

Final preparation

The day before the exam, I imagined myself sitting in the exam room and the exam in front of me, what would I do? I have to accomplish all the study points in 4 hours, so that’s what I did. I performed all points in just 2 hours, practicing will make you accomplish even better times.

Red Hat EX401

The exam itself isn’t hard, and actually I finished before half the time, three things though:

  • You have to organize and manage your time very well
  • You will have enough time to test everything again
  • Don’t panic and keep relaxed

And best of luck to everyone.

Dockerizing nginx

Now that we have images ready, let’s start dockerizing some applications to play with. I will start with dockerizing nginx and serve some static html content from it.

To dockerize an application, you need 2 things:

      1. The base image which will be used to build the new image from
      2. A dockerfile

Now what’s a dockerfile?

Dockerfile is a text file that describes how and what will the new image do. It contains command that you would execute normally when installing and configuring applications.

Dockerizing nginx

Now the first thing is to choose which image you want to build nginx on. I’ll choose the previously created image from Debian Wheezy.

So let’s start writing our Dockerfile…

I always put some description, the author and the usage of the image on the top, so create a new file (nginx.docker) and put:

Now this is a simple docker file which builds layers of images in these steps:

      1. A new image FROM aossama/wheezy
      2. With a maintainer details
      3. After than run these commands in a new container
      4. Then expose port 80
      5. And port 443
      6. The image should have these mount points available
      7. When the container starts, use this working directory by default
      8. Finally run nginx

Now build the new image:

And inspect the images, if the build finish successfully you should find the new image in the list.

Your image is ready to run in a container, to run it:

Inspect the new image and grep the IP address:

If you opened a browser and point to this IP address, you should see the welcome page of nginx.

Build RHEL Docker Image from Scratch








This article covers how to build a RHEL6 docker image from scratch and use it in docker. In order to build this image you either should have access to RHN or have DVD to install the base system from. I’ll be installing it from a DVD.

We’ll be using yum and rpm to initialize the image’s root directory.

Build RHEL Image

First install yum and rpm

Next step is to initialize the RPM database so that we can install all of the software we need in the image: we will need to create the directory for the database because RPM expects it to exist, and then use the RPM “rebuilddb” command:

In order for YUM to manage and install software into our image it needs to know which RHEL version to install – for this the package redhat-release to be installed in the image.

Next configure YUM repository

Now install yum and rpm in the image

At this point you have a small environment which you can chroot into and modify it the way you like

You will drop to a bash shell in the newly created system, I won’t do any further changes to this system except modifying the banner

Import Docker Image

Now let’s import this new system into docker:

If the import goes well, you will see the ID of the newly imported image

First check the images list:

And start the new image

And voilĂ , you just built a RHEL6 docker image, imported it, and ready to dockerize it.

Build Debian Docker Image from Scratch








Now that we have practiced some basic operations with Docker, let’s create, import and start an image. This article covers how to build a Debian docker image from scratch and use it in docker.

We’ll be using the famous debootstrap tool which installs a Debian base system into a subdirectory of another, already installed system.

Build Wheezy

First install debootstrap

Then prepare the root directory

Finally build a Wheezy system

At this point you have a small environment which you can chroot into and modify it the way you like

You will drop to a bash shell in the newly created system, I won’t do any further changes to this system except modifying the banner

Import Docker Image

Now let’s import this new system into docker:

If the import goes well, you will see the ID of the newly imported image

First check the images list:

And start the new image

And voilĂ , you just built a new docker image, imported it, and ready to dockerize it.