Accès clients

Le Blogjavascript

Have you ever tried to scrape or harvest data from an existing website — I mean, even ajax-bloated ones? Did you ever attempt to test javascript-dependent interactions within a Web application you built? Well, if you answered yes to one of the questions above, you might be interested in PhantomJS.

PhantomJS is a headless WebKit with JavaScript API. By headless, they mean you can script a real Webkit based browser with no need for a full graphical interface installed.

Installation

On OSX, installation can be achieved using homebrew (note that XCode must be installed on your machine):

$ brew install phantomjs

It can take a bit of time for the binaries to be built, especially because of their dependency to Qt4. When it's done, you can test it this way:

$ phantomjs
Usage: phantomjs [options] script.[js|coffee] [script argument [script argument ...]]
Options:
    --load-images=[yes|no]             Load all inlined images (default is 'yes').
    --load-plugins=[yes|no]            Load all plugins (i.e. 'Flash', 'Silverlight', ...) (default is 'no').
    --proxy=address:port               Set the network proxy.
    --disk-cache=[yes|no]              Enable disk cache (at desktop services cache storage location, default is 'no').
    --ignore-ssl-errors=[yes|no]       Ignore SSL errors (i.e. expired or self-signed certificate errors).

Installation instructions for other platforms and alternative methods can be found on the PhantomJS project wiki.

As a side note, there's also a Python implementation of PhantomJS, PyPhantomJS, which adds plugins support! Also, I've found myself having no segfault using the Python version while the standard one is a bit more unstable on my box (no troll please).

To install PyPhantomJS, let's use pip:

$ pip install PyPhantomJS

The PyPhantomJS executable is named — surprisepyphantomjs:

$ pyphantomjs
usage: pyphantomjs [options] script.[js|coffee] [script argument [script argument ...]]
Minimalistic headless WebKit-based JavaScript-driven tool
positional arguments:
  script.[js|coffee]    The script to execute, and any args to pass to it
optional arguments:
  -h, --help            show this help message and exit
  --disk-cache {yes,no}
                        Enable disk cache (default: no)
  --ignore-ssl-errors {yes,no}
                        Ignore SSL errors (default: no)
  --load-images {yes,no}
                        Load all inlined images (default: yes)
  --load-plugins {yes,no}
                        Load all plugins (i.e. Flash, Silverlight, ...) (default: no)
  --proxy address:port  Set the network proxy
  -v, --verbose         Show verbose debug messages
  --version             show this program's version and license

Usage of the two versions is exactly the same.

Basic usage

PhantomJS scripts can be written in standard JavaScript or in CoffeeScript. Mainly matter of taste here, but CoffeeScript syntax looks really interesting.

So let's write our first script, we want to retrieve the weather forecast for a given city using Google:

// script: meteo.js
var page = new WebPage()
, output = { errors: [], results: null };
if (phantom.args.length == 0) {
    console.log('You must specify a city, eg. "Paris, France"');
    phantom.exit(1);
}
page.open('http://www.google.fr/search?q=meteo+' + phantom.args[0], function (status) {
    if (status !== 'success') {
        output.errors.push('Unable to access network');
    } else {
        var cells = page.evaluate(function(){
            try {
                var cells = document.querySelectorAll('.tpo tr tr')[4].querySelectorAll('td');
                return Array.prototype.map.call(cells, function(cell) {
                    return cell.innerText.replace(/[^0-9]/g, '');
                });
            } catch (e) {
                return [];
            }
        });
        if (!cells || !cells.length > 0) {
            output.errors.push('No valid meteo data found');
        } else {
            output.results = {
                city: phantom.args[0],
                today: {
                    afternoon: cells[1],
                    morning:   cells[2],
                },
                tomorrow: {
                    afternoon: cells[3],
                    morning:   cells[4],
                }
            };
        }
        console.log(JSON.stringify(output, null, '    '));
    }
    phantom.exit();
});

Notice we use the phantom.args Array which contains the parameters passed to the script.

The main magic happens in the page.evaluate() method, we pass it a JavaScript function which will be evaluated within the retrieved page document environment. It's a kind of non-persistent XSS injection just to help you to operate on the page contents =)

Now it's time to launch the script to see how it goes:

$ phantomjs meteo.js "Montpellier, France"
{
    "errors": [],
    "results": {
        "city": "Montpellier, France",
        "today": {
            "afternoon": "29",
            "morning": "17"
        },
        "tomorrow": {
            "afternoon": "28",
            "morning": "17"
        }
    }
}

Now with an invalid city name:

$ phantomjs meteo.js "Unexistent City"
{
    "errors": [
        "No valid meteo data found"
    ],
    "results": null
}

Let's try with another city, an existing one this time:

$ phantomjs meteo.js "Paris, France"
{
    "errors": [],
    "results": {
        "city": "Paris, France",
        "today": {
            "afternoon": "21",
            "morning": "11"
        },
        "tomorrow": {
            "afternoon": "21",
            "morning": "11"
        }
    }
}

As a side note and in case you were wondering, you now understand a bit more why I moved to Montpellier ;)

I CAN HAZ SCREENSHOTS

PhantomJS also allows some nice tricks like injecting scripts to the remote page, very useful when a remote website doesn't ship with your favorite framework (eg. jQuery)… or can render a PNG image of a captured area of the webpage. The example below saves a capture of the weather forecast area:

// script: meteoclip.js
var page = new WebPage();
page.open('http://www.google.fr/search?q=meteo+montpellier,+France', function (status) {
    if (status !== 'success') {
        output.error = 'Unable to access network';
    } else {
        page.clipRect = {
            top: 127,
            left: 170,
            width: 400,
            height: 114
        }
        page.render('meteo.png');
        console.log('Capture saved');
    }
    phantom.exit();
});

Running the meteoclip.js script will get yourself this fancy image stored in meteo.jpg:

There are tons of other cool topics to cover about PhantomJS, like navigation handling, automated logging in, external resources retrieving, functional testing, code organization… so I'll maybe post a bit more about it soon, who knows!

With all the hype coming to server-side Javascript lately, especially around Node, I was feeling the need to give it a try to see how it goes. Also, getting back to work after three full weeks of unwired holidays was hard enough to worth deserving some playtime with cool and fun technologies.

Node is described as an Evented I/O Framework for Google's V8 JavaScript Engine. Think of it as a toolkit to produce high-performance distributed, event-driven and scalable non-blocking network servers. Okay, whatever the way I want to describe the project, it's buzzword-bingo™. Let's say it's mainly about catching events and react accordingly, to make load distribution and parallel processing easier and more effective.

Installing Node

Installation on my Mac went smoothly and took nearly two minutes by compiling it from the sources; here's how I did (there might be easier or better ways, I don't really care):

$ mkdir tmp
$ git clone http://github.com/ry/node.git
$ cd node
$ ./configure && make && sudo make install

You now have access to the node executable available on your system.

A simple example of a Node HTTP server (put the code below in a test.js file):

var http = require('http');

var server = http.createServer(function(req, res) {
  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.write('Hello World');
  res.end();
});

server.listen(3000, "127.0.0.1");

Then launch the created webserver using the command line:

$ node test.js

And point your browser at http://127.0.0.1:3000 to get printed Hello World. Neat, huh?

Introducing Express, a Web Framework on top of Node

Express is a Web framework built on top of Node, HTTP and Connect, allowing easy creation of full-fledged Web applications. It has routing, handles environments as well as several template engines and much more.

Installation is as easy as Node's one, so here we go:

$ git clone http://github.com/visionmedia/express.git
$ cd express
$ git submodule update --init
$ sudo make install && sudo make install-support

That's it. You can now write your own test application, eg. in a new hello.js file:

var app = require('express').createServer();

app.get('/', function(req, res){
    res.send('Hello World');
});

app.get('/hello/:name', function(req, res){
    res.send('Hello ' + req.param('name') + '!');
});

app.listen(3000, "127.0.0.1");

console.log('Server running at http://127.0.0.1:3000/');

Launch your webapp server by the command line:

$ node hello.js

Express and will create a Node server listening to the local port 3000, so head your favorite browser to http://127.0.0.1:3000/ then http://127.0.0.1:3000/hello/niko to get the picture of what the above code does. Those familiar with Web framework such as rails, django or symfony won't be much disturbed.

Express also ships with an express executable which provides useful commands. To create a new hello application skeleton, just run:

 $ express hello
   create : hello
   create : hello/app.js
   create : hello/logs
   create : hello/public/javascripts
   create : hello/pids
   create : hello/public/stylesheets
   create : hello/public/stylesheets/style.less
   create : hello/public/images
   create : hello/views/partials
   create : hello/views/layout.jade
   create : hello/views/index.jade
   create : hello/test
   create : hello/test/app.test.js

Above command just created an hello project directory where you can cd into and launch the server by its default front controller app.js:

$ cd hello
$ node app.js
Express server listening on port 3000

Note that the generated project skeleton implies using Jade as a tremplate engine and the Less CSS syntax, while one might want to use something else, which is perfectly possible by configuring the project differently.

Next steps documentation will be provided by official Express documentation.

Of course, Express might not be as full-featured as older well-established Web frameworks, but for simple needs it can be pretty easy to setup and deploy, and — probably equally importantly — fun to play with and learn.

Conclusion

As you can see, installing and using Node and Express is quite straightforward, even if you have to dig into the deeper Web to find docs, when they exist. Javascript is a great, agile and well-known language, and taking part of it server-side definitely makes sense if you want my opinion.

Let's see how this will evolve in the future, as there are not as many backend-oriented libs in JavaScript as there are in other languages like python, ruby or php yet. But more and more node modules are appearing day after day, such as Mongoose or Socket.IO, which I'll definitely be playing with as soon as possible.

Thanks for your attention, have fun, take care and don't break the Web.

Derniers commentaires

Tweets