SlideShare a Scribd company logo
Crawling with NodeJS
JSMeetup2@Paris 24.11.2010
@sylvinus
Crawling?
Web crawling
Grab
Process
Store
?
NodeJS
Server-side Javascript
Async / Event-driven / Reactor pattern
Small stdlib, Exploding module ecosystem
Why?
Boldly going where no one has g...
Threads vs. Async
ZOMG Server-side CSS3 selectors!
Apricot
https://github.com/silentrob/Apricot
HTML/DOM Parser, inspired by Hpricot
Sizzle + JSDOM + XUI
Problems w/ Apricot
if (file.match(/^https?:///)) {
    var urlInfo = url.parse(file, parseQueryString=false),
    host = http.createClient(((urlInfo.protocol === 'http:') ? 80 : 443),
urlInfo.hostname),
    req_url = urlInfo.pathname;
    if (urlInfo.search) {
      req_url += urlInfo.search;
    }
    var request = host.request('GET', req_url, { host: urlInfo.hostname });
    request.addListener('response', function (response) {
      var data = '';
      response.addListener('data', function (chunk) {
        data += chunk;
      });
      response.addListener("end", function() {
        fnLoaderHandle(null, data);
      });
    });
    if (request.end) {
      request.end();
    } else {
      request.close();
    }
      
  } else {
    fs.readFile(file, encoding='utf8', fnLoaderHandle);
  }
Problems w/ Apricot
No advanced HTTP client in Node’s lib
npm install request
https + redirects + buffering
Problems w/ Apricot
Apricot.parse("<p id='test'>An HTML Fragment</p>", function(doc) {
doc.find("selector"); // Populates internal collection, See Sizzle selector syntax (rules)
doc.each(callback); // Itterates over the collection, applying a callback to each match
(element)
doc.remove(); // Removes all elements in the internal collection (See XUI Syntax)
doc.inner("fragment"); // See XUI Syntax
doc.outer("fragment"); // See XUI Syntax
doc.top("fragment"); // See XUI Syntax
doc.bottom("fragment"); // See XUI Syntax
doc.before("fragment"); // See XUI Syntax
doc.after("fragment"); // See XUI Syntax
doc.hasClass("class"); // See XUI Syntax
doc.addClass("class"); // See XUI Syntax
doc.removeClass("class"); // See XUI Syntax
doc.toHTML; // Returns the HTML
doc.innerHTML; // Returns the innerHTML of the body.
doc.toDOM; // Returns the DOM representation
// Most methods are chainable, so this works
doc.find("selector").addClass('foo').after(", just because");
});
Problems w/ Apricot
XUI api?!
jQuery please :)
require("jsdom").jQueryify !!
var jsdom = require("jsdom"),
window = jsdom.jsdom().createWindow();
jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js' , function() {
window.$('body').append('<div class="testing">Hello World, It works</div>');
console.log(window.$('.testing').text());
});
Concurrency?
https://github.com/coopernurse/node-pool
npm install generic-pool
generic-pool
// Create a MySQL connection pool with
// a max of 10 connections and a 30 second max idle time
var poolModule = require('generic-pool');
var pool = poolModule.Pool({
name : 'mysql',
create : function(callback) {
var Client = require('mysql').Client;
var c = new Client();
c.user = 'scott';
c.password = 'tiger';
c.database = 'mydb';
c.connect();
callback(c);
},
destroy : function(client) { client.end(); },
max : 10,
idleTimeoutMillis : 30000,
log : false
});
// borrow connection - callback function is called
// once a resource becomes available
pool.borrow(function(client) {
client.query("select * from foo", [], function() {
// return object back to pool
pool.returnToPool(client);
});
});
So what?
Apricot - XUI + jQuery
+ request + generic-pool
+ qunit + ?
=
??
Simple API?
var Crawler = require("node-crawler").Crawler;
var c = new Crawler({
"maxConnections":10,
"timeout":60,
"defaultHandler":function(error,result,$) {
$("#content a:link").each(function(a) {
c.queue(a.href);
})
}
});
c.queue(["http://jamendo.com/","http://tedxparis.com", ...]);
c.queue([{
"uri":"http://parisjs.org/register",
"method":"POST"
"handler":function(error,result,$) {
$("div:contains(Thank you)").after(" very much");
}
}]);
Name contest! :)
node-crawler ?
Crawly ?
?????
Thanks!
First code on github tonight
Help & Forks welcomed
(We’re hiring HTML5/JS hackers ;-)
Also, http://html5weekend.org/

More Related Content

What's hot

The Promised Land (in Angular)
The Promised Land (in Angular)The Promised Land (in Angular)
The Promised Land (in Angular)
Domenic Denicola
 
Avoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promisesAvoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promises
Ankit Agarwal
 
Promise pattern
Promise patternPromise pattern
Promise pattern
Sebastiaan Deckers
 
Node.js in action
Node.js in actionNode.js in action
Node.js in action
Simon Su
 
async/await in Swift
async/await in Swiftasync/await in Swift
async/await in Swift
Peter Friese
 
Callbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascriptCallbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascript
Łukasz Kużyński
 
HTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymoreHTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymore
Remy Sharp
 
Promises, Promises
Promises, PromisesPromises, Promises
Promises, Promises
Domenic Denicola
 
Asynchronous programming done right - Node.js
Asynchronous programming done right - Node.jsAsynchronous programming done right - Node.js
Asynchronous programming done right - Node.js
Piotr Pelczar
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
Wim Godden
 
HTML5 JavaScript APIs
HTML5 JavaScript APIsHTML5 JavaScript APIs
HTML5 JavaScript APIs
Remy Sharp
 
JavaScript Promises
JavaScript PromisesJavaScript Promises
JavaScript Promises
Tomasz Bak
 
The promise of asynchronous php
The promise of asynchronous phpThe promise of asynchronous php
The promise of asynchronous php
Wim Godden
 
Javascript call ObjC
Javascript call ObjCJavascript call ObjC
Javascript call ObjC
Lin Luxiang
 
Working with AFNetworking
Working with AFNetworkingWorking with AFNetworking
Working with AFNetworking
waynehartman
 
Understanding the Node.js Platform
Understanding the Node.js PlatformUnderstanding the Node.js Platform
Understanding the Node.js Platform
Domenic Denicola
 
Node.js - A Quick Tour
Node.js - A Quick TourNode.js - A Quick Tour
Node.js - A Quick Tour
Felix Geisendörfer
 
Understanding Asynchronous JavaScript
Understanding Asynchronous JavaScriptUnderstanding Asynchronous JavaScript
Understanding Asynchronous JavaScript
jnewmanux
 
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2KZepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Thomas Fuchs
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to Tornado
Gavin Roy
 

What's hot (20)

The Promised Land (in Angular)
The Promised Land (in Angular)The Promised Land (in Angular)
The Promised Land (in Angular)
 
Avoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promisesAvoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promises
 
Promise pattern
Promise patternPromise pattern
Promise pattern
 
Node.js in action
Node.js in actionNode.js in action
Node.js in action
 
async/await in Swift
async/await in Swiftasync/await in Swift
async/await in Swift
 
Callbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascriptCallbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascript
 
HTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymoreHTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymore
 
Promises, Promises
Promises, PromisesPromises, Promises
Promises, Promises
 
Asynchronous programming done right - Node.js
Asynchronous programming done right - Node.jsAsynchronous programming done right - Node.js
Asynchronous programming done right - Node.js
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
HTML5 JavaScript APIs
HTML5 JavaScript APIsHTML5 JavaScript APIs
HTML5 JavaScript APIs
 
JavaScript Promises
JavaScript PromisesJavaScript Promises
JavaScript Promises
 
The promise of asynchronous php
The promise of asynchronous phpThe promise of asynchronous php
The promise of asynchronous php
 
Javascript call ObjC
Javascript call ObjCJavascript call ObjC
Javascript call ObjC
 
Working with AFNetworking
Working with AFNetworkingWorking with AFNetworking
Working with AFNetworking
 
Understanding the Node.js Platform
Understanding the Node.js PlatformUnderstanding the Node.js Platform
Understanding the Node.js Platform
 
Node.js - A Quick Tour
Node.js - A Quick TourNode.js - A Quick Tour
Node.js - A Quick Tour
 
Understanding Asynchronous JavaScript
Understanding Asynchronous JavaScriptUnderstanding Asynchronous JavaScript
Understanding Asynchronous JavaScript
 
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2KZepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to Tornado
 

Similar to Web Crawling with NodeJS

soft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.js
soft-shake.ch
 
Tornadoweb
TornadowebTornadoweb
Tornadoweb
Osman Yuksel
 
Nodejs and WebSockets
Nodejs and WebSocketsNodejs and WebSockets
Nodejs and WebSockets
Gonzalo Ayuso
 
Java script at backend nodejs
Java script at backend   nodejsJava script at backend   nodejs
Java script at backend nodejs
Amit Thakkar
 
Javascript Frameworks for Joomla
Javascript Frameworks for JoomlaJavascript Frameworks for Joomla
Javascript Frameworks for Joomla
Luke Summerfield
 
Week 4 - jQuery + Ajax
Week 4 - jQuery + AjaxWeek 4 - jQuery + Ajax
Week 4 - jQuery + Ajax
baygross
 
Express Presentation
Express PresentationExpress Presentation
Express Presentation
aaronheckmann
 
dojo.Patterns
dojo.Patternsdojo.Patterns
dojo.Patterns
Peter Higgins
 
5.node js
5.node js5.node js
5.node js
Geunhyung Kim
 
Node intro
Node introNode intro
Node intro
cloudhead
 
Sanjeev ghai 12
Sanjeev ghai 12Sanjeev ghai 12
Sanjeev ghai 12
Praveen kumar
 
Pracitcal AJAX
Pracitcal AJAXPracitcal AJAX
Pracitcal AJAX
jherr
 
Html5 For Jjugccc2009fall
Html5 For Jjugccc2009fallHtml5 For Jjugccc2009fall
Html5 For Jjugccc2009fall
Shumpei Shiraishi
 
Building Applications Using Ajax
Building Applications Using AjaxBuilding Applications Using Ajax
Building Applications Using Ajax
s_pradeep
 
jQuery: Events, Animation, Ajax
jQuery: Events, Animation, AjaxjQuery: Events, Animation, Ajax
jQuery: Events, Animation, Ajax
Constantin Titarenko
 
Introduction to Vert.x
Introduction to Vert.xIntroduction to Vert.x
Introduction to Vert.x
Yiguang Hu
 
Build Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBuild Your Own CMS with Apache Sling
Build Your Own CMS with Apache Sling
Bob Paulin
 
NodeJS
NodeJSNodeJS
NodeJS
Alok Guha
 
An opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathonAn opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathon
Luciano Mammino
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patterns
Stoyan Stefanov
 

Similar to Web Crawling with NodeJS (20)

soft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.js
 
Tornadoweb
TornadowebTornadoweb
Tornadoweb
 
Nodejs and WebSockets
Nodejs and WebSocketsNodejs and WebSockets
Nodejs and WebSockets
 
Java script at backend nodejs
Java script at backend   nodejsJava script at backend   nodejs
Java script at backend nodejs
 
Javascript Frameworks for Joomla
Javascript Frameworks for JoomlaJavascript Frameworks for Joomla
Javascript Frameworks for Joomla
 
Week 4 - jQuery + Ajax
Week 4 - jQuery + AjaxWeek 4 - jQuery + Ajax
Week 4 - jQuery + Ajax
 
Express Presentation
Express PresentationExpress Presentation
Express Presentation
 
dojo.Patterns
dojo.Patternsdojo.Patterns
dojo.Patterns
 
5.node js
5.node js5.node js
5.node js
 
Node intro
Node introNode intro
Node intro
 
Sanjeev ghai 12
Sanjeev ghai 12Sanjeev ghai 12
Sanjeev ghai 12
 
Pracitcal AJAX
Pracitcal AJAXPracitcal AJAX
Pracitcal AJAX
 
Html5 For Jjugccc2009fall
Html5 For Jjugccc2009fallHtml5 For Jjugccc2009fall
Html5 For Jjugccc2009fall
 
Building Applications Using Ajax
Building Applications Using AjaxBuilding Applications Using Ajax
Building Applications Using Ajax
 
jQuery: Events, Animation, Ajax
jQuery: Events, Animation, AjaxjQuery: Events, Animation, Ajax
jQuery: Events, Animation, Ajax
 
Introduction to Vert.x
Introduction to Vert.xIntroduction to Vert.x
Introduction to Vert.x
 
Build Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBuild Your Own CMS with Apache Sling
Build Your Own CMS with Apache Sling
 
NodeJS
NodeJSNodeJS
NodeJS
 
An opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathonAn opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathon
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patterns
 

More from Sylvain Zimmer

Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneDeveloper-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing one
Sylvain Zimmer
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
Sylvain Zimmer
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
Sylvain Zimmer
 
PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?
Sylvain Zimmer
 
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Sylvain Zimmer
 
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
Sylvain Zimmer
 
140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript
Sylvain Zimmer
 
Joshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewJoshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical Overview
Sylvain Zimmer
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJS
Sylvain Zimmer
 
no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4
Sylvain Zimmer
 
Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011
Sylvain Zimmer
 
Archicamp présentation
Archicamp présentationArchicamp présentation
Archicamp présentation
Sylvain Zimmer
 
Twisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesTwisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecases
Sylvain Zimmer
 

More from Sylvain Zimmer (13)

Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneDeveloper-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing one
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?
 
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
 
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
 
140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript
 
Joshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewJoshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical Overview
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJS
 
no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4
 
Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011
 
Archicamp présentation
Archicamp présentationArchicamp présentation
Archicamp présentation
 
Twisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesTwisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecases
 

Recently uploaded

Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
Develop Secure Enterprise Solutions with iOS Mobile App Development Services
Develop Secure Enterprise Solutions with iOS Mobile App Development ServicesDevelop Secure Enterprise Solutions with iOS Mobile App Development Services
Develop Secure Enterprise Solutions with iOS Mobile App Development Services
Damco Solutions
 
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
Fwdays
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Fwdays
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
Accelerating Migrations = Recommendations
Accelerating Migrations = RecommendationsAccelerating Migrations = Recommendations
Accelerating Migrations = Recommendations
isBullShit
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Snarky Security
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
Latest Tech Trends Series 2024 By EY India
Latest Tech Trends Series 2024 By EY IndiaLatest Tech Trends Series 2024 By EY India
Latest Tech Trends Series 2024 By EY India
EYIndia1
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
UX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business GoalsUX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business Goals
FIDO Alliance
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 

Recently uploaded (20)

Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
Develop Secure Enterprise Solutions with iOS Mobile App Development Services
Develop Secure Enterprise Solutions with iOS Mobile App Development ServicesDevelop Secure Enterprise Solutions with iOS Mobile App Development Services
Develop Secure Enterprise Solutions with iOS Mobile App Development Services
 
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
Accelerating Migrations = Recommendations
Accelerating Migrations = RecommendationsAccelerating Migrations = Recommendations
Accelerating Migrations = Recommendations
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
Latest Tech Trends Series 2024 By EY India
Latest Tech Trends Series 2024 By EY IndiaLatest Tech Trends Series 2024 By EY India
Latest Tech Trends Series 2024 By EY India
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
UX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business GoalsUX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business Goals
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 

Web Crawling with NodeJS

  • 4. ?
  • 5. NodeJS Server-side Javascript Async / Event-driven / Reactor pattern Small stdlib, Exploding module ecosystem
  • 6. Why? Boldly going where no one has g... Threads vs. Async ZOMG Server-side CSS3 selectors!
  • 8. Problems w/ Apricot if (file.match(/^https?:///)) {     var urlInfo = url.parse(file, parseQueryString=false),     host = http.createClient(((urlInfo.protocol === 'http:') ? 80 : 443), urlInfo.hostname),     req_url = urlInfo.pathname;     if (urlInfo.search) {       req_url += urlInfo.search;     }     var request = host.request('GET', req_url, { host: urlInfo.hostname });     request.addListener('response', function (response) {       var data = '';       response.addListener('data', function (chunk) {         data += chunk;       });       response.addListener("end", function() {         fnLoaderHandle(null, data);       });     });     if (request.end) {       request.end();     } else {       request.close();     }          } else {     fs.readFile(file, encoding='utf8', fnLoaderHandle);   }
  • 9. Problems w/ Apricot No advanced HTTP client in Node’s lib npm install request https + redirects + buffering
  • 10. Problems w/ Apricot Apricot.parse("<p id='test'>An HTML Fragment</p>", function(doc) { doc.find("selector"); // Populates internal collection, See Sizzle selector syntax (rules) doc.each(callback); // Itterates over the collection, applying a callback to each match (element) doc.remove(); // Removes all elements in the internal collection (See XUI Syntax) doc.inner("fragment"); // See XUI Syntax doc.outer("fragment"); // See XUI Syntax doc.top("fragment"); // See XUI Syntax doc.bottom("fragment"); // See XUI Syntax doc.before("fragment"); // See XUI Syntax doc.after("fragment"); // See XUI Syntax doc.hasClass("class"); // See XUI Syntax doc.addClass("class"); // See XUI Syntax doc.removeClass("class"); // See XUI Syntax doc.toHTML; // Returns the HTML doc.innerHTML; // Returns the innerHTML of the body. doc.toDOM; // Returns the DOM representation // Most methods are chainable, so this works doc.find("selector").addClass('foo').after(", just because"); });
  • 11. Problems w/ Apricot XUI api?! jQuery please :) require("jsdom").jQueryify !! var jsdom = require("jsdom"), window = jsdom.jsdom().createWindow(); jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js' , function() { window.$('body').append('<div class="testing">Hello World, It works</div>'); console.log(window.$('.testing').text()); });
  • 13. generic-pool // Create a MySQL connection pool with // a max of 10 connections and a 30 second max idle time var poolModule = require('generic-pool'); var pool = poolModule.Pool({ name : 'mysql', create : function(callback) { var Client = require('mysql').Client; var c = new Client(); c.user = 'scott'; c.password = 'tiger'; c.database = 'mydb'; c.connect(); callback(c); }, destroy : function(client) { client.end(); }, max : 10, idleTimeoutMillis : 30000, log : false }); // borrow connection - callback function is called // once a resource becomes available pool.borrow(function(client) { client.query("select * from foo", [], function() { // return object back to pool pool.returnToPool(client); }); });
  • 14. So what? Apricot - XUI + jQuery + request + generic-pool + qunit + ? = ??
  • 15. Simple API? var Crawler = require("node-crawler").Crawler; var c = new Crawler({ "maxConnections":10, "timeout":60, "defaultHandler":function(error,result,$) { $("#content a:link").each(function(a) { c.queue(a.href); }) } }); c.queue(["http://jamendo.com/","http://tedxparis.com", ...]); c.queue([{ "uri":"http://parisjs.org/register", "method":"POST" "handler":function(error,result,$) { $("div:contains(Thank you)").after(" very much"); } }]);
  • 16. Name contest! :) node-crawler ? Crawly ? ?????
  • 17. Thanks! First code on github tonight Help & Forks welcomed (We’re hiring HTML5/JS hackers ;-) Also, http://html5weekend.org/