Scraping Techniques to Extract Advertisements from Web Pages @ EuroPython 2011

Sunday, 1 July 2012 at 21:40 (@944) | Posted in Python | Leave a comment
Tags: , , ,

On January 22, 2011 I was contacted by Mirko Urru for his thesis and on June 24 at 14:30 we were together to present his work at EuroPython: a very nice experience!

PDF icon  Slide – Scraping Techniques to Extract Advertisements from Web Pages


Online Advertising is an emerging research field, at the intersection of Information Retrieval, Machine Learning, Optimization, and Microeconomics. Its main goal is to choose the right ads to present to a user engaged in a given task, such as Sponsored Search Advertising or Contextual Advertising. The former puts ads on the page returned from a Web search engine following a query. The latter puts ads within the content of a generic, third party, Web page. The ads themselves are selected and served by automated systems based on the content displayed to the user.

Web scraping is the set of techniques used to automatically get some information from a website instead of manually copying it. In particular, we’re interested in studying and adopting scraping techniques for: i. accessing tags as object members ii. finding out tags whose name, contents or attributes match selection criteria iii. accessing tag attributes by using a dictionary-like syntax.

In this talk, we focus on the adoption of scraping techniques in the contextual advertising field. In particular, we present a system aimed at finding the most relevant ads for a generic web page p. Starting from p, the system selects a set of its inlinks (i.e., the pages that link p) and extracts the ads contained into them. Selection is performed querying the Google search engine, whereas extraction is made by using suitable scraping techniques.

More info at talk page at EuroPython.


Leave a Comment »

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at
Entries and comments feeds.