Low maintenance data integration (ETL)

Low maintenance data integration (ETL)

By mtm from London.pm
Date: Wednesday, 13 August 2008 10:40
Duration: 30 minutes
Language:
Tags: dataprocessing etl sjerek

You can find more information on the speaker's site:


This is a tech talk about an existing ETL system used at Nestoria.co.uk (vertical search engine, 4 countries). It's the processing piece between arrived data and database insert.
http://en.wikipedia.org/wiki/Extract%2C_transform%2C_load

Lots of Perl folks have written ETL systems in the past, lots will have to write one in the future. There is often no way around a custom solution.

We will look at some best practices around 24/7 availability, monitoring, data cleansing, data quality, i18n, scaling, dealing with failures and changes ... and of course CPAN modules.

Nestoria had to integrate dozens of different formats (flatfile, database dumps, XML, custom), delivery methods (fetch, crawl, FTP) and update methods (complete, incremental, partial, custom). We thought we were prepared for everything, but over the years we learned some valuable lessons about corrupt files, failing servers, data quality, i18n issues and performance.


Attended by: Sébastien Aperghis-Tramoni (‎Maddingue‎), Nicholas Clark, Paul-Christophe Varoutas, Thomas Klausner (‎domm‎), Léon Brocard (‎acme‎), R Geoffrey Avery (‎rGeoffrey‎), Gabor Szabo (‎szabgab‎), Jos Boumans (‎kane‎), Salvador Fandiño (‎salva‎), Hermen Lesscher, Søren Lund (‎slu‎), Tobias Henoeckl (‎hoeni‎), Sue Spence (‎virtualsue‎), Chisel Wright, Luis Motta Campos (‎LMC‎), Sven Esbjerg, Kaare Rasmussen, Søren Døygaard, Peter Makholm (‎brother‎), Francoise Dehinbo (‎franky‎), allan juul, Darko Obradovic, Kristoffer Gleditsch (‎toffer‎), Bern, David Leadbeater (‎dg‎), Sebastian Willert, Lars Jorgensen, Nigel Metheringham (‎nigelm‎), Imran Chaudhry (‎icjs‎), Patrick Donelan (‎patspam‎), Stan Sawa, Darius Jokilehto, Bart Lateur, Henrik Hald Nørgaard, Morten Meyling, Nicholas Oxhøj (‎noxhoej‎),