Automation for Retrieval and Loading of Market Data in Financial Environments Like Asset and Risk Management

X Spooling

Jochen Hayek

Principal Consultant
Aleph Soft GmbH

  <Jochen.Hayek (@) Aleph-Soft.com>
  Augsburger Straße 33
  D-10789 Berlin
  Germany

$Id: book.sgml 1.26 2007/02/17 14:42:30 johayek Exp $
$Source: /home/jochen_hayek/usr/src/x_spooling/doc/presentation/RCS/book.sgml $

abstract

We built a system for automatic retrieval and loading of market data (and actually a lot more) from various vendors. It includes separate pieces, that may be of interest for you, even if you already have another system in production.

Impressum (aka german legalese)

Verantwortlich für diese Web-Seiten („Homepage“): Aleph Soft GmbH, Jochen Hayek, Augsburger Straße 33, D-10789 Berlin

Disclaimer

Mit dem Urteil vom 12. Mai 1998 - 312 O 85/98 - "Haftung für Links" hat das Landgericht Hamburg entschieden, dass man durch einen Link auf eine andere Homepage deren Inhalte ggf. mit zu verantworten hat. Dies kann laut LG nur verhindert werden, indem man sich ausdrücklich von diesen Inhalten distanziert. Hiermit distanzieren wir uns ausdrücklich von sämtlichen Inhalten aller von uns per Link angebotenen fremden Seiten.

Table of Contents

1. Introduction

2. Data Retrieval

3. Loading Data into Data Stores and Data Base Systems

3.1. Data Loader
3.2. Automating the Data Loading
3.3. Data Abstraction -- Reformatting Data to Simple CSV Structures

4. File Upload for Sending Requests

5. Working on the Data Store upon Successful Loadings

6. Log Files

Chapter 1. Introduction

These are the main requirements of the financial data systems targeted by us:

data retrieval
loading data into data stores and data base systems
file upload for sending requests
data abstraction -- reformatting data to simple CSV structures
working on the data store upon successful loadings

We already implemented interfaces to these data vendors:

Bloomberg, Thomson, Wertpapiermitteilungen
S&P, Moody's, Merrill Lynch
Citigroup, J.P.Morgan, Lehman, MSCI
Bankhaus Ellwanger & Geiger (very simple HTTP web server interface)
Deutsche Börse, STOXX (not so simple or even quite challenging HTTP web server interfaces)

The data vendors of your interest are probably already amongst the ones listed in our introduction, or we will find, they are pretty similar to something we've already done. So we mostly talk about configuration and simple customizations and not about new developments.

...

Chapter 2. Data Retrieval

retrieving data file from vendors

Data vendors usually provide you with an account on a server, that you can access through the public Internet, and the methods are FTP, HTTP resp. HTTPS, SSH / SCP, and RSYNC, amongst others. And they tell you which files you can retrieve and at what times.

We provide you with a smart and rather stable automation system, making use of powerful publicly available open source utilities.

Usually files are nervously waited for at the times specified by their suppliers. Our revolutionary approach to retrieving files is mirroring, which we apply every couple of minutes throughout the entire day. Mirroring is not an expensive method, as you might be tempted to think; it retrieves a directory listing and compares time stamps and file sizes; files already retrieved do not get retrieved again; this way you don't need to worry about the cost of continuous retrieving, and you will also never have to worry again about missed new files and also not about updated old files.

We apply different methods to find out which files were added on the remote side, and we feed these new files to the loading system discussed below.

Most of the data vendors accesses work through FTP, mostly with a rather flat directory structure, but esp. one works with a rather nested directory structure, that is S&P.

Some of the data vendor accesses work through HTTP resp. HTTPS. In general and also for the data vendors, that we have dealt with so far, there is no such thing as an HTTP or HTML directory listing, so in order to get the directory listing, that we certainly need, we have to extract the available information from several hyperlinked web pages. This is the case for J.P. Morgan Chase & Co. and also for Merrill Lynch; but accessing the data files on the web server of Bankhaus Ellwanger & Geiger is a piece of cake.

So far the data vendors, that enforce SSH / SCP access, have all been corporate group internal.

...

Internally we call this part the Download Spoolers.

Chapter 3. Loading Data into Data Stores and Data Base Systems

Let us assume, you already decided for a data store or data base system offered by a particular vendor like FAME or Asset Control.

Let us also assume, you already have a general loading facility for your data store. So far we made use of a data loader for FAME data stores, but loading data into Asset Control's data store shouldn't be much harder and much different.

If you have not yet decided for a commercial data store, we can easily build you one on top of the publicly available Berkely DB, together with a data loader and other utilities.

The first approach for a loader is to load the data file following its original structure as it is. That way, you have to write a complete new loader for every kind of data file to be loaded.

A far smarter approach [1] is to split

the task of dealing with dedicated data file structures
from a straight forward data loader.

We deal with this preprocessing or reformatting process in a subsection, titled DA.

Well, loading data is not the entire job, you have to get the loading wrapped up in a highly automated system. This is what we provide you with in this context.

...

Internally we call this part the Response Spoolers. That may sound strange, but in the context of Bloomberg Data Licence the files loaded here are actually responses to requests generated following local needs and uploaded beforehand.

3.1. Data Loader

We expect the data loader

to deal with simple CSV and Fixed-Record / Fixed-Columms files on the input side,
to take a configuration file,
- letting you specify in simple terms the columns of a particular input data file,
- and also letting you specify in simple terms the target data structure, i.e. without making use of whatever full-fledged programming language.
and to accept command line parameters to be made use of within the configuration file mentioned beforehand.

...

3.2. Automating the Data Loading

Requirements:

versions of configurations following the time axis
consistency checks (e.g. header checks)
providing parameters specified in a high level configuration on file resp. on data store
...

...

3.3. Data Abstraction -- Reformatting Data to Simple CSV Structures

There are files provided by their vendors in a way so you can directly proceed and load them.

There are other files, that have just a couple of extra lines before and after the header and the data lines in the body.

And there are other files, that have more or less nested structures. We employ utilities to reformat files from vendors like Thomson and Wertpapiermitteilungen into simple, easily loadable CSV structures.

We provide you with preprocessing and reformatting for the vendors we listed above in the introduction.

You may have already worked with files from Bloomberg, that only on a quite low level they look like ordinary CSV files, but in fact their lines consist of lots of pairs of names and values, so they are a little more complicated to deal with.

You may have already worked with files from Thomson, with several time series, one after the other, with very short lines consisting of two colums only.

You may have already worked with files from Wertpapiermitteilungen. They are not fixed-record files, and they are not CSV files.

Would you mind, if all of them appeared to you as simple ordinary CSV files? We are sure, that you appreciate the value of our utilities for dealing with them.

...

Internally we call this part files.pl . You may recognize, it's perl code, not shell code.

Chapter 4. File Upload for Sending Requests

We provide you with a nice and fairly stable automation for the upload of files, usually called Request Files.

...

Chapter 5. Working on the Data Store upon Successful Loadings

In dedicated data loaders, we are always tempted to immediately perform steps, that strictly spoken are not part of the loading process itself.

Because of the simple nature of the data loader described above, we may not even be able to perform these steps using that data loader. Certain compound actions will not be easy to implement with such a loader.

So for whatever reason, there are actions, that you will want to perform immediately upon successful loading. We provide you with a technique called Post Actions or Post Jobs to be implemented in perl code, post because they are applied after the loading.

We provide you with implementation patterns for post action steps

to be performed one by one (Synchronous Routines)
and also for steps to be perfomed in parallel (Asynchronous Routines).

Post Actions are performed within the Data Loader Automation.

...

Chapter 6. Log Files

...