The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, as extracting data correctly will set the stage for how subsequent processes will go.
Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization/format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through third-party APIs. These file formats may include but not limited to JSON, XML, Binary. Thay may be semi-structured or Unstructured. The streaming of the extracted data source and load on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required. In general, the goal of the extraction phase is to convert the data into a single format that is appropriate for transformation processing.
An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
Overview of Extraction
Extraction is the operation of extracting data from a source system for further use in a data warehouse environment. This is the first step of the ETL process. After the extraction, this data can be transformed and loaded into the data warehouse.
The source systems for a data warehouse are typically transaction processing applications.
Designing and creating the extraction process is often one of the most time-consuming tasks in the ETL process and, indeed, in the entire data warehousing process. The source systems might be very complex and poorly documented, and thus determining which data needs to be extracted can be difficult. The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date. Moreover, the source system typically cannot be modified, nor can its performance or availability be adjusted, to accommodate the needs of the data warehouse extraction process.
Extraction Methods in Data Warehouses
The extraction method you should choose is highly dependent on the source system and also from the business needs in the target data warehouse environment. Very often, there’s no possibility to add additional logic to the source systems to enhance an incremental extraction of data due to the performance or the increased workload of these systems. Usually, these systems can not be altered or overloaded for the Extraction process since these are Business-critical systems. Sometimes even the customer is not allowed to add anything to an out-of-the-box application system.
The estimated amount of the data to be extracted and the stage in the ETL process (initial load or maintenance of data) may also impact the decision of how to extract, from a logical and a physical perspective. Basically, one has to decide how to extract data logically and physically.
Logical Extraction Methods
There are two kinds of logical extraction:
- Full Extraction
- Incremental Extraction
Full Extraction – Initial Load
The data is extracted completely from the source system(Historical data). Since this extraction reflects all the data currently available on the source system. The source data will be provided as-is and no additional logical information (for example, timestamps) is necessary on the source site. An example of a full extraction may be an export file of a distinct table or a remote SQL statement scanning the complete source table.
At a specific point in time, only the data that has changed since a well-defined event back in history will be extracted. This event may be the last time of extraction or a more complex business event like the last booking day of a fiscal period.
To identify this delta Δ change there must be a possibility to identify all the changed information since this specific time event. This information can be either provided by the source data itself like an application column, reflecting the last-changed timestamp or a changing table where an appropriate additional mechanism keeps track of the changes besides the originating transactions. In most cases, using the latter method means adding extraction logic to the source system.
IF the data warehouses do not use any change-capture techniques as part of the extraction process. Instead, entire tables from the source systems are extracted to the data warehouse or staging area, and these tables are compared with a previous extract from the source system to identify the changed data. This approach may not have a significant impact on the source systems, but it clearly can place a considerable burden on the data warehouse processes, particularly if the data volumes are large.
Physical Extraction Methods
Depending on the chosen logical extraction method and the capabilities and restrictions on the source side, the extracted data can be physically extracted by two mechanisms. The data can either be extracted online from the source system or from an offline structure. Such an offline structure might already exist or it might be generated by an extraction routine.
There are the following methods of physical extraction:
- Online Extraction
- Offline Extraction
The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves or to an intermediate system that stores the data in a preconfigured manner (for example, snapshot logs or change tables). Note that the intermediate system is not necessarily physically different from the source system.
With online extractions, you need to consider whether the distributed transactions are using original source objects or prepared source objects.
This is the preferred method. The data is not extracted directly from the source system but is staged explicitly outside the original source system. This data will be a source of truth and it will kept pristine and will be stored in Blob storage. The data already has an existing structure (for example, redo logs, archive logs, or transportable tablespaces) or was created by an extraction routine.
You should consider the following structures:
- Flat file data in a defined, generic format. Additional information about the source object is necessary for further processing.
- Dump filesOracle-specific format. Information about the containing objects is included.
- Redo and archive logs information is in a special, additional dump file.
- Transportable tablespaces
Change Data Capture
An important consideration for extraction is incremental extraction, also called CDC (Change Data Capture). If a data warehouse extracts data from an operational system on a nightly basis, then the data warehouse requires only the data that has changed since the last extraction (that is, the data that has been modified in the past 24 hours).
When it is possible to efficiently identify and extract only the most recently changed data, the extraction process (as well as all downstream operations in the ETL process) can be much more efficient because it must extract a much smaller volume of data. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive to the operation of the system. Change Data Capture is typically the most challenging technical issue in data extraction, although there are some practical solutions to capture CDC by having Opensource or Commercial products depending upon the complexity of source systems.
Extraction Using Data Files
Most database systems provide mechanisms for exporting or unloading data from the internal database format into flat files. Extracts from mainframe systems often use COBOL programs, but many databases, as well as third-party software vendors, provide export or unload utilities.
Data extraction does not necessarily mean that entire database structures are unloaded in flat files. In many cases, it may be appropriate to unload entire database tables or objects. In other cases, it may be more appropriate to unload only a subset of a given table such as the changes in the source system since the last extraction or the results of joining multiple tables together. Different extraction techniques vary in their capabilities to support these two scenarios. Parallel Extraction techniques can use considerably more CPU and I/O resources on the source system, and the impact on the source system should be evaluated before parallelizing any extraction technique.
There are different methods to do the extraction of data from source systems, as a data engineer one has to understand about the business use case, the source system, the complexity of source system, data structures, target system, volume of data and speed of data for designing a robust data extraction pipeline.
In the Next post let us discuss about Data Transformation.0