Basically, Spring Batch Job runs in single thread. To increase the throughput, we need to parallelize the job by partitioning. The core architecture of Job partitioning is as follows.
- Partition Step : The wrapper of common Step with a partitioner and grid size (or partition size)
- Single Step : Common Step with a Reader, a Processor and a Writer
- Partitioner : The core part of partitioning. Partitioner generates Step Execution Context for each Single Step
- Step Execution Context : usually contains query parameters for each Step
By partitioning a Job, we can increase concurrency as mush as grid size (or partition size). Now, I am showing you an example.
CREATE TABLE TB_PARTITION_SOURCE( DAY_OF_WEEK NUMBER, COL1 NUMBER, COL2 VARCHAR2(10), PRIMARY KEY(DAY_OF_WEEK, COL1) ); CREATE TABLE TB_PARTITION_TARGET( NEW_COL1 NUMBER, NEW_COL2 VARCHAR2(10), PRIMARY KEY(NEW_COL1) );
Notice that source table (TB_PARTITION_SOURCE) has a partition column whose value is 1 (Sunday) to 7 (Saturday).
To split the query, I use the following sql.
SELECT COL1, COL2
WHERE DAY_OF_WEEK BETWEEN ? AND ?
Partitioner is the implementation of org.springframework.batch.core.partition.support.Partitioner. The role is to make a StepExecutionContext for each Step.
The example builds query parameters depending on the grid size.
|Grid size||Query Parameters (from, to)|
|3||1-3, 4-5, 6-7|
|4||1-2, 3-4, 5-6, 7-7|
|5||1-2, 3-4, 5-5, 6-6, 7-7|
|6||1-2, 3-3, 4-4, 5-5, 6-6, 7-7|
|7||1-1, 2-2, 3-3, 4-4, 5-5, 6-6, 7-7|
Example Spring Context
Partitioned Step is referencing “SingleStep” with a partitioner and grid size.
SingleStep is a common Step with a Reader, a Processor and a Writer.
Full source code
You can download full source code from https://github.com/tkstone/spring_batch_sample01. The files to check are
- src/main/resources/spring/job-test-partition-context.xml – main spring context
- src/main/java/test/main/TestPartitionRun.java – main job invoker
- src/main/java/test/reader/DayOfWeekPartitioner.java – partitioner
- src/main/java/test/reader/TestPartitionParameterSetter.java – parameter setter
Some points to consider
- To use partition, you must write a partitioner. That is, you must build the logic to split source data. This is the weak point compared to Hadoop whose partition is done automatically.
- The source data (i.e. table) must have suitable partition key (range or list). If source table’s key is auto increment, partitioning is not suitable. (applying hash function to the key can disable index)
- If parallelizing a Job is required, consider using Hadoop. But in case of RDBMS to RDBMS ETL, Spring Batch could be the choice.