At the April 27 Data Architecture call,
I volunteered to write a use case for replication of subsets of data. Below
is my attempt to capture the basic idea in a few examples..
Allen
Data Subset Replication
Replication of entire objects (e.g.,
a file, an entire database) is a natural, and obvious, place to start considering
replication. However there is a real need to replicate subsets of
data. Here are a number of motivating examples (use cases):
Company A has an employee database containing
all information about its employees. The database is multiple terabytes
in size and updates are frequent. Suppose that the payroll information
is contained in a single table in that database. The payroll department
needs to have fast access to the payroll table. This table is only
a few tens of gigabytes in size and updates are infrequent. Replicating
just this subset reduces storage consumption on the payroll system, reduces
the bandwidth used to maintain the replica and reduces the processing power
used to create and handle updates to the payroll system. This is
an example of replicating a single table instead of an entire database.
A Life Sciences example. Suppose
that there is a large file that describes the entire human genome held
on some server at UCLA. This file is multiple terabytes in size.
A researcher in Paris desires to perform some computation on genes
14 through 19. For efficiency the data being processed must be at
a server in Paris. Instead of expending considerable resource to
move the entire file, relatively small amounts of resource are used to
move just that portion of the file containing genes 14 through 19. Now
suppose that a second researcher in Paris desires to perform a computation
on genes 18 and 19. Instead of moving the entire file, or even moving
the subset of the file containing genes 18 and 19, that researcher can
reuse the partial replica already held in Parsis since it contains genes
14 through 19 and the genes of interest are a subset of those.
A final example from the database world.
Suppose that a multi-site hospital keeps a database of its patients
and that the database contains all patient information including voluminous
information such as x-ray images and MRI scans. Thus it is very large.
One of the hospitals in the system, located in Boston, is very specialized
- it only sees local elderly cancer patients. This hospital needs
to have a replica of the database that contains only those patients - replicating
the entire database is, as above, a waste of precious resources. So
the hospital needs a replica that contains only those patient records that
match a query that might look something like: "Patient.age > 65
AND DISTANCE(Patient.address, "Boston") < '50 miles'
AND (Patient.illnesses INCLUDES "cancer" OR Patient.past
Illnesses INCLUDES "cancer")"