At the April 27 Data Architecture call, I volunteered to write a use case for replication of subsets of data. Below is my attempt to capture the basic idea in a few examples..

Allen

Data Subset Replication

Replication of entire objects (e.g., a file, an entire database) is a natural, and obvious, place to start considering replication. However there is a real need to replicate subsets of data. Here are a number of motivating examples (use cases):

Company A has an employee database containing all information about its employees. The database is multiple terabytes in size and updates are frequent. Suppose that the payroll information is contained in a single table in that database. The payroll department needs to have fast access to the payroll table. This table is only a few tens of gigabytes in size and updates are infrequent. Replicating just this subset reduces storage consumption on the payroll system, reduces the bandwidth used to maintain the replica and reduces the processing power used to create and handle updates to the payroll system. This is an example of replicating a single table instead of an entire database.

A Life Sciences example. Suppose that there is a large file that describes the entire human genome held on some server at UCLA. This file is multiple terabytes in size. A researcher in Paris desires to perform some computation on genes 14 through 19. For efficiency the data being processed must be at a server in Paris. Instead of expending considerable resource to move the entire file, relatively small amounts of resource are used to move just that portion of the file containing genes 14 through 19. Now suppose that a second researcher in Paris desires to perform a computation on genes 18 and 19. Instead of moving the entire file, or even moving the subset of the file containing genes 18 and 19, that researcher can reuse the partial replica already held in Parsis since it contains genes 14 through 19 and the genes of interest are a subset of those.

A final example from the database world. Suppose that a multi-site hospital keeps a database of its patients and that the database contains all patient information including voluminous information such as x-ray images and MRI scans. Thus it is very large. One of the hospitals in the system, located in Boston, is very specialized - it only sees local elderly cancer patients. This hospital needs to have a replica of the database that contains only those patients - replicating the entire database is, as above, a waste of precious resources. So the hospital needs a replica that contains only those patient records that match a query that might look something like: "Patient.age > 65 AND DISTANCE(Patient.address, "Boston") < '50 miles' AND (Patient.illnesses INCLUDES "cancer" OR Patient.past Illnesses INCLUDES "cancer")"