Data deduplication is a method of reducing
storage needs by eliminating redundant data.
Overview
Two types of data deduplication
are available on Tivoli® Storage
Manager: client-side data deduplication and server-side
data deduplication.
Client-side data deduplication is
a data deduplication technique that is used on the backup-archive
client to remove redundant data during backup and archive processing
before the data is transferred to the Tivoli Storage
Manager server. Using client-side data deduplication can reduce the
amount of data that is sent over a local area network.
Server-side
data deduplication is a data deduplication technique that is
done by the server. The Tivoli Storage
Manager administrator can specify the data deduplication location
(client or server) to use with the DEDUP parameter
on the REGISTER NODE or UPDATE NODE server
command.
Enhancements
With client-side
data deduplication, you can:
- Exclude specific files on a client from data deduplication.
- Enable a data deduplication cache that reduces network traffic
between the client and the server. The cache contains extents that
were sent to the server in previous incremental backup operations.
Instead of querying the server for the existence of an extent, the
client queries its cache.
Specify a size and location for a client
cache. If an inconsistency between the server and the local cache
is detected, the local cache is removed and repopulated.
Note: For
applications that use the Tivoli Storage
Manager API, the
data deduplication cache must not be used because of the potential
for backup failures caused by the cache being out of sync with the Tivoli Storage
Manager server.
If multiple, concurrent Tivoli Storage
Manager client sessions
are configured, there must be a separate cache configured for each
session.
- Enable both client-side data deduplication and compression to
reduce the amount of data that is stored by the server. Each extent
is compressed before it is sent to the server. The trade-off is between
storage savings and the processing power that is required to compress
client data. In general, if you compress and deduplicate data on the
client system, you are using approximately twice as much processing
power as data deduplication alone.
The server can work with deduplicated,
compressed data. In addition, backup-archive clients earlier than
V6.2 can restore deduplicated, compressed data.
Client-side data deduplication uses the
following process:
- The client creates extents. Extents are parts of
files that are compared with other file extents to identify duplicates.
- The client and server work together to identify duplicate extents.
The client sends non-duplicate extents to the server.
- Subsequent client data-deduplication operations create new extents.
Some or all of those extents might match the extents that were created
in previous data-deduplication operations and sent to the server.
Matching extents are not sent to the server again.
Benefits
Client-side
data deduplication provides several advantages:
- It can reduce the amount of data that is sent over the local area
network (LAN).
- The processing power that is required to identify duplicate data
is offloaded from the server to client nodes. Server-side data deduplication
is always enabled for deduplication-enabled storage pools. However,
files that are in the deduplication-enabled storage pools and that
were deduplicated by the client, do not require additional processing.
- The processing power that is required to remove duplicate data
on the server is eliminated, allowing space savings on the server
to occur immediately.
Client-side data deduplication has a possible disadvantage.
The server does not have whole copies of client files until you
back up the primary storage pools that contain client extents to a
non-deduplicated copy storage pool. (Extents are parts of a
file that are created during the data-deduplication process.) During
storage pool backup to a non-deduplicated storage pool, client extents
are reassembled into contiguous files.
By default, primary sequential-access
storage pools that are set up for data deduplication must be backed
up to non-deduplicated copy storage pools before they can be reclaimed
and before duplicate data can be removed. The default ensures that
the server has copies of whole files at all times, in either a primary
storage pool or a copy storage pool.
Important: For
further data reduction, you can enable client-side data deduplication
and compression together. Each extent is compressed before it is sent
to the server. Compression saves space, but it increases the processing
time on the client workstation.
In a data deduplication-enabled storage
pool (file pool) only one instance of a data extent is retained. Other
instances of the same data extent are replaced with a pointer to the
retained instance.
When client-side data deduplication is enabled,
and the server has run out of storage in the destination pool, but
there is a next pool defined, the server will stop the transaction.
The Tivoli Storage
Manager client
retries the transaction without client-side data deduplication. To
recover, the Tivoli Storage
Manager administrator
must add more scratch volumes to the original file pool, or retry
the operation with deduplication disabled.
For client-side data
deduplication, the Tivoli Storage
Manager server must be Version 6.2 or higher.
Prerequisites
When configuring
client-side data deduplication, the following requirements must be
met:
- The client and server must be at version 6.2.0 or later. The
latest maintenance version should always be used.
- When a client backs up or archives a file, the data is written
to the primary storage pool that is specified by the copy group of
the management class that is bound to the data. To deduplicate the
client data, the primary storage pool must be a sequential-access
disk (FILE) storage pool that is enabled for data deduplication.
- The value of the DEDUPLICATION option on the
client must be set to YES. You can set the DEDUPLICATION option
in the client options file, in the preference editor of the IBM® Tivoli Storage Manager client
GUI, or in the client option set on the Tivoli Storage
Manager server.
Use the DEFINE CLIENTOPT command to set the DEDUPLICATION option
in a client option set. To prevent the client from overriding the
value in the client option set, specify FORCE=YES.
- Client-side data deduplication must be enabled on the server.
To enable client-side data deduplication, use the DEDUPLICATION parameter
on the REGISTER NODE or UPDATE NODE server
command. Set the value of the parameter to CLIENTORSERVER.
- Ensure files on the client are not excluded from client-side data
deduplication processing. By default, all files are included. You
can optionally exclude specific files from client-side data deduplication
with the exclude.dedup client option.
- Files on the client must not be encrypted. Encrypted files and
files from encrypted file systems cannot be deduplicated.
- Files must be larger than 2 KB and transactions must be below
the value that is specified by the CLIENTDEDUPTXNLIMIT option.
Files that are 2 KB or smaller are not deduplicated.
The server can limit the maximum transaction size for data
deduplication by setting the CLIENTDEDUPTXNLIMIT option
on the server. See the Administrator's Guide for
details.
The following operations take precedence over client-side
data deduplication:
- LAN-free data movement
- Subfile backup operations
- Simultaneous-write operations
- Data encryption
Important: Do not schedule or enable any of those
operations during client-side data deduplication. If any of those
operations occur during client-side data deduplication, client-side
data deduplication is turned off, and a message is written to the
error log.
The setting on the server ultimately determines
whether client-side data deduplication is enabled. See Table 1.
Table 1. Data deduplication settings: Client
and serverValue of the client DEDUPLICATION
option |
Setting on the server |
Data deduplication location |
Yes |
On either the server or the client |
Client |
Yes |
On the server only |
Server |
No |
On either the server or the client |
Server |
No |
On the server only |
Server |
Encrypted files
The
Tivoli Storage
Manager server and
the backup-archive client cannot deduplicate encrypted files. If
an encrypted file is encountered during data deduplication processing,
the file is not deduplicated, and a message is logged.
Tip: You
do not have to process encrypted files separately from files that
are eligible for client-side data deduplication. Both types of files
can be processed in the same operation. However, they are sent to
the server in different transactions.
As a security precaution,
you can take one or more of the following steps:
- Enable storage-device encryption together with client-side data
deduplication.
- Use client-side data deduplication only for nodes that are secure.
- If you are uncertain about network security, enable Secure Sockets
Layer (SSL).
- If you do not want certain objects (for example, image objects)
to be processed by client-side data deduplication, you can exclude
them on the client. If an object is excluded from client-side data
deduplication and it is sent to a storage pool that is set up for
data deduplication, the object is deduplicated on server.
- Use the SET DEDUPVERIFICATIONLEVEL command
to detect possible security attacks on the server during client-side
data deduplication. Using this command, you can specify a percentage
of client extents for the server to verify. If the server detects
a possible security attack, a message is displayed.