Pseudonymizer (ULTIMATE)
Your GitLab database contains sensitive information. To protect sensitive information when you run analytics on your database, you can use the Pseudonymizer service, which:
- Uses
HMAC(SHA256)
to mutate fields containing sensitive information. - Preserves references (referential integrity) between fields.
- Exports your GitLab data, scrubbed of sensitive material.
WARNING: If the source data is available, users can compare and correlate the scrubbed data with the original.
To generate a pseudonymized data set:
- Configure Pseudonymizer fields and output location.
- Enable Pseudonymizer data collection.
- Optional. Generate a data set manually.
Configure Pseudonymizer
To use the Pseudonymizer, configure both the fields you want to anonymize, and the location to store the scrubbed data:
-
Create a manifest file: This file describes the fields to include or pseudonymize.
-
Default manifest - GitLab provides a default manifest in your GitLab installation
(example
manifest.yml
file). To use the example manifest file, use theconfig/pseudonymizer.yml
relative path when you configure connection parameters. - Custom manifest - To use a custom manifest file, use the absolute path to the file when you configure the connection parameters.
-
Default manifest - GitLab provides a default manifest in your GitLab installation
(example
-
Configure connection parameters: In the configuration method appropriate for
your version of GitLab, specify the object storage
connection parameters (
pseudonymizer.upload.connection
).
For Omnibus installations:
-
Edit
/etc/gitlab/gitlab.rb
and add the following lines by replacing with the values you want:gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml' gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name gitlab_rails['pseudonymizer_upload_connection'] = { 'provider' => 'AWS', 'region' => 'eu-central-1', 'aws_access_key_id' => 'AWS_ACCESS_KEY_ID', 'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY' }
If you are using AWS IAM profiles, omit the AWS access key and secret access key/value pairs.
gitlab_rails['pseudonymizer_upload_connection'] = { 'provider' => 'AWS', 'region' => 'eu-central-1', 'use_iam_profile' => true }
-
Save the file and reconfigure GitLab for the changes to take effect.
For installations from source:
-
Edit
/home/git/gitlab/config/gitlab.yml
and add or amend the following lines:pseudonymizer: manifest: config/pseudonymizer.yml upload: remote_directory: 'gitlab-elt' # bucket name connection: provider: AWS aws_access_key_id: AWS_ACCESS_KEY_ID aws_secret_access_key: AWS_SECRET_ACCESS_KEY region: eu-central-1
-
Save the file and restart GitLab for the changes to take effect.
Enable Pseudonymizer data collection
To enable data collection:
- On the top bar, select Menu > Admin.
- On the left sidebar, select Settings > Metrics and Profiling, then expand Pseudonymizer data collection.
- Select Enable Pseudonymizer data collection.
- Select Save changes.
Generate data set manually
You can also run the Pseudonymizer manually:
- Set these environment variables:
-
PSEUDONYMIZER_OUTPUT_DIR
- Where to store the output CSV files. Defaults to/tmp
. These commands produce CSV files that can be quite large. Make sure the directory can store a file at least 10% of the size of your database. -
PSEUDONYMIZER_BATCH
- The batch size when querying the database. Defaults to100000
.
-
- Run the command appropriate for your application:
-
Omnibus GitLab:
sudo gitlab-rake gitlab:db:pseudonymizer
-
Installations from source:
sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production
-
Omnibus GitLab:
After you run the command, upload the output CSV files to your configured object storage. After the upload completes, delete the output file from the local disk.