QA/Bugzilla/Sanitizing Files Before Submission/id

This page is about sanitizing files before submission to the QA Team via Bugzilla, email, or other means.

This page focuses on providing information about sanitizing common file formats such as ODF (odt, ods, odp, etc..), and Microsoft Office formats such as doc(x), xls(x), and ppt(x).

Why sanitize my files?
Example files submitted to the LibreOffice project are typically made publicly accessible so that our entire community can work on correcting any problems related to them. When files are submitted to Bugzilla (our bug-tracking system), the submitter grants permission to use them under a Free Culture license (typically CC-BY-SA 4.0) and the files and contents may be reused by people within and without the LibreOffice project.

We strongly recommend that bug reporters sanitize files before submission so that no personal or private information is unintentionally shared with others.

Effective sanitization
No single technique will remove all identifying information from a digital file. There are several techniques that can be used in conjunction to remove private text and metadata from an ODF file while (hopefully) retaining the necessary structure for reproducing a bug.

Remove private metadata
User data can be viewed and cleared by going to.


 * Under the General tab:
 * Click Reset to reset the general user data like total editing time and revision number.
 * Uncheck the Apply user data checkbox.
 * Under the Description and Custom Properties tabs clear any data you don't want disseminated.

Disable change-tracking and remove stored changes
Recording Changes is a useful feature for collaboration, but needs to be disabled and cleared during the sanitization process.


 * On the Security tab, make sure the Record changes checkbox is unchecked.
 * On the Security tab, make sure the Record changes checkbox is unchecked.

Under, clear all changes.

Remove old versions
Go to and delete any older versions of the document that may be stored there.

Reveal hidden content
Once you've revealed any hidden content, you'll need to decide whether to delete it, sanitize it, or leave it as-is.

Writer

 * Check this feature to be sure all hidden paragraphs are visible.

Calc

 * Make sure there aren't any hidden sheets.

Sanitize file text
After running regexps, examine the document to make sure that all private text has been sanitized.

Writer
The basic plan is to replace all characters with an "x". We can (try to) do this with "Find & Replace".

Preferred 
 * In the Search for field put [:alpha:]
 * In the Replace with field put "x"
 * Make sure Whole words only is unchecked
 * Make sure Regular expressions is checked.
 * Click Replace All

Alternative 
 * In the Search for field put "." (a single period)
 * In the Replace with field put "x"
 * Make sure Whole words only is unchecked
 * Click Other Options to expand the dialog, and make sure Regular expressions is checked.
 * Click Replace All

Now you have a document containing lots of "x"s. With luck, the bug still will be reproducible with that document.

Calc
Follow the instructions provided for Writer on every sheet in the spreadsheet.
 * Make sure to check "All Sheets" in the Find & Replace dialog

Sanitize media
AFAIK there is no easy way to replace all media in an ODF file with dummy versions.

One possible technique would be to unzip the ODF file on disk and replace all images, audio, and video with replacements. (We could turn this task into an EasyHack)

Sanitize formulas
There may be a lot of Math objects (formulas) in an ODF file. Straightforward replacement of everything with 'x' in 'source text' of a formula would likely produce a rendering error, so some eqn keywords have to be left unchanged.

For one of the possible solutions for the problem refer to, which is a source text of the simplistic Python macro. To use, export to plain text, rename the file to have '.py' extension and place it in the macros directory.

More information

 * Regular and Confidential Attachments

Links

 * Removing metadata
 * Removing date stamps from comments in documents
 * Sanitization at Wikipedia
 * NSA sanitization practices