- Modern Big Data Processing with Hadoop
- V. Naresh Kumar Prashant Shindgikar
- 107字
- 2025-04-04 17:12:20
Truncation
Another variant of erasing is truncation, where we make all the input data a uniform size. This is useful when we are pretty sure that information loss is accepted in the further processing of the pipelines.
This can also be an intelligent truncation where we are aware of the data we are dealing with. Let's see this example of email addresses:
Input |
Output |
What's truncated |
alice@localhost.com |
alice |
@localhost.com |
bob@localhost.com |
bob |
@localhost.com |
rob@localhost.com |
rob |
@localhost.com |
From the preceding examples, we can see that all the domain portions from the email are truncated as all of them belong to the same domain. This technique saves storage space.