Extracting Phone Numbers With Apache Tika

Kevin Feasel

2017-07-27

Hadoop

Unni Mana knows how to get your digits:

Last time, I had difficulties detecting phone numbers from different types of documents. The challenge was that I had to use different parsers to parse and extract the phone numbers. For example, to extract phone numbers from a Word document, I had to use a library that supports Word. Also, I cannot use the same library or logic to parse a PDF file. Ultimately, I need to maintain different libraries for different document types, which, as you can image, can lead to many issues.

It looks like this covers international phone numbers as well.  Seems pretty interesting.

Related Posts

Working With The Databricks API Via Powershell

Gerhard Brueckl has a Powershell module for interacting with Databricks, either Azure or AWS: As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended […]

Read More

Kafka Connect Converters And Serialization

Robin Moffatt goes into great detail on Apache Kafka Connect converters and serialization techniques: Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include: Connectors – the JAR files that define how to integrate with the data store itself Converters – handling serialization and deserialization of […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31