Extracting Phone Numbers With Apache Tika

Kevin Feasel

2017-07-27

Hadoop

Unni Mana knows how to get your digits:

Last time, I had difficulties detecting phone numbers from different types of documents. The challenge was that I had to use different parsers to parse and extract the phone numbers. For example, to extract phone numbers from a Word document, I had to use a library that supports Word. Also, I cannot use the same library or logic to parse a PDF file. Ultimately, I need to maintain different libraries for different document types, which, as you can image, can lead to many issues.

It looks like this covers international phone numbers as well.  Seems pretty interesting.

Related Posts

Five Books For Learning Kafka

Data Flair has a guide to five books to help you learn Apache Kafka: The book “Kafka: The Definitive Guide” is written by engineers from Confluent andLinkedIn who are responsible for developing Kafka. They explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. It contains detailed examples as well. […]

Read More

Push-Based Alerting With Kafka Streams

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams: Now we can query from it and show the aggregate window timestamp alongside the result: ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \ HOST, INVALID_LOGIN_COUNT \ FROM INVALID_USERS_LOGINS_PER_HOST; 1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 1 1521646620000 | […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31