“After you have built out your data lake, use it. Ask it questions. You will begin to see patterns where you want to dig deeper. The Hadoop ecosystem doesn’t allow for that digging and not at a speed that is customer facing. For that, you need some sort of analytical database.”– Eva Donaldson.
I have interviewed Eva Donaldson, software engineer and data architect at iContact. Main topic of the interview is her experience in using HPE Vertica.
RVZ
Q1. What is the business of iContact?
Eva Donaldson: iContact is a provider of cloud based email marketing, marketing automation and social media marketing products. We offer expert advice, design services, and an award-winning Salesforce email integration and Google Analytics tracking features specializing in small and medium sized businesses and nonprofits in U.S. and internationally.
Q2. What kind of information are your customers asking for?
Eva Donaldson: Marketing analytics including but not limited to how customers reached them, interaction with individual messages, targeting of marketing based on customer identifiers.
Q3. What are the main technical challenges you typically face when performing email marketing for small and medium businesses?
Eva Donaldson: Largely our technical challenges are based on sheer size and scope of data processing. We need to process multiple data points on each customer interaction, on each customer individually and on landing page interaction.
Q4. You attempted to build a product on Infobright. Why did you choose Infobright? What was your experience?
Eva Donaldson: We started with Infobright because we were using it for log processing and review. It worked okay for that since all the logs are always referenced by date which would come in order. For anything but the most basic querying by date Infobright failed. Tables could not be joined. Selection by any column not in order was impossible in the size data we were processing. For really large datasets some rows would just not be inserted without warning or explanation.
Q5. After that, you deployed a solution using HPE Vertica. Why did you choose HPE Vertica? Why didn`t you instead consider another open source solution?
Eva Donaldson: Once we determined that Infobright was not the correct solution, we knew already that we needed an analytical style database. I asked anyone and everyone who was working with true analytics at scale what database backend they were using and if they were happy. Three products come to the forefront: Vertica, Teradata and Oracle. The people using Oracle who were happy were complete Oracle shops. Since we do not use Oracle for anything this was not the solution for us. We decided to review Vertica, Teradata and Netezza. Of the three Vertica for our needs came out the clear winner.
Vertica installs on commodity hardware which meant we could deploy it immediately on servers we had on hand already. Scaling out is horizontal since Vertica clusters natively which meant it fit exactly in with the way we already handled our scaling practices.
After the POC with Vertica’s free version and seeing the speed and accuracy of queries, there was no doubt we had picked the right one for our needs. Continued use and expansion of the cluster has continued to prove that Vertica stands up to everything we throw at it. We have been able to easily put in a new node, migrate nodes to beefy boxes when we needed to. Performance on queries has been unequaled. We are able to return complex analytical queries in milliseconds.
As to other open source tools, we did consider them. I looked at Greenplum and I don’t remember what all other columnar data stores. There are loads of them out there. But they are all limited in one way or another and most of them are very similar in ability to Infobright. They just don’t scale to what we needed.
The other place people always think is Hadoop. Hadoop and all the related ecosystem is a great place to put stuff while you are wondering what questions you can ask. It is nice to have Hadoop (Hive, Hbase, etc.) to have a place to stick EVERYTHING without question. Then from there you can begin to do some very broad analysis to see what you have. But nothing coming out of a basic file system is going to get you the nitty-gritty analysis to answer the real questions in a timely manner. After you have built out your data lake, use it. Ask it questions. You will begin to see patterns where you want to dig deeper. The Hadoop ecosystem doesn’t allow for that digging and not at a speed that is customer facing. For that, you need some sort of analytical database.
Q6. Can you give us some technical details on how you use HPE Vertica? What are the specific features of HPE Vertica you use and for what?
Eva Donaldson: We have Vertica installed on Ubuntu 12.04 in a three node cluster. We load data via the bulk upload methods available from the JDBC driver. Querying includes many of the advanced analytical functions available in the language as well as standard SQL statements.
We use the Management Console to get insight into query performance, system health, etc. Management Console also provides a tool to suggest and build projections based on queries that have been run in the past. We run the database designer on a fairly regular basis to keep things tuned to how it is actively being used.
We do most of our loading via Pentaho DI and quite a lot of querying from that as well. We also have connectors from Pentaho reports. We have some PHP applications that reach that data as well.
Q7. To query the database, did you have as requirement to use a standard SQL interface? Or it does not really matter which query language you use?
Eva Donaldson: Yes, we required a standard SQL interface and availability of a JDBC driver to integrate the database with our other tools and applications.
Q8. Did you perform any benchmark to measure the query performance you obtain with HPE Vertica? If yes, can you tell us how did you perform such benchmark (e.g. what workloads did you use, what kind of queries did you consider, etc,)
Eva Donaldson: To perform benchmarks we loaded our biggest fact table and its related dimensions. We took our most expensive queries and a handful of “like to have” queries that did not work at all in Infobright and pushed them through Vertica. I no longer have the results of those tests but obviously we were pleased as we chose the product.
Q9. What about updates? Do you have any measures for updates as well?
Eva Donaldson: We do updates regularly with both UPDATE and MERGE statements. MERGE is a very powerful utility. I do not have specific times but again Vertica performs splendidly. Updates on millions of rows performs accurately and within seconds.
Q10. What is your experience of using various Business Intelligence, Visualization and ETL tools in their environment with HPE Vertica?
Eva Donaldson: The only BI tools we use are all part of the Pentaho suite. We use Report Designer, Analyzer and Data Integration. Since Pentaho comes with Vertica connectors it was very easy to begin working with it as the backend of our jobs and reports.
Qx Anything else you wish to add?
Eva Donaldson: If you are looking for an easy to build and maintain, performant analytical database nothing beats Vertica, hands down. If you are working with enough data that you are wondering how to process it all having an analytical database to be able to actually process the data, aggregate it, ask complicated questions from is priceless. We have gained enormous insight into our information because we can ask it questions in so many different ways and because we can get the data back in a performant manner.
———————
Eva Donaldson is a software engineer and data architect with 15+ years of experience building robust applications to both gather data and return it to assist in solving business challenges in marketing and medical environments. Her experience includes both OLAP and OLTP style databases using SQL Server, Oracle, MySQL, Infobright and HP Vertica. In addition, she has architected and developed the data consumption middle and front end tiers in PHP, C#, VB.Net and Java.
Resources
– What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016.
– Column store database formats like ORC and Parquet reach new levels of performance. Steve Sarsfield, HPE Vertica. ODBMS.org, JUNE 15, 2016
–Taking on some of big data’s biggest challenges. Steve Sarsfield , HPE Vertica. ODBMS.org, June 2016
– What’s New in Vertica 7.2?: Apache Kafka Integration!, HPE, February 2, 2016.
Related Posts
– On the Internet of Things. Interview with Colin Mahony. ODBMS Industry Watch, March 14, 2016.
– On Big Data Analytics. Interview with Shilpa Lawande, ODBMS Industry Watch, December 10, 2015.
Follow us on Twitter: @odbmsorg
##
PlanetMySQL Voting: Vote UP / Vote DOWN