Community, GHC15

Infrastructure & Optimization for Data Sci Presentations @ GHC15

Optimization and Big Data: An Application from Car Rental Pricing

-Ezgi C. Eren,Scientist II OF Pros, Inc.

Relation between optimization and big data
The speaker started out with an important point that Big data comes with its own challenges and it comes with its own problems. Big data is useful only when we manage to make sense of all that enormous data.

Rental pricing and distribution system has a forecasting element and the main goal is to maximize revenue by pricing car rental at the MST optimal and yet competitive price.
The system has different components like revenue management , competitor prices and channel strategies.
The complexity of their problem is because the dataset has Millions of constraints and vars like location, brand and car type. Their solution to predicting pricing lies in reducing the dimensions. The speaker did this by applying their variables into categories like product and channel, and then optimizing. It is interesting that they feed competitor prices to their forecaster results and the output to a pricing and distribution optimizer which then gives the price.

The speaker also talked about how they tweak their heuristics. They set a demand limit and then run the optimization. Then increase the demand limit and then view the results.

(Upload photo of optimization steps soon)

The speaker ended by pointing out that its all about adapting optimizations to handle all that data.
 

High Availability and High Frequency Big Data Analytics
-Esther Kundin, Bloomberg LP.

The speaker started with explaining the problem space by showing a screen shot of the data a portfolio manager sees daily when managing investments. The data he uses involves a lot of historic data and many critical price points. The data needs to be highly available and accessible. To invest the portfolio manager needs to analyze the data to determine what to invest in. She also explained why this problem is important by talking about how the loss of a single price point can impact a regular client’s 401K or college savings.

The data needs to be accessed immediately as investment decisions are made as soon as possible based on market trends, and hence downtime has to be lesser than the read latency.

The speaker used hbase to solve the high availability problem. Hbase does random access on data very well, and has fault tolerance built in. The speaker’s team worked with hbase to put in place cluster replication. This way the data is always available. I thought it was really innovative that the speaker rolled out their own fix to ensure availability in case the data center goes down.

Hbase access is pretty fast in general, however they noticed that the average latency goes down as the number of read accesses goes higher. This is because speed is bounded by the lowest region server and java garbage collection is to blame. They used data to set heuristics. They

High availability:
Hbase was selected as the architecture. Doing random access on data is what hbaseb is really good for. They r trying to analyze your investment not the entire market. Hbase has fault tolerance also.
It takers time to realize if a server is down. Downtime means data is missing and that’s a problem as it can’t access any of your data.
H base now has backups on different servers. Downtime does not affect data access.
Hbase in general in very fast but as they noticed that as the get requests increase the avg latency goes higher. This is due to the fact that the speed is bounded by the lowest region server and java garbage collection is to blame. To overcome this, the speaker used a very large memory footprint, Synchronized GC via coprocrssors and
Read from backup when GC in progress

Questions
Hbase is very structures for a certain problem. If u r trying to use same data for diff access patterns how do u use it.
A:Can’t use. But u can replicate and separate clusters for diff access patterns

 
SEAD: Infrastructure for Managing Research Data in the Long Tail
-Sandy Payette, Research Investigator, U.Michigan – Ann Arbor
(Raw notes pasted, I ll format as when I get time)

 
Use case of scientists or researcher
NSF project funded for cyber infra initiative- sead

Sead is focused in the middle space of computational analysis and published results. There is a cycle of improvements in results and publications and reuse.

Long tail of data where the complexity is lower but heterogeneity is higher. Like indiv collecting bird science data. They have datasets under their desk in hard drive. They collaborate to publish and they need infra to do so.
Smaller scale is what they r trying to serve.

Its a web based infrastructure
Example : Mississippi flood proj- has raw data ,Geo images and computations is the models that they store

Sead provides managing shared spaces and published data can be referenced by the journal creating a cycle of cross ref.

Dead for r plugins been developed to do computes in r n put it back.
Ux for dataset mgmt and staging area for publishing and archiving datasets.

Can also be used for historical data and surveys
The demo ui looks pretty cool and like aawordpress site. I expected to see something very much cluttered and intense and not appealing to the eye but I was pleasantly surprised to see the ux neat and clean and simple. Kudos to that for free research community

Components of the system:

Had a rule based engine that scores data based on appropriateness. It treks what repositories are good for me based on metadata

Benefits are large:
Scientific collaboration
reuse of research publication data
Discover scientific data
Reproduce sci results , link data to literature
contribute to emergence of cyber infra- initiatives for open data and transparency

 

#ghc15

Notes on presentations can be found here: notes.