Top ten places to visit in the world

Tired of spending all your time working? Finally, wanted to take some rest from work and enjoy the beauty of the world? So, you are on the right page, where we will going to discuss the top 10 places…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




First Impressions of H2O.ai DAI 1.4.2

Given its minuscule size, I loaded the data straight from my local machine. The first thing I noticed about DAI is that the data visualization has improved. It’s not Tableau, but it’s a lot more interactive than in the previous 1.3.1 release. This is a nice step forward.

One new thing is that it now determines the type of variables. This is helpful, though could probably be improved a bit more. For instance, TICKER is a categorical variable with two values, but that isn’t shown even though DAI later detects it as a categorical variable. Similarly, DATE isn’t expressed in terms of a start and stop date. This dataset view isn’t particularly interactive either.

Ideally DAI might guess variable types and then allow a user to override them if the guess is incorrect.

Similarly, I’d like to be able to sort and select values. I wonder about this from H2O.ai’s perspective. Presumably they don’t want to get caught investing in rebuilding BI tools that already exist. At the same time there’s a lot of value to integrating data exploration and ML for an end to end pipeline. Finding the right balance between 1st party build and outsourcing to existing BI tools is probably difficult.

If you click “dataset rows” it shows more info about the dataset:

The visualizations have improved too. The kinds of visualizations available are shown in this screen:

If I click on the scatter plots, I then see a view like this:

Initially I was a little frustrated that I couldn’t select the variables to plot here. If I’d been able to, I probably would have built a candlestick plot with the dates just to see how it looked. That would have given a more fundamental understanding of the data I’ve been looking at.

However, visualization doesn’t seem to be the point here. Instead, this are a couple graphs to help with data modeling. Judging from the help dialog, the thought seems to be to provide some guidance on how to apply ML to the data set, rather than to get insight into the data itself. That probably makes sense as the BI space is incredibly crowded and has numerous high quality tools.

This view is great:

It’s interactive too. You can drag the predictors around and rearrange them. The graph then bumps itself around slightly to preserve distances.

I am really surprised volume isn’t showing as correlated with label as high volume days are indicative of volatility.

The core of the product is focused on “driverless” AI, that is, automating core parts of the ML process, including feature engineering and model building. To do that, we select the data set and hit “predict.”

Now we get to the part where the product really shines. In this case, I’m picking my target as “LABEL.” The data is also a time series, so I select date as the time column. I then cranked time up to 10 and interpretability down to 1 and hit “launch.”

At this point DAI is chugging away soaking up the two GPUs on this machine. That’s a big deal. A lot of ML frameworks claim to scale out, but typically its a significant refactor or hits a hard limit on a handful of cores. The algorithms here appear to be making use of this powerful hardware.

As the model is building we can view its progress. Here’s an example:

One thing that immediately sticks out is the ROC looks wrong, very wrong. This is great feedback from the UI in realtime.

Our model is perfect. It turns out we’re leaking data from the future. It’s very easy to predict where the market will go for a given day when you already know the close price. So, we’re going to cancel building this model and trying something else.

For time series, H2O.ai DAI has an option to do this. We’re going to select predicting one day in the future as so:

This feature is really amazing. I’ve literally spent years writing the windowing logic that is very similar in functionality to what is here. It’s a lot of work. Bugs constantly crop up causing leakage. Feature engineering on time series is a mess because in addition to arithmetic combinations you probably want moving averages over a variety of windows. A lot of that is automated in H2O.ai DAI.

This is where things broke down a bit. DAI returned an error:

I’m not really sure what’s going wrong here. We changed two things from the last experiment, setting “how many days to predict” and “after how many days to start predicting” to 1. With the failed error message, the next obvious thing to do is “DOWNLOAD LOGS.” The logs download gives a zip file with a few different files in it:

There seem to be two relevant logs, one for the experiment and another for the server. The experiment shows a failure here:

So, the issue seems to be related to the cross validation with four folds interacting poorly with the time series settings. This is further validated by inspecting the server log:

I had some guesses as to what was breaking here (spoiler, they were all wrong):

Anyway… This minimal python script seems to run cleanly:

Note that my IP and creds are there. This box ran unsecured for a week or so while I fiddled with things. I was a little surprised nothing bad happened. I guess the combination of the obscure OCI IP space and the odd port worked to provide the best kind of security — obscurity.

The auth in front of DAI is minimal. It segments out different users and their experiments, allowing the system to be multi-user. However, anyone can access DAI with any creds. Presumably a more full featured auth system with AD/LDAP is on the way. Right now any install on the web is essentially an ML honeypot.

We already have a data set. It’d previously been uploaded, but I can’t figure out how to get the key, so I’m just going to upload it again. I tried this and it blew up:

To work around this, I decided to get the data up on the cloud box locally. I had to install git and then clone the repo. The file we want is now at /home/opc/arbit/volatility/output.csv

Retrying to load the data, we’re now getting a “permission denied,” so that’s progress. I’m going with the nuclear option of a chmod 777and moving the file to /. That seems to run cleanly now. This wasn’t terrible, but it does seem like some things could be improved here to make it easier to load data. Presumably the functions are available since the functionality is in the UI. The issue is probably my inability to figure out how to invoke that functionality from Python. While the tutorial is great, I’m not sure where the full documentation is.

Next up I attempted to preview an experiment:

That gave a really helpful error about some missing parameters:

Ok, so next up is to figure out how to launch the time series experiment from the python API.

I managed to run a few experiments using functions like this:

Poking around in the autocomplete for the function, it became a little more obvious what’s wrong. It looks like the variable names aren’t consistent between the preview and start experiment functions:

That gave me a familiar error:

Of course, now I had another garbage ROC curve. I also thought I knew what the two windowing variables were doing, but now I’m not so sure.

I decided the break the data into train (800 points) and test (200 points) sets. I also did one bit of cleansing, dropping three data points for the TICKER GEF. Rerunning with the previous settings, I got some more interesting feedback.

It’s really cool that DAI detected TICKER is useless now. Unfortunately, it’s interference on VOLUME is not correct. The volume is the number of shares traded each day and likely a great source of information about volatility. It’s also not an ID column, so was misidentified.

Sure enough, I can see how DAI might have inferred a regime change for 2017–2018. I bet we can work around that by normalizing high, low and close using the daily open.

To normalize, I modified the python script slightly to normalize and then round. I also changed the volume from a per share to a dollar total in millions. I rounded that as well. All this wraps up a bunch of guesses about the data and opinionates this training set in a particular way. It seemed a good way to get unstuck.

I’ve done (at least) one weird thing in that I’ve stripped the units from the OHLC while at the same time introducing dollar units to the volume. I’m not sure how I feel about that, but it seems worth trying.

One of the really cool things about DAI is that I’m thinking about all this at all, rather than struggling with date encoding or something horrible like that. The prompts about variables and what DAI guessed about them spurred me to think more deeply about the data. That seems a significant advancement in how data science tools help.

After all that, the new data set looked like this:

Of course, I got all excited about rerunning the experiment and then had trouble logging into the UI. SSH also went dark for about 10 minutes. When I got back into the box, the CPU use was low but a number of experiments were chugging away. I’m surprised how much memory they’re soaking up. I’m also surprised that I have a 57.5GB Java running. That feels a bit memory leaky to me, but who knows. I didn’t really dig into the cause here.

These experiments had been running for a few days, so I didn’t feel particularly attached to them. I was a bit curious what would happen to the experiments on a reboot, so I did a sudo reboot.

DAI came back after a few minutes and the box was a lot more responsive. Judging from top and the DAI console, the running experiments entered a cancelled status on reboot. Ideally you’d probably want a pause/resume but that’s not the end of the world.

I’m pretty sure something went horribly wrong with the two 40 hour+ experiments. What exactly, I’m not sure. One theory — maybe launching an experiment isn’t a fire and forget thing. Instead when I closed my laptop for over a day, it might have caused the experiment to hang. If so, a bit of refactoring seems in order.

Anyway… with the machine back up, it was time to try again. I got this error again:

To test if this is this issue, I decided to trash all the dates and create new arbitrary ones. Because I’m lazy and this data set is tiny, I used the fill function in Excel. I relaunched the experiment as so:

DAI gave me a notification that it was dropping TICKER. That makes sense given that all our data points have a value of GF. After that, it seemed to run. Getting a bit optimistic, I started up a longer running experiment as well. This one defaulted to Log Loss for the scorer. I’m not sure why, but given that I don’t have strong feelings (other than vaguely remembered dogma that AUC is best) I decided to let the expert inside of DAI do its thing.

Back up in the experiment view, it looks like it’s time to go get a coffee (or 10):

Whoever came up with the experiment name generator deserves a round of applause. They’re pronounceable and memorable. “Pudaface” sounds like a friendly little experiment. “Noguneko” is clearly more serious. I had high hopes for it.

Both experiments wrapped up after about 40 minutes. Presumably the time estimate doesn’t take into account concurrent experiments. To be fair, that would be incredibly hard to account for.

The time is also highly unpredictable in that the number of iterations can be lower than expected if the experiment converges quickly. That seems to have happened in both these cases.

As I began poking at these two experiments, it became clear I’d made a mistake. I’d forgotten to include the test dataset. Creating a new experiment with the test dataset and the sequential dates, I saw a menu I’d never seen before:

The visual feedback here is one of the things I really like about DAI. The impact of the sequential data isn’t clear from the API, but it’s quite evident here.

Partway through the run our model is maybe ok:

Right about 97% of the labels are false and 3% are true. Given that a classified that always guessed false would get 97% accuracy. We’re seeing 99.62%, so it’s actually doing a bit better. This is encouraging.

A little more insight into all this is provided by the precision/recall view in DAI:

This view shows us getting 100% precision. Interestingly, recall is 81.25%. I was a bit surprised by that as past models I’ve built on similar data have skewed heavily the other way.

These intermediate results seem particularly impressive given that we tossed a bunch of information out in our normalization. We then chucked out even more information when we forced the dates to be sequential, losing things like the impact of weekends, holidays and seasonality. Though, it seems unlikely we would have learned much about the impact of Christmas on financial markets from three years of data for one symbol…

This does all make me wonder what a customer scoring metric would look like in DAI. Though, I assume if you want to roll up your shirtsleeves and really poke around an ML method, that’s more the domain of the underlying H2O libraries than DAI.

The experiment took about three hours to run. On completion I got this:

It looks like the model peaked about an hour in and took two more hours to satisfy the convergence criteria. I’d set both time and accuracy to 10 in DAI, so it kept running longer than it might have otherwise.

The variable importance view shows that our most useful features were high, which was the normalized high for the day and some clusters based on that variable. I set interpretability to 1, so there’s no good explanation for what those features are. If interpretability had been set higher the features might have been something more obvious like an arithmetic combination of the original features.

The experiment summary shows just how much work DAI did here. Only one original feature is used, the normalized high. Another 24 engineered features are used, though only the top three seem to matter at all. Here are some of them:

As promised, the model is not particularly interpretable, but it does seem to do reasonably well. This is exactly what we asked for and DAI delivered.

I should probably try this with a more real data set. That would include multiple symbols and a test set. I suspect a lot of these results are going to be a lot more meaningful once I do that.

I’m also pretty sure I’ve got a time leakage issue. That’s probably down to not understanding the time variables well enough.

I learned a lot here.

I’m really excited to see what this product evolves into and continue poking around with it.

I have have mixed feeling about the continuing need for some data wrangling and feature engineering. On one hand it would have been cool to click launch and have a magic model pop out. On the other, it’s nice that some of the insight I’ve gotten into time series financial data over 10+ years of playing with it in both an ML and professional context is still useful. People haven’t been entirely replaced yet!

Add a comment

Related posts:

Using email marketing to drive traffic to your website

Summary of the significance of using dispatch marketing to drive business to your website