1. Introduction - 1.3Building a model with BigQuery ML - 《Trading with Machine Learning》

Based on the results of your analysis, you can decide which manufacturers will
likely see increases in their share price due to their competitive advantage.
The first thing you need to ask yourself is, what data do I have?
This determines the form of the model you will create.
Is the data a numerical time series, or
is it a qualitative ranking like pass or fail?
根据分析结果，决定哪些制造商的股价可能会因其竞争优势而上涨。你首先要问自己，我有什么数据？这将决定您将创建的模型的形式。数据是一个数值时间序列，还是一个像通过或失败这样的定性排名？

Before investing the time and effort to create a model from your data., you should consider whether it is private data to which you have exclusive access. Or is it public data that anyone can use? Private data is much more valuable in that it gives you the ability to model a unique feature that you can use to predict changes in the company’s performance and price. Public data on the other hand may have already been used by other trading groups to create models and trading strategies. It also may be already fully incorporated into the current share price and so have no value in predicting future price changes. The only thing worse than having no data is using old data to predict changes in share price for performance. It’s as if you’ve got today’s winning lottery numbers in tomorrow’s mail, doesn’t do you much good.
在投入时间和精力从数据创建模型之前，您应该考虑它是否是您具有独占访问权限的私有数据。或者是任何人都可以使用的公共数据？私有数据的价值更大，因为它使你能够对一个独特的特性进行建模，从而预测公司业绩和价格的变化。另一方面，公共数据可能已经被其他贸易集团用来创建模型和交易策略。它也可能已经完全纳入当前股价，因此在预测未来股价变动方面没有价值。用旧数据来预测股价的变化比用旧数据来预测业绩更糟糕。就好像你明天的邮件里有今天的中奖彩票号码，对你没什么好处。

Old data has two downsides, one, it may not reflect the current business environment that the company faces. And two, it’s already incorporated into the share price, and so has no trading value.
旧数据有两个缺点，一是可能无法反映公司当前面临的商业环境。第二，它已经被纳入股价，因此没有交易价值。

Models, even purely statistical models have implicit assumptions. The most common assumption is that the factors and features modeled using historical data will accurately predict future changes in share price. This assumes that the economic and market environment in which shares trade is fairly static. That’s a reasonable assumption over short time horizons, but fails miserably over long time periods. Your model needs to be retested constantly with fresh data. And your assumptions need to be re-validated based on the current performance of your model.
模型，甚至纯粹的统计模型都有隐含的假设。最常见的假设是，使用历史数据建模的因素和特征将准确预测未来股价的变化。这是假设股票交易的经济和市场环境相当稳定。这在短期内是一个合理的假设，但在长时间内却失败了。你的模型需要不断地用新的数据重新测试。您的假设需要根据模型的当前性能进行重新验证。

Sometimes you’re lucky enough to get both good quality and doggedness data, and private exogenous data.
This allows you to model both the current price trends and potentially predict longer-term changes in share value.
Qualitative variables such as competitiveness and performance rankings can also have a large impact on long-term value.
Now we’ll look at how you can model exogenous performance data on competing chip manufacturers. Based on the results of your analysis, you will be able to decide which manufacturers are likely to see increases in their sales and share price due to their competitive advantage. Let’s say you’re an analyst at a technology fund. Specifically, your expertise is semiconductor manufacturers like Intel and AMD.
The dataset that you have to analyze and model is the performance of CPU chips or processors. This is a real public dataset that we will demo and make available to you in the course references.
This exercise and modeling CPU performance by vendor is taken from Sarah Robinson’s blog post on BigQuery ML. She’s a developer advocate for Google.
So what data are we given?
Let’s say instead of a public dataset, this is a private data set you created through your own research.
It has all the characteristics of the CPUs that different vendors are producing. One of your goal should be to compare the performance of these processors against each other to see if there any clear leaders or laggards among the vendors. If you can get a winning model that forecast chip performance before it hits the market, it might help your fund value chip vendors more accurately. So given all these inputs about the chip specifications like speed, memory, operating systems, cash, etc. Can you predict the benchmark scores say from zero to 100 that you could use for comparison? Here’s the complete code for creating a machine learning model with a dataset using BigQuery ML.
有时候，您很幸运地同时获得了高质量的数据和顽强的数据，以及私有的外部数据。
这使你可以模拟当前的价格趋势，并有可能预测股票价值的长期变化。
竞争力和业绩排名等定性变量也会对长期价值产生很大影响。
现在我们来看看如何在竞争的芯片制造商上建立外部性能数据的模型。根据您的分析结果，您将能够决定哪些制造商的销售额和股价可能会因其竞争优势而有所增长。假设你是一家科技基金的分析师。具体来说，你的专长是像英特尔和AMD这样的半导体制造商。
你需要分析和建模的数据集是CPU芯片或处理器的性能。这是一个真实的公共数据集，我们将演示并在课程参考中提供给您。
这个由供应商提供的CPU性能的练习和建模来自SarahRobinson在BigQueryML上的博客文章。
我们得到了什么数据？
假设这不是公共数据集，而是您通过自己的研究创建的私有数据集。
它具有不同厂商生产的cpu的所有特性。您的目标之一应该是将这些处理器的性能相互比较，看看供应商中是否有明显的领先者或落后者。如果你能准确地预测出你的芯片在市场上的价值，那么你就可以准确地预测出你的芯片。因此，考虑到芯片规格的所有这些输入，比如速度、内存、操作系统、现金等等，你能预测出基准分数，比如从0到100，你可以用来进行比较吗？下面是使用bigqueryml创建带有数据集的机器学习模型的完整代码。

At the top, we give our model a name. Then since we’re predicting a numeric benchmark score, we’ll choose linear regression as our model type.
There happens to be an existing benchmark for CPUs that you can use to try to predict performance for these newer models. One commonly used CPU benchmark is called 631 Dip Sang is based on the computer program, Dip Sang which was the 2008 world computer speed chess champion. This benchmark program stress test CPUs by making them play expert-level chest and compute very large decision trees and advanced move ordering. Say for the sake of argument that the benchmarking exercise took a week to run on new CPUs. If you were given the specs of these CPUs right as the test was starting. And you build a model that accurately predicted the score of each CPU using ML, instead of waiting for the actual results a week later. You could have valuable insights into which vendor CPUs were under or over performing. You could then use this relative performance data to project increases and decreases in sales, which could impact vendor share price. Let’s see a quick demo of running this model in BigQuery ML. >> Hi, I’m Evan. Let’s do a test drive of BigQuery machine learning to see just how easy it is to create machine learning models with just SQL right where your data already lives inside of BigQuery. For this example use case, we’ll just be using a benchmark dataset of CPU characteristics. That’s how fast your computer chip is inside of your computer. Given known information about it, can we predict its performance? Let’s get started. So here’s the dataset inside of BigQuery.
在顶部，我们为模型命名。既然我们预测的是一个数值基准分数，我们将选择线性回归作为我们的模型类型。
碰巧有一个现有的cpu基准，可以用来预测这些新模型的性能。一个常用的CPU基准被称为631dipsang是基于计算机程序，Dip Sang是2008年世界计算机速度象棋冠军。这个基准测试程序通过让CPU扮演专家级的宝箱，计算非常大的决策树和高级的移动顺序来对CPU进行压力测试。为了争论起见，假设基准测试工作在新的cpu上运行了一周。如果你在测试开始的时候得到了这些CPU的规格。并且您构建了一个模型，使用ML精确地预测每个CPU的得分，而不是等待一周后的实际结果。您可以深入了解哪些供应商的CPU性能不佳或过高。然后，您可以使用这些相对性能数据来预测销售额的增减，这可能会影响供应商的股价。让我们看看在bigqueryml中运行这个模型的快速演示。让我们做一个BigQuery机器学习的测试，看看用SQL创建机器学习模型是多么容易，只要你的数据就在BigQuery内部。对于这个示例用例，我们将使用一个CPU特性的基准数据集。这就是你的电脑芯片在电脑里的速度。根据已知的信息，我们能预测它的性能吗？我们开始吧。这是BigQuery内部的数据集。

One of the tips and tricks that I like to use is,
if somebody gives you a query like if I just gave you this query,
you can actually hold down the command key or the Windows key on your keyboard.
And it turns all of those tables into buttons that you can click on.
Why is that useful?
Say you were somewhere else in our query and you clicked on this,
it would bring up the schema, the details and the preview of the dataset.
So if you have multiple different tables inside of this query,
you just click on this first one and you can say, what are the columns that I have?
Well, it’s called the CPU spec, the integer speed for
its processing some of these characteristics.
So without giving too much away.
Essentially, these computers are going to be running very fast chess algorithms.
And they’re trying to basically optimize a bunch of
moves inside of chess which could take a human years to do.
But since computers are very good at math,
some processors were better at doing this than others.
So what information do we have for the columns?
Well, we have a vendor like if you like Intel and AMD,
we have the model name of the CPU.
You the megahertz, how fast it is, how many cores it has, how many chips,
all things that are relevant to the CPU processing.
If you’re familiar with CPUs, a lot of these might look familiar.
If you’re not, there just all kinds of features that we can basically say,
well, I don’t know if the number of chips.
Or the number of L2 cache in memory in kilobytes is going to be a better feature
for our machine learning model to predict overall performance.
And the good news is you don’t have to know.
Machine learning is not about creating these if then or case when statements.
Throw in all the features that you think might be useful,
let the model figure out the relationship between all of your input features and
then what you’re doing the prediction on.
So let’s see what actually what I’ll be doing the prediction on.
If you scroll all the way down you’ll notice that
there are some benchmarks here.
We just make this a little clearer on the screen.
All the way at the bottom you’re going to see that there is some _600, _602,
_605, _620.
These are industry standard benchmarks, essentially different types of algorithms,
standard algorithms that CPU vendors can bring their CPUs to and
then run essentially standardized.
It’s a standardized test.
If you’ve ever been to a university or school courses,
this is the same test that they’re going to be running against
different hardware and assess its performance.
And it will come down to one number.
So for example, if you wanted to see what this data look like in the preview of
BiqQuery, we have all of these inputs, cores, chips, channels.
And again, I’m showing the limits of my understanding of hardware, what operating
system it’s running on, what compiler it’s running on, who’s sponsoring the test.
And then all of these columns here are how well those computer chips did in the past.
So we have the right answer, which is your label column in machine learning terms,
in the past.
So we know that row 1 here, this Intel Xenon Gold 5115,
for this given test the 631 deep in s is 4.48, whatever that means.
Generally, I’m assuming lower is better or maybe higher is better for the spec.
You have to check the spec for that.
But mainly what I want to try to do with machine learning is,
can we replicate, can we predict how well one of these given
columns 1 through 20 or everything except for the the test.
Let’s train the model to try to make this prediction.
Actually not running it through the test, but predicting its performance given all
these characteristics how well you think it’s going to do.
A good analogy is well, if a student is studying for a test, if one student’s
studying for a test five hours a night for a week, and another one just decides,
you know what, I don’t want to study, only studies one hour or zero hours.
You can start to make those predictions of how well they’re going to do on that test.
Similar argument here except a little bit more complicated in that you have a lot of
different columns that may be interrelated to each other and
they have more weight and more importance.
But you let machine learning figure that out, okay?
So to review we have a lot of input features.
They’re just columns inside of SQL.
And then we have one in this particular case would ignore all the rest of these
tests except for the 631.
And basically see if we can, knowing the information that we have on this for
the past, can we predict the test for similar specs of hardware for the future?
And the good news is, I mean not to give away the ending, but
you can actually predict with relative accuracy because there’s a very strong
relationship between some of this processor specs as you can imagine and
how well it performs in the test.
But again, machine learning is not about you saying if the cores are greater than
two and this and that and it’s Sunday when you run the test,
then it’s going to be a greater than five benchmarking result.
You don’t need to do that.
You just feed the machine learning model data, and it comes out with the result.
So you can do the select star.
But again, the faster way to preview data in BigQuery is clicking that preview tab.
How many different processes are we testing?
Well, we have a small amount, we got 1,400.
All this data is publicly available under the Fair Use License from SPEC.org.
And they actually provide,
if you’re wondering what is this benchmark description?
It’s a program that’s based on though world speed chess computer chip, and
it’s kind of cool.
So you have to have a standardized test on something.
Why not pick something that has a lot of math which is like chess memorizing those
opening moves.
All right, so back to BigQuery.
So given the fact that all of these columns are inputs,
how do you create a model to predict?
Well, as you can follow along in Sarah Robinson,
this is where I got this entire narrative from as well.
Sarah Robinson from developer relations and
Google has an amazing write-up on this.
So this is just my condensed version of her original blog post.
So to create a model inside of BigQuery the first thing you need to do is you need
to have some training data.
So the training data that we’re going to have is basically,
well, let’s pass in the vendor name, the model name,
the megahertz, the memory, the speed of the memory, how much memory in gigabytes,
the L1 L2 L3 cache, operating system, compiler, the sponsor.
Whatever you think is right and again machine learning is iterative.
If you don’t get good model performance from these features, again features or
columns, you can come back and then do different ones and
retrain your model as often as you need.
So it’s literally just a select statement and
I’ll show you what this data looks like.
So given all of these features all the way through here,
can we predict this last column, this score?
So we’re going to try to predict that.
And again, we’re going to train the model on the past saying hey rip through all
these examples try to learn a relationship.
I’m not going to give you anything more business logic or rules.
I’m just going to give you the data.
Can you learn the relationship between all the columns from sponsor and
earlier on in the query and can you predict this numeric field?
Because we’re predicting a numeric field that really shortens the model options,
the list of model options that we want to use.
So a good model option for predicting on a numeric field,
like next month’s sales, here it’s just a float or
a decimal figure, it is going to be a linear regression.
So linear regression, when you think a numeric field predicting you
might want to start with linear regression.
You might have heard of other models, like deep neural networks and whatnot.
Absolutely, you could do those.
Linear regression has to be really fast.
So if you get good performance on linear regression inside of BigQuery,
you might be able to say, okay, cool.
Can I squeeze it a little bit more performance by doing a more sophisticated
model type.
There’s absolutely nothing wrong with starting with linear regression.
So the actual syntax for this, let me zoom in a little bit, Looks like this.
如果我给了你一个查询键，或者我给你的一个键，如果我给了你一个查询的提示，如果你真的用这个键的话。它把所有的表格变成你可以点击的按钮。为什么有用？假设您在我们的查询中的另一个地方，然后单击此按钮，它将显示模式、详细信息和数据集预览。因此，如果在这个查询中有多个不同的表，只需单击第一个表，就可以说，我有哪些列？好吧，它被称为CPU规格，它处理这些特性的整数速度。所以不需要付出太多。基本上，这些计算机将运行非常快的国际象棋算法。他们正试图优化国际象棋中的一系列棋步，这可能需要人类数年的时间。但由于计算机非常擅长数学，所以有些处理器在这方面比其他处理器做得更好。那么我们有哪些关于专栏的信息呢？好吧，我们有一个供应商，如果你喜欢英特尔和AMD，我们有CPU的型号名称。你就是兆赫，它有多快，有多少核，多少芯片，所有与CPU处理有关的东西。如果您熟悉CPU，那么其中很多可能看起来很熟悉。如果你没有，我们可以说，我们基本上可以说，嗯，我不知道芯片的数量。或者内存中二级缓存的数量（以千字节为单位）将是我们的机器学习模型预测总体性能的一个更好的特性。好消息是你不必知道。机器学习不是要创建这些if-then或case-when语句。加入所有你认为有用的特性，让模型找出你所有输入特性之间的关系，然后你在做什么预测。所以让我们看看我在做什么预测。如果你一直向下滚动，你会发现这里有一些基准。我们只是在屏幕上再清楚一点。在底部你会看到有一些600，602，605，620。这些都是行业标准的基准测试，本质上是不同类型的算法，CPU供应商可以将它们的CPU带到标准化的运行中。这是一个标准化的测试。如果你曾经上过大学或学校的课程，这是相同的测试，他们将运行在不同的硬件上，并评估其性能。最后就是一个数字。例如，如果你想在BiqQuery的预览中看到这些数据是什么样子的，我们有所有这些输入、核心、芯片、通道。再一次，我展示了我对硬件理解的局限性，它运行在什么操作系统上，运行在什么编译器上，谁来主持测试。然后这里所有的专栏都是这些计算机芯片在过去的表现。所以我们有了正确的答案，在过去的机器学习术语中，你的标签栏。所以我们知道这里的第1行，这个Intel Xenon Gold 5115，对于这个给定的测试，s中的631深度是4.48，不管这意味着什么。一般来说，我假设低的更好，或者更高的规格更好。你必须检查规格。但我想用机器学习来做的主要是，我们能不能复制，我们能不能预测这些列中的1到20，或者除了测试之外的所有内容。让我们训练模型来做这个预测。实际上不是在测试中运行它，而是根据所有这些特性来预测它的性能，你认为它会有多好。一个很好的类比是很好的，如果一个学生为了考试而学习，如果一个学生一个星期每晚学习五个小时，而另一个学生决定，你知道吗，我不想学习，只学习一个小时或零个小时。你可以开始预测他们在测试中的表现。这里有类似的论点，只是稍微复杂一点，因为你有很多不同的列，它们可能相互关联，它们有更大的权重和更重要的意义。但是你让机器学习来解决这个问题，好吗？所以回顾一下我们有很多输入特性。它们只是SQL中的列。在这个特殊的例子中，我们有一个会忽略所有其他的测试，除了631。基本上看我们是否可以，知道过去的信息，我们能预测未来类似规格硬件的测试吗？好消息是，我的意思是不透露结尾，但实际上你可以相对准确地预测，因为一些处理器规格和它在测试中的表现之间有很强的关系。但是，机器学习并不是说，如果核心大于2，这个和那个，那么你运行测试的时候是星期天