Scheduling For Efficient Large-Scale Machine Learning Training

[MUSIC]. Hi, everyone. Thanks for coming. I am Madan Musuvathi from the Research and Software
engineering group, and it gives me utmost pleasure to introduce Jinliang back
to Microsoft Research. He was our intern
couple of years ago, working in Parce on trying to build the first version of
distributed ML training. So Jinliang is a PhD student at CMU working with Eric Xing
and Garth Gibson. He’s an expert on large-scale ML
systems and that’s what his PhD. That’s what he’s been
working during his PhD. Some of the initial work
on parameter server led to– he contributed to ideas that eventually became a
startup called Petuum, the standard database
advisor, and then actually, he’s moved on to do
lots of cool things on dynamic scheduling of
large-scale models. That’s what he’ll be talking
about today. So Jinliang.>>Thank you Madan
for the introduction. Hi. Good morning everybody. Yes at an instance here two years
ago I had really good time. Now I’ve come back and work here. So I’m really glad I had opportunity to speak to you guys and thank you-all
very much for coming. So the title of my talk is scheduling for efficient large-scale
Machine Learning training. As you-all know so scheduling is a classic research topic
in computer systems, but because of workload it’s special important with
Machine Learning training. There are many new opportunities
that can weaken leverage in scheduling to improve the efficiency of this listening computation. So as you probably already observed, over the past couple years, Machine Learning has achieved wide success in many
applicants domains. So machine learning is basically
a set of techniques to extract knowledge or
insights from big data sets. So the extracted insights are summarized as a set of
mathematical equations, which we refer to as
the Machine Model. So Machine Learning training, it’s basically funch parameters of those mathematical equations that fit your observations about problem. So in essence, the early days
of Machine Learning training has always being a hard problem because it’s highly
compositional heavy. Today, when people continue
to pipe Machine Learning into new domains and extent improve
performance of existing problems. We are collecting bigger Datasets and but it’s more complex
than machine models. Because of this, the
computation a challenge for Machine Learning training is
becoming heavier and heavier. So not only Machine
Learning takes long time, because the Machine
Learning models have lot parameters and generally
lot of intermediate results, also consume large memory. Lastly even though distributed competition and probably of computing can help
improve training time, implementing a parallel
distributed program is hard for many Machine Learning
researchers and practitioners. Motivated by this challenge, I have devoted my PhD research to improve the efficiency of
Machine Learning training, by developing practical
Machine Learning systems that are easy to use for Machine Learning researchers
and programmers. So the key idea behind my research is leverage the structural properties of pulling completion to improve
efficiency of the application. There are a couple of challenges
faced by this approach. So first of all, what kind
of structural properties can be leveraged to improve the efficiency of Machine
Learning computation. Second, how generalizable
are those properties. Can they be generalized across different models and
different algorithms, and lastly how can we
leveraged those information without heavily burdening
the application user. By solving these challenges, I have developed two
different systems for Machine Learning training, which I’ll talk to you today. I also worked on quite a bit on TensorFlow to improve it’s memory
finished state during training. So before I go into
details about my projects, I want to clarify that
the scheduling I’m talking about is scheduling
within a Single training job. So basically when you get a hardware resource and
you have a training job, how do you make efficient use of hardware resource to improve the efficiency of a
training computation. This in contrast to cluster scheduling where
you have bunch of jobs, and you want to maximize the
throughput of your compute cluster. So when we think about
Compute Cluster, I think there are three
most precious resources. One is your narrow bandwidth, second pillar computation
and lastly is your memory. So I have done project to explore making better use
of all three of them, to both improve training time and to enable training larger models
without using button hardware. So this will be telling you today. That’s the three projects. So despite that, there are many different machine models and
different learning algorithms. On Machine Learning algorithms
typically take this common form. First, you’re going to take many
passes of your training data, and second within each training pass you’re going to process your
training data in mini-batches, and within mini-batch
you’re going to produce some updates to improve
your model quality. So basically over
time you’re going to observe your model quality
gradually improves, when you compute up mini-batch updates to update
the model parameters. Your training stops when your model quality stops changing
or when it becomes satisfactory. So this is what we call convergence. So this is the iterative convergence nature of
Machine Learning training, is what distinguished
Machine Learning training from classic computer programs. Because Machine Learning training
is basically a search process. So you can think of on theorist. There’s update function that
you want to minimize or maximize by finding a good
set of model parameters. You can think of this update function as a surface or
multidimensional space. When you are processing
mini-batches to produce updates, you’re taking small steps to move your weights along this
multidimensional surface. This training process stops when this little ball is closing
off to the minimum. Intuitively by looking at this
process, you will find that, if there is some error
during your computation, it does not kill your process. It does not mean you
are completely wrong. Because as long as error
is probably bonded you can compensate for this error by
taking a couple of more steps. So this makes a unique trade-off during Machine Learning training to trade a Capetian quality
for computing throughput. So basically, the speed of your training algorithm
depends on two factors. One, is how fast you
can take your step, which is your computing throughput, and second is how good each step is, which is the quality of competition. So a lot of training systems make
a trade-off between the two, to improve their
computing throughput. So one of the important benefit of this trade-off is that it enables an efficient way to parallelize
your machine algorithm, a very simple way to [inaudible]
your machine algorithm, which is referred to
as data parallelism. In data parallelism, what
you do is simply run many or all of your mini-batches
in parallel on the workers. So in the most basic form
of data parallelism, each worker process the mini-batch
of data, produce some updates, and all the other workers synchronize to upload their preferred
updates to a set of servers, which we call primary server, then put the new set of parameter states to begin the
next mini-batch of computation. As you can see it,
during this process, we can greatly improve the computation throughput by
processing mini-batches in parallel. However, this has a negative effect
on your computation quality. So the reason that this has a negative effect in
computation quality is that the data parallelism does not retain sequential semantics of
your training computation. So in sequential training, you process your
mini-batches sequentially, so that later mini-batches
can all observe updates produced from
earlier mini-batches. However, being processed
mini-batches in parallel, in that they are parallel form, the mini-batches that are
processed in parallel will not be able to observe updates from all other
parallel workers [inaudible] synchronization barrier. So what this means is that
the results produced by this parallel process
would be different from the result you get from
the sequential execution. Let’s suppose your model parameters
of your weights starts at W_0, after you process the
first mini-batch, you’re going to produce some
updates called Delta W_1, and by applying Delta W_1 to W_0, you get the first model
parameter state, W_1. Then sequentially
process next mini-batch, you get parameter state W_2. However, if you process second
mini-batch in parallel with W_0, you’re going to get
different set of W_2 compared to the
sequential computation. If you directly apply
this Delta W_2 to W_1, you’re going to get a slightly
different parameter states compared to sequential execution. So this is why data
parallelism does not give you the same result as
a sequential computation. So even though there
is some difference, usually you can compensate for this difference by taking
in some more steps. So that’s why data parallelism, even though it does not
give you exact results as sequential computation, it’s still worth
doing because you can greatly improve your
computation throughput. However, in order to implement
efficiently the parallelism system, there’s one problem that
we cannot overlook, which is the
communication bottleneck. So the model [inaudible] earlier, they have this property
that computation per minute batch is pretty light. However, because models
have many parameters, we communicate updates after
each mini-batch computation, you’re going to have a big
communication overhead. So basically, as you can see here, if you communicate parameter updates after each single
mini-batch computation, you spend a lot of
time on communication. The opportunity with
observed here is that the parameter updates are
element-wise additions. So what this means is that
from each mini-batch, we get a bunch of updates for
each interior parameters, and we have the opportunity
to coalesce updates from different mini-batches to reduce
the volume of computation. So this gives us the idea
to do local buffering, which basically says we can communicate after processing
N mini-batches to allow us to coalesce updates to
reduce the communication volume. Secondly, this gives the
idea of bounded staleness, such that instead of waiting
for communication to finish, we can go into computing the next set of mini-batches while
communication is still in place. This allow us to overlap
communication with the computation, to further reduce the
communication overhead, also additionally allow us to
tolerate transit stragglers. So as you can see,
with those approaches, we can improve
computation throughput. That’s what people commonly do. However, they pay a big price
for the computation consistency. So the problem that
I’m interested in is whether or not we can better leverage hardware
resource to still get the computation throughput
from parallelization without getting so much inconsistency
in parameter states. So next, I’m going to
talk about two projects. [inaudible] communication as
scattering computation to improve on the parameter consistency
during parallel computation. So one of the things we’ve observed is that when we do
data parallel computation, during communication events,
the network could be idle. So one of the straightforward
idea that people use is they can tune the
communication frequency, simply communicate all parameter
updates more frequently, to improve the freshness or improve the consistency
of the parameter states. However, this is not
an optimal approach. If you think about when communicate all updates as one big message, you basically put those
updates into network queue, and those updates cannot be modified when they’re
in the network queue. However, when those updates
are in communication, you might be finishing new mini-batches
and producing new updates. Those updates ideally could be
coalesced with updates in transit to give you higher value from the same amount
of network bandwidth. So with this insight, the idea of my work was
pretty straightforward. We got to do fine-grained
communication. So basically, we’re still going
to do our periodic summarization. But whenever we receive
spare network bandwidth, we’re going to do fine-grained
communication to utilize the spare network bandwidth to improve the consistency
in parameter states. So now, because we are doing
fine-grained communication, we’re not going to communicate to all the parameter
updates in one sitting. So this comes the question of what parameter updates
that we need to communicate when we do this
optimistic communication. So one signal we found to
be useful is communicate parameter updates based on
their relative magnitude. So basically, we’re going to look at the relative magnitude of parameter updates relative
to parameter values, and prioritize
communicating those updates by how large the relative magnitude. So now, I’m going to show you some experiment results to demonstrate the
effectiveness of this idea. So our baseline is basically, we are synchronizing the model
parameters every N time. So basically, this is the parameter system that
I developed called Bosen, that runs our 16 CPU machines. The application here
is topic modeling. So the baseline here is basically, we are just going to synchronize all the model parameters N times
during one local data pass. So the most basic form is
we’re going to synchronize, actually we process all
the local data once, so we’re going to synchronize once after we process all the local data. So as you can see, this
takes a long time to converge to a good model quality. So basically, on the y axis, I’m showing you the
time for this model to converge to a satisfactory
quality. Yes.>>What model is this?>>This is the model called
latent [inaudible] allocation. So this is useful talking
about clustering documents. So algorithm I used here
is the subgroup algorithm. It’s known as SD. But the
idea applies to SD as well. So I purposely want to show that
this works for not only SD, but also something
broad-wise in algorithms.>>So this works for sparse models?>>Yeah. So this works
for sparse models. This also works,
because the early days, the models of interest are
more sparse access models. But we’re also seeing this today, this also works on decimals
like in neural networks. Basically, people prioritize
communication of larger updates, like when you train data pilot
training neural networks, of course, [inaudible] how limited. If they use part of this partition
idea as a compression idea. Basically, there’s a recent paper
called deprecated compression. Basically, they say instead of communicating updates
for all the parameters, we only communicate
updates for top K, one percent, or maybe 0.1
percent of parameters.>>So [inaudible] one person
is based on the magnitude?>>Yeah. Basically, the relative
magnitude of the data changes.>>So is it reasonable to assume some layers converge faster
than the other layers, and because of that, you cannot plug such an
optimization like you can?>>Yes. Actually true. So basically, you can even observe that some parameters
converge faster than others. Basically, over time, let’s say some parameter stops
changing pretty early, some parameter takes a long time.>>Okay. So let’s say one of the
layers converges really quickly. Why do I even bother sending an
update [inaudible] Freeze it, freeze that layer for the
rest of the training.>>Yes. So that would be ideal. So yeah, that’s the idea. But how do you decide a
complete stop changing? So basically, you determine that by comparing the changes in
that layer with other layers.>>Okay. Got it.>>It’s that way to communicate. So basically, when you increase
communication frequency here, you’re going to see the
convergence speed is improved, and you’re going to see basically beyond a Synchronization
Predator Pass is going to see, you ‘re not going to see any
improvements with convergence rate. Now I’m going to show you with faculty communication what kind of Convergence rate we can achieve. So basically with
faculty communication, we’re going to give each Worker
Node, a Bandwidth Budget, and the Work Node is going to perform one Synchronization
Predator Pass, but it’s going to quickly communicate updates whenever received
Bandwidth is spare. So for example when we have 300 Megabits per second,
we can communicate, we can achieve a fast
Convergence rate compared to a synchronize more
is pretty bypass. Give it a higher Bandwidth Budget, will converge even faster. What I want to show you here, is that even though the
convergence time on comparing to synchronize
all parameters, and frankly the
communication is similar, however we communicate less
Bytes per Data paths to achieve the same effect because
we do find communication, each Byte we sent carries
plenty more information. So far I’m showing
you the partition is based not randomly
picking up its descent. If we pick updates based
on relative magnitude, we can see even faster convergence
under same monthly budget. You may use only 300
Megabits per second, we can achieve roughly
the same convergence rate as previously that uses
640 Megabits per second, and we converge, and we give
it a higher Bandwidth budget with relative magnitude
prioritization, we converge even faster. So this is basically about.>>How did you understand
your distribute system? How do you decide consensus
on any point time, which parameters actually you want to do an overview Software
instance you want to Synchronize?>>So this statistic is the
Parameter Server Model. Thanks. So far I have taught. Yes.>>So also your estimate like one gigantic Bandwidth between the Parameter server and the workers, so a network is more complicated, how much does that contribute
to complication of that?>>So right now, the scheme I was talking about is basically give each Node a
Bandwidth budget based on, this Node can use
this much Bandwidth. The network is complicated because the machine you have for example works
with one internet card, but network does not give you end-to-end Bandwidth
Megabits per second. So I agree with you Thaloc, finding the transfusion
Node is a hard problem, so I don’t have a really
good solution for that, it depends on architecture.>>So with this the
Bandwidth budget year look like on a Parameter server? Is this the shorted
parameter servers?>>So it’s architecture of the system is humid that
this the Server Nodes. The Sub-process and
the Work processes are located on the
same physical machine. So basically this Bandwidth
budget will be shared by the Sub-process and the Work process.>>The Bandwidth number
that you’re showing here is meaning the worker?>>[inaudible] both. Basically,
the Server plus Worker they do not use more than
640 Megabytes per second. On the silver side, there is limiting the
completion of fresh parameters. So far, I have talked
about scheduling network communication to improve the consistency in
all the parameters. I want to talk about scheduling computation to improve
consistency in parameter values. Before talking about
this, I want to take a step back to look at how we
[inaudible] a computation. Into that poly-therm, when we
[inaudible] the computation, we simply take random subset of mini-batches and
run them in parallel. Can we think of better
width pallets computation such that the [inaudible] with actually preserve the synchronous semantic of the sequential algorithm? So while pins he would observe is
that in some [inaudible] models, the parameters are spatially
accessed which means we process a mini batch of data
are being processed data sample. It does not access all
the model parameters. This gives you the
opportunity to find it, I suppose that in accessing
parameter I wrote them in parallel. So for some models this property
is easier to leverage because in some models the parameters
to access based on some data sample attributes. Here I’m assuming in
your application which referred called matrix factorization, and this is something like this modal [inaudible]
recommendation systems. So here the data sample
obviously user ratings to items. Basically, each record contains
that user ID and itemID, and the reading that these are
reading that user give to the item. The parameters that we want
to learn are basically latent vectors for the
users and for the items. When we use SGD to
let those parameters, we use SGD to learn those
parameters on the private access pattern for the vectors. Basically, for each
record [inaudible] axis user vector basically
userID, item vector is an itemID. It’s basically for each directorate
the parameters accessed, depends on the data attribute
values for these two attributes. More formally we state that this is the property such that there exists key fields in your data attributes whenever two data samples at
different in oldest key attributes, they will not access
the same parameters. If you find this property
holds true fabrication, this gives an easy way to
petition the training data to find palette that does not incur
conflicting parameter accesses. This property is true for
metropolization although truth of topic modeling grid implicit [inaudible] and so on
other applications. So when we find this property
holds true for those applications. For example in metropolization this allow us to partition
the data-set but this fields on to eliminate
conflicting parameter axis. For example, for the matrix
[inaudible] application, and we can organize the
trend data into a 2D matrix, with userIDs or itemIDs
as the matrix indices, and we can partition this matrix
along these two dimensions, such that different partitions do
not access the same parameters. For example, here the blocks
of the same color but not access the same parameters processing with not
accessing parameters. So basically this gives
us a way to penalize training algorithm without
sacrificing the sequential semantic. So this is actually a special case of automatic parallelizing compilers. So this [inaudible] special case of automatic parallelizing compilers. However, to actually leverage this property in competition
and there are some challenges. The first challenge is
that those properties only [inaudible] to some
models not other models. So if the user wants to use this, user will have to look at
the training algorithm, decide whether or not
this property holds true and not hold true we can
flip back to the poly-therm. But there’s [inaudible] for user to decide what kind of
politician he wants to do. Second, even when this
property holds true, implementing attributed
competition using this computation would not be trivial because user would need to figure out that the
computation dependence patent, figure out how to
partition your data-set, and figure out for example
how to quantitate workers to efficiently compute the schedule because you need some scheduling to assign [inaudible] works
on to your pilot workers. Facing those challenges, it inspires me to design
Automatic Parallelization system, that will ideally automatically
locate your training program and based on memory access
passion how the access or the impact competition act
as some other parameters, to decide how to place
the training algorithm. So basically, the system I
implemented is called Orion. So basically, Orion
provides abstraction of your cluster as a single
thread and a huge memory. So basically, from application
programs point view, it’s only seen as single
thread with huge memory. As the big memory that human memory, is abstracted away, It’s abstracted
as a multi-dimensional arrays. So for example, you can use this multi- dimensional
array to store more the parameters
and slower datasets. Those multi-dimensional arrays would be automatically
partitioned by the system, based on your competition
and characteristics. The system additionally
provides a product for user to pilots a single
serial training for loop, across the distributed
cluster of machines. The parallelization, will preserve the synchronous
semantics when possible, and otherwise will fall
back to the parallelizim, then user gives you the permission. So now I’m going to show you
some experiments results comparing Orion with other systems. One comparison I’m
showing is compared with Bosen which is the
system I showed before. Supposing, it’s using
scheduled communication to improve the convergence rate for this same application which is copy modeling Latent Dirichlet allocation. So here on x axis is time, and the y axis is a log likelihood which basically
it for this application. We want to maximize the training
log likelihood to get good model. When we use Orion, we
can see Orion gives you a faster convergence
of this application, and that the speed is
not that much different, but still like we get
improvement in convergence rate. What’s most significant is
about network bandwidth use. So in Bosen in order to
achieve that curve industry, you had to excessive
network communication, so that Greek curve is how
much network bandwidth you use during the training process. As you can see Bosen’s roughly using near about 1500 megabits per second, but Orion is using five times
less than that in average. Also in order to implement the
Bosen Magnitude Hippolythem, you had not implement thousands
lives of C++ program. However, in Orion, you
only need to implement a few 100 lines of Julia which
is a scripting language, provides you a nice
syntax like Python. So also I’m also showing comparison
between Orion and TensorFlow, this is the STDR with them
for Matrix Factorization. So here, we’re showing
convergence or time, and the y-axis is the training loss. So as you can see TensorFlow
converges much slower, compared to Orion implementations. This is running on a single
machine with 32 virtual CPUs. So TensorFlow runs
slower for two reasons, one is that TensorFlow is
not highly optimized for sparse competition like
Matrix Factorization has, so basically each of
that pass TensorFlow takes about twice as
much time as Orion, and second because TensorFlow
is using Tytherpalathum parallel competition
TensorFlow suffers a slower convergence
rate compared to Orion. So far, I have talked
about two projects where, I scale on network communication
and computation to improved consistency
in prompted states. So far we have focused on improving computation
across mini-batches. There’s a trend in
machine-learning that the mini-batch computation is
becoming more and more complex. A good example of this
is Deep Neural Networks whereas deep neural networks has brought different characteristics, compared to the models that could be interesting in the early days. In different neural networks, we see much heavier
computation per mini-batch, and there is a dense
parameter access from any batch because it typically
acts all the parameters. Lastly, because these two properties, when we paralyze different
Deep Neural network using their Pylotheme, people typically do synchronization
once per mini-batch. Clever, because the complex
computation per mini-batch, this gives us more
opportunity to schedule within a single mini-batch to improve the efficiency
of deep neural networks. So basically, a deep neural network, you can think of it as
a cascade of functions. Each function has its
own set parameters, and takes in the output from
period function as its input. So basically, the
intermediate states or the intermediate results from
those functions or layers, to also refer to the
functions of layers. So basically, in order to
learn those parameters, a common algorithm that people
use is called Back Propagation. The idea of Back
propagation is that you’re going to start from input data, it goes through all the
layers in the neural network, to produce a prediction of the input. Then you compare your prediction
with the actual label, to get the signals that you can
Back prop through all the layers, to compute gradients or updates
for the model parameters. One thing we can observe
in this process, is that not all of the model
parameters are used at the same time, and not all updates are
generated at the same time. This gives opportunity to schedule the communication of
the parameters updates, so that we can overlap communication even within
a single mini-batch. Back in 2014, a bunch of
students in OLA and myself, we worked on putting cafe onto
Bosen to do data polar training, for deep neural networks. One trick we had which we referred to as with Free Back Propagation, was to overlap backward computation with with narrow communication. So basically, as I said, when you train a deep neural networks during single each single mini-batch, you first perform the Forward pass then you perform
the Backward pass. The idea with Free Back
Propagation is pretty simple. [inaudible] says once you
produce updates for each layer, immediately send them of
the communicating network, such that you can
overlap Back Propagation with your update communication. In the ideal case, the computation only has to be
idle for the communication of the last layer of your
parameter updates. However, things are
not always so ideal, and where network
bandwidth is limited, the communication of your last layer could delay the communication
of your first layer. So when we look closely
at this process, during the backward
process you’re computing parameter updates for
the top layers first, then you gradually compute parameter
updates for the lower layers. However, in the Forward pass you’re going to need the parameters
for the first layer first, then gradually you’re going to need parameters for the later layers. This means, if the parameter updates communication for that
later layers takes a long time because of
limited network bandwidth, it could delay the communication of the updates of the first layer. So this gives idea of prioritizing communication based
on when the value is needed, which we referred to as
Priority-Based Parameter Propagation. So you still going
to do a forward pass and backward pass as before, however, when you generate parameter updates you’re going to
find when you’re communicating. For example, like in this
case when we generate proper updates for layer three, we’re not going to send
all of them at once, we’re going to send
them in small chunks. So then, when we generate
updates for layer two, we are going to stop sending updates for layer three and we’ll start sending
updates for layer two. This basically, doing a
forward communication, give us an opportunity to schedule communication
and based on the values when the
values are needed. Then finally, when we
get to layer one we can immediately start computing
parameter updates for layer one. Then when parameter updates, after parameter updates
for layer one is received, the Forward communication can begin, then we can overlap computing upper layers with the communication of the rest
of the parameter updates. This basically gives
us more opportunity to overlap communication with
the mini-batch computation. So I’m just going to show
you a simple experiment on how effective this is. Basically, we implemented
this idea in MXNet. MXNet is one of the deep learning
frameworks that people use. For data parts when the MXNet already implements with
Free Back Propagation. Basically, the current curve
shows you the anullah MXNet under the Throughput of ResNet-50 using
different network bandwidth. Basically, as you can see, when we increase network
bandwidth ResNet is going to get higher to higher Throughput. Now, the purple curve P3 shows you on the result we get after we
implement our optimization. What you observe is that the network parameters
is limited for example, like when network parameters is
only four gigabits per second, we see about 30-40
percent improvement in the computing throughput. Yes.>>How did you run this, how did you limit the bandwidth
of the networks?>>So basically, we give each
nodes a bandwidth budget, basically sort all the
bandwidth on each node. So to 4k bits per second,
5k bits per second. There’s Linux, I think it’s
called traffic controllers.>>Okay, got it. Right.>>Yes.>>So can you go back
to the previous slide?>>Yeah.>>So when you send a data
for L2, 2, all right.>>Yeah.>>If that gets delayed, does that mean the
forward pass for L2 gets delayed because it doesn’t
have all the data?>>Yes. It will get delayed.>>So if that happens, then you’re going to
still have a gap there?>>Yeah.>>Do you find that in practice, this type of scenario happens?>>I don’t think we specifically
meant how much really we’re getting from because
the L2 is delayed. Before I primary focus on, I don’t think we actually
measured how much these currently as they happen because this would depends on your network
architecture as well. It depends on your model, how much time you spend on computing a first layer or timespan
communicating that. So I don’t think we have
specific number of how much that one happens. Yes.>>What’s the granular
[inaudible] with which you split each layer?>>So this after a tuning parameter. I think when we do that, each chunk has about tens of
thousands of floating point numbers. So, yes, at least we are tens to hundreds of
kilobytes in each chunk.>>Does that still matter at all? I can’t really listen to
let’s say, the TCP practices.>>So if what we find is that because the software for sending each
message has some overhead, if we’re sending messages,
it become too small. The software becomes kind
of significant competitor. So, basically, we’re doing a lot
[inaudible] present each packet. So we cannot make it too small. Plague is to, basically, I think to test thousands or
200,000 for employment efforts. I think we think it’s
strange, it’s fun. Basically, so far,
I have talked about improving the competence
speed, but however, besides competition speed, there’s another problem that’s important for tuning larger models
which is the memory. So next I’m going to show you
a project that will improve memory efficiency to
allow us to bring a 10 times larger models
on the same hardware.>>So indeed you try it on some of the larger models like BERT and things like that.>>So actually BERT
is not the more that has [inaudible] lot of parameters. It has lot of parameters,
but you have to try with modal staff even
more parameters and BERT. To the model I tried
called Mixture Experts. This model, you can change
them perhaps the entire model. So what is more accurate one, I can run on a single GPU, that’s about 2.5 billion parameters. BERT has about 100
million parameters. So there’s some runtime overhead with this approach we start talking
about during the talk.>>You didn’t show the
class for all of those, you weren’t sure for less than 15?>>That was the previous one.>>So you didn’t really call
it utterances 19 paper, had evaluated on larger models? So this paper, right? This paper was not evaluated
on NeuroNet or NeuroNets. This was the previous sparse models.>>What of this one?>>This one, let me show
you the benchmarks we had.>>Sorry. System 19 paper.
That was the 19 paper.>>Yeah, this paper. We evaluated on our
ResNet on, basically, a bunch of [inaudible]
ResNet inception also valid on Recurrent
Neural Network. I forgot exactly what that one is, I think I forgot, I’d say, Recurrent Neural
Network or ResNet.>>The amount of benefit you
get from this is depends on how much computation you do per
parameter on your model, right?>>Yeah.>>How does that range of computation
changes for your benchmark? Do you have like bear
high to very low?>>So you mean for this moment or?>>No. For the previous right there, yeah, this paper. This technique.>>This technique you mean
like how much many civic get, it depends on number of
parameters in the model.>>The ratio of the computation per parameter of their model, right?>>I don’t have that.>>Good luck.>>So to motivate this problem, first, people really need logic
models to get good performance. So this is a paper published
by Google two years ago. This paper is called Proposing a
New Layer called Mixture Experts. So you don’t have to know the
details of how this layer works. However, the idea you need
to know is that, basically, this allows you to increase
number of parameters in the model by a large factor so
that you can get good performance. So on x-axis, I’m showing the
number of parameters in the model. So this application
is language modeling. On the y-axis, I’m
showing test perplexity. The test perplexity, this is a
little bit bigger on the model is. So as you can see, we increase the number of
parameters in this model, you’re going to get better test
plexi or a better model quality. However, the largest model costs
about 130 peeling parameters, and these requires
about 120 GPUs to run. So however, even though people
decides big and bigger models, the GPU memory is highly
limited, highly expensive. So here I’m showing
you a comparison of DRAM price with GPU price in
terms of megabytes per dollar. So as you can see over
the last 20 years, DRAM price has been decreasing. However, the GPU price is highly
expensive compared to DRAM price. So remember this a log-log scale. So you fetch leaf was
Server server side GPUs, almost three years more magnitude, more expensive compared
to DRAM price. So not only GPUs are expensive, GPU memory is high limited. With largest GPU, you can get
possible 32 gigabytes memory. However, this GPU compared
to it’s 16-gigabit version, it’s about $1,400 more expensive. This means, you’re paying
about one cent extra. So this means, you are paying about
8.5 cents per extra megabytes. This also allow more expensive
than if you would just buy DRAM. Remember, you’re getting
this memory without getting any extra computation. So motivated by this challenges.>>Why is it that we see such
expensive for [inaudible]>>So I think there are two reasons. One reason is that,
there’s a trade off. One is you’d want
really large memory, you will not be able to
sustain a high bandwidth, or if you want high bandwidth, the capacity of the America
cannot go really big. That’s a Techno talent, but also, I think because Arabia
is dominating this. So, basically, whatever it says, that’s the price, there’s no
competitor on this phase.>>So yes, specifically
motivated by this challenge, there has already a lot of work on reducing memory consumption or
making memory more efficient. So there are mainly two
category of techniques. One is called Gradient Checkpointing, basically trying to leverage recomputation to reduce
memory footprint. The other is called Memory Swapping, basically trying to leverage chip house memory to reduce
memory consumption on GPU. So basically, the high-level idea of Gradient Checkpointing is based
on backpropagation process. So during backpropagation, you
will need the intermediate results from the forward pass to compute the gradients for the
model parameters. So what this means is that we’ve reached the last layer
of your neural network, you have the cache, or you have to store the intermediate results from
all the previous layers, so you can backprop and
compute their gradients. This means if your
network has n layers, you have a ON memory costs stored
on the intermediate results. So the idea of back gradient
check pointing is basically it says instead of store all
the intermediate results, we’re going to store
a few check points, so that we can recompute on the outer layers and get
results when they are needed. So ideally, if your
network has n layers, you can partition network
into n partitions, and each partition has,
sorry, not n partition, square root of n partitions, each partition has a
square root of n notes, so you can reduce your memory
overhead from n to square root of n. So another idea is
called Memory Swapping. So the idea says
basically, for example, when each layer the intermediate
results after they’re generated, they are needed for
computing in the next layer, but they’re also needed during
the backware propagation paths. So basically, for this
long dependencies, instead of caching that
value in GPU memory, you can temporarily put them
onto cheaper host memory and slap them back in
before they’re used. This is called memory swapping. So those ideas work pretty
well for neural networks, and that’s linear
because linear network make it easier for you
to decide which nodes to checkpoint and which node
to swap and when to swap them. However, they are more and more
neural network architectures that’s becoming non-linear. In this case, those techniques will
have a hard time to work well.>>What do you mean by non-linear?>>So non-linear, I mean for example, if we think of actually that’s
next time I’m talking about. So you have a layer that
has a large [inaudible] , a lot of [inaudible]. So you don’t have to
compute this one by one. So this other example is
like COD mixture experts, that’s what I mentioned earlier. So this is a layer in
your neural network. So what happens is in this layer is that there are
many parameters per experts. This number never experts is hyperparameter you
can tune for example, you can change these thousands
or even hundreds of thousands. Those minibatch comes in, different data samples that minibatch can be handled
by different experts, so experts are going
to run in parallel. So this is a large
final architecture. You don’t compute the
experts like one-by-one. So that’s an example of
a non-linear structure. So in this case, if you compute other layers in parallel you’re going to have
a large memory overhead. So basically for the goal
might work is to come up with a memory efficient
approach to reduce memory consumption for both
linear and non-linear graphs. We want to implement and evaluate this on metroframework
like TensorFlow, so that we can evaluate it for different models for across the
different large set benchmarks. We want to make sure that to
take advantage of architecture, the application does not
need to make any changes. So there’s already been
a bunch of work to reduce memory consumption for
TensorFlow from places of work, do gradient check pointing, do memory swapping, and
as I mentioned they have limitations because
typically there’re limited to well for linear graphs. There’s also work for doing memory swapping for a special
operation called while loop. So of course this
technique only applies if your model uses this
operation while loop. So the first idea in our technique is to treat parallelism
for memory consumption. So as I mentioned, the layers could have
large [inaudible] like how TensorFlow schedules computation for this architecture is basically, TensorFlow is doing a
Breath-first traversal of your graph structure. So basically, TensorFlow is going
to start from a source node, and often the source node on
the finished computation, TensorFlow will look at all the
other successors of that node and schedule them by putting
them into a thread pool. So as you can see now we have
four nodes to run in parallel. This gives us the memory
consumption of five operations. Then basically, this process proceed, I’m going to schedule other operations onto the
graphs finished executing. Basically, the TensorFlow approach by doing Breadth-first traversal, the graph gives us
a high parallelism, however this also gave us
a high memory consumption. One simple idea to reduce
memory consumption during this completion process is that we can linearize
the competition graph. Basically, we can sort all the nodes in a topological sorted order, and run the operations one by one. So the idea is basically, we find a topological sorted order of all the nodes by
running one by one, and this gives us a peak memory
consumption of four operations. So the idea we propose, is basically something in the middle. So basically, we
proposed to partition the competition graph
into smaller partitions, and we find the topological sorted
order among those partitions.>>This is the forward pass or
both forward and backward pass?>>So this is the forward pass, but I’m thinking a general
competition graph.>>So in- When you are doing
training, you’re going to have a backward pass in which you still have the same problem of keeping the data around to
use in the backward problem.>>Right. So that’s true. Yes, that’s true. That’s the second
idea I’m going to talk about.>>Okay.>>So basically, you can still use the swapping idea to
assess that problem.>>Right.>>So yes, I agree with you. So specifically, for this part, we are going to be able to explore
parallelism within a partition. Then, we’re going to
[inaudible] among partitions to constrain
the memory consumption. So for the problem that I mentioned, so for a partition, it’s result could be used
in a nearby partition, but also could be used in a
partition that’s really far away. So basically, when the results being used in a partition
that’s really far away, we’re going to still swap
them out to host memory, and stamp them in before they are needed to control the memory usage. Because we find Polygon Sorting Order among the acute
partitions one-by-one, this gives us a easy
time to decide when to swap things out and when
to swap them back in. To give you idea of the
effectiveness of this technique, we’ve drawn transformer
model and we use a dimension experts as the Feed
Forward layer in that model, and we are able to reduce
memory consumption for that particular model from my
9.5 gigabyte to 6.8 gigabytes. The last idea we have is particularly for models that have large
unprimed parameters. For example, in the
transformer that I just showed you that’s about 800
million parameters. So the idea is that the intensive flow if you implement them all attends to
look in the store, or the parameters and constants as persistent testers in GPU memory. So in GPU memory. So what we propose is that we
want to store this variables in a host memory and bring them back to GPU memory when they are
needed for the competition. So in TensorFlow, we need
tensor value across devices, TensorFlow basically going to insert a pair of center receive operation to communicate that tensor variable
from CPU memory to GPU memory. So in this case, for example, if that variable is going
to be needed both in the forward pass and
in the backward pass. So TensorFlow on insert one parison the receive to communicate
that value from CPU to GPU. What this means is that once
the GPU receives that tensor, GPU has two cached that tensor value in GPU
memory for a really long time. So however, but basically what we
propose is to only basically to insert sender-receiver pairs for this variable when they are needed. So basically here that need and
two places we’re going to insert two centricity pairs to communicate
this variable from CPU to GPU. This allow us to reduce
the memory consumption for models to help bottom parameters. For example transformer
with Mixer experts. If you reduced that
memory consumption from 6.8 gigabytes to
roughly 3.3 gigabytes. So we implemented this
technique in TensorFlow core, submit a budget TensorFlow
core and there’s no change in the
TensorFlow Python API. So applications can leverage this technique with no change
in the application code. So we tested this on
machines that have media techniques GPU
with 32 virtual cores and 64 kilobytes CPU memory. So our benchmark consists
of five different models, they use different architectures. For example we have a transformer
which used attention. We have a transformer
with Mitchell experts, and we have ResNet which is
convolution neural network, we have GANs, we have
statically unrolled ions. First of all that, I want
to draw your attention is mutual this transformer
with mutual experts, which I’ve talked about earlier. So basically on the y axis, I’m showing you the peak
memory consumption of the model running on different systems with
different optimizations. The purple bar shows you
execution of Vanilla TensorFlow, and the green bar shows the execution with like
partitioning plus swapping, and the blue bar shows you, additionally we have occlusion with variable placement optimization. So as you can see for a
transformer with Mitchell experts, we can reduce memory
consumption from about 9.5 gigabytes to 3.3 gigabytes
for pigment consumption. So this is a model that
have lot parameters, about 800 million parameters. For other models, they don’t
have too many parameters. The second largest model, for resonant I think has about
60 million parameters. So next result I’m going
to draw you to is GANs. For GANs we achieved
largest memory saving. We’ve used memory consumption from about 11 gigabytes to
about 1.4 gigabytes, roughly 87 percent memory reduction. On average, we can achieve mimic consumption
about 70 or 70 percent. So this optimization brings
some overheads for run-time. Because one is we’re restricting
how much Python we can use. Second, we incur additional
GPU CPU on data communication. The largest overhead comes for the model that has
a lot of parameters. Explicitly the backs
here is that runtime overhead compared to running
on Vanilla TensorFlow.>>This is overhead for
throughput to convergence.>>Right. So this is overhead
in terms of throughput because our question does not change
the semantics of computation. You still get the same
result running on them. So this is basically the
time to complete one. Yes.>>You should finish
the third section. We should probably wait. Perfect.>>All right.>>Okay, I still have
a few more minutes.>>My question was for idea to which you are showing where you’re
swapping things up back and forth. I think I missed the part
[inaudible] increment media.>>Yeah.>>Is it just as similar to that
or there’s some data over there?>>So it’s similar. The difference is that reading and the more than
neural architecture. So I think more than your
network architecture as a linear architectural
layer by layer. So what the part is different is that we partition the linear network. So because generally the graph is not linear sequence of operations.>>Is that statement true? You said that in general
the graphs that people used for parallelism
they are not linear.>>That’s not exactly true. So the modal architecture, a
lot of them are linear. For example if you look at ResNet, the linear sequence of layers, someone knock here
just not linear for example Mitchell
experts, I mentioned. But if you look at the computer
graph, TensorFlow users, is not a linear
sequence of operations, because each layer consists
of many operations, they have different
fan-outs each layer. So if you look at
that component graphs specifically it’s not a linear graph. But if you look on you catching
a half on high level it will be a linear set of operations.>>I can’t vouch for that for you. You can’t your own except
Graph Generator for [inaudible] even though it used 24 days but it was all
just bushy and spread out. So one more question. My cheek has this month
budget of their memory, can your algorithm optimize
it such that it uses. How much memory is
the maximum amount. That’s a very good question, that’s something I’ve
planned to do in the future but right
now we don’t have that. So I can talk about that
towards the end of the talk. So that’s basically, as I was
talking about the run time overhead. So basically for the models that
have flood primaries we have about 3.4 times runtime overhead. The good case for example
for overheads are about 30 percent overheads to achieve about 40
percent memory saving. So in average the overhead
is about 2.22 times. However, if you exclude the highly
overhead like high parameter, the Mitchell experts with
high number parameters, of overhead we can achieve
with 55 percent overhead. You can achieve about 60
percent of memory reduction. So one of the direct benefit of this optimization is that it
allows us to run bigger models. For example, for the Mitchell
experts what I mentioned earlier, on Vanilla TensorFlow you can run basically four experts per layer. This gives you about 0.24 pillars
parameters on a single GPU, and if we use our
optimizations you can skip the number parameters per
modeled by roughly 10 times. I wish I could run much
larger models than this. Often it runs deeper models. For example for ResNet,
skilled ResNets up to about almost 2,000 layers. Competitor [inaudible]
TensorFlow, you can draw above 500 layers in resonate. So this also works in
the distributed setting. So we use library
called Mesh TensorFlow, this gives a way to do more of the part of them
across different GPUs. By using Mesh TensorFlow across
four different machines, TensorFlow outlaws you to run like Mixture Experts of about
120 experts per layer. This is, of course, four machines
with four different GPUs. So I want to mention that
the experts that used here, each expert is not the
same as either used here. Basically, these experts
are the smaller. So with TensorFlow, we may
secure these to 16 machines, we don’t get a linear scaling
in terms of parameters. Basically, we increase number
of machines by four times with only increased primary
size by almost twice. So only use average
system, TensorFlowMem, on the same four machines, we can scale number parameters
totaling to six billion parameters. As we optimize the Mixture Experts, we can even more increase number
of parameters on these 40 pews. Basically, I’m not
going to go into how we optimize Mixture Expert
implementation, but basically, the high-level idea is,
we’re going to partition the big tensors entitlement
to small tensors. So we have more opportunity to
do swapping between them. Yes.>>How module is this from
the primary set point? Do I get this for free?>>So for TensorFlowMem, you’re going to get a TensorFlowMem, basically, going to
get this for free. For this, you’re going to do a little bit better
application implementation.>>But I need to express
that in my graph.>>Yes. You need to basically change how you implement
your competition graph.>>Right.>>But you get this for free.>>Great.>>So this awful laws to draw longer sequence for
recruiting on network, for example, compared to
Vanilla TensorFlow, basically, it can scale up to 400 lens, less of 400 in a sequence, and we can scale to 800
with some runtime overhead. So basically, I’m showing you
here is time per mini-batch, with longer sequences, of course, you’re going to run slower, but compared to when you
want us on the same sequence minus we have about 50
percent of runtime overhead.>>So how much do you
remember it was this?>>I think it was 12
gigabytes of memory.>>You see the bad sides are sequenced then I guess
beyond a certain point, you would see an issue in terms
of the computers being used, so I guess, it’s part of those.>>So I’m not trying to say you should run really
really long sequence, but just in case people want to run, this gives you document. So in summary, I have talked about
three meter directors, one is, you can make better use of a network communication by
practicing communication based. First, doing fan of communication. Second, by practicing communication
based on relative magnitude, and based on whether the values
are being used, and second, you can carefully schedule a public
competition by leveraging with memory access pattern to reduce the inconsistency
in public states. Lastly, you can do better
scheduling and better placement in your professional growth to reduce memory consumption of GPU memory
by leveraging cheap host memory. So looking at the future, machine learning is
still faster working, there are a lot new
models being developed, there are a lot of new operations people want to use
in machine learning, if you don’t like the
recently proposed a capsule. So also because the heavy competition
machine learning people are developing in your
hardware architectures to favor machine learning. My interest for research
is that I want to continue to develop new software
systems for machine learning. What I see is that this
kind of software systems is pushing the boundaries of many
disciplines in computer science, and I want to go to explore different technologies to improve the state of art for
controlling systems. So the questions that I’m
really interested are: first, how can we support expanding
sets muscling competitions, and how can we take advantage
of new computer hardware. So there are a couple more concrete directions that I’m
basically interested in, one is programming
support and compilation. So how can we support new
operators, for example, capsule? There’s a recent paper
published at Google about comparing how different
compilers produce kernels. They’re basically comparing
the performance of different kernels produced
by different compilers. So they are automatical part
compilers like plate ML, Tensor Comprehension, for
a plate ML, basically, here, I think the graph
is not very clear, I apologize for that. This, basically, is showing
you the runtime of the kernel. So, basically, the red curve here shows you carefully cut
up my kernel implementation. So basically, a purple
curve shows play ML, shows tensile comprehension
is a green part here. So basically, what we serve
that the [inaudible] message I want to show with
this graph is that there’s still a big gap between the automatic compilers
compared to many optimized code. So I think there’s still a
a lot room for improvements for guaranteeing efficient kernels. Second, can we include, for example, control flow primitives
in that [inaudible] graph. For example, functions. So if your graph may have a lot of
repeated patterns or sub graph, for example, I can resonate, you have this retreat block
that you repeat many times. Can you think of defining
those repeated pattern? Right now, basically, people are
kind of doing in-place function. So can you think of attracting this as a function in
the community graph, which will simplify program effort, it also gives you maybe a more opportunity to optimize
the repeated patterns. Lastly, for example, have the colored lights competition
if we knew a hardware. Also, not the direction I’m
interested in is more than polytheme is how do we partition. Both the polytheme I think
involves two problems, how do we partition operations, and how do we place Func1 operations
on two different devices. There’s I think monthly
direction that’s interesting is thinking
about dynamic replacing operators on devices for dynamically
changing graph topologies. For example, when the graph
involves dynamic control flow, the graph topology is not
statically determined, and there might be
opportunity for dynamically placing operation center devices to improve the
competition efficiency. Because all of those directions
involves complex on some species, maybe we should leverage new
techniques, for example, machine learning for a
complex task-based to find if you have a good device
placement for efficiency. So a more concrete example
I want to talk about is, in the near future that could
be interesting direction, is to achieve memory-efficient
deep learning with high-capacity throughput. The direction as I mentioned is
suppose we have maybe considering, how do we maximize competence throughput within
the member constraint we have. So we’re not yet, there are many techniques to reduce memory consumption
for deep learning. So there’s scheduling,
which I talked about trading Degree of Parallelism
with memory consumption. There’s technique
gradient check pointing, and feeding competition
for American assumption, there’s memory swapping,
there’s also quantization which is accuracy for memory consumption. So it’s hard to determine what kind of
techniques you want to apply, or for each of this technique, there are many
[inaudible] for example, if one is scheduling,
it’s NP-hard problem, is NP-complete problem to find the scheduling that minimizes
its memory consumption. A second best configuration of those techniques will depends on your particular model
architecture and the hardware. Lastly, those techniques are
interdependent, which means, the order you apply those
techniques would also matter or you should apply them. Probably, a good question
I think to ask is, how can we minimize
training time or achieve a good model quality subject
to certain memory constraint? With that, I want to
conclude my talk. I think, without concluding my talk, and if you have any questions, please feel free to ask.>>I have probably a
lot of questions to go.>>[inaudible]. Thank you.>>Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *