[MUSIC]. Hi, everyone. Thanks for coming. I am Madan Musuvathi from the Research and Software

engineering group, and it gives me utmost pleasure to introduce Jinliang back

to Microsoft Research. He was our intern

couple of years ago, working in Parce on trying to build the first version of

distributed ML training. So Jinliang is a PhD student at CMU working with Eric Xing

and Garth Gibson. He’s an expert on large-scale ML

systems and that’s what his PhD. That’s what he’s been

working during his PhD. Some of the initial work

on parameter server led to– he contributed to ideas that eventually became a

startup called Petuum, the standard database

advisor, and then actually, he’s moved on to do

lots of cool things on dynamic scheduling of

large-scale models. That’s what he’ll be talking

about today. So Jinliang.>>Thank you Madan

for the introduction. Hi. Good morning everybody. Yes at an instance here two years

ago I had really good time. Now I’ve come back and work here. So I’m really glad I had opportunity to speak to you guys and thank you-all

very much for coming. So the title of my talk is scheduling for efficient large-scale

Machine Learning training. As you-all know so scheduling is a classic research topic

in computer systems, but because of workload it’s special important with

Machine Learning training. There are many new opportunities

that can weaken leverage in scheduling to improve the efficiency of this listening computation. So as you probably already observed, over the past couple years, Machine Learning has achieved wide success in many

applicants domains. So machine learning is basically

a set of techniques to extract knowledge or

insights from big data sets. So the extracted insights are summarized as a set of

mathematical equations, which we refer to as

the Machine Model. So Machine Learning training, it’s basically funch parameters of those mathematical equations that fit your observations about problem. So in essence, the early days

of Machine Learning training has always being a hard problem because it’s highly

compositional heavy. Today, when people continue

to pipe Machine Learning into new domains and extent improve

performance of existing problems. We are collecting bigger Datasets and but it’s more complex

than machine models. Because of this, the

computation a challenge for Machine Learning training is

becoming heavier and heavier. So not only Machine

Learning takes long time, because the Machine

Learning models have lot parameters and generally

lot of intermediate results, also consume large memory. Lastly even though distributed competition and probably of computing can help

improve training time, implementing a parallel

distributed program is hard for many Machine Learning

researchers and practitioners. Motivated by this challenge, I have devoted my PhD research to improve the efficiency of

Machine Learning training, by developing practical

Machine Learning systems that are easy to use for Machine Learning researchers

and programmers. So the key idea behind my research is leverage the structural properties of pulling completion to improve

efficiency of the application. There are a couple of challenges

faced by this approach. So first of all, what kind

of structural properties can be leveraged to improve the efficiency of Machine

Learning computation. Second, how generalizable

are those properties. Can they be generalized across different models and

different algorithms, and lastly how can we

leveraged those information without heavily burdening

the application user. By solving these challenges, I have developed two

different systems for Machine Learning training, which I’ll talk to you today. I also worked on quite a bit on TensorFlow to improve it’s memory

finished state during training. So before I go into

details about my projects, I want to clarify that

the scheduling I’m talking about is scheduling

within a Single training job. So basically when you get a hardware resource and

you have a training job, how do you make efficient use of hardware resource to improve the efficiency of a

training computation. This in contrast to cluster scheduling where

you have bunch of jobs, and you want to maximize the

throughput of your compute cluster. So when we think about

Compute Cluster, I think there are three

most precious resources. One is your narrow bandwidth, second pillar computation

and lastly is your memory. So I have done project to explore making better use

of all three of them, to both improve training time and to enable training larger models

without using button hardware. So this will be telling you today. That’s the three projects. So despite that, there are many different machine models and

different learning algorithms. On Machine Learning algorithms

typically take this common form. First, you’re going to take many

passes of your training data, and second within each training pass you’re going to process your

training data in mini-batches, and within mini-batch

you’re going to produce some updates to improve

your model quality. So basically over

time you’re going to observe your model quality

gradually improves, when you compute up mini-batch updates to update

the model parameters. Your training stops when your model quality stops changing

or when it becomes satisfactory. So this is what we call convergence. So this is the iterative convergence nature of

Machine Learning training, is what distinguished

Machine Learning training from classic computer programs. Because Machine Learning training

is basically a search process. So you can think of on theorist. There’s update function that

you want to minimize or maximize by finding a good

set of model parameters. You can think of this update function as a surface or

multidimensional space. When you are processing

mini-batches to produce updates, you’re taking small steps to move your weights along this

multidimensional surface. This training process stops when this little ball is closing

off to the minimum. Intuitively by looking at this

process, you will find that, if there is some error

during your computation, it does not kill your process. It does not mean you

are completely wrong. Because as long as error

is probably bonded you can compensate for this error by

taking a couple of more steps. So this makes a unique trade-off during Machine Learning training to trade a Capetian quality

for computing throughput. So basically, the speed of your training algorithm

depends on two factors. One, is how fast you

can take your step, which is your computing throughput, and second is how good each step is, which is the quality of competition. So a lot of training systems make

a trade-off between the two, to improve their

computing throughput. So one of the important benefit of this trade-off is that it enables an efficient way to parallelize

your machine algorithm, a very simple way to [inaudible]

your machine algorithm, which is referred to

as data parallelism. In data parallelism, what

you do is simply run many or all of your mini-batches

in parallel on the workers. So in the most basic form

of data parallelism, each worker process the mini-batch

of data, produce some updates, and all the other workers synchronize to upload their preferred

updates to a set of servers, which we call primary server, then put the new set of parameter states to begin the

next mini-batch of computation. As you can see it,

during this process, we can greatly improve the computation throughput by

processing mini-batches in parallel. However, this has a negative effect

on your computation quality. So the reason that this has a negative effect in

computation quality is that the data parallelism does not retain sequential semantics of

your training computation. So in sequential training, you process your

mini-batches sequentially, so that later mini-batches

can all observe updates produced from

earlier mini-batches. However, being processed

mini-batches in parallel, in that they are parallel form, the mini-batches that are

processed in parallel will not be able to observe updates from all other

parallel workers [inaudible] synchronization barrier. So what this means is that

the results produced by this parallel process

would be different from the result you get from

the sequential execution. Let’s suppose your model parameters

of your weights starts at W_0, after you process the

first mini-batch, you’re going to produce some

updates called Delta W_1, and by applying Delta W_1 to W_0, you get the first model

parameter state, W_1. Then sequentially

process next mini-batch, you get parameter state W_2. However, if you process second

mini-batch in parallel with W_0, you’re going to get

different set of W_2 compared to the

sequential computation. If you directly apply

this Delta W_2 to W_1, you’re going to get a slightly

different parameter states compared to sequential execution. So this is why data

parallelism does not give you the same result as

a sequential computation. So even though there

is some difference, usually you can compensate for this difference by taking

in some more steps. So that’s why data parallelism, even though it does not

give you exact results as sequential computation, it’s still worth

doing because you can greatly improve your

computation throughput. However, in order to implement

efficiently the parallelism system, there’s one problem that

we cannot overlook, which is the

communication bottleneck. So the model [inaudible] earlier, they have this property

that computation per minute batch is pretty light. However, because models

have many parameters, we communicate updates after

each mini-batch computation, you’re going to have a big

communication overhead. So basically, as you can see here, if you communicate parameter updates after each single

mini-batch computation, you spend a lot of

time on communication. The opportunity with

observed here is that the parameter updates are

element-wise additions. So what this means is that

from each mini-batch, we get a bunch of updates for

each interior parameters, and we have the opportunity

to coalesce updates from different mini-batches to reduce

the volume of computation. So this gives us the idea

to do local buffering, which basically says we can communicate after processing

N mini-batches to allow us to coalesce updates to

reduce the communication volume. Secondly, this gives the

idea of bounded staleness, such that instead of waiting

for communication to finish, we can go into computing the next set of mini-batches while

communication is still in place. This allow us to overlap

communication with the computation, to further reduce the

communication overhead, also additionally allow us to

tolerate transit stragglers. So as you can see,

with those approaches, we can improve

computation throughput. That’s what people commonly do. However, they pay a big price

for the computation consistency. So the problem that

I’m interested in is whether or not we can better leverage hardware

resource to still get the computation throughput

from parallelization without getting so much inconsistency

in parameter states. So next, I’m going to

talk about two projects. [inaudible] communication as

scattering computation to improve on the parameter consistency

during parallel computation. So one of the things we’ve observed is that when we do

data parallel computation, during communication events,

the network could be idle. So one of the straightforward

idea that people use is they can tune the

communication frequency, simply communicate all parameter

updates more frequently, to improve the freshness or improve the consistency

of the parameter states. However, this is not

an optimal approach. If you think about when communicate all updates as one big message, you basically put those

updates into network queue, and those updates cannot be modified when they’re

in the network queue. However, when those updates

are in communication, you might be finishing new mini-batches

and producing new updates. Those updates ideally could be

coalesced with updates in transit to give you higher value from the same amount

of network bandwidth. So with this insight, the idea of my work was

pretty straightforward. We got to do fine-grained

communication. So basically, we’re still going

to do our periodic summarization. But whenever we receive

spare network bandwidth, we’re going to do fine-grained

communication to utilize the spare network bandwidth to improve the consistency

in parameter states. So now, because we are doing

fine-grained communication, we’re not going to communicate to all the parameter

updates in one sitting. So this comes the question of what parameter updates

that we need to communicate when we do this

optimistic communication. So one signal we found to

be useful is communicate parameter updates based on

their relative magnitude. So basically, we’re going to look at the relative magnitude of parameter updates relative

to parameter values, and prioritize

communicating those updates by how large the relative magnitude. So now, I’m going to show you some experiment results to demonstrate the

effectiveness of this idea. So our baseline is basically, we are synchronizing the model

parameters every N time. So basically, this is the parameter system that

I developed called Bosen, that runs our 16 CPU machines. The application here

is topic modeling. So the baseline here is basically, we are just going to synchronize all the model parameters N times

during one local data pass. So the most basic form is

we’re going to synchronize, actually we process all

the local data once, so we’re going to synchronize once after we process all the local data. So as you can see, this

takes a long time to converge to a good model quality. So basically, on the y axis, I’m showing you the

time for this model to converge to a satisfactory

quality. Yes.>>What model is this?>>This is the model called

latent [inaudible] allocation. So this is useful talking

about clustering documents. So algorithm I used here

is the subgroup algorithm. It’s known as SD. But the

idea applies to SD as well. So I purposely want to show that

this works for not only SD, but also something

broad-wise in algorithms.>>So this works for sparse models?>>Yeah. So this works

for sparse models. This also works,

because the early days, the models of interest are

more sparse access models. But we’re also seeing this today, this also works on decimals

like in neural networks. Basically, people prioritize

communication of larger updates, like when you train data pilot

training neural networks, of course, [inaudible] how limited. If they use part of this partition

idea as a compression idea. Basically, there’s a recent paper

called deprecated compression. Basically, they say instead of communicating updates

for all the parameters, we only communicate

updates for top K, one percent, or maybe 0.1

percent of parameters.>>So [inaudible] one person

is based on the magnitude?>>Yeah. Basically, the relative

magnitude of the data changes.>>So is it reasonable to assume some layers converge faster

than the other layers, and because of that, you cannot plug such an

optimization like you can?>>Yes. Actually true. So basically, you can even observe that some parameters

converge faster than others. Basically, over time, let’s say some parameter stops

changing pretty early, some parameter takes a long time.>>Okay. So let’s say one of the

layers converges really quickly. Why do I even bother sending an

update [inaudible] Freeze it, freeze that layer for the

rest of the training.>>Yes. So that would be ideal. So yeah, that’s the idea. But how do you decide a

complete stop changing? So basically, you determine that by comparing the changes in

that layer with other layers.>>Okay. Got it.>>It’s that way to communicate. So basically, when you increase

communication frequency here, you’re going to see the

convergence speed is improved, and you’re going to see basically beyond a Synchronization

Predator Pass is going to see, you ‘re not going to see any

improvements with convergence rate. Now I’m going to show you with faculty communication what kind of Convergence rate we can achieve. So basically with

faculty communication, we’re going to give each Worker

Node, a Bandwidth Budget, and the Work Node is going to perform one Synchronization

Predator Pass, but it’s going to quickly communicate updates whenever received

Bandwidth is spare. So for example when we have 300 Megabits per second,

we can communicate, we can achieve a fast

Convergence rate compared to a synchronize more

is pretty bypass. Give it a higher Bandwidth Budget, will converge even faster. What I want to show you here, is that even though the

convergence time on comparing to synchronize

all parameters, and frankly the

communication is similar, however we communicate less

Bytes per Data paths to achieve the same effect because

we do find communication, each Byte we sent carries

plenty more information. So far I’m showing

you the partition is based not randomly

picking up its descent. If we pick updates based

on relative magnitude, we can see even faster convergence

under same monthly budget. You may use only 300

Megabits per second, we can achieve roughly

the same convergence rate as previously that uses

640 Megabits per second, and we converge, and we give

it a higher Bandwidth budget with relative magnitude

prioritization, we converge even faster. So this is basically about.>>How did you understand

your distribute system? How do you decide consensus

on any point time, which parameters actually you want to do an overview Software

instance you want to Synchronize?>>So this statistic is the

Parameter Server Model. Thanks. So far I have taught. Yes.>>So also your estimate like one gigantic Bandwidth between the Parameter server and the workers, so a network is more complicated, how much does that contribute

to complication of that?>>So right now, the scheme I was talking about is basically give each Node a

Bandwidth budget based on, this Node can use

this much Bandwidth. The network is complicated because the machine you have for example works

with one internet card, but network does not give you end-to-end Bandwidth

Megabits per second. So I agree with you Thaloc, finding the transfusion

Node is a hard problem, so I don’t have a really

good solution for that, it depends on architecture.>>So with this the

Bandwidth budget year look like on a Parameter server? Is this the shorted

parameter servers?>>So it’s architecture of the system is humid that

this the Server Nodes. The Sub-process and

the Work processes are located on the

same physical machine. So basically this Bandwidth

budget will be shared by the Sub-process and the Work process.>>The Bandwidth number

that you’re showing here is meaning the worker?>>[inaudible] both. Basically,

the Server plus Worker they do not use more than

640 Megabytes per second. On the silver side, there is limiting the

completion of fresh parameters. So far, I have talked

about scheduling network communication to improve the consistency in

all the parameters. I want to talk about scheduling computation to improve

consistency in parameter values. Before talking about

this, I want to take a step back to look at how we

[inaudible] a computation. Into that poly-therm, when we

[inaudible] the computation, we simply take random subset of mini-batches and

run them in parallel. Can we think of better

width pallets computation such that the [inaudible] with actually preserve the synchronous semantic of the sequential algorithm? So while pins he would observe is

that in some [inaudible] models, the parameters are spatially

accessed which means we process a mini batch of data

are being processed data sample. It does not access all

the model parameters. This gives you the

opportunity to find it, I suppose that in accessing

parameter I wrote them in parallel. So for some models this property

is easier to leverage because in some models the parameters

to access based on some data sample attributes. Here I’m assuming in

your application which referred called matrix factorization, and this is something like this modal [inaudible]

recommendation systems. So here the data sample

obviously user ratings to items. Basically, each record contains

that user ID and itemID, and the reading that these are

reading that user give to the item. The parameters that we want

to learn are basically latent vectors for the

users and for the items. When we use SGD to

let those parameters, we use SGD to learn those

parameters on the private access pattern for the vectors. Basically, for each

record [inaudible] axis user vector basically

userID, item vector is an itemID. It’s basically for each directorate

the parameters accessed, depends on the data attribute

values for these two attributes. More formally we state that this is the property such that there exists key fields in your data attributes whenever two data samples at

different in oldest key attributes, they will not access

the same parameters. If you find this property

holds true fabrication, this gives an easy way to

petition the training data to find palette that does not incur

conflicting parameter accesses. This property is true for

metropolization although truth of topic modeling grid implicit [inaudible] and so on

other applications. So when we find this property

holds true for those applications. For example in metropolization this allow us to partition

the data-set but this fields on to eliminate

conflicting parameter axis. For example, for the matrix

[inaudible] application, and we can organize the

trend data into a 2D matrix, with userIDs or itemIDs

as the matrix indices, and we can partition this matrix

along these two dimensions, such that different partitions do

not access the same parameters. For example, here the blocks

of the same color but not access the same parameters processing with not

accessing parameters. So basically this gives

us a way to penalize training algorithm without

sacrificing the sequential semantic. So this is actually a special case of automatic parallelizing compilers. So this [inaudible] special case of automatic parallelizing compilers. However, to actually leverage this property in competition

and there are some challenges. The first challenge is

that those properties only [inaudible] to some

models not other models. So if the user wants to use this, user will have to look at

the training algorithm, decide whether or not

this property holds true and not hold true we can

flip back to the poly-therm. But there’s [inaudible] for user to decide what kind of

politician he wants to do. Second, even when this

property holds true, implementing attributed

competition using this computation would not be trivial because user would need to figure out that the

computation dependence patent, figure out how to

partition your data-set, and figure out for example

how to quantitate workers to efficiently compute the schedule because you need some scheduling to assign [inaudible] works

on to your pilot workers. Facing those challenges, it inspires me to design

Automatic Parallelization system, that will ideally automatically

locate your training program and based on memory access

passion how the access or the impact competition act

as some other parameters, to decide how to place

the training algorithm. So basically, the system I

implemented is called Orion. So basically, Orion

provides abstraction of your cluster as a single

thread and a huge memory. So basically, from application

programs point view, it’s only seen as single

thread with huge memory. As the big memory that human memory, is abstracted away, It’s abstracted

as a multi-dimensional arrays. So for example, you can use this multi- dimensional

array to store more the parameters

and slower datasets. Those multi-dimensional arrays would be automatically

partitioned by the system, based on your competition

and characteristics. The system additionally

provides a product for user to pilots a single

serial training for loop, across the distributed

cluster of machines. The parallelization, will preserve the synchronous

semantics when possible, and otherwise will fall

back to the parallelizim, then user gives you the permission. So now I’m going to show you

some experiments results comparing Orion with other systems. One comparison I’m

showing is compared with Bosen which is the

system I showed before. Supposing, it’s using

scheduled communication to improve the convergence rate for this same application which is copy modeling Latent Dirichlet allocation. So here on x axis is time, and the y axis is a log likelihood which basically

it for this application. We want to maximize the training

log likelihood to get good model. When we use Orion, we

can see Orion gives you a faster convergence

of this application, and that the speed is

not that much different, but still like we get

improvement in convergence rate. What’s most significant is

about network bandwidth use. So in Bosen in order to

achieve that curve industry, you had to excessive

network communication, so that Greek curve is how

much network bandwidth you use during the training process. As you can see Bosen’s roughly using near about 1500 megabits per second, but Orion is using five times

less than that in average. Also in order to implement the

Bosen Magnitude Hippolythem, you had not implement thousands

lives of C++ program. However, in Orion, you

only need to implement a few 100 lines of Julia which

is a scripting language, provides you a nice

syntax like Python. So also I’m also showing comparison

between Orion and TensorFlow, this is the STDR with them

for Matrix Factorization. So here, we’re showing

convergence or time, and the y-axis is the training loss. So as you can see TensorFlow

converges much slower, compared to Orion implementations. This is running on a single

machine with 32 virtual CPUs. So TensorFlow runs

slower for two reasons, one is that TensorFlow is

not highly optimized for sparse competition like

Matrix Factorization has, so basically each of

that pass TensorFlow takes about twice as

much time as Orion, and second because TensorFlow

is using Tytherpalathum parallel competition

TensorFlow suffers a slower convergence

rate compared to Orion. So far, I have talked

about two projects where, I scale on network communication

and computation to improved consistency

in prompted states. So far we have focused on improving computation

across mini-batches. There’s a trend in

machine-learning that the mini-batch computation is

becoming more and more complex. A good example of this

is Deep Neural Networks whereas deep neural networks has brought different characteristics, compared to the models that could be interesting in the early days. In different neural networks, we see much heavier

computation per mini-batch, and there is a dense

parameter access from any batch because it typically

acts all the parameters. Lastly, because these two properties, when we paralyze different

Deep Neural network using their Pylotheme, people typically do synchronization

once per mini-batch. Clever, because the complex

computation per mini-batch, this gives us more

opportunity to schedule within a single mini-batch to improve the efficiency

of deep neural networks. So basically, a deep neural network, you can think of it as

a cascade of functions. Each function has its

own set parameters, and takes in the output from

period function as its input. So basically, the

intermediate states or the intermediate results from

those functions or layers, to also refer to the

functions of layers. So basically, in order to

learn those parameters, a common algorithm that people

use is called Back Propagation. The idea of Back

propagation is that you’re going to start from input data, it goes through all the

layers in the neural network, to produce a prediction of the input. Then you compare your prediction

with the actual label, to get the signals that you can

Back prop through all the layers, to compute gradients or updates

for the model parameters. One thing we can observe

in this process, is that not all of the model

parameters are used at the same time, and not all updates are

generated at the same time. This gives opportunity to schedule the communication of

the parameters updates, so that we can overlap communication even within

a single mini-batch. Back in 2014, a bunch of

students in OLA and myself, we worked on putting cafe onto

Bosen to do data polar training, for deep neural networks. One trick we had which we referred to as with Free Back Propagation, was to overlap backward computation with with narrow communication. So basically, as I said, when you train a deep neural networks during single each single mini-batch, you first perform the Forward pass then you perform

the Backward pass. The idea with Free Back

Propagation is pretty simple. [inaudible] says once you

produce updates for each layer, immediately send them of

the communicating network, such that you can

overlap Back Propagation with your update communication. In the ideal case, the computation only has to be

idle for the communication of the last layer of your

parameter updates. However, things are

not always so ideal, and where network

bandwidth is limited, the communication of your last layer could delay the communication

of your first layer. So when we look closely

at this process, during the backward

process you’re computing parameter updates for

the top layers first, then you gradually compute parameter

updates for the lower layers. However, in the Forward pass you’re going to need the parameters

for the first layer first, then gradually you’re going to need parameters for the later layers. This means, if the parameter updates communication for that

later layers takes a long time because of

limited network bandwidth, it could delay the communication of the updates of the first layer. So this gives idea of prioritizing communication based

on when the value is needed, which we referred to as

Priority-Based Parameter Propagation. So you still going

to do a forward pass and backward pass as before, however, when you generate parameter updates you’re going to

find when you’re communicating. For example, like in this

case when we generate proper updates for layer three, we’re not going to send

all of them at once, we’re going to send

them in small chunks. So then, when we generate

updates for layer two, we are going to stop sending updates for layer three and we’ll start sending

updates for layer two. This basically, doing a

forward communication, give us an opportunity to schedule communication

and based on the values when the

values are needed. Then finally, when we

get to layer one we can immediately start computing

parameter updates for layer one. Then when parameter updates, after parameter updates

for layer one is received, the Forward communication can begin, then we can overlap computing upper layers with the communication of the rest

of the parameter updates. This basically gives

us more opportunity to overlap communication with

the mini-batch computation. So I’m just going to show

you a simple experiment on how effective this is. Basically, we implemented

this idea in MXNet. MXNet is one of the deep learning

frameworks that people use. For data parts when the MXNet already implements with

Free Back Propagation. Basically, the current curve

shows you the anullah MXNet under the Throughput of ResNet-50 using

different network bandwidth. Basically, as you can see, when we increase network

bandwidth ResNet is going to get higher to higher Throughput. Now, the purple curve P3 shows you on the result we get after we

implement our optimization. What you observe is that the network parameters

is limited for example, like when network parameters is

only four gigabits per second, we see about 30-40

percent improvement in the computing throughput. Yes.>>How did you run this, how did you limit the bandwidth

of the networks?>>So basically, we give each

nodes a bandwidth budget, basically sort all the

bandwidth on each node. So to 4k bits per second,

5k bits per second. There’s Linux, I think it’s

called traffic controllers.>>Okay, got it. Right.>>Yes.>>So can you go back

to the previous slide?>>Yeah.>>So when you send a data

for L2, 2, all right.>>Yeah.>>If that gets delayed, does that mean the

forward pass for L2 gets delayed because it doesn’t

have all the data?>>Yes. It will get delayed.>>So if that happens, then you’re going to

still have a gap there?>>Yeah.>>Do you find that in practice, this type of scenario happens?>>I don’t think we specifically

meant how much really we’re getting from because

the L2 is delayed. Before I primary focus on, I don’t think we actually

measured how much these currently as they happen because this would depends on your network

architecture as well. It depends on your model, how much time you spend on computing a first layer or timespan

communicating that. So I don’t think we have

specific number of how much that one happens. Yes.>>What’s the granular

[inaudible] with which you split each layer?>>So this after a tuning parameter. I think when we do that, each chunk has about tens of

thousands of floating point numbers. So, yes, at least we are tens to hundreds of

kilobytes in each chunk.>>Does that still matter at all? I can’t really listen to

let’s say, the TCP practices.>>So if what we find is that because the software for sending each

message has some overhead, if we’re sending messages,

it become too small. The software becomes kind

of significant competitor. So, basically, we’re doing a lot

[inaudible] present each packet. So we cannot make it too small. Plague is to, basically, I think to test thousands or

200,000 for employment efforts. I think we think it’s

strange, it’s fun. Basically, so far,

I have talked about improving the competence

speed, but however, besides competition speed, there’s another problem that’s important for tuning larger models

which is the memory. So next I’m going to show you

a project that will improve memory efficiency to

allow us to bring a 10 times larger models

on the same hardware.>>So indeed you try it on some of the larger models like BERT and things like that.>>So actually BERT

is not the more that has [inaudible] lot of parameters. It has lot of parameters,

but you have to try with modal staff even

more parameters and BERT. To the model I tried

called Mixture Experts. This model, you can change

them perhaps the entire model. So what is more accurate one, I can run on a single GPU, that’s about 2.5 billion parameters. BERT has about 100

million parameters. So there’s some runtime overhead with this approach we start talking

about during the talk.>>You didn’t show the

class for all of those, you weren’t sure for less than 15?>>That was the previous one.>>So you didn’t really call

it utterances 19 paper, had evaluated on larger models? So this paper, right? This paper was not evaluated

on NeuroNet or NeuroNets. This was the previous sparse models.>>What of this one?>>This one, let me show

you the benchmarks we had.>>Sorry. System 19 paper.

That was the 19 paper.>>Yeah, this paper. We evaluated on our

ResNet on, basically, a bunch of [inaudible]

ResNet inception also valid on Recurrent

Neural Network. I forgot exactly what that one is, I think I forgot, I’d say, Recurrent Neural

Network or ResNet.>>The amount of benefit you

get from this is depends on how much computation you do per

parameter on your model, right?>>Yeah.>>How does that range of computation

changes for your benchmark? Do you have like bear

high to very low?>>So you mean for this moment or?>>No. For the previous right there, yeah, this paper. This technique.>>This technique you mean

like how much many civic get, it depends on number of

parameters in the model.>>The ratio of the computation per parameter of their model, right?>>I don’t have that.>>Good luck.>>So to motivate this problem, first, people really need logic

models to get good performance. So this is a paper published

by Google two years ago. This paper is called Proposing a

New Layer called Mixture Experts. So you don’t have to know the

details of how this layer works. However, the idea you need

to know is that, basically, this allows you to increase

number of parameters in the model by a large factor so

that you can get good performance. So on x-axis, I’m showing the

number of parameters in the model. So this application

is language modeling. On the y-axis, I’m

showing test perplexity. The test perplexity, this is a

little bit bigger on the model is. So as you can see, we increase the number of

parameters in this model, you’re going to get better test

plexi or a better model quality. However, the largest model costs

about 130 peeling parameters, and these requires

about 120 GPUs to run. So however, even though people

decides big and bigger models, the GPU memory is highly

limited, highly expensive. So here I’m showing

you a comparison of DRAM price with GPU price in

terms of megabytes per dollar. So as you can see over

the last 20 years, DRAM price has been decreasing. However, the GPU price is highly

expensive compared to DRAM price. So remember this a log-log scale. So you fetch leaf was

Server server side GPUs, almost three years more magnitude, more expensive compared

to DRAM price. So not only GPUs are expensive, GPU memory is high limited. With largest GPU, you can get

possible 32 gigabytes memory. However, this GPU compared

to it’s 16-gigabit version, it’s about $1,400 more expensive. This means, you’re paying

about one cent extra. So this means, you are paying about

8.5 cents per extra megabytes. This also allow more expensive

than if you would just buy DRAM. Remember, you’re getting

this memory without getting any extra computation. So motivated by this challenges.>>Why is it that we see such

expensive for [inaudible]>>So I think there are two reasons. One reason is that,

there’s a trade off. One is you’d want

really large memory, you will not be able to

sustain a high bandwidth, or if you want high bandwidth, the capacity of the America

cannot go really big. That’s a Techno talent, but also, I think because Arabia

is dominating this. So, basically, whatever it says, that’s the price, there’s no

competitor on this phase.>>So yes, specifically

motivated by this challenge, there has already a lot of work on reducing memory consumption or

making memory more efficient. So there are mainly two

category of techniques. One is called Gradient Checkpointing, basically trying to leverage recomputation to reduce

memory footprint. The other is called Memory Swapping, basically trying to leverage chip house memory to reduce

memory consumption on GPU. So basically, the high-level idea of Gradient Checkpointing is based

on backpropagation process. So during backpropagation, you

will need the intermediate results from the forward pass to compute the gradients for the

model parameters. So what this means is that we’ve reached the last layer

of your neural network, you have the cache, or you have to store the intermediate results from

all the previous layers, so you can backprop and

compute their gradients. This means if your

network has n layers, you have a ON memory costs stored

on the intermediate results. So the idea of back gradient

check pointing is basically it says instead of store all

the intermediate results, we’re going to store

a few check points, so that we can recompute on the outer layers and get

results when they are needed. So ideally, if your

network has n layers, you can partition network

into n partitions, and each partition has,

sorry, not n partition, square root of n partitions, each partition has a

square root of n notes, so you can reduce your memory

overhead from n to square root of n. So another idea is

called Memory Swapping. So the idea says

basically, for example, when each layer the intermediate

results after they’re generated, they are needed for

computing in the next layer, but they’re also needed during

the backware propagation paths. So basically, for this

long dependencies, instead of caching that

value in GPU memory, you can temporarily put them

onto cheaper host memory and slap them back in

before they’re used. This is called memory swapping. So those ideas work pretty

well for neural networks, and that’s linear

because linear network make it easier for you

to decide which nodes to checkpoint and which node

to swap and when to swap them. However, they are more and more

neural network architectures that’s becoming non-linear. In this case, those techniques will

have a hard time to work well.>>What do you mean by non-linear?>>So non-linear, I mean for example, if we think of actually that’s

next time I’m talking about. So you have a layer that

has a large [inaudible] , a lot of [inaudible]. So you don’t have to

compute this one by one. So this other example is

like COD mixture experts, that’s what I mentioned earlier. So this is a layer in

your neural network. So what happens is in this layer is that there are

many parameters per experts. This number never experts is hyperparameter you

can tune for example, you can change these thousands

or even hundreds of thousands. Those minibatch comes in, different data samples that minibatch can be handled

by different experts, so experts are going

to run in parallel. So this is a large

final architecture. You don’t compute the

experts like one-by-one. So that’s an example of

a non-linear structure. So in this case, if you compute other layers in parallel you’re going to have

a large memory overhead. So basically for the goal

might work is to come up with a memory efficient

approach to reduce memory consumption for both

linear and non-linear graphs. We want to implement and evaluate this on metroframework

like TensorFlow, so that we can evaluate it for different models for across the

different large set benchmarks. We want to make sure that to

take advantage of architecture, the application does not

need to make any changes. So there’s already been

a bunch of work to reduce memory consumption for

TensorFlow from places of work, do gradient check pointing, do memory swapping, and

as I mentioned they have limitations because

typically there’re limited to well for linear graphs. There’s also work for doing memory swapping for a special

operation called while loop. So of course this

technique only applies if your model uses this

operation while loop. So the first idea in our technique is to treat parallelism

for memory consumption. So as I mentioned, the layers could have

large [inaudible] like how TensorFlow schedules computation for this architecture is basically, TensorFlow is doing a

Breath-first traversal of your graph structure. So basically, TensorFlow is going

to start from a source node, and often the source node on

the finished computation, TensorFlow will look at all the

other successors of that node and schedule them by putting

them into a thread pool. So as you can see now we have

four nodes to run in parallel. This gives us the memory

consumption of five operations. Then basically, this process proceed, I’m going to schedule other operations onto the

graphs finished executing. Basically, the TensorFlow approach by doing Breadth-first traversal, the graph gives us

a high parallelism, however this also gave us

a high memory consumption. One simple idea to reduce

memory consumption during this completion process is that we can linearize

the competition graph. Basically, we can sort all the nodes in a topological sorted order, and run the operations one by one. So the idea is basically, we find a topological sorted order of all the nodes by

running one by one, and this gives us a peak memory

consumption of four operations. So the idea we propose, is basically something in the middle. So basically, we

proposed to partition the competition graph

into smaller partitions, and we find the topological sorted

order among those partitions.>>This is the forward pass or

both forward and backward pass?>>So this is the forward pass, but I’m thinking a general

competition graph.>>So in- When you are doing

training, you’re going to have a backward pass in which you still have the same problem of keeping the data around to

use in the backward problem.>>Right. So that’s true. Yes, that’s true. That’s the second

idea I’m going to talk about.>>Okay.>>So basically, you can still use the swapping idea to

assess that problem.>>Right.>>So yes, I agree with you. So specifically, for this part, we are going to be able to explore

parallelism within a partition. Then, we’re going to

[inaudible] among partitions to constrain

the memory consumption. So for the problem that I mentioned, so for a partition, it’s result could be used

in a nearby partition, but also could be used in a

partition that’s really far away. So basically, when the results being used in a partition

that’s really far away, we’re going to still swap

them out to host memory, and stamp them in before they are needed to control the memory usage. Because we find Polygon Sorting Order among the acute

partitions one-by-one, this gives us a easy

time to decide when to swap things out and when

to swap them back in. To give you idea of the

effectiveness of this technique, we’ve drawn transformer

model and we use a dimension experts as the Feed

Forward layer in that model, and we are able to reduce

memory consumption for that particular model from my

9.5 gigabyte to 6.8 gigabytes. The last idea we have is particularly for models that have large

unprimed parameters. For example, in the

transformer that I just showed you that’s about 800

million parameters. So the idea is that the intensive flow if you implement them all attends to

look in the store, or the parameters and constants as persistent testers in GPU memory. So in GPU memory. So what we propose is that we

want to store this variables in a host memory and bring them back to GPU memory when they are

needed for the competition. So in TensorFlow, we need

tensor value across devices, TensorFlow basically going to insert a pair of center receive operation to communicate that tensor variable

from CPU memory to GPU memory. So in this case, for example, if that variable is going

to be needed both in the forward pass and

in the backward pass. So TensorFlow on insert one parison the receive to communicate

that value from CPU to GPU. What this means is that once

the GPU receives that tensor, GPU has two cached that tensor value in GPU

memory for a really long time. So however, but basically what we

propose is to only basically to insert sender-receiver pairs for this variable when they are needed. So basically here that need and

two places we’re going to insert two centricity pairs to communicate

this variable from CPU to GPU. This allow us to reduce

the memory consumption for models to help bottom parameters. For example transformer

with Mixer experts. If you reduced that

memory consumption from 6.8 gigabytes to

roughly 3.3 gigabytes. So we implemented this

technique in TensorFlow core, submit a budget TensorFlow

core and there’s no change in the

TensorFlow Python API. So applications can leverage this technique with no change

in the application code. So we tested this on

machines that have media techniques GPU

with 32 virtual cores and 64 kilobytes CPU memory. So our benchmark consists

of five different models, they use different architectures. For example we have a transformer

which used attention. We have a transformer

with Mitchell experts, and we have ResNet which is

convolution neural network, we have GANs, we have

statically unrolled ions. First of all that, I want

to draw your attention is mutual this transformer

with mutual experts, which I’ve talked about earlier. So basically on the y axis, I’m showing you the peak

memory consumption of the model running on different systems with

different optimizations. The purple bar shows you

execution of Vanilla TensorFlow, and the green bar shows the execution with like

partitioning plus swapping, and the blue bar shows you, additionally we have occlusion with variable placement optimization. So as you can see for a

transformer with Mitchell experts, we can reduce memory

consumption from about 9.5 gigabytes to 3.3 gigabytes

for pigment consumption. So this is a model that

have lot parameters, about 800 million parameters. For other models, they don’t

have too many parameters. The second largest model, for resonant I think has about

60 million parameters. So next result I’m going

to draw you to is GANs. For GANs we achieved

largest memory saving. We’ve used memory consumption from about 11 gigabytes to

about 1.4 gigabytes, roughly 87 percent memory reduction. On average, we can achieve mimic consumption

about 70 or 70 percent. So this optimization brings

some overheads for run-time. Because one is we’re restricting

how much Python we can use. Second, we incur additional

GPU CPU on data communication. The largest overhead comes for the model that has

a lot of parameters. Explicitly the backs

here is that runtime overhead compared to running

on Vanilla TensorFlow.>>This is overhead for

throughput to convergence.>>Right. So this is overhead

in terms of throughput because our question does not change

the semantics of computation. You still get the same

result running on them. So this is basically the

time to complete one. Yes.>>You should finish

the third section. We should probably wait. Perfect.>>All right.>>Okay, I still have

a few more minutes.>>My question was for idea to which you are showing where you’re

swapping things up back and forth. I think I missed the part

[inaudible] increment media.>>Yeah.>>Is it just as similar to that

or there’s some data over there?>>So it’s similar. The difference is that reading and the more than

neural architecture. So I think more than your

network architecture as a linear architectural

layer by layer. So what the part is different is that we partition the linear network. So because generally the graph is not linear sequence of operations.>>Is that statement true? You said that in general

the graphs that people used for parallelism

they are not linear.>>That’s not exactly true. So the modal architecture, a

lot of them are linear. For example if you look at ResNet, the linear sequence of layers, someone knock here

just not linear for example Mitchell

experts, I mentioned. But if you look at the computer

graph, TensorFlow users, is not a linear

sequence of operations, because each layer consists

of many operations, they have different

fan-outs each layer. So if you look at

that component graphs specifically it’s not a linear graph. But if you look on you catching

a half on high level it will be a linear set of operations.>>I can’t vouch for that for you. You can’t your own except

Graph Generator for [inaudible] even though it used 24 days but it was all

just bushy and spread out. So one more question. My cheek has this month

budget of their memory, can your algorithm optimize

it such that it uses. How much memory is

the maximum amount. That’s a very good question, that’s something I’ve

planned to do in the future but right

now we don’t have that. So I can talk about that

towards the end of the talk. So that’s basically, as I was

talking about the run time overhead. So basically for the models that

have flood primaries we have about 3.4 times runtime overhead. The good case for example

for overheads are about 30 percent overheads to achieve about 40

percent memory saving. So in average the overhead

is about 2.22 times. However, if you exclude the highly

overhead like high parameter, the Mitchell experts with

high number parameters, of overhead we can achieve

with 55 percent overhead. You can achieve about 60

percent of memory reduction. So one of the direct benefit of this optimization is that it

allows us to run bigger models. For example, for the Mitchell

experts what I mentioned earlier, on Vanilla TensorFlow you can run basically four experts per layer. This gives you about 0.24 pillars

parameters on a single GPU, and if we use our

optimizations you can skip the number parameters per

modeled by roughly 10 times. I wish I could run much

larger models than this. Often it runs deeper models. For example for ResNet,

skilled ResNets up to about almost 2,000 layers. Competitor [inaudible]

TensorFlow, you can draw above 500 layers in resonate. So this also works in

the distributed setting. So we use library

called Mesh TensorFlow, this gives a way to do more of the part of them

across different GPUs. By using Mesh TensorFlow across

four different machines, TensorFlow outlaws you to run like Mixture Experts of about

120 experts per layer. This is, of course, four machines

with four different GPUs. So I want to mention that

the experts that used here, each expert is not the

same as either used here. Basically, these experts

are the smaller. So with TensorFlow, we may

secure these to 16 machines, we don’t get a linear scaling

in terms of parameters. Basically, we increase number

of machines by four times with only increased primary

size by almost twice. So only use average

system, TensorFlowMem, on the same four machines, we can scale number parameters

totaling to six billion parameters. As we optimize the Mixture Experts, we can even more increase number

of parameters on these 40 pews. Basically, I’m not

going to go into how we optimize Mixture Expert

implementation, but basically, the high-level idea is,

we’re going to partition the big tensors entitlement

to small tensors. So we have more opportunity to

do swapping between them. Yes.>>How module is this from

the primary set point? Do I get this for free?>>So for TensorFlowMem, you’re going to get a TensorFlowMem, basically, going to

get this for free. For this, you’re going to do a little bit better

application implementation.>>But I need to express

that in my graph.>>Yes. You need to basically change how you implement

your competition graph.>>Right.>>But you get this for free.>>Great.>>So this awful laws to draw longer sequence for

recruiting on network, for example, compared to

Vanilla TensorFlow, basically, it can scale up to 400 lens, less of 400 in a sequence, and we can scale to 800

with some runtime overhead. So basically, I’m showing you

here is time per mini-batch, with longer sequences, of course, you’re going to run slower, but compared to when you

want us on the same sequence minus we have about 50

percent of runtime overhead.>>So how much do you

remember it was this?>>I think it was 12

gigabytes of memory.>>You see the bad sides are sequenced then I guess

beyond a certain point, you would see an issue in terms

of the computers being used, so I guess, it’s part of those.>>So I’m not trying to say you should run really

really long sequence, but just in case people want to run, this gives you document. So in summary, I have talked about

three meter directors, one is, you can make better use of a network communication by

practicing communication based. First, doing fan of communication. Second, by practicing communication

based on relative magnitude, and based on whether the values

are being used, and second, you can carefully schedule a public

competition by leveraging with memory access pattern to reduce the inconsistency

in public states. Lastly, you can do better

scheduling and better placement in your professional growth to reduce memory consumption of GPU memory

by leveraging cheap host memory. So looking at the future, machine learning is

still faster working, there are a lot new

models being developed, there are a lot of new operations people want to use

in machine learning, if you don’t like the

recently proposed a capsule. So also because the heavy competition

machine learning people are developing in your

hardware architectures to favor machine learning. My interest for research

is that I want to continue to develop new software

systems for machine learning. What I see is that this

kind of software systems is pushing the boundaries of many

disciplines in computer science, and I want to go to explore different technologies to improve the state of art for

controlling systems. So the questions that I’m

really interested are: first, how can we support expanding

sets muscling competitions, and how can we take advantage

of new computer hardware. So there are a couple more concrete directions that I’m

basically interested in, one is programming

support and compilation. So how can we support new

operators, for example, capsule? There’s a recent paper

published at Google about comparing how different

compilers produce kernels. They’re basically comparing

the performance of different kernels produced

by different compilers. So they are automatical part

compilers like plate ML, Tensor Comprehension, for

a plate ML, basically, here, I think the graph

is not very clear, I apologize for that. This, basically, is showing

you the runtime of the kernel. So, basically, the red curve here shows you carefully cut

up my kernel implementation. So basically, a purple

curve shows play ML, shows tensile comprehension

is a green part here. So basically, what we serve

that the [inaudible] message I want to show with

this graph is that there’s still a big gap between the automatic compilers

compared to many optimized code. So I think there’s still a

a lot room for improvements for guaranteeing efficient kernels. Second, can we include, for example, control flow primitives

in that [inaudible] graph. For example, functions. So if your graph may have a lot of

repeated patterns or sub graph, for example, I can resonate, you have this retreat block

that you repeat many times. Can you think of defining

those repeated pattern? Right now, basically, people are

kind of doing in-place function. So can you think of attracting this as a function in

the community graph, which will simplify program effort, it also gives you maybe a more opportunity to optimize

the repeated patterns. Lastly, for example, have the colored lights competition

if we knew a hardware. Also, not the direction I’m

interested in is more than polytheme is how do we partition. Both the polytheme I think

involves two problems, how do we partition operations, and how do we place Func1 operations

on two different devices. There’s I think monthly

direction that’s interesting is thinking

about dynamic replacing operators on devices for dynamically

changing graph topologies. For example, when the graph

involves dynamic control flow, the graph topology is not

statically determined, and there might be

opportunity for dynamically placing operation center devices to improve the

competition efficiency. Because all of those directions

involves complex on some species, maybe we should leverage new

techniques, for example, machine learning for a

complex task-based to find if you have a good device

placement for efficiency. So a more concrete example

I want to talk about is, in the near future that could

be interesting direction, is to achieve memory-efficient

deep learning with high-capacity throughput. The direction as I mentioned is

suppose we have maybe considering, how do we maximize competence throughput within

the member constraint we have. So we’re not yet, there are many techniques to reduce memory consumption

for deep learning. So there’s scheduling,

which I talked about trading Degree of Parallelism

with memory consumption. There’s technique

gradient check pointing, and feeding competition

for American assumption, there’s memory swapping,

there’s also quantization which is accuracy for memory consumption. So it’s hard to determine what kind of

techniques you want to apply, or for each of this technique, there are many

[inaudible] for example, if one is scheduling,

it’s NP-hard problem, is NP-complete problem to find the scheduling that minimizes

its memory consumption. A second best configuration of those techniques will depends on your particular model

architecture and the hardware. Lastly, those techniques are

interdependent, which means, the order you apply those

techniques would also matter or you should apply them. Probably, a good question

I think to ask is, how can we minimize

training time or achieve a good model quality subject

to certain memory constraint? With that, I want to

conclude my talk. I think, without concluding my talk, and if you have any questions, please feel free to ask.>>I have probably a

lot of questions to go.>>[inaudible]. Thank you.>>Thank you.

Welcome to our blog!