Jump to content

OK, I think I have a strong suspicion now what is wrong with the server


Inspiration

Recommended Posts

First let me explain what I observed

 

While mining, the dreadful errors of came up once more and actions were rolled back.

So far so good, while annoying this is to be expected for certain server problems.

Where it gets interesting what I observed several times afterwards.

 

Like seeing some actions succeed, give ore and then further actions rollback the terrain modifications.

Sometimes several at once and this still is somewhat explainable by regular issues, be it extra buggy.

 

But now it gets interesting, I left the spot or ore alone a little bit and mined on the other side of a cliff (for as far as that succeeded).

What happened when I switched again is that terrain that was long gone came back on my first action and then disappear again as I interacted with the terrain.

And not just one action I did, but several and not all at once!

 

Some educated guess at what is wrong

 

This all looks like there are inconsistencies in the server state, different servers or server sub-systems see a different reality and are not properly synchronized. The same I have observed with mining with another player. Terrain modifications made by one were not visible to the other. While at another time and with another player we were seeing each-others actions just fine.

 

This leads me to the conclusion that the method used to accept a huge volume of player actions, likely a form of partitioning/load-balancing to divide the workload in combination with caching sub-systems are the root cause of most of the experienced issues. Work is assigned to incorrectly up-to-date systems that subsequently refuse the action to take place. Because if a player does an action based on what the client shows and the server executing the action sees another reality, this will trigger code that invalidates said action. It has to as to protect the integrity of the server and counter manipulation attempts. Yet in this case, it is the server that is validating against an incorrect internal state. This result in retries by players causing even more load and more "random" phenomena. Some of these actions might end up served by a correct sever and succeed while other attempts fail, all worsening the inconsistencies over time.

 

So the primary cause does not seem to be a network issue, database issue or the general amount of players hitting the server.

Obviously each of those can have/be an issue as well and those won't help (except to muddy the waters).

But the behavior I described here is not explainable by just network or database issues or a mixture of the two, it clearly has to do with how work is divided/assigned and/or caching.

Link to comment
Share on other sites

11 hours ago, Inspiration said:

This all looks like there are inconsistencies in the server state, different servers or server sub-systems see a different reality and are not properly synchronized. The same I have observed with mining with another player. Terrain modifications made by one were not visible to the other. While at another time and with another player we were seeing each-others actions just fine.

Your observations and conclusions match mine. It extends to inventory state, too, and interactions between inventory state and "world" resources (pretty much "mining") are where the two sets of inconsistencies combine with most disruptive-to-gameplay effect.

I found last night that if I started drilling through "rock" I'd get a "Missing parameter" message on the first 'dig' but it'd go away after that. I surmise that this error message comes up because the operation is interacting with my inventory to see whether it should collect the stuff that's being removed from the world, and it (sensibly) only checks that each time I hit LMB or the automine key. I could drill tunnels through rock all night with little problem. Most of the problems I had occurred when shifitng "stuff" from the world to my inventory, or vice-versa, in the case of repairing or refuelling.

11 hours ago, Inspiration said:

So the primary cause does not seem to be a network issue, database issue or the general amount of players hitting the server.

Obviously each of those can have/be an issue as well and those won't help (except to muddy the waters).

But the behavior I described here is not explainable by just network or database issues or a mixture of the two, it clearly has to do with how work is divided/assigned and/or caching.

I don't know what you've seen about how NQ are using distributed cloud services to handle both the volume of compute and the issues of geographical spread, but my understanding is that the network based nature of the cloud and the way they're dividing up and even making local copies of the world to keep local pings-to-server low is going to have a much greater impact on the integrity of their meshed servers and 'master' databases than it normally would, and it wouldn't surprise me if their problems are a combination of "trying too hard to keep things centrally accounted for with millisecond precisions" and "the speed of light getting in the way".

Link to comment
Share on other sites

It all depends when the network is busy or the servers are under heavy load. Synconizing itself also produces a load that will likely increase sharply for many players. Presumably the individual servers could no longer process this, and the instance types had to be upgraded.

Such problems can arise when the load is distributed across different servers without taking player sessions into account. All players should ideally run within a zone on a server, even if game zones change and zone borders are a challenge even with normal MMOs. Zones with many players use a stronger server, zones with few players can use weaker servers, and zones without players can stop the server completely and thus save server costs. All of this should be possible in the cloud.

Link to comment
Share on other sites

Basically they are struggling with data loss/corruption caused by client-sever lag and/or desynchronization between servers. This fits with recent actions to try and minimize player traffic to the servers. But.. this is also ample evidence of plain bugs in the gamer like tutorials, client crashes, http timeouts etc.. Likely caused by hastily written code the last week or so.. And it also explains to some degree why activating more servers would not be a fix all solution to the problems we are having.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...