@scolobb The speaker discussed why deep neural networks are overparameterized (the number of parameters often exceeds the amount of training data) but still give excellent results.
@scolobb Conclusion: they do not really know, but let's find out!
@scolobb One would expect over-training at some point: excellent performance on the training data, but bad performance on the test data.
@scolobb My guess would be that deep neural networks use *alternative* parameters on every layer, that do not compete for finding the best fit. Does BERT really need 12 layers? Theory suggests it does not...
@scolobb ... but we don't know how to train more concise models, yet. Dacheng Tao suggested to use statistical gradient descent instead of gradient descent, but I did not fully understand what that solves.
@scolobb (now just structuring my thoughts instead of answering your question, probably)
@scolobb On Monday, Geoff Hinton said he does not believe that the brain does gradient descent, but it's the best we can do at the moment to train complex models
The "unofficial" Information Retrieval Mastodon Instance.
Goal: Make idf.social a viable and valuable social space for anyone working in Information Retrieval and related scientific research.
Everyone welcome but expect some level of geekiness on the instance and federated timelines.